[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-05 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100054#comment-17100054
 ] 

Hudson commented on TIKA-3094:
--

SUCCESS: Integrated in Jenkins build Tika-trunk #1812 (See 
[https://builds.apache.org/job/Tika-trunk/1812/])
TIKA-3094 -- new metadata for every parse :( (tallison: 
[https://github.com/apache/tika/commit/4a558303d1ed9b352e519d35a48a4e30367ebfff])
* (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java


> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099964#comment-17099964
 ] 

Tim Allison commented on TIKA-3094:
---

Thank you, [~bob]!  On 3, that was my idiocy in not initializing a fresh 
Metadata object for each file.  Fixed in master.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-05 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099868#comment-17099868
 ] 

Bob Paulin commented on TIKA-3094:
--

Thanks [~tallison] .  For #2 JAXB was removed from the JDK distribution in Java 
11 so not surprised there.  We should be able to correct that by bringing in 
the dependencies.  A little surprised you hit it on Java 8 since it's still in 
the JDK then but OSGi might be hiding it.  Will take a look.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-05 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099859#comment-17099859
 ] 

Tim Allison commented on TIKA-3094:
---

Hi [~bob], I'll take #3.

On 2, if you comment out the following in master, that's the triggering file:

{noformat}
needToFix.add("testAccess2_encrypted.accdb");
{noformat}

You should be able to reproduce it at least in 8.  I _think_ I got it in both 8 
and 11 last night, but may be mistaken.  Wait, y, I got it in at least 8 last 
night, and I can reproduce in 11 this morning.

{noformat}
java.lang.ClassNotFoundException: javax.xml.bind.JAXBException not found by 
org.apache.tika.bundle [19]

at 
org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringImpl.java:1639)
at 
org.apache.felix.framework.BundleWiringImpl.access$200(BundleWiringImpl.java:80)
at 
org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass(BundleWiringImpl.java:2053)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
at 
com.healthmarketscience.jackcess.impl.office.AgileEncryptionProvider.(AgileEncryptionProvider.java:70)
at 
com.healthmarketscience.jackcess.impl.OfficeCryptCodecHandler.create(OfficeCryptCodecHandler.java:89)
at 
com.healthmarketscience.jackcess.CryptCodecProvider.createHandler(CryptCodecProvider.java:116)
at 
com.healthmarketscience.jackcess.impl.PageChannel.initialize(PageChannel.java:105)
at 
com.healthmarketscience.jackcess.impl.DatabaseImpl.(DatabaseImpl.java:554)
at 
com.healthmarketscience.jackcess.impl.DatabaseImpl.open(DatabaseImpl.java:415)
at 
com.healthmarketscience.jackcess.DatabaseBuilder.open(DatabaseBuilder.java:267)
at 
org.apache.tika.parser.microsoft.JackcessParser.parse(JackcessParser.java:95)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
{noformat}

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.

2020-05-05 Thread Bob Paulin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099848#comment-17099848
 ] 

Bob Paulin commented on TIKA-3094:
--

Hey [~tallison] I ran a build on Java 8 and Java 11 and I was unable to 
recreate #2.  Can you provide more details?  perhaps the output of the stack 
trace you're getting?

For #3 I do get errors running the build but I'm  not sure which are expected 
and which are the ones with the wrong metadata.  Can you provide an example.

 

Also it might be helpful to separate these out into different JIRAs.  This one 
is snowballing a bit.

> Apache Tika fails to extract text for pptx extension.
> -
>
> Key: TIKA-3094
> URL: https://issues.apache.org/jira/browse/TIKA-3094
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.24, 1.24.1
>Reporter: Abhishek Chauhan
>Assignee: Bob Paulin
>Priority: Critical
> Attachments: Sample PPT.pptx
>
>
> This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx 
> ententions which was earlier working with Apache Tika 1.23 is no longer 
> working in 1.24 version.
> For .ppt extention it is working fine in both 1.23 and 1.24
>  
> As I referred to release notes [https://tika.apache.org/1.24/index.html], you 
> have updated the POI to 4.1.2. That might be the root cause of this problem. 
> POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] 
> which is not present in bundle I guess.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)