[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100054#comment-17100054 ] Hudson commented on TIKA-3094: -- SUCCESS: Integrated in Jenkins build Tika-trunk #1812 (See [https://builds.apache.org/job/Tika-trunk/1812/]) TIKA-3094 -- new metadata for every parse :( (tallison: [https://github.com/apache/tika/commit/4a558303d1ed9b352e519d35a48a4e30367ebfff]) * (edit) tika-bundle/src/test/java/org/apache/tika/bundle/BundleIT.java > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099964#comment-17099964 ] Tim Allison commented on TIKA-3094: --- Thank you, [~bob]! On 3, that was my idiocy in not initializing a fresh Metadata object for each file. Fixed in master. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099868#comment-17099868 ] Bob Paulin commented on TIKA-3094: -- Thanks [~tallison] . For #2 JAXB was removed from the JDK distribution in Java 11 so not surprised there. We should be able to correct that by bringing in the dependencies. A little surprised you hit it on Java 8 since it's still in the JDK then but OSGi might be hiding it. Will take a look. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099859#comment-17099859 ] Tim Allison commented on TIKA-3094: --- Hi [~bob], I'll take #3. On 2, if you comment out the following in master, that's the triggering file: {noformat} needToFix.add("testAccess2_encrypted.accdb"); {noformat} You should be able to reproduce it at least in 8. I _think_ I got it in both 8 and 11 last night, but may be mistaken. Wait, y, I got it in at least 8 last night, and I can reproduce in 11 this morning. {noformat} java.lang.ClassNotFoundException: javax.xml.bind.JAXBException not found by org.apache.tika.bundle [19] at org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringImpl.java:1639) at org.apache.felix.framework.BundleWiringImpl.access$200(BundleWiringImpl.java:80) at org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass(BundleWiringImpl.java:2053) at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521) at com.healthmarketscience.jackcess.impl.office.AgileEncryptionProvider.(AgileEncryptionProvider.java:70) at com.healthmarketscience.jackcess.impl.OfficeCryptCodecHandler.create(OfficeCryptCodecHandler.java:89) at com.healthmarketscience.jackcess.CryptCodecProvider.createHandler(CryptCodecProvider.java:116) at com.healthmarketscience.jackcess.impl.PageChannel.initialize(PageChannel.java:105) at com.healthmarketscience.jackcess.impl.DatabaseImpl.(DatabaseImpl.java:554) at com.healthmarketscience.jackcess.impl.DatabaseImpl.open(DatabaseImpl.java:415) at com.healthmarketscience.jackcess.DatabaseBuilder.open(DatabaseBuilder.java:267) at org.apache.tika.parser.microsoft.JackcessParser.parse(JackcessParser.java:95) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:277) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) {noformat} > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3094) Apache Tika fails to extract text for pptx extension.
[ https://issues.apache.org/jira/browse/TIKA-3094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17099848#comment-17099848 ] Bob Paulin commented on TIKA-3094: -- Hey [~tallison] I ran a build on Java 8 and Java 11 and I was unable to recreate #2. Can you provide more details? perhaps the output of the stack trace you're getting? For #3 I do get errors running the build but I'm not sure which are expected and which are the ones with the wrong metadata. Can you provide an example. Also it might be helpful to separate these out into different JIRAs. This one is snowballing a bit. > Apache Tika fails to extract text for pptx extension. > - > > Key: TIKA-3094 > URL: https://issues.apache.org/jira/browse/TIKA-3094 > Project: Tika > Issue Type: Bug >Affects Versions: 1.24, 1.24.1 >Reporter: Abhishek Chauhan >Assignee: Bob Paulin >Priority: Critical > Attachments: Sample PPT.pptx > > > This is regressed from 1.23 version of Apache Tika. Text extraction for .pptx > ententions which was earlier working with Apache Tika 1.23 is no longer > working in 1.24 version. > For .ppt extention it is working fine in both 1.23 and 1.24 > > As I referred to release notes [https://tika.apache.org/1.24/index.html], you > have updated the POI to 4.1.2. That might be the root cause of this problem. > POI requires [https://mvnrepository.com/artifact/com.zaxxer/SparseBitSet/1.2] > which is not present in bundle I guess. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)