[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479715#comment-17479715 ] Hudson commented on TIKA-3164: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #430 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/430/]) TIKA-3164 -- avoid deprecated SAXHelper (tallison: [https://github.com/apache/tika/commit/7ed25f2e61994c51e2ba38e11bdd1ec15ed1f625]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java TIKA-3164 -- avoid deprecated SAXHelper -- fix checkstyle (tallison: [https://github.com/apache/tika/commit/c8804ad5d0a5c48a7947018b7d319c00980bbbcb]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17479623#comment-17479623 ] Hudson commented on TIKA-3164: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #429 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/429/]) TIKA-3164, take two -- still broken in the bundle (tallison: [https://github.com/apache/tika/commit/4933fc4750ef200b19a2839fc1249ecdd23ec67f]) * (edit) tika-bundles/tika-bundle-standard/pom.xml * (edit) tika-bundles/tika-bundle-standard/src/test/java/org/apache/tika/bundle/BundleIT.java TIKA-3164: better but still broken (tallison: [https://github.com/apache/tika/commit/e3ff863058d3bb8036fb618c314966824e295c99]) * (edit) tika-bundles/tika-bundle-standard/pom.xml TIKA-3164 -- Upgrade to Apache POI 5.2.0. Many thanks to PJ Fanning for fixing the osgi integration. (tallison: [https://github.com/apache/tika/commit/b30ef77763d54a742143fa6af139d532b99297d6]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java * (edit) CHANGES.txt * (edit) tika-bundles/tika-bundle-standard/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSTextExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java * (edit) tika-parent/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java * (edit) tika-eval/tika-eval-app/src/main/resources/log4j2.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.2.2 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478095#comment-17478095 ] Tim Allison commented on TIKA-3164: --- Will give these a try. Thank you! > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478079#comment-17478079 ] PJ Fanning commented on TIKA-3164: -- Again, I know nothing about OSGI but com.sun.org.apache.xpath.internal.jaxp is the package where the default XPathFactory is and you probably need to add that to [https://github.com/apache/tika/blob/TIKA-3164-v2/tika-bundles/tika-bundle-standard/pom.xml|https://github.com/apache/tika/blob/TIKA-3164-v2/tika-bundles/tika-bundle-standard/pom.xml#L209] > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17478068#comment-17478068 ] PJ Fanning commented on TIKA-3164: -- [~tallison] I don't know much about the Tika build. I notice though that [https://github.com/apache/tika/blob/main/tika-bundles/tika-bundle-standard/pom.xml#L146] makes no reference of log4j-api which is an important dependency for POI > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17477990#comment-17477990 ] Tim Allison commented on TIKA-3164: --- Thank you [~pj.fanning] ! I'm still not able to figure out how to configure our tika-bundle-standard (branch TIKA-3164-v2) to work with POI 5.x. If you or anyone else can help with this, I'd appreciate it. I'd really like to move to POI 5.x. {noformat} org.ops4j.pax.logging.pax-logging-api[org.ops4j.pax.logging.internal.Activator] : Disabling JULI Logger API support. [ERROR] SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". [ERROR] SLF4J: Defaulting to no-operation (NOP) logger implementation [ERROR] SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. [ERROR] java.lang.NoClassDefFoundError: org/apache/logging/log4j/spi/LoggerContextFactory [ERROR] at org.apache.poi.openxml4j.util.ZipSecureFile.(ZipSecureFile.java:37) [ERROR] at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.(OOXMLParser.java:103) [ERROR] at sun.misc.Unsafe.ensureClassInitialized(Native Method) [ERROR] at sun.reflect.UnsafeFieldAccessorFactory.newFieldAccessor(UnsafeFieldAccessorFactory.java:43) [ERROR] at sun.reflect.ReflectionFactory.newFieldAccessor(ReflectionFactory.java:156) [ERROR] at java.lang.reflect.Field.acquireFieldAccessor(Field.java:1088) [ERROR] at java.lang.reflect.Field.getFieldAccessor(Field.java:1069) [ERROR] at java.lang.reflect.Field.getLong(Field.java:611) [ERROR] at java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1875) [ERROR] at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:79) [ERROR] at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:506) [ERROR] at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494) [ERROR] at java.security.AccessController.doPrivileged(Native Method) [ERROR] at java.io.ObjectStreamClass.(ObjectStreamClass.java:494) [ERROR] at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391) [ERROR] at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:681) [ERROR] at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2003) [ERROR] at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1850) [ERROR] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2160) [ERROR] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) [ERROR] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) [ERROR] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) [ERROR] at java.util.ArrayList.readObject(ArrayList.java:799) [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [ERROR] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) [ERROR] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [ERROR] at java.lang.reflect.Method.invoke(Method.java:498) [ERROR] at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1184) [ERROR] at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2296) [ERROR] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187) [ERROR] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) [ERROR] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2405) [ERROR] at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2329) [ERROR] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2187) [ERROR] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1667) [ERROR] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503) [ERROR] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461) [ERROR] at org.apache.tika.fork.ForkObjectInputStream.readObject(ForkObjectInputStream.java:97) [ERROR] at org.apache.tika.fork.ForkServer.readObject(ForkServer.java:293) [ERROR] at org.apache.tika.fork.ForkServer.initializeParserAndLoader(ForkServer.java:209) [ERROR] at org.apache.tika.fork.ForkServer.processRequests(ForkServer.java:147) [ERROR] at org.apache.tika.fork.ForkServer.main(ForkServer.java:121) [ERROR] Caused by: java.lang.ClassNotFoundException: Unable to find class org.apache.logging.log4j.spi.LoggerContextFactory [ERROR] at org.apache.tika.fork.ClassLoaderProxy.findClass(ClassLoaderProxy.java:119) [ERROR] at java.lang.ClassLoader.loadClass(ClassLoader.java:418) [ERROR] at java.lang.ClassLoader.loadClass(ClassLoader.java:351) [ERROR] ... 42 more {noformat} and {noformat} [ERROR] org.apache.tika.bundle.BundleIT.testPoiTikaBundle Time elapsed: 2.648 s <<< ERROR! java.lang.RuntimeException: XPathFactory#newInstance() failed to create an
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17476495#comment-17476495 ] PJ Fanning commented on TIKA-3164: -- POI 5.2.0 is out - it has a fix for [https://bz.apache.org/bugzilla/show_bug.cgi?id=65676] - so Tika won't need its own forked version of XSSFReader > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458686#comment-17458686 ] Tim Allison commented on TIKA-3164: --- Oh my goodness, thank you [~bob]! There's no rush. POI 5.x will go out with an upgraded PDFBox early in the new year. Thank you! > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458683#comment-17458683 ] Bob Paulin commented on TIKA-3164: -- Hey [~tallison] . See the mention but will likely not get to this for a few day. Did a few tests yesterday and I'm able to recreate your results on my machine but don't have any specific recommendations yet. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17458428#comment-17458428 ] Tim Allison commented on TIKA-3164: --- I'm still not able to get the bundle to work; see the TIKA-3164-v2 branch. I'm now getting two exceptions, one in the ForkParser test, one in the poi bundle test. ForkParser test {noformat} java.lang.NoClassDefFoundError: org/apache/logging/log4j/spi/LoggerContextFactory [ERROR] at org.apache.poi.openxml4j.util.ZipSecureFile.(ZipSecureFile.java:37) [ERROR] at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.(OOXMLParser.java:103) [ERROR] at sun.misc.Unsafe.ensureClassInitialized(Native Method) [ERROR] at sun.reflect.UnsafeFieldAccessorFactory.newFieldAccessor(UnsafeFieldAccessorFactory.java:43) [ERROR] at sun.reflect.ReflectionFactory.newFieldAccessor(ReflectionFactory.java:156) [ERROR] at java.lang.reflect.Field.acquireFieldAccessor(Field.java:1088) [ERROR] at java.lang.reflect.Field.getFieldAccessor(Field.java:1069) [ERROR] at java.lang.reflect.Field.getLong(Field.java:611) {noformat} and testPoiTikaBundle {noformat} java.lang.RuntimeException: XPathFactory#newInstance() failed to create an XPathFactory for the default object model: http://java.sun.com/jaxp/xpath/dom with the XPathFactoryConfigurationException: javax.xml.xpath.XPathFactoryConfigurationException: No XPathFctory implementation found for the object model: http://java.sun.com/jaxp/xpath/dom at org.apache.tika.bundle.BundleIT.testPoiTikaBundle(BundleIT.java:313) {noformat} > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457841#comment-17457841 ] ASF GitHub Bot commented on TIKA-3164: -- tballison opened a new pull request #462: URL: https://github.com/apache/tika/pull/462 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457821#comment-17457821 ] ASF GitHub Bot commented on TIKA-3164: -- tballison merged pull request #462: URL: https://github.com/apache/tika/pull/462 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457688#comment-17457688 ] Hudson commented on TIKA-3164: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #381 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/381/]) TIKA-3164, revert back to poi 4.x in main (tallison: [https://github.com/apache/tika/commit/10d925439cd862f74679ec5fa9a9b5863f50ce2c]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java * (delete) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OPCPackageWrapper.java * (edit) tika-bundles/tika-bundle-standard/pom.xml * (edit) CHANGES.txt * (edit) tika-parent/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java * (delete) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/TikaXSSFSheetXMLHandler.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSTextExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457626#comment-17457626 ] Tim Allison commented on TIKA-3164: --- I've created a TIKA-3164-v2 branch until we can fix POI 5.1.0 in bundle. I'll revert main back to POI 4.1.2 for now. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457604#comment-17457604 ] Tim Allison commented on TIKA-3164: --- I'm really hoping we don't have to do this: https://craftsmen.nl/getting-log4j2-to-work-in-an-osgi-context/ > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457495#comment-17457495 ] Hudson commented on TIKA-3164: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #380 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/380/]) TIKA-3164 update POI to 5.1.0 -- try to fix bundle (tallison: [https://github.com/apache/tika/commit/c2ee0234700519e95aefe4199d91a4d6b56b5ec6]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pdf-module/pom.xml * (edit) tika-parsers/tika-parsers-ml/tika-parser-nlp-module/pom.xml * (edit) tika-pipes/tika-fetchers/tika-fetcher-http/pom.xml * (edit) tika-langdetect/tika-langdetect-tika/pom.xml * (edit) tika-pipes/tika-emitters/tika-emitter-fs/pom.xml * (edit) tika-pipes/pom.xml * (edit) tika-pipes/tika-fetchers/tika-fetcher-s3/pom.xml * (edit) tika-pipes/tika-pipes-iterators/tika-pipes-iterator-solr/pom.xml * (edit) tika-translate/pom.xml * (edit) tika-integration-tests/tika-pipes-s3-integration-tests/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-audiovideo-module/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-jdbc-commons/pom.xml * (edit) tika-parsers/tika-parsers-extended/pom.xml * (edit) tika-parsers/tika-parsers-ml/tika-parser-advancedmedia-module/pom.xml * (edit) tika-pipes/tika-httpclient-commons/pom.xml * (edit) tika-server/pom.xml * (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-package/pom.xml * (edit) pom.xml * (edit) tika-pipes/tika-pipes-iterators/tika-pipes-iterator-gcs/pom.xml * (edit) tika-parsers/tika-parsers-ml/tika-transcribe-aws/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-commons/pom.xml * (edit) tika-integration-tests/tika-pipes-solr-integration-tests/pom.xml * (edit) tika-parsers/pom.xml * (edit) tika-serialization/pom.xml * (edit) tika-fuzzing/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-crypto-module/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-zip-commons/pom.xml * (edit) tika-server/tika-server-core/pom.xml * (edit) tika-app/pom.xml * (edit) tika-langdetect/tika-langdetect-lingo24/pom.xml * (edit) tika-parsers/tika-parsers-ml/pom.xml * (edit) tika-eval/tika-eval-core/pom.xml * (edit) tika-java7/pom.xml * (edit) tika-example/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-apple-module/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-commons/pom.xml * (edit) tika-integration-tests/pom.xml * (edit) tika-integration-tests/tika-pipes-opensearch-integration-tests/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-digest-commons/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-mail-module/pom.xml * (edit) tika-bundles/pom.xml * (edit) tika-langdetect/tika-langdetect-opennlp/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-xmp-commons/pom.xml * (edit) tika-langdetect/tika-langdetect-optimaize/pom.xml * (edit) tika-parsers/tika-parsers-extended/tika-parser-sqlite3-module/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-package/pom.xml * (edit) tika-parsers/tika-parsers-ml/tika-age-recogniser/pom.xml * (edit) tika-pipes/tika-emitters/tika-emitter-gcs/pom.xml * (edit) tika-server/tika-server-standard/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-font-module/pom.xml * (edit) tika-pipes/tika-emitters/pom.xml * (edit) tika-parent/pom.xml * (edit) tika-pipes/tika-emitters/tika-emitter-s3/pom.xml * (edit) tika-eval/pom.xml * (edit) CHANGES.txt * (edit) tika-pipes/tika-pipes-iterators/tika-pipes-iterator-s3/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-miscoffice-module/pom.xml * (edit) tika-langdetect/tika-langdetect-test-commons/pom.xml * (edit) tika-pipes/tika-fetchers/tika-fetcher-gcs/pom.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-news-module/pom.xml * (edit) tika-parsers/tika-parsers-ml/tika-dl/pom.xml * (edit) tika-bundles/tika-bundle-standard/pom.xml * (edit) tika-pipes/tika-emitters/tika-emitter-opensearch/pom.xml * (edit) tika-parsers/tika-parsers-extended/tika-parser-scientific-package/pom.xml * (edit) tika-langdetect/tika-langdetect-mitll-text/pom.xml * (edit) tika-pipes/tika-emitters/tika-emitter-solr/pom.xml * (
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457482#comment-17457482 ] Tim Allison commented on TIKA-3164: --- [~bobpaulin], I broke the bundle. Please help if you can. POI now requires log4j2. How do we handle that in the bundle? > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457427#comment-17457427 ] Hudson commented on TIKA-3164: -- UNSTABLE: Integrated in Jenkins build Tika » tika-main-jdk8 #379 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/379/]) TIKA-3164 -- upgrade to POI 5.1.0 (#462) (github: [https://github.com/apache/tika/commit/22261ab09b2809847da87f24252dad2dfde81978]) * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/MetadataExtractor.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/TikaXSSFSheetXMLHandler.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OOXMLExtractorFactory.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/OPCPackageWrapper.java * (edit) CHANGES.txt * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/log4j2.xml * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/xps/XPSTextExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSLFPowerPointExtractorDecorator.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/OutlookExtractor.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java * (edit) tika-parent/pom.xml TIKA-3164 update POI to 5.1.0 -- fix convergence checks (tallison: [https://github.com/apache/tika/commit/dbc680f500b83621b06deb7bb7aa23f9bda39efa]) * (edit) tika-parent/pom.xml > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457365#comment-17457365 ] Tim Allison commented on TIKA-3164: --- Many thanks to [~kiwiwings], [~pj.fanning] and the POI team for their help in this upgrade! > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Assignee: Tim Allison >Priority: Major > Fix For: 2.1.1 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457364#comment-17457364 ] ASF GitHub Bot commented on TIKA-3164: -- tballison merged pull request #462: URL: https://github.com/apache/tika/pull/462 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457362#comment-17457362 ] ASF GitHub Bot commented on TIKA-3164: -- tballison opened a new pull request #462: URL: https://github.com/apache/tika/pull/462 Thanks for your contribution to [Apache Tika](https://tika.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Tika issue tracker](https://issues.apache.org/jira/projects/TIKA) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`TIKA-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[TIKA-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Tika is successfully built and unit tests pass by running `mvn clean test` * there should be no conflicts when merging the pull request branch into the *recent* `main` branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled `main` branch. We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Tika in general, please sign up for the [Tika mailing list](http://tika.apache.org/mail-lists.html). Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@tika.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457357#comment-17457357 ] Tim Allison commented on TIKA-3164: --- https://bz.apache.org/bugzilla/show_bug.cgi?id=65326 ? Y, I agree that logging for security vulnerabilities is important which is why in the proposal above, I carved out info level reporting from the XMLHelper. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457355#comment-17457355 ] PJ Fanning commented on TIKA-3164: -- [~tallison] I added a comment on https://bz.apache.org/bugzilla/show_bug.cgi?id=65683 > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457352#comment-17457352 ] Tim Allison commented on TIKA-3164: --- I reran the large scale regression tests after making a local patch for 65676, and everything looks to be in good shape. I also ran a hefty set of multithreading tests and found no problems. I'll merge this into main shortly. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17457332#comment-17457332 ] Tim Allison commented on TIKA-3164: --- I'm really grateful that POI has moved to log4j2 (today's news notwithstanding)... The amount of new, effective logging is several orders of magnitude larger than 4.x. I had 36MB of logs with 4.x on ~400k MSOffice files, and my log for 5.x will probably be around 5GB once the run is complete. I'm wondering if we should configure default logging in tika-app and tika-server to turn off POI's logging or if we should add massive warnings in the release notes? Something like this that would allow XMLHelper's warning? [~pj.fanning] and fellow Tika devs, what do you think? {noformat} {noformat} > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456772#comment-17456772 ] PJ Fanning commented on TIKA-3164: -- [~tallison] we try to set as many settings as possible to prevent the XML parser or transformer from being susceptible to XXE issues and if a user's JAXP setup loads implementations that are less safe then they could be susceptible to XXE. From what I have seen, it will be unpopular for us to force uptake of particular parser and transformer implementations. These days, xerces is not regularly released and the forks of xerces that are built into the Java runtime probably are safer. You could say the same for xalan. On the transformer side, you have saxon as an alternative. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456760#comment-17456760 ] Tim Allison commented on TIKA-3164: --- Y. 5.1.0 > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456755#comment-17456755 ] Tim Allison commented on TIKA-3164: --- All sounds good. Thank you, [~pj.fanning]. I’ll open issues and share files when I’m back to a keyboard. In the external schema issue, two questions: 1) can we force xerces or, frankly, any specific implementation? We had issues before where users had different default xml parsing than I did and debugging was a pain, and we couldn’t guarantee consistency across platforms. 2) I see in the comments in the issue that the logging is benign. I wanted to confirm that we are not vulnerable to xxe. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456741#comment-17456741 ] PJ Fanning commented on TIKA-3164: -- POI 4 and below have custom logging which took effort to enable so most users would never see POI logging. With POI 5, logging events that noone ever saw are now being seen. I guess we need to start looking at the more annoying log messages and downgrading them to info or even debug. For 1 and 2, could you create POI issues and attach files that cause the logging? For accessExternalSchema issue, we have [https://bz.apache.org/bugzilla/show_bug.cgi?id=65326] I've been relucatant to entirey remove this logging but if you could add the full stacktrace there, we can look into it? You are using POI 5.1.0? > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456737#comment-17456737 ] Tim Allison commented on TIKA-3164: --- Y. Thank you [~pj.fanning]! That's exactly it. I can fix it on the Tika side for now by copy/pasting XSSFSheetXMLHandler. Three other points of interest: 1) I'm getting this on quite a few files in our regression set. Warnings are great, but is something else going on? org.apache.poi.hpsf.CodePageString String terminator (\0) for CodePageString property value occurred before the end of string. Trimming and hope for the best. 2) I'm getting a lot of these warnings. Should we be checking if an entry is a directory before adding them to the parts list: org.apache.poi.openxml4j.exceptions.InvalidFormatException: A part name shall not have a forward slash as the last character [M1.5]: /word/_rels/ 3) How can I avoid this and make sure that we are not vulnerable to xxe? org.apache.poi.util.XMLHelper SAX Feature unsupported [log suppressed for 5 minutes]http://javax.xml.XMLConstants/property/accessExternalSchema java.lang.IllegalArgumentException: Property 'http://javax.xml.XMLConstants/property/accessExternalSchema' is not recognized. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456722#comment-17456722 ] PJ Fanning commented on TIKA-3164: -- [~tallison] could [https://bz.apache.org/bugzilla/show_bug.cgi?id=65676] be the same issue that you are running into with numbers in last column being merged with 1st column on next row? > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456718#comment-17456718 ] PJ Fanning commented on TIKA-3164: -- [~tallison] I committed this to POI svn just now - [https://github.com/apache/poi/commit/077815b37b9d325e2cf576c64ec5dd6a6f77fff4] poi-ooxml-lite is only a subset of all the poi-ooxml-full and most of the stuff that is added is based on what classes and xsbs are loaded while we run our tests - but we have hacks to add some missing xsbs and classes. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17456688#comment-17456688 ] Tim Allison commented on TIKA-3164: --- I finally had time to run the regression tests against ~400k files. The reports are here: https://corpora.tika.apache.org/base/share/reports-poi-5.x.tgz There are ~20 fixed exceptions. Two files have this new exception: {noformat} Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/ctcustomxmlblockd3c1type.xsb {noformat} There's a very small regression in that in a handful of xlsx files, if there's a number in the last column of a row, it is not cleared before the content in the first cell of the next row. So we get: {noformat} ...1.5 1.5kultur... from ...1.5 kultur... I'll open an issue with POI and see if I can patch this at the Tika level for now. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440733#comment-17440733 ] Tim Allison commented on TIKA-3164: --- Thank you [~pj.fanning]! I've started a new TIKA-3164 branch based on {{main}} to give this a try. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438135#comment-17438135 ] PJ Fanning commented on TIKA-3164: -- POI 5.1.0 is released - maybe Tika will be able to use this version without running into the problems with POI 5.0.0. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17371382#comment-17371382 ] Tim Allison commented on TIKA-3164: --- I hand-checked some of the content diffs in {{spreadsheetml}} with a recent build of POI. Tika is now extracting the missed content now when single-threaded. My guess is that the fix above to {{setAllThreadsPreferEventExtractors}} actually fixed the issue, but I'll rerun in batch mode after we cut the 1.27 rc1. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340873#comment-17340873 ] Tim Allison commented on TIKA-3164: --- NPE in wmf: https://bz.apache.org/bugzilla/show_bug.cgi?id=65293 > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340862#comment-17340862 ] Tim Allison commented on TIKA-3164: --- There was a multithreading, erm, feature in the Tika code that led to all the missing attachments... we have to call {{POIXMLExtractorFactory.setAllThreadsPreferEventExtractors(true);}} not {{ POIXMLExtractorFactory.setThreadPrefersEventExtractors(true);}} Will rerun shortly. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17340249#comment-17340249 ] Tim Allison commented on TIKA-3164: --- Reports are here: https://corpora.tika.apache.org/base/reports/poi-5.0.1-snapshot-reports.tgz These compare the latest 4.x vs. 5.0.1-snapshot. There's a new NPE in WMF parsing, and it looks like we're missing a bunch of attachments. I also need to look into why there's less content coming out of application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ... this could be a Tika item, not POI... Parse times seem to be slower for ooxml than in 4.x, but that could be an artifact of the mood of the vm at the time of running... > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17339749#comment-17339749 ] Tim Allison commented on TIKA-3164: --- [~kiwiwings], I got the build to work with the latest. I'm sorry for my delay. I'm running the regression tests against MSOffice files now... Thank you! > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Task >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325401#comment-17325401 ] Tim Allison commented on TIKA-3164: --- Thank you [~kiwiwings] ! > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17325386#comment-17325386 ] Andreas Beeker commented on TIKA-3164: -- Added more .xsbs and classes - the old POI integration test code only processed every 2nd hierarchy ... m( [http://svn.apache.org/viewvc?view=revision&revision=1888985] [~tallison] Please give it a try when the code has been tested by Jenkins and the tests are green > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307829#comment-17307829 ] Tim Allison commented on TIKA-3164: --- Two files that cause this complaint: testPDFEmbeddingAndEmbedded.docx and test_recursive_embedded.docx. Both here: https://github.com/apache/tika/tree/branch_1x/tika-parsers/src/test/resources/test-documents > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307826#comment-17307826 ] Tim Allison commented on TIKA-3164: --- Progress! New missing xsb: {noformat} XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/stoletype716btype.xsb (org.apache.poi.schemas.ooxml.system.ooxml.stoletype716btype) {noformat} > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307472#comment-17307472 ] Andreas Beeker commented on TIKA-3164: -- I'm now recursing through .xlsx and .docx in our integration tests [1]. Please regenerate the lite jar via "ant clean test test-integration test-ooxml-lite" in POI and try again in TIKA. [1] http://svn.apache.org/viewvc?view=revision&revision=1887978 > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307139#comment-17307139 ] Tim Allison commented on TIKA-3164: --- Yep, that's exactly what's going on. I found that if I uncomment {{ }} in the build, the necessary files are included. > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307135#comment-17307135 ] PJ Fanning commented on TIKA-3164: -- [~tallison] I don't know for definite but we have 2 jars with the XMLBeans generated classes (for ooxml schemas) - they were renamed in POI 5 - poi-ooxml-lite and poi-ooxml-full - I suspect that some stuff that you might need is missing and that it might be worth checking poi-ooxml-full If we know what's missing in poi-ooxml-lite, we can see about fixing POI build to include the missing bits > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17307099#comment-17307099 ] Tim Allison commented on TIKA-3164: --- [~fanningpj], many thanks for your help on this. I'm now getting a clean build on 5.0.1-SNAPSHOT. With the Tika integration, though, I'm still getting the following exception on several unit tests. When I look inside the {{ooxml-lite}} jar for both 5.0.0 and 5.0.1-SNAPSHOT (even after I add Tika's {{EmbeddedDocument.docx}}, I see {{org/apache/poi/schemas/ooxml/system/oleobjelement.xsb}} but not {{/oleobjectelement.xsb}}. Any idea how to fix this? {noformat} Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/oleobjectelement.xsb (org.apache.poi.schemas.ooxml.system.ooxml.oleobjectelement) - code 0 at org.apache.xmlbeans.impl.schema.SchemaTypeSystemImpl$XsbReader.(SchemaTypeSystemImpl.java:1315) at org.apache.xmlbeans.impl.schema.SchemaTypeSystemImpl.resolveHandle(SchemaTypeSystemImpl.java:3138) at org.apache.xmlbeans.SchemaComponent$Ref.getComponent(SchemaComponent.java:113) at org.apache.xmlbeans.SchemaGlobalElement$Ref.get(SchemaGlobalElement.java:76) at org.apache.xmlbeans.impl.schema.SchemaTypeLoaderBase.findElement(SchemaTypeLoaderBase.java:103) at org.apache.xmlbeans.impl.schema.SchemaTypeImpl.createElementType(SchemaTypeImpl.java:988) at org.apache.xmlbeans.impl.values.XmlObjectBase.create_element_user(XmlObjectBase.java:913) at org.apache.xmlbeans.impl.store.Xobj.getUser(Xobj.java:1597) at org.apache.xmlbeans.impl.store.Cur.getUser(Cur.java:2571) at org.apache.xmlbeans.impl.store.Cur.getObject(Cur.java:2565) at org.apache.xmlbeans.impl.store.Cursor._getObject(Cursor.java:819) at org.apache.xmlbeans.impl.store.Cursor.syncWrapHelper(Cursor.java:2522) at org.apache.xmlbeans.impl.store.Cursor.syncWrap(Cursor.java:2453) at org.apache.xmlbeans.impl.store.Cursor.getObject(Cursor.java:2080) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractParagraph(XWPFWordExtractorDecorator.java:236) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.extractIBodyText(XWPFWordExtractorDecorator.java:161) at org.apache.tika.parser.microsoft.ooxml.XWPFWordExtractorDecorator.buildXHTML(XWPFWordExtractorDecorator.java:124) at org.apache.tika.parser.microsoft.ooxml.AbstractOOXMLExtractor.getXHTML(AbstractOOXMLExtractor.java:136) at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:214) at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:113) {noformat} > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289198#comment-17289198 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784360595 Please don't waste time on this... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289196#comment-17289196 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784360102 Y, see my branch. We have to do a coupla handfuls of stuff on the tika side. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289086#comment-17289086 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784212148 I made a mistake while building this. Looks like there are a few issues with the upgrade. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289085#comment-17289085 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning closed pull request #404: URL: https://github.com/apache/tika/pull/404 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289060#comment-17289060 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784174494 K. jdk 8 _should_ work, right? I'll ping the dev list. Thank you! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289059#comment-17289059 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784173672 I'm not getting that issue - I'm using zulu jdk 11.0.7 and ant 1.10.8 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289053#comment-17289053 ] ASF GitHub Bot commented on TIKA-3164: -- tballison edited a comment on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784167038 Clean checkout. `ant compile` appears to work. `ant test` fails with: ``` [echo] Using Ant: Apache Ant(TM) version 1.10.9 compiled on September 27 2020 from /apache/apache-ant-1.10.9, Ant detected Java 1.8 (may be different than actual Java sometimes...) [echo] Using Java: 1.8.0_282/1.8.0_282-b08/25.282-b08/OpenJDK 64-Bit Server VM from AdoptOpenJDK on Linux: 5.8.0-43-generic [echo] Building Apache POI version 5.0.1-SNAPSHOT and RC: RC1 test-main: [javac] Compiling 1 source file to /home/tallison/Intellij/poi-trunk/build/poi-ant-contrib -test-main-write-testfile: -test-scratchpad-check: test-scratchpad-download-resources: test-scratchpad: -test-scratchpad-write-testfile: -test-ooxml-check: test-ooxml: -test-ooxml-write-testfile: compile-ooxml-lite: [echo] Create ooxml-lite schemas BUILD FAILED /home/tallison/Intellij/poi-trunk/build.xml:1812: /home/tallison/Intellij/poi-trunk/build/ooxml-lite-report.clazz doesn't exist ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289052#comment-17289052 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784167038 Clean checkout. `ant compile` appears to work. `ant test` fails with: ```test-main: [javac] Compiling 1 source file to /home/tallison/Intellij/poi-trunk/build/poi-ant-contrib -test-main-write-testfile: -test-scratchpad-check: test-scratchpad-download-resources: test-scratchpad: -test-scratchpad-write-testfile: -test-ooxml-check: test-ooxml: -test-ooxml-write-testfile: compile-ooxml-lite: [echo] Create ooxml-lite schemas BUILD FAILED /home/tallison/Intellij/poi-trunk/build.xml:1812: /home/tallison/Intellij/poi-trunk/build/ooxml-lite-report.clazz doesn't exist ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289047#comment-17289047 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784155288 I'm happy to do so for testing, but I'm hesitant to add even more to tika. The point of 2.x is to modularize and make dependencies smaller. I wouldn't rule it out, necessarily... Any recs on the above build failure? Thank you! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289048#comment-17289048 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784156188 I'd suggest a clean checkout - there could be some stuff hanging around that `ant clean` is not removing This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289040#comment-17289040 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784152065 @tballison would it be worth just using ooxml-schemas-full on tika - tika is big so the benefit of ooxml-schemas-lite is lower This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289041#comment-17289041 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning edited a comment on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784152065 @tballison would it be worth just using ooxml-schemas-full on tika? - tika is big so the benefit of ooxml-schemas-lite is lower This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289020#comment-17289020 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784129952 I'm sure the above is user error. I've been away from POI for too long...argh... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289019#comment-17289019 ] ASF GitHub Bot commented on TIKA-3164: -- tballison edited a comment on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784128134 ``` openjdk version "1.8.0_282" OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_282-b08) OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.282-b08, mixed mode) ``` Ubuntu Uninstalled old ant Installed new ant ``` ant -f fetch.xml -Ddest=system ``` ``` echo ANT_HOME /apache/apache-ant-1.10.9 ``` ``` ant -v Apache Ant(TM) version 1.10.9 compiled on September 27 2020 ``` ant clean test ```BUILD FAILED BUILD FAILED /home/tallison/Intellij/poi-trunk/build.xml:1812: /home/tallison/Intellij/poi-trunk/build/ooxml-lite-report.clazz doesn't exist Total time: 2 minutes 58 seconds ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289018#comment-17289018 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784128134 Uninstalled old ant Installed new ant ``` ant -f fetch.xml -Ddest=system ``` ``` echo ANT_HOME /apache/apache-ant-1.10.9 ``` ``` ant -v Apache Ant(TM) version 1.10.9 compiled on September 27 2020 ``` ant clean test ```BUILD FAILED BUILD FAILED /home/tallison/Intellij/poi-trunk/build.xml:1812: /home/tallison/Intellij/poi-trunk/build/ooxml-lite-report.clazz doesn't exist Total time: 2 minutes 58 seconds ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289013#comment-17289013 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784117840 the gradle build depends quite a bit on the ant one - I would suggest getting ant build working and then the gradle build will probably start working This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289010#comment-17289010 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784115989 Fresh checkout... ./gradlew build Results: ```> Task :ooxml:compileJava /home/tallison/Intellij/poi-trunk/src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFCell.java:564: error: cannot access DocumentFactory f = CTCellFormula.Factory.newInstance(); ^ class file for org.apache.xmlbeans.impl.schema.DocumentFactory not found /home/tallison/Intellij/poi-trunk/src/ooxml/java/org/apache/poi/xssf/usermodel/XSSFColor.java:117: error: recursive constructor invocation public XSSFColor(byte[] rgb, IndexedColorMap colorMap) { ^ /home/tallison/Intellij/poi-trunk/src/ooxml/java/org/apache/poi/xddf/usermodel/XDDFLineProperties.java:42: error: recursive constructor invocation public XDDFLineProperties(XDDFFillProperties fill) { ^ /home/tallison/Intellij/poi-trunk/src/ooxml/java/org/apache/poi/xddf/usermodel/text/XDDFHyperlink.java:29: error: recursive constructor invocation public XDDFHyperlink(String id) { ``` I'm guessing I need to run ant first to pull in the dependencies? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289009#comment-17289009 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784114343 there is the ooxml-schemas-full jar for cases where ooxml-schemas-lite is missing stuff I thought all the xsb stuff was in ooxml-schemas-lite jar definitely worth adding a test case to poi code base POI 6.0.0 is probably going to be next release and it could be a couple of months (fairly big logging changes just merged and probably an uptake of a refactored xmlbeans jar) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289006#comment-17289006 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784110289 I'm adding "test_recursive_embedded.docx" to a unit test in POI locally to see if I can get it to add the oleobjectelement.xsb in schemas-lite. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289001#comment-17289001 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784102052 See: https://github.com/apache/tika/tree/TIKA-3164-1.x This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289000#comment-17289000 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784099481 Aside from including the full schemas jar, is there a solution to this: ```[ERROR] Tests run: 13, Failures: 0, Errors: 12, Skipped: 0, Time elapsed: 0.548 s <<< FAILURE! - in org.apache.tika.parser.RecursiveParserWrapperTest [ERROR] org.apache.tika.parser.RecursiveParserWrapperTest.testMaxEmbedded Time elapsed: 0.16 s <<< ERROR! org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.ooxml.OOXMLParser@626c569b at org.apache.tika.parser.RecursiveParserWrapperTest.testMaxEmbedded(RecursiveParserWrapperTest.java:191) Caused by: org.apache.xmlbeans.SchemaTypeLoaderException: XML-BEANS compiled schema: Could not locate compiled schema resource org/apache/poi/schemas/ooxml/system/ooxml/oleobjectelement.xsb (org.apache.poi.schemas.ooxml.system.ooxml.oleobjectelement) - code 0 at org.apache.tika.parser.RecursiveParserWrapperTest.testMaxEmbedded(RecursiveParserWrapperTest.java:191) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17288994#comment-17288994 ] ASF GitHub Bot commented on TIKA-3164: -- tballison commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-784055893 Sorry for my delay! Not clear on why that test failed for you. Let me take a look. Working on this today. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3164) Upgrade to POI 5.0.0 when available
[ https://issues.apache.org/jira/browse/TIKA-3164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17288661#comment-17288661 ] ASF GitHub Bot commented on TIKA-3164: -- pjfanning commented on pull request #404: URL: https://github.com/apache/tika/pull/404#issuecomment-783721207 @tballison I tried this on my laptop - the tika-parser microsoft tests passed but job later failed with ``` [ERROR] Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 13.181 s <<< FAILURE! - in org.apache.tika.parser.gdal.TestGDALParser [ERROR] org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo Time elapsed: 12.795 s <<< FAILURE! java.lang.AssertionError at org.apache.tika.parser.gdal.TestGDALParser.testParseBasicInfo(TestGDALParser.java:82) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Upgrade to POI 5.0.0 when available > --- > > Key: TIKA-3164 > URL: https://issues.apache.org/jira/browse/TIKA-3164 > Project: Tika > Issue Type: Bug >Reporter: Tim Allison >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005)