[jira] [Commented] (TIKA-3690) upgrade to poi 5.2.1
[ https://issues.apache.org/jira/browse/TIKA-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504983#comment-17504983 ] Andreas Hubold commented on TIKA-3690: -- FYI, I got an OutOfMemoryError when I tried the update to POI 5.2.1. I've asked on the POI mailing list, maybe that's also interesting for you: https://lists.apache.org/thread/fmb746gypgfpj8k0lmcvtn89zppwb95p > upgrade to poi 5.2.1 > > > Key: TIKA-3690 > URL: https://issues.apache.org/jira/browse/TIKA-3690 > Project: Tika > Issue Type: Improvement > Components: parser >Reporter: PJ Fanning >Priority: Major > Fix For: 2.3.1, 1.28.2 > > > There is a POI CVE that may or may not affect TIka - > https://lists.apache.org/thread/hqc0ohg0z1j0p4ysm3y4ct6g2d8sjc2b > Generally, probably a good idea to upgrade anyway -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (TIKA-3575) Cannot use loadErrorHandler="ignore" in tika config
[ https://issues.apache.org/jira/browse/TIKA-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429131#comment-17429131 ] Andreas Hubold commented on TIKA-3575: -- Thanks [~tallison], I'd suggest to * either change the default for loadErrorHandler in TikaConfig#serviceLoaderFromDomElement back to IGNORE. (this would be my preferred choice, and a very simple change) * or keep the default at THROW but extend #serviceLoaderFromDomElement to check for a value of "ignore" in the attribute and respect that. And if the default is THROW now, it should also be the default if no service-loader element specified, otherwise it feels inconsistent and could surprise users. If you search for org.apache.tika.config.LoadErrorHandler#IGNORE, you can see that it's still the default at some places. {quote}The goal was to allow finer-grained module selection so that you'd never have load errors that you'd want to ignore. {quote} I really like the separation into modules in Tika 2.x. That's a great improvement! Our use case for LoadErrorHandler#IGNORE: It can still be useful to include a module but exclude some of its parsers/dependencies. For example we're using tika-parser-code-module but just don't need Matlab and SAS7BDATParser, so we want to exclude parso and jmatio dependencies to reduce the number of dependencies. It's a nice feature that this disables the parsers without additional necessary configuration in tika config (and our downstream users could simply add dependencies to enable parsers without touching configuration). I think it's a good idea to bundle different parsers into logical modules, like different code parsers in tika-parser-code-modules. But sometimes that may not be fine-grained enough, and that's where LoadErrorHandler#IGNORE plays a nice role, IMHO. > Cannot use loadErrorHandler="ignore" in tika config > --- > > Key: TIKA-3575 > URL: https://issues.apache.org/jira/browse/TIKA-3575 > Project: Tika > Issue Type: Bug > Components: config >Affects Versions: 2.0.0, 2.1.0 >Reporter: Andreas Hubold >Priority: Major > Labels: regression > > Tika 2.0.0 changed the default error handler to throw exceptions, and does > not ignore errors when loading parsers anymore as it was the case with Tika > 1.x. > See > [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470] > There's no configuration option to restore the previous behavior. It should > be possible to set > {code} > > {code} > but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement > only considers "warn" and "throw" as possible values. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (TIKA-3575) Cannot use loadErrorHandler="ignore" in tika config
[ https://issues.apache.org/jira/browse/TIKA-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Hubold updated TIKA-3575: - Description: Tika 2.0.0 changed the default error handler to throw exceptions, and does not ignore errors when loading parsers anymore as it was the case with Tika 1.x. See [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470] There's no configuration option to restore the previous behavior. It should be possible to set {code} {code} but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement only considers "warn" and "throw" as possible values. was: Tika 2.0.0 changed the default error handler to throw exceptions, and does not ignore errors when loading parsers anymore as it was the case with Tika 1.x. See [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470)] There's no configuration option to restore the previous behavior. It should be possible to set {code} {code} but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement only considers "warn" and "throw" as possible values. > Cannot use loadErrorHandler="ignore" in tika config > --- > > Key: TIKA-3575 > URL: https://issues.apache.org/jira/browse/TIKA-3575 > Project: Tika > Issue Type: Bug > Components: config >Affects Versions: 2.0.0, 2.1.0 >Reporter: Andreas Hubold >Priority: Major > Labels: regression > > Tika 2.0.0 changed the default error handler to throw exceptions, and does > not ignore errors when loading parsers anymore as it was the case with Tika > 1.x. > See > [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470] > There's no configuration option to restore the previous behavior. It should > be possible to set > {code} > > {code} > but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement > only considers "warn" and "throw" as possible values. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (TIKA-3575) Cannot use loadErrorHandler="ignore" in tika config
[ https://issues.apache.org/jira/browse/TIKA-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428718#comment-17428718 ] Andreas Hubold commented on TIKA-3575: -- After looking more into this, I saw that the LoadErrorHandler.THROW is only used as default, if a `` element is specified. Otherwise, the default is still IGNORE. So maybe the default should just be changed back to IGNORE. BTW, I run into this with the following declaration {code:java} {code} But as it seems, I can simply remove the whole service-loader element to avoid the problem. IIUC, the InitializableProblemHandler isn't called by any predefined class anymore anyway. I had this declaration to avoid warnings from the PDFParser in previous Tika versions, but that's not necessary anymore with Tika 2.x. > Cannot use loadErrorHandler="ignore" in tika config > --- > > Key: TIKA-3575 > URL: https://issues.apache.org/jira/browse/TIKA-3575 > Project: Tika > Issue Type: Bug > Components: config >Affects Versions: 2.0.0, 2.1.0 >Reporter: Andreas Hubold >Priority: Major > Labels: regression > > Tika 2.0.0 changed the default error handler to throw exceptions, and does > not ignore errors when loading parsers anymore as it was the case with Tika > 1.x. > See > [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470)] > There's no configuration option to restore the previous behavior. It should > be possible to set > {code} > > {code} > but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement > only considers "warn" and "throw" as possible values. > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (TIKA-3575) Cannot use loadErrorHandler="ignore" in tika config
Andreas Hubold created TIKA-3575: Summary: Cannot use loadErrorHandler="ignore" in tika config Key: TIKA-3575 URL: https://issues.apache.org/jira/browse/TIKA-3575 Project: Tika Issue Type: Bug Components: config Affects Versions: 2.1.0, 2.0.0 Reporter: Andreas Hubold Tika 2.0.0 changed the default error handler to throw exceptions, and does not ignore errors when loading parsers anymore as it was the case with Tika 1.x. See [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470)] There's no configuration option to restore the previous behavior. It should be possible to set {code} {code} but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement only considers "warn" and "throw" as possible values. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (TIKA-2802) Out of memory issues when extracting large files (pst)
[ https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841008#comment-16841008 ] Andreas Hubold edited comment on TIKA-2802 at 5/16/19 6:40 AM: --- I wonder if the addition of Xerces is still recommended for Java 9+ projects. Since Java 9, the JDK contains bugfixes from Xerces 2.11.0, see https://bugs.openjdk.java.net/browse/JDK-8044086 For Java 13, an update to Xerces 2.12.0 is in progress according to https://bugs.openjdk.java.net/browse/JDK-8214064 Do you know which Xerces issue was causing the problem? was (Author: ahubold): I wonder if the addition of Xerces is still recommended for Java 9+ projects. Since Java 9, the JDK contains bugfixes from Xerces 2.11.0, see lhttps://bugs.openjdk.java.net/browse/JDK-8044086 For Java 13, an update to Xerces 2.12.0 is in progress according to https://bugs.openjdk.java.net/browse/JDK-8214064 Do you know which Xerces issue was causing the problem? > Out of memory issues when extracting large files (pst) > -- > > Key: TIKA-2802 > URL: https://issues.apache.org/jira/browse/TIKA-2802 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20, 1.19.1 > Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04. > Java: jdk1.8.0_151 > >Reporter: Caleb Ott >Priority: Critical > Attachments: Selection_111.png, Selection_117.png > > > I have an application that extracts text from multiple files on a file share. > I've been running into issues with the application running out of memory > (~26g dedicated to the heap). > I found in the heap dumps there is a "fDTDDecl" buffer which is creating very > large char arrays and never releasing that memory. In the picture you can see > the heap dump with 4 SAXParsers holding onto a large chunk of memory. The > fourth one is expanded to show it is all being held by the "fDTDDecl" field. > This dump is from a scaled down execution (not a 26g heap). > It looks like that DTD field should never be that large, I'm wondering if > this is a bug with xerces instead? I can easily reproduce the issue by > attempting to extract text from large .pst files. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)
[ https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841008#comment-16841008 ] Andreas Hubold commented on TIKA-2802: -- I wonder if the addition of Xerces is still recommended for Java 9+ projects. Since Java 9, the JDK contains bugfixes from Xerces 2.11.0, see lhttps://bugs.openjdk.java.net/browse/JDK-8044086 For Java 13, an update to Xerces 2.12.0 is in progress according to https://bugs.openjdk.java.net/browse/JDK-8214064 Do you know which Xerces issue was causing the problem? > Out of memory issues when extracting large files (pst) > -- > > Key: TIKA-2802 > URL: https://issues.apache.org/jira/browse/TIKA-2802 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.20, 1.19.1 > Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04. > Java: jdk1.8.0_151 > >Reporter: Caleb Ott >Priority: Critical > Attachments: Selection_111.png, Selection_117.png > > > I have an application that extracts text from multiple files on a file share. > I've been running into issues with the application running out of memory > (~26g dedicated to the heap). > I found in the heap dumps there is a "fDTDDecl" buffer which is creating very > large char arrays and never releasing that memory. In the picture you can see > the heap dump with 4 SAXParsers holding onto a large chunk of memory. The > fourth one is expanded to show it is all being held by the "fDTDDecl" field. > This dump is from a scaled down execution (not a 26g heap). > It looks like that DTD field should never be that large, I'm wondering if > this is a bug with xerces instead? I can easily reproduce the issue by > attempting to extract text from large .pst files. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076162#comment-16076162 ] Andreas Hubold commented on TIKA-1367: -- Yes, true. I was referring to [~gus_heck]'s comment, where he wrote that dependencies were completely lost in Maven when upgrading from 1.12 to 1.15. > Tika documentation should list tika-parsers parser dependencies > --- > > Key: TIKA-1367 > URL: https://issues.apache.org/jira/browse/TIKA-1367 > Project: Tika > Issue Type: Improvement > Components: documentation >Reporter: Sergey Beryozkin > Fix For: 1.16 > > > tika-parsers module has many strong transitive parser dependencies. Maven > users of tika-parsers have to exclude all the transitivie dependencies > manually. Documenting the list of the existing transitive dependencies and > keeping the list up to date will help developers exclude the libraries not > needed for a given project. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies
[ https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076137#comment-16076137 ] Andreas Hubold commented on TIKA-1367: -- mvn dependency:tree lists the dependencies of tika-parsers 1.15 for me. They are also correctly listed in the pom available on Maven central: http://central.maven.org/maven2/org/apache/tika/tika-parsers/1.15/tika-parsers-1.15.pom Maybe you've somehow installed a wrong pom into your maven repository? > Tika documentation should list tika-parsers parser dependencies > --- > > Key: TIKA-1367 > URL: https://issues.apache.org/jira/browse/TIKA-1367 > Project: Tika > Issue Type: Improvement > Components: documentation >Reporter: Sergey Beryozkin > Fix For: 1.16 > > > tika-parsers module has many strong transitive parser dependencies. Maven > users of tika-parsers have to exclude all the transitivie dependencies > manually. Documenting the list of the existing transitive dependencies and > keeping the list up to date will help developers exclude the libraries not > needed for a given project. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (TIKA-967) Tika comes with transitive Maven dependency to a test artifact of vorbis-java-core
[ https://issues.apache.org/jira/browse/TIKA-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andreas Hubold updated TIKA-967: Affects Version/s: 1.3 Tika comes with transitive Maven dependency to a test artifact of vorbis-java-core --- Key: TIKA-967 URL: https://issues.apache.org/jira/browse/TIKA-967 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.1, 1.2, 1.3 Reporter: Andreas Hubold Priority: Minor Tika 1.2 has the following dependencies: {noformat} \- org.apache.tika:tika-parsers:jar:1.2:compile [INFO]+- org.apache.tika:tika-core:jar:1.2:compile [INFO]+- org.gagravarr:vorbis-java-tika:jar:0.1:compile [INFO]| \- org.gagravarr:vorbis-java-core:jar:tests:0.1:runtime .. [INFO]+- org.gagravarr:vorbis-java-core:jar:0.1:compile {noformat} The transitive dependency to {{org.gagravarr:vorbis-java-core:jar:tests}} is wrong. It only contains test resources for vorbis (note that Maven classifier 'tests'). It seems this is caused by a bug in the pom.xml of org.gagravarr:vorbis-java-tika. It contains a dependency to vorbis-java-core with classifier {{tests}} and scope {{test,provided}}. This is not a valid scope (you can't enumerate multiple scopes here for a Maven dependency). Tika could work around this by excluding the transitive dependency in the dependency declaration of {{vorbis-java-tika}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-967) Tika comes with transitive Maven dependency to a test artifact of vorbis-java-core
[ https://issues.apache.org/jira/browse/TIKA-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427271#comment-13427271 ] Andreas Hubold commented on TIKA-967: - I've created a pull request here: https://github.com/Gagravarr/VorbisJava/pull/1 Tika comes with transitive Maven dependency to a test artifact of vorbis-java-core --- Key: TIKA-967 URL: https://issues.apache.org/jira/browse/TIKA-967 Project: Tika Issue Type: Bug Components: packaging Affects Versions: 1.1, 1.2 Reporter: Andreas Hubold Priority: Minor Tika 1.2 has the following dependencies: {noformat} \- org.apache.tika:tika-parsers:jar:1.2:compile [INFO]+- org.apache.tika:tika-core:jar:1.2:compile [INFO]+- org.gagravarr:vorbis-java-tika:jar:0.1:compile [INFO]| \- org.gagravarr:vorbis-java-core:jar:tests:0.1:runtime .. [INFO]+- org.gagravarr:vorbis-java-core:jar:0.1:compile {noformat} The transitive dependency to {{org.gagravarr:vorbis-java-core:jar:tests}} is wrong. It only contains test resources for vorbis (note that Maven classifier 'tests'). It seems this is caused by a bug in the pom.xml of org.gagravarr:vorbis-java-tika. It contains a dependency to vorbis-java-core with classifier {{tests}} and scope {{test,provided}}. This is not a valid scope (you can't enumerate multiple scopes here for a Maven dependency). Tika could work around this by excluding the transitive dependency in the dependency declaration of {{vorbis-java-tika}}. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira