[jira] [Commented] (TIKA-3690) upgrade to poi 5.2.1

2022-03-11 Thread Andreas Hubold (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17504983#comment-17504983
 ] 

Andreas Hubold commented on TIKA-3690:
--

FYI, I got an OutOfMemoryError when I tried the update to POI 5.2.1. I've asked 
on the POI mailing list, maybe that's also interesting for you: 
https://lists.apache.org/thread/fmb746gypgfpj8k0lmcvtn89zppwb95p

> upgrade to poi 5.2.1
> 
>
> Key: TIKA-3690
> URL: https://issues.apache.org/jira/browse/TIKA-3690
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: PJ Fanning
>Priority: Major
> Fix For: 2.3.1, 1.28.2
>
>
> There is a POI CVE that may or may not affect TIka - 
> https://lists.apache.org/thread/hqc0ohg0z1j0p4ysm3y4ct6g2d8sjc2b
> Generally, probably a good idea to upgrade anyway



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (TIKA-3575) Cannot use loadErrorHandler="ignore" in tika config

2021-10-15 Thread Andreas Hubold (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17429131#comment-17429131
 ] 

Andreas Hubold commented on TIKA-3575:
--

Thanks [~tallison], I'd suggest to
 * either change the default for loadErrorHandler in 
TikaConfig#serviceLoaderFromDomElement back to IGNORE. (this would be my 
preferred choice, and a very simple change)
 * or keep the default at THROW but extend #serviceLoaderFromDomElement to 
check for a value of "ignore" in the attribute and respect that. And if the 
default is THROW now, it should also be the default if no service-loader 
element specified, otherwise it feels inconsistent and could surprise users. If 
you search for org.apache.tika.config.LoadErrorHandler#IGNORE, you can see that 
it's still the default at some places.

{quote}The goal was to allow finer-grained module selection so that you'd never 
have load errors that you'd want to ignore.
{quote}
I really like the separation into modules in Tika 2.x. That's a great 
improvement!

Our use case for LoadErrorHandler#IGNORE: It can still be useful to include a 
module but exclude some of its parsers/dependencies. For example we're using 
tika-parser-code-module but just don't need Matlab and SAS7BDATParser, so we 
want to exclude parso and jmatio dependencies to reduce the number of 
dependencies. It's a nice feature that this disables the parsers without 
additional necessary configuration in tika config (and our downstream users 
could simply add dependencies to enable parsers without touching configuration).

I think it's a good idea to bundle different parsers into logical modules, like 
different code parsers in tika-parser-code-modules. But sometimes that may not 
be fine-grained enough, and that's where LoadErrorHandler#IGNORE plays a nice 
role, IMHO.

> Cannot use loadErrorHandler="ignore" in tika config
> ---
>
> Key: TIKA-3575
> URL: https://issues.apache.org/jira/browse/TIKA-3575
> Project: Tika
>  Issue Type: Bug
>  Components: config
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Andreas Hubold
>Priority: Major
>  Labels: regression
>
> Tika 2.0.0 changed the default error handler to throw exceptions, and does 
> not ignore errors when loading parsers anymore as it was the case with Tika 
> 1.x.
> See  
> [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470]
> There's no configuration option to restore the previous behavior. It should 
> be possible to set
> {code}
> 
> {code}
> but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement 
> only considers "warn" and "throw" as possible values.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3575) Cannot use loadErrorHandler="ignore" in tika config

2021-10-14 Thread Andreas Hubold (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Hubold updated TIKA-3575:
-
Description: 
Tika 2.0.0 changed the default error handler to throw exceptions, and does not 
ignore errors when loading parsers anymore as it was the case with Tika 1.x.

See  
[https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470]

There's no configuration option to restore the previous behavior. It should be 
possible to set

{code}

{code}

but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement 
only considers "warn" and "throw" as possible values.



 

 

 

  was:
Tika 2.0.0 changed the default error handler to throw exceptions, and does not 
ignore errors when loading parsers anymore as it was the case with Tika 1.x.

See  
[https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470)]

There's no configuration option to restore the previous behavior. It should be 
possible to set

{code}

{code}

but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement 
only considers "warn" and "throw" as possible values.



 

 

 


> Cannot use loadErrorHandler="ignore" in tika config
> ---
>
> Key: TIKA-3575
> URL: https://issues.apache.org/jira/browse/TIKA-3575
> Project: Tika
>  Issue Type: Bug
>  Components: config
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Andreas Hubold
>Priority: Major
>  Labels: regression
>
> Tika 2.0.0 changed the default error handler to throw exceptions, and does 
> not ignore errors when loading parsers anymore as it was the case with Tika 
> 1.x.
> See  
> [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470]
> There's no configuration option to restore the previous behavior. It should 
> be possible to set
> {code}
> 
> {code}
> but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement 
> only considers "warn" and "throw" as possible values.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3575) Cannot use loadErrorHandler="ignore" in tika config

2021-10-14 Thread Andreas Hubold (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17428718#comment-17428718
 ] 

Andreas Hubold commented on TIKA-3575:
--

After looking more into this, I saw that the LoadErrorHandler.THROW is only 
used as default, if a `` element is specified. Otherwise, the 
default is still IGNORE. So maybe the default should just be changed back to 
IGNORE.

BTW, I run into this with the following declaration
{code:java}
 {code}
But as it seems, I can simply remove the whole service-loader element to avoid 
the problem. IIUC, the InitializableProblemHandler isn't called by any 
predefined class anymore anyway. I had this declaration to avoid warnings from 
the PDFParser in previous Tika versions, but that's not necessary anymore with 
Tika 2.x.

> Cannot use loadErrorHandler="ignore" in tika config
> ---
>
> Key: TIKA-3575
> URL: https://issues.apache.org/jira/browse/TIKA-3575
> Project: Tika
>  Issue Type: Bug
>  Components: config
>Affects Versions: 2.0.0, 2.1.0
>Reporter: Andreas Hubold
>Priority: Major
>  Labels: regression
>
> Tika 2.0.0 changed the default error handler to throw exceptions, and does 
> not ignore errors when loading parsers anymore as it was the case with Tika 
> 1.x.
> See  
> [https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470)]
> There's no configuration option to restore the previous behavior. It should 
> be possible to set
> {code}
> 
> {code}
> but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement 
> only considers "warn" and "throw" as possible values.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3575) Cannot use loadErrorHandler="ignore" in tika config

2021-10-14 Thread Andreas Hubold (Jira)
Andreas Hubold created TIKA-3575:


 Summary: Cannot use loadErrorHandler="ignore" in tika config
 Key: TIKA-3575
 URL: https://issues.apache.org/jira/browse/TIKA-3575
 Project: Tika
  Issue Type: Bug
  Components: config
Affects Versions: 2.1.0, 2.0.0
Reporter: Andreas Hubold


Tika 2.0.0 changed the default error handler to throw exceptions, and does not 
ignore errors when loading parsers anymore as it was the case with Tika 1.x.

See  
[https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470|https://github.com/apache/tika/commit/e47c6cd62e587fdaae7e2e999f37122d09449754#diff-3955d56f4d95c6e600966c486c58f92483c900d32d553d18b3cf2940cbf2c768R470)]

There's no configuration option to restore the previous behavior. It should be 
possible to set

{code}

{code}

but the code in org.apache.tika.config.TikaConfig#serviceLoaderFromDomElement 
only considers "warn" and "throw" as possible values.



 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-2802) Out of memory issues when extracting large files (pst)

2019-05-16 Thread Andreas Hubold (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841008#comment-16841008
 ] 

Andreas Hubold edited comment on TIKA-2802 at 5/16/19 6:40 AM:
---

I wonder if the addition of Xerces is still  recommended for Java 9+ projects. 

Since Java 9, the JDK contains bugfixes from Xerces 2.11.0, see 
https://bugs.openjdk.java.net/browse/JDK-8044086
For Java 13, an update to Xerces 2.12.0 is in progress according to 
https://bugs.openjdk.java.net/browse/JDK-8214064

Do you know which Xerces issue was causing the problem?


was (Author: ahubold):
I wonder if the addition of Xerces is still  recommended for Java 9+ projects. 

Since Java 9, the JDK contains bugfixes from Xerces 2.11.0, see  
lhttps://bugs.openjdk.java.net/browse/JDK-8044086
For Java 13, an update to Xerces 2.12.0 is in progress according to 
https://bugs.openjdk.java.net/browse/JDK-8214064

Do you know which Xerces issue was causing the problem?

> Out of memory issues when extracting large files (pst)
> --
>
> Key: TIKA-2802
> URL: https://issues.apache.org/jira/browse/TIKA-2802
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20, 1.19.1
> Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04.
> Java: jdk1.8.0_151
>  
>Reporter: Caleb Ott
>Priority: Critical
> Attachments: Selection_111.png, Selection_117.png
>
>
> I have an application that extracts text from multiple files on a file share. 
> I've been running into issues with the application running out of memory 
> (~26g dedicated to the heap).
> I found in the heap dumps there is a "fDTDDecl" buffer which is creating very 
> large char arrays and never releasing that memory. In the picture you can see 
> the heap dump with 4 SAXParsers holding onto a large chunk of memory. The 
> fourth one is expanded to show it is all being held by the "fDTDDecl" field. 
> This dump is from a scaled down execution (not a 26g heap).
> It looks like that DTD field should never be that large, I'm wondering if 
> this is a bug with xerces instead? I can easily reproduce the issue by 
> attempting to extract text from large .pst files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2802) Out of memory issues when extracting large files (pst)

2019-05-16 Thread Andreas Hubold (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16841008#comment-16841008
 ] 

Andreas Hubold commented on TIKA-2802:
--

I wonder if the addition of Xerces is still  recommended for Java 9+ projects. 

Since Java 9, the JDK contains bugfixes from Xerces 2.11.0, see  
lhttps://bugs.openjdk.java.net/browse/JDK-8044086
For Java 13, an update to Xerces 2.12.0 is in progress according to 
https://bugs.openjdk.java.net/browse/JDK-8214064

Do you know which Xerces issue was causing the problem?

> Out of memory issues when extracting large files (pst)
> --
>
> Key: TIKA-2802
> URL: https://issues.apache.org/jira/browse/TIKA-2802
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.20, 1.19.1
> Environment: Reproduced on Windows 2012 R2 and Ubuntu 18.04.
> Java: jdk1.8.0_151
>  
>Reporter: Caleb Ott
>Priority: Critical
> Attachments: Selection_111.png, Selection_117.png
>
>
> I have an application that extracts text from multiple files on a file share. 
> I've been running into issues with the application running out of memory 
> (~26g dedicated to the heap).
> I found in the heap dumps there is a "fDTDDecl" buffer which is creating very 
> large char arrays and never releasing that memory. In the picture you can see 
> the heap dump with 4 SAXParsers holding onto a large chunk of memory. The 
> fourth one is expanded to show it is all being held by the "fDTDDecl" field. 
> This dump is from a scaled down execution (not a 26g heap).
> It looks like that DTD field should never be that large, I'm wondering if 
> this is a bug with xerces instead? I can easily reproduce the issue by 
> attempting to extract text from large .pst files.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2017-07-06 Thread Andreas Hubold (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076162#comment-16076162
 ] 

Andreas Hubold commented on TIKA-1367:
--

Yes, true. I was referring to [~gus_heck]'s comment, where he wrote that 
dependencies were completely lost in Maven when upgrading from 1.12 to 1.15.

> Tika documentation should list tika-parsers parser dependencies
> ---
>
> Key: TIKA-1367
> URL: https://issues.apache.org/jira/browse/TIKA-1367
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sergey Beryozkin
> Fix For: 1.16
>
>
> tika-parsers module has many strong transitive parser dependencies. Maven 
> users of tika-parsers have to exclude all the transitivie dependencies 
> manually. Documenting the list of the existing transitive dependencies and 
> keeping the list up to date will help developers exclude the libraries not 
> needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-1367) Tika documentation should list tika-parsers parser dependencies

2017-07-06 Thread Andreas Hubold (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16076137#comment-16076137
 ] 

Andreas Hubold commented on TIKA-1367:
--

mvn dependency:tree lists the dependencies of tika-parsers 1.15 for me. They 
are also correctly listed in the pom available on Maven central: 
http://central.maven.org/maven2/org/apache/tika/tika-parsers/1.15/tika-parsers-1.15.pom
Maybe you've somehow installed a wrong pom into your maven repository?

> Tika documentation should list tika-parsers parser dependencies
> ---
>
> Key: TIKA-1367
> URL: https://issues.apache.org/jira/browse/TIKA-1367
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Sergey Beryozkin
> Fix For: 1.16
>
>
> tika-parsers module has many strong transitive parser dependencies. Maven 
> users of tika-parsers have to exclude all the transitivie dependencies 
> manually. Documenting the list of the existing transitive dependencies and 
> keeping the list up to date will help developers exclude the libraries not 
> needed for a given project.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (TIKA-967) Tika comes with transitive Maven dependency to a test artifact of vorbis-java-core

2013-05-13 Thread Andreas Hubold (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Hubold updated TIKA-967:


Affects Version/s: 1.3

 Tika comes with transitive Maven dependency to a test artifact of 
 vorbis-java-core 
 ---

 Key: TIKA-967
 URL: https://issues.apache.org/jira/browse/TIKA-967
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.1, 1.2, 1.3
Reporter: Andreas Hubold
Priority: Minor

 Tika 1.2 has the following dependencies:
 {noformat}
 \- org.apache.tika:tika-parsers:jar:1.2:compile
 [INFO]+- org.apache.tika:tika-core:jar:1.2:compile
 [INFO]+- org.gagravarr:vorbis-java-tika:jar:0.1:compile
 [INFO]|  \- org.gagravarr:vorbis-java-core:jar:tests:0.1:runtime
 ..
 [INFO]+- org.gagravarr:vorbis-java-core:jar:0.1:compile
 {noformat}
 The transitive dependency to {{org.gagravarr:vorbis-java-core:jar:tests}} is 
 wrong. It only contains test resources for vorbis (note that Maven classifier 
 'tests').
 It seems this is caused by a bug in the pom.xml of 
 org.gagravarr:vorbis-java-tika. It contains a dependency to vorbis-java-core 
 with classifier {{tests}} and scope {{test,provided}}. This is not a valid 
 scope (you can't enumerate multiple scopes here for a Maven dependency).
 Tika could work around this by excluding the transitive dependency in the 
 dependency declaration of {{vorbis-java-tika}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-967) Tika comes with transitive Maven dependency to a test artifact of vorbis-java-core

2012-08-02 Thread Andreas Hubold (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13427271#comment-13427271
 ] 

Andreas Hubold commented on TIKA-967:
-

I've created a pull request here: https://github.com/Gagravarr/VorbisJava/pull/1

 Tika comes with transitive Maven dependency to a test artifact of 
 vorbis-java-core 
 ---

 Key: TIKA-967
 URL: https://issues.apache.org/jira/browse/TIKA-967
 Project: Tika
  Issue Type: Bug
  Components: packaging
Affects Versions: 1.1, 1.2
Reporter: Andreas Hubold
Priority: Minor

 Tika 1.2 has the following dependencies:
 {noformat}
 \- org.apache.tika:tika-parsers:jar:1.2:compile
 [INFO]+- org.apache.tika:tika-core:jar:1.2:compile
 [INFO]+- org.gagravarr:vorbis-java-tika:jar:0.1:compile
 [INFO]|  \- org.gagravarr:vorbis-java-core:jar:tests:0.1:runtime
 ..
 [INFO]+- org.gagravarr:vorbis-java-core:jar:0.1:compile
 {noformat}
 The transitive dependency to {{org.gagravarr:vorbis-java-core:jar:tests}} is 
 wrong. It only contains test resources for vorbis (note that Maven classifier 
 'tests').
 It seems this is caused by a bug in the pom.xml of 
 org.gagravarr:vorbis-java-tika. It contains a dependency to vorbis-java-core 
 with classifier {{tests}} and scope {{test,provided}}. This is not a valid 
 scope (you can't enumerate multiple scopes here for a Maven dependency).
 Tika could work around this by excluding the transitive dependency in the 
 dependency declaration of {{vorbis-java-tika}}.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira