[jira] [Updated] (TIKA-3839) Property com.ctc.wstx.maxEntityCount is not supported

2022-08-18 Thread Lakatos Gyula (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lakatos Gyula updated TIKA-3839:

Description: 
First of all, this might not even be a bug, just a slight annoyance.

Whenever I try to parse the attached PDF, I get the following error:

 
{code:java}
[main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could 
not be setup [log suppressed for 5 minutes]
java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not 
supported
    at 
java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246)
    at 
org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732)
    at 
org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303)
    at 
org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229)
    at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code}
 

After a couple of hours of Googling, I realized that there is an XML parser 
implementation called woodstox. If I include that dependency on the classpath, 
this exception is no longer present, because it understands the 
_com.ctc.wstx.maxEntityCount_ property.

As far as I see, the 1.28.4 version of tika-parsers included this library as a 
compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the 
case, but there must be a good reason for it.

However, I think it would be a good idea to change the exception's message from:
{code:java}
SAX Security Manager could not be setup [log suppressed for 5 minutes] {code}
to something more meaningful.

Something that mentions woodstox would be good (especially if the only property 
that Tika tries to set is woodstox specific). Also, spamming/printing the 
message every 5 minutes is pointless in my opinion. If woodstox is not on the 
classpath, it will fail anyways (and also it is not synchronized so if you 
parsing a lot of documents at the same time in parallel, it still can print it 
more than once).

  was:
First of all, this might not even be a bug, just a slight annoyance.

Whenever I try to parse the attached PDF, I get the following error:

 
{code:java}
[main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could 
not be setup [log suppressed for 5 minutes]
java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not 
supported
    at 
java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246)
    at 
org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732)
    at 
org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303)
    at 
org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229)
    at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code}
 

After a couple of hours of Googling, I realized that there is an XML parser 
implementation called woodstox. If I include that dependency on the classpath, 
this exception is no longer present, because it understands the 
_com.ctc.wstx.maxEntityCount_ property.

As far as I see, the 1.28.4 version of tika-parsers included this library as a 
compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the 
case, but there must be a good reason for it.

However, I think it would be a good idea to change the exception's message from:
{code:java}
SAX Security Manager could not be setup [log suppressed for 5 minutes] {code}
to something more meaningful.

Something that mentions woodstox would be good (especially if the only property 
that Tika tries to set is woodstox specific). Also, spamming/printing the 
message every 5 minutes is pointless in my opinion. If woodstox is not on the 
classpath, it will fail anyways.


> Property com.ctc.wstx.maxEntityCount is not supported
> -
>
> Key: TIKA-3839
> URL: https://issues.apache.org/jira/browse/TIKA-3839
> Project: Tika
>  Issue 

[jira] [Created] (TIKA-3839) Property com.ctc.wstx.maxEntityCount is not supported

2022-08-17 Thread Lakatos Gyula (Jira)
Lakatos Gyula created TIKA-3839:
---

 Summary: Property com.ctc.wstx.maxEntityCount is not supported
 Key: TIKA-3839
 URL: https://issues.apache.org/jira/browse/TIKA-3839
 Project: Tika
  Issue Type: Bug
Affects Versions: 2.4.1
Reporter: Lakatos Gyula
 Attachments: 8a4b2154-b6c1-4e0e-b8be-8ce4e68c454f.pdf

First of all, this might not even be a bug, just a slight annoyance.

Whenever I try to parse the attached PDF, I get the following error:

 
{code:java}
[main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could 
not be setup [log suppressed for 5 minutes]
java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not 
supported
    at 
java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246)
    at 
org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732)
    at 
org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303)
    at 
org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229)
    at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863)
    at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772)
    at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270)
    at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code}
 

After a couple of hours of Googling, I realized that there is an XML parser 
implementation called woodstox. If I include that dependency on the classpath, 
this exception is no longer present, because it understands the 
_com.ctc.wstx.maxEntityCount_ property.

As far as I see, the 1.28.4 version of tika-parsers included this library as a 
compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the 
case, but there must be a good reason for it.

However, I think it would be a good idea to change the exception's message from:
{code:java}
SAX Security Manager could not be setup [log suppressed for 5 minutes] {code}
to something more meaningful.

Something that mentions woodstox would be good (especially if the only property 
that Tika tries to set is woodstox specific). Also, spamming/printing the 
message every 5 minutes is pointless in my opinion. If woodstox is not on the 
classpath, it will fail anyways.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file

2022-08-05 Thread Lakatos Gyula (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575951#comment-17575951
 ] 

Lakatos Gyula commented on TIKA-3832:
-

[~tallison] Thanks a lot for fixing the problem! Tika is awesome. :)

> Required array length is too large (OOM) error when reading a PDF file
> --
>
> Key: TIKA-3832
> URL: https://issues.apache.org/jira/browse/TIKA-3832
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Lakatos Gyula
>Priority: Major
> Fix For: 1.28.5, 2.4.2
>
> Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf
>
>
> I'm working on a web crawler and it got obliterated with an OutOfMemory error 
> by a random PDF from the internet.
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Required array length 
> 2147483638 + 14 is too large
>   at 
> java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
>   at 
> java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
>   at 
> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257)
>   at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229)
>   at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>   at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
>   at java.base/java.io.StringWriter.write(StringWriter.java:99)
>   at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
>   at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196)
>   at com.example.TikaOOMExample.main(TikaOOMExample.java:31)
> {code}
> I reproduced the error in this repository:
> [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/]
> Uploaded the PDF into the attachments as well. It can be opened and read by 
> the PDF readers I tried (Edge, Adobe, Chrome).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (TIKA-3832) Required array length is too large when reading a PDF file

2022-08-05 Thread Lakatos Gyula (Jira)
Lakatos Gyula created TIKA-3832:
---

 Summary: Required array length is too large when reading a PDF file
 Key: TIKA-3832
 URL: https://issues.apache.org/jira/browse/TIKA-3832
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 2.4.1
Reporter: Lakatos Gyula
 Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf

I'm working on a web crawler and it got obliterated with an OutOfMemory error 
by a random PDF from the internet.
{code:java}
Exception in thread "main" java.lang.OutOfMemoryError: Required array length 
2147483638 + 14 is too large
at 
java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
at 
java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
at 
java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257)
at 
java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229)
at 
java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
at java.base/java.io.StringWriter.write(StringWriter.java:99)
at 
org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
at 
org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
at 
org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
at 
org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
at 
org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
at 
org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
at 
org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
at 
org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977)
at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981)
at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959)
at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907)
at 
org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196)
at com.example.TikaOOMExample.main(TikaOOMExample.java:31)
{code}
I reproduced the error in this repository:
[https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/]

Uploaded the PDF into the attachments as well. It can be opened and read by the 
PDF readers I tried (Edge, Adobe, Chrome).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file

2022-08-05 Thread Lakatos Gyula (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lakatos Gyula updated TIKA-3832:

Summary: Required array length is too large (OOM) error when reading a PDF 
file  (was: Required array length is too large when reading a PDF file)

> Required array length is too large (OOM) error when reading a PDF file
> --
>
> Key: TIKA-3832
> URL: https://issues.apache.org/jira/browse/TIKA-3832
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.4.1
>Reporter: Lakatos Gyula
>Priority: Major
> Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf
>
>
> I'm working on a web crawler and it got obliterated with an OutOfMemory error 
> by a random PDF from the internet.
> {code:java}
> Exception in thread "main" java.lang.OutOfMemoryError: Required array length 
> 2147483638 + 14 is too large
>   at 
> java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649)
>   at 
> java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642)
>   at 
> java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257)
>   at 
> java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229)
>   at 
> java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740)
>   at java.base/java.lang.StringBuffer.append(StringBuffer.java:410)
>   at java.base/java.io.StringWriter.write(StringWriter.java:99)
>   at 
> org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141)
>   at 
> org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47)
>   at 
> org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57)
>   at 
> org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106)
>   at 
> org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959)
>   at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907)
>   at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239)
>   at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108)
>   at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196)
>   at com.example.TikaOOMExample.main(TikaOOMExample.java:31)
> {code}
> I reproduced the error in this repository:
> [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/]
> Uploaded the PDF into the attachments as well. It can be opened and read by 
> the PDF readers I tried (Edge, Adobe, Chrome).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)