[jira] [Updated] (TIKA-3839) Property com.ctc.wstx.maxEntityCount is not supported
[ https://issues.apache.org/jira/browse/TIKA-3839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lakatos Gyula updated TIKA-3839: Description: First of all, this might not even be a bug, just a slight annoyance. Whenever I try to parse the attached PDF, I get the following error: {code:java} [main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could not be setup [log suppressed for 5 minutes] java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not supported at java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246) at org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732) at org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303) at org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229) at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code} After a couple of hours of Googling, I realized that there is an XML parser implementation called woodstox. If I include that dependency on the classpath, this exception is no longer present, because it understands the _com.ctc.wstx.maxEntityCount_ property. As far as I see, the 1.28.4 version of tika-parsers included this library as a compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the case, but there must be a good reason for it. However, I think it would be a good idea to change the exception's message from: {code:java} SAX Security Manager could not be setup [log suppressed for 5 minutes] {code} to something more meaningful. Something that mentions woodstox would be good (especially if the only property that Tika tries to set is woodstox specific). Also, spamming/printing the message every 5 minutes is pointless in my opinion. If woodstox is not on the classpath, it will fail anyways (and also it is not synchronized so if you parsing a lot of documents at the same time in parallel, it still can print it more than once). was: First of all, this might not even be a bug, just a slight annoyance. Whenever I try to parse the attached PDF, I get the following error: {code:java} [main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could not be setup [log suppressed for 5 minutes] java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not supported at java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246) at org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732) at org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303) at org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229) at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code} After a couple of hours of Googling, I realized that there is an XML parser implementation called woodstox. If I include that dependency on the classpath, this exception is no longer present, because it understands the _com.ctc.wstx.maxEntityCount_ property. As far as I see, the 1.28.4 version of tika-parsers included this library as a compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the case, but there must be a good reason for it. However, I think it would be a good idea to change the exception's message from: {code:java} SAX Security Manager could not be setup [log suppressed for 5 minutes] {code} to something more meaningful. Something that mentions woodstox would be good (especially if the only property that Tika tries to set is woodstox specific). Also, spamming/printing the message every 5 minutes is pointless in my opinion. If woodstox is not on the classpath, it will fail anyways. > Property com.ctc.wstx.maxEntityCount is not supported > - > > Key: TIKA-3839 > URL: https://issues.apache.org/jira/browse/TIKA-3839 > Project: Tika > Issue
[jira] [Created] (TIKA-3839) Property com.ctc.wstx.maxEntityCount is not supported
Lakatos Gyula created TIKA-3839: --- Summary: Property com.ctc.wstx.maxEntityCount is not supported Key: TIKA-3839 URL: https://issues.apache.org/jira/browse/TIKA-3839 Project: Tika Issue Type: Bug Affects Versions: 2.4.1 Reporter: Lakatos Gyula Attachments: 8a4b2154-b6c1-4e0e-b8be-8ce4e68c454f.pdf First of all, this might not even be a bug, just a slight annoyance. Whenever I try to parse the attached PDF, I get the following error: {code:java} [main] WARN org.apache.tika.utils.XMLReaderUtils - SAX Security Manager could not be setup [log suppressed for 5 minutes] java.lang.IllegalArgumentException: Property com.ctc.wstx.maxEntityCount is not supported at java.xml/com.sun.xml.internal.stream.XMLInputFactoryImpl.setProperty(XMLInputFactoryImpl.java:246) at org.apache.tika.utils.XMLReaderUtils.trySetStaxSecurityManager(XMLReaderUtils.java:732) at org.apache.tika.utils.XMLReaderUtils.getXMLInputFactory(XMLReaderUtils.java:303) at org.apache.tika.parser.ParseContext.getXMLInputFactory(ParseContext.java:229) at org.apache.tika.parser.pdf.XFAExtractor.extract(XFAExtractor.java:90) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractAcroForm(AbstractPDF2XHTML.java:863) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:772) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:270) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:97) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:170) {code} After a couple of hours of Googling, I realized that there is an XML parser implementation called woodstox. If I include that dependency on the classpath, this exception is no longer present, because it understands the _com.ctc.wstx.maxEntityCount_ property. As far as I see, the 1.28.4 version of tika-parsers included this library as a compile-time dependency, however, 2.0.0+ doesn't. I'm not sure why this is the case, but there must be a good reason for it. However, I think it would be a good idea to change the exception's message from: {code:java} SAX Security Manager could not be setup [log suppressed for 5 minutes] {code} to something more meaningful. Something that mentions woodstox would be good (especially if the only property that Tika tries to set is woodstox specific). Also, spamming/printing the message every 5 minutes is pointless in my opinion. If woodstox is not on the classpath, it will fail anyways. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file
[ https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17575951#comment-17575951 ] Lakatos Gyula commented on TIKA-3832: - [~tallison] Thanks a lot for fixing the problem! Tika is awesome. :) > Required array length is too large (OOM) error when reading a PDF file > -- > > Key: TIKA-3832 > URL: https://issues.apache.org/jira/browse/TIKA-3832 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 >Reporter: Lakatos Gyula >Priority: Major > Fix For: 1.28.5, 2.4.2 > > Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf > > > I'm working on a web crawler and it got obliterated with an OutOfMemory error > by a random PDF from the internet. > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Required array length > 2147483638 + 14 is too large > at > java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649) > at > java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642) > at > java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257) > at > java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229) > at > java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740) > at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) > at java.base/java.io.StringWriter.write(StringWriter.java:99) > at > org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47) > at > org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) > at com.example.TikaOOMExample.main(TikaOOMExample.java:31) > {code} > I reproduced the error in this repository: > [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/] > Uploaded the PDF into the attachments as well. It can be opened and read by > the PDF readers I tried (Edge, Adobe, Chrome). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (TIKA-3832) Required array length is too large when reading a PDF file
Lakatos Gyula created TIKA-3832: --- Summary: Required array length is too large when reading a PDF file Key: TIKA-3832 URL: https://issues.apache.org/jira/browse/TIKA-3832 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.4.1 Reporter: Lakatos Gyula Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf I'm working on a web crawler and it got obliterated with an OutOfMemory error by a random PDF from the internet. {code:java} Exception in thread "main" java.lang.OutOfMemoryError: Required array length 2147483638 + 14 is too large at java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649) at java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642) at java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257) at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229) at java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740) at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) at java.base/java.io.StringWriter.write(StringWriter.java:99) at org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) at org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) at org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) at org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) at org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47) at org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57) at org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) at org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250) at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270) at org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959) at org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907) at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239) at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) at com.example.TikaOOMExample.main(TikaOOMExample.java:31) {code} I reproduced the error in this repository: [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/] Uploaded the PDF into the attachments as well. It can be opened and read by the PDF readers I tried (Edge, Adobe, Chrome). -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (TIKA-3832) Required array length is too large (OOM) error when reading a PDF file
[ https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lakatos Gyula updated TIKA-3832: Summary: Required array length is too large (OOM) error when reading a PDF file (was: Required array length is too large when reading a PDF file) > Required array length is too large (OOM) error when reading a PDF file > -- > > Key: TIKA-3832 > URL: https://issues.apache.org/jira/browse/TIKA-3832 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 2.4.1 >Reporter: Lakatos Gyula >Priority: Major > Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf > > > I'm working on a web crawler and it got obliterated with an OutOfMemory error > by a random PDF from the internet. > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Required array length > 2147483638 + 14 is too large > at > java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649) > at > java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642) > at > java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257) > at > java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229) > at > java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740) > at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) > at java.base/java.io.StringWriter.write(StringWriter.java:99) > at > org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47) > at > org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) > at com.example.TikaOOMExample.main(TikaOOMExample.java:31) > {code} > I reproduced the error in this repository: > [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/] > Uploaded the PDF into the attachments as well. It can be opened and read by > the PDF readers I tried (Edge, Adobe, Chrome). -- This message was sent by Atlassian Jira (v8.20.10#820010)