[ https://issues.apache.org/jira/browse/TIKA-3832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17575847#comment-17575847 ]
Tim Allison commented on TIKA-3832: ----------------------------------- Thank you for sharing the file. PDFBox's ExtractText has no problem with this. Tika is entering an infinite loop here: {noformat} <p>Abweichender Beschluss</p> <p/> </div> <ul> <li>Bekanntmachung 18.5.2021 Gemeinderat</li> <li>1 Beschlussvorlage 43/2021 - Verkehrsplanung; Sanierung der Zornedinger Straße in Harthausen - Variantenuntersuchung</li> <li> > Zwischenbericht Vorplanung</li> <li> > Lageplan (Teil Süd)</li> <li> -->Pläne und Ansichten</li> <ul> <li>140102_Unterlage 5 utm-Blatt 1</li> <li>140102_Unterlage 5 utm-Blatt 1</li> {noformat} > Required array length is too large (OOM) error when reading a PDF file > ---------------------------------------------------------------------- > > Key: TIKA-3832 > URL: https://issues.apache.org/jira/browse/TIKA-3832 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.4.1 > Reporter: Lakatos Gyula > Priority: Major > Attachments: 7581cfbf-8c1e-4154-bfbb-4e633d858d5f.pdf > > > I'm working on a web crawler and it got obliterated with an OutOfMemory error > by a random PDF from the internet. > {code:java} > Exception in thread "main" java.lang.OutOfMemoryError: Required array length > 2147483638 + 14 is too large > at > java.base/jdk.internal.util.ArraysSupport.hugeLength(ArraysSupport.java:649) > at > java.base/jdk.internal.util.ArraysSupport.newLength(ArraysSupport.java:642) > at > java.base/java.lang.AbstractStringBuilder.newCapacity(AbstractStringBuilder.java:257) > at > java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:229) > at > java.base/java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:740) > at java.base/java.lang.StringBuffer.append(StringBuffer.java:410) > at java.base/java.io.StringWriter.write(StringWriter.java:99) > at > org.apache.tika.sax.ToTextContentHandler.characters(ToTextContentHandler.java:108) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.WriteOutContentHandler.characters(WriteOutContentHandler.java:160) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.xpath.MatchingContentHandler.characters(MatchingContentHandler.java:81) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.ContentHandlerDecorator.characters(ContentHandlerDecorator.java:141) > at > org.apache.tika.sax.SafeContentHandler.access$201(SafeContentHandler.java:47) > at > org.apache.tika.sax.SafeContentHandler.lambda$new$0(SafeContentHandler.java:57) > at > org.apache.tika.sax.SafeContentHandler.filter(SafeContentHandler.java:106) > at > org.apache.tika.sax.SafeContentHandler.characters(SafeContentHandler.java:250) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:270) > at > org.apache.tika.sax.XHTMLContentHandler.characters(XHTMLContentHandler.java:295) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:977) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:981) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractBookmarkText(AbstractPDF2XHTML.java:959) > at > org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:907) > at > org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:239) > at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:108) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:196) > at com.example.TikaOOMExample.main(TikaOOMExample.java:31) > {code} > I reproduced the error in this repository: > [https://github.com/laxika/apache-tika-oom-reproduction|http://example.com/] > Uploaded the PDF into the attachments as well. It can be opened and read by > the PDF readers I tried (Edge, Adobe, Chrome). -- This message was sent by Atlassian Jira (v8.20.10#820010)