[ https://issues.apache.org/jira/browse/PDFBOX-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317606#comment-17317606 ]
Jason Pyeron commented on PDFBOX-5158: -------------------------------------- I have a valid (non-corrupt) pdf which seems to have similar behavior. I cannot share the PDF as it has customer data and PII in it, but I can run tests against it. [pdinc bug 2386|https://projects.pdinc.us/show_bug.cgi?id=2386] getEOFEndingOffsets got caught in an endless loop due to too many PDF xref found by XrefTrailerResolver {noformat} type pos (after EOF or start of xref) ======= ====== xref 217557 xref 220101 %%EOF 220298 xref 339064 xref 339461 %%EOF 339660 xref 533303 xref 533947 %%EOF 534146 xref 546909 xref 547346 %%EOF 547545 xref 625816 xref 626372 %%EOF 626571 xref 635964 xref 636321 %%EOF 636520 xref 657314 xref 657838 %%EOF 658037 xref 667430 xref 667787 %%EOF 667986 {noformat} > Infinite loop on corrupted PDF in 3.0.0-SNAPSHOT > ------------------------------------------------ > > Key: PDFBOX-5158 > URL: https://issues.apache.org/jira/browse/PDFBOX-5158 > Project: PDFBox > Issue Type: Bug > Affects Versions: 3.0.0 PDFBox > Reporter: Tim Allison > Priority: Critical > > I found a bunch of files that had a "read too many EOFs", which is a safety > check we now do in TikaInputStream to identify parsers that read an EOF > > 1000 times, which may be a sign of an infinite loop. > When I turn off this safety check in TikaInputStream, I get an infinite loop. > This is one of the triggering files: > https://corpora.tika.apache.org/base/docs/commoncrawl3/OE/OELHPKYAQPDNDWC535NE23Z6FKYRMN7W > It's a truncated file from Common Crawl. > The stacktrace when this is thrown is: > {noformat} > afterRead:809, TikaInputStream (org.apache.tika.io) > read:82, ProxyInputStream (org.apache.commons.io.input) > <init>:113, RandomAccessReadBuffer (org.apache.pdfbox.io) > loadPDF:454, Loader (org.apache.pdfbox) > loadPDF:430, Loader (org.apache.pdfbox) > getPDDocument:189, PDFParser (org.apache.tika.parser.pdf) > parse:148, PDFParser (org.apache.tika.parser.pdf) > parse:288, CompositeParser (org.apache.tika.parser) > parse:288, CompositeParser (org.apache.tika.parser) > parse:150, AutoDetectParser (org.apache.tika.parser) > parse:157, RecursiveParserWrapper (org.apache.tika.parser) > getRecursiveMetadata:379, TikaTest (org.apache.tika) > getRecursiveMetadata:369, TikaTest (org.apache.tika) > getRecursiveMetadata:357, TikaTest (org.apache.tika) > getRecursiveMetadata:351, TikaTest (org.apache.tika) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org