[ https://issues.apache.org/jira/browse/TIKA-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741578#comment-17741578 ]
Nick Burch commented on TIKA-4098: ---------------------------------- The more bytes beyond the start we check for the PDF marker, the more likely we are to mis-identify a different file as a PDF. The %PDF- marker is pretty unique at the start of a file, but progressively less so as the content continues. (Consider a markdown file of a talk on file formats, that could easily have the text "Look for %PDF- at the start" on page 10 and we don't want to mark the whole thing as a PDF!) If you know for sure that a file is a PDF, just skip detection and tell Tika and we'll hand it off to the PDF parser! If your use case has very few text-based formats, you can fairly safely bump the search window up. Out-of-the-box, I'd be very worried to push it much further due to the false positive risk > Detection fails on PDF with garbage before header > ------------------------------------------------- > > Key: TIKA-4098 > URL: https://issues.apache.org/jira/browse/TIKA-4098 > Project: Tika > Issue Type: Bug > Components: core > Affects Versions: 2.8.0 > Reporter: Thierry Guérin > Priority: Minor > Attachments: garbageBeforeHeader.pdf > > > PDF detection fails on files that contain too much garbage before the header > 'PDF%-'. > Those PDFs do not respect the specification, but are nonetheless correctly > handled by PDF viewers. > The joined PDF is an example on the garbage found in a real-life PDF (looks > like email headers that 'leaked' onto the PDF file). The PDF itself is one > that I generated so that the exemple si small. > The current magic for PDFs limits the search for the '%PDF-%' header to 512 > bytes, and in the joined PDF it's located after 702 garbage bytes. > I looked at the sources of PdfBox and Ghostscript to see how they handle this > case and: > * Ghostscript searches through the entire file (see > [https://github.com/ArtifexSoftware/ghostpdl/blob/master/pdf/ghostpdf.c] > lines 1323-1339) > * PdfBox reads the file line by line, and stops looking for the header when > it encounters a line that starts with a digit (see > [https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java] > lines 1561-) > From the doc in tika-mimetypes.xml for the application/pdf MIME type, I > understand that increasing the maximum offset can trigger false positives. I > increased it to 768, and the unit tests pass, but I didn't find any PDF that > tests this particular case, so either it doesn't exist or there are > integration tests that aren't part of this repo ? > How can I go about testing for regressions ? I can provide a pull request for > this change, but where do I put the test PDF and a unit test? -- This message was sent by Atlassian Jira (v8.20.10#820010)