[jira] [Commented] (TIKA-4098) Detection fails on PDF with garbage before header

Nick Burch (Jira) Mon, 10 Jul 2023 05:01:05 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17741578#comment-17741578
 ]


Nick Burch commented on TIKA-4098:
----------------------------------

The more bytes beyond the start we check for the PDF marker, the more likely we 
are to mis-identify a different file as a PDF. The %PDF- marker is pretty 
unique at the start of a file, but progressively less so as the content 
continues. (Consider a markdown file of a talk on file formats, that could 
easily have the text "Look for %PDF- at the start" on page 10 and we don't want 
to mark the whole thing as a PDF!)

If you know for sure that a file is a PDF, just skip detection and tell Tika 
and we'll hand it off to the PDF parser!

If your use case has very few text-based formats, you can fairly safely bump 
the search window up. Out-of-the-box, I'd be very worried to push it much 
further due to the false positive risk

> Detection fails on PDF with garbage before header
> -------------------------------------------------
>
>                 Key: TIKA-4098
>                 URL: https://issues.apache.org/jira/browse/TIKA-4098
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.8.0
>            Reporter: Thierry Guérin
>            Priority: Minor
>         Attachments: garbageBeforeHeader.pdf
>
>
> PDF detection fails on files that contain too much garbage before the header 
> 'PDF%-'.
> Those PDFs do not respect the specification, but are nonetheless correctly 
> handled by PDF viewers.
> The joined PDF is an example on the garbage found in a real-life PDF (looks 
> like email headers that 'leaked' onto the PDF file). The PDF itself is one 
> that I generated so that the exemple si small.
> The current magic for PDFs  limits the search for the '%PDF-%' header to 512 
> bytes, and in the joined PDF it's located after 702 garbage bytes.
> I looked at the sources of PdfBox and Ghostscript to see how they handle this 
> case and:
>  * Ghostscript searches through the entire file (see 
> [https://github.com/ArtifexSoftware/ghostpdl/blob/master/pdf/ghostpdf.c] 
> lines 1323-1339)
>  * PdfBox reads the file line by line, and stops looking for the header when  
> it encounters a line that starts with a digit (see 
> [https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java]
>  lines 1561-)
> From the doc in tika-mimetypes.xml for the application/pdf MIME type, I 
> understand that increasing the maximum offset can trigger false positives. I 
> increased it to 768, and the unit tests pass, but I didn't find any PDF that  
> tests this particular case, so either it doesn't exist or there are 
> integration tests that aren't part of this repo ?
> How can I go about testing for regressions ? I can provide a pull request for 
> this change, but where do I put the test PDF and a unit test?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4098) Detection fails on PDF with garbage before header

Reply via email to