Thierry Guérin created TIKA-4098:
------------------------------------

             Summary: Detection fail on PDF with garbage before header
                 Key: TIKA-4098
                 URL: https://issues.apache.org/jira/browse/TIKA-4098
             Project: Tika
          Issue Type: Bug
          Components: core
    Affects Versions: 2.8.0
            Reporter: Thierry Guérin
         Attachments: garbageBeforeHeader.pdf

PDF detection fails on files that contain too much garbage before the header 
'PDF%-'.

Those PDFs do not respect the specification, but are nonetheless correctly 
handled by PDF viewers.

The joined PDF is an example on the garbage found in a real-life PDF (looks 
like email headers that 'leaked' onto the PDF file). The PDF itself is one that 
I generated so that the exemple si small.

The current magic for PDFs  limits the search for the '%PDF-%' header to 512 
bytes, and in the joined PDF it's located after 702 garbage bytes.

I looked at the sources of PdfBox and Ghostscript to see how they handle this 
case and:
 * Ghostscript searches through the entire file (see 
[https://github.com/ArtifexSoftware/ghostpdl/blob/master/pdf/ghostpdf.c] lines 
1323-1339)
 * PdfBox reads the file line by line, and stops looking for the header when  
it encounters a line that starts with a digit (see 
[https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java]
 lines 1561-)

>From the doc in tika-mimetypes.xml for the application/pdf MIME type, I 
>understand that increasing the maximum offset can trigger false positives. I 
>increased it to 768, and the unit tests pass, but I didn't find any PDF that  
>tests this particular case, so either it doesn't exist or there are 
>integration tests that aren't part of this repo ?

How can I go about testing for regressions ? I can provide a pull request for 
this change, but where do I put the test PDF and a unit test?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to