[jira] [Commented] (TIKA-4098) Detection fails on PDF with garbage before header

Jira Tue, 11 Jul 2023 07:30:46 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17742035#comment-17742035
 ]


Thierry Guérin commented on TIKA-4098:
--------------------------------------

[~nick] I use Tika to detect the MIME type of files, as those files come from 
emails and experience has shown that you can't trust email headers to correctly 
report the MIME type of attachments. So I really can't make any assumptions 
regarding the type as it can range from images to word documents.

I understand the concern about markdown files, though. I just found interesting 
that I encountered such a PDF in the wild (e-invoice) and thought that upping 
the limit a bit might be worth it. But then again I have no hard statistics on 
how many of those are generated or corrupted by whatever ill-written software, 
so if you think it's too risky, I'll just use the custom-mimetypes.xml to tune 
the limits just for me.

[~tilman] in an e-mail that contains a PDF the PDF will be encoded (most likely 
in base64) so it's not an issue. In theory you can have a PDF with only ASCII 
characters, but in practice all libraries that I know of compress by default, 
making the PDF non-ASCII, and email clients don't take the risk and encode 
PDFs. But then again it COULD happen.

> Detection fails on PDF with garbage before header
> -------------------------------------------------
>
>                 Key: TIKA-4098
>                 URL: https://issues.apache.org/jira/browse/TIKA-4098
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 2.8.0
>            Reporter: Thierry Guérin
>            Priority: Minor
>         Attachments: garbageBeforeHeader.pdf
>
>
> PDF detection fails on files that contain too much garbage before the header 
> 'PDF%-'.
> Those PDFs do not respect the specification, but are nonetheless correctly 
> handled by PDF viewers.
> The joined PDF is an example on the garbage found in a real-life PDF (looks 
> like email headers that 'leaked' onto the PDF file). The PDF itself is one 
> that I generated so that the exemple si small.
> The current magic for PDFs  limits the search for the '%PDF-%' header to 512 
> bytes, and in the joined PDF it's located after 702 garbage bytes.
> I looked at the sources of PdfBox and Ghostscript to see how they handle this 
> case and:
>  * Ghostscript searches through the entire file (see 
> [https://github.com/ArtifexSoftware/ghostpdl/blob/master/pdf/ghostpdf.c] 
> lines 1323-1339)
>  * PdfBox reads the file line by line, and stops looking for the header when  
> it encounters a line that starts with a digit (see 
> [https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdfparser/COSParser.java]
>  lines 1561-)
> From the doc in tika-mimetypes.xml for the application/pdf MIME type, I 
> understand that increasing the maximum offset can trigger false positives. I 
> increased it to 768, and the unit tests pass, but I didn't find any PDF that  
> tests this particular case, so either it doesn't exist or there are 
> integration tests that aren't part of this repo ?
> How can I go about testing for regressions ? I can provide a pull request for 
> this change, but where do I put the test PDF and a unit test?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4098) Detection fails on PDF with garbage before header

Reply via email to