Julio J. Gomez Diaz created TIKA-3965:
-----------------------------------------

             Summary: Detector for valid PDF files
                 Key: TIKA-3965
                 URL: https://issues.apache.org/jira/browse/TIKA-3965
             Project: Tika
          Issue Type: Bug
          Components: tika-core
    Affects Versions: 2.6.0
            Reporter: Julio J. Gomez Diaz
         Attachments: test2.pdf

If we use MagicDetector or the detector using the content via DefaultDetector 
it identifies as PDF file an invalid file such as the attached one, with this 
simple content:

 
{code:java}
<script>alert(1)</script>
%PDF-1.7{code}
 

Is there any alternative detector in Tika that reads the whole file content in 
order to not detected as PDF a non-valid PDF file?

If there is not, will it make sense to implement it? Which would be the right 
java package location for this?

 

This sample file is detected as wrong by Adobe Reader and any online PDF 
processor we found online, but Tika identified it as PDF.

 

Thanks in advance



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to