[ 
https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr closed TIKA-4276.
---------------------------------
    Resolution: Not A Bug

> Tika fails to detect damaged pdf
> --------------------------------
>
>                 Key: TIKA-4276
>                 URL: https://issues.apache.org/jira/browse/TIKA-4276
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 2.9.2
>            Reporter: Xiaohong Yang
>            Priority: Major
>
> We use Tika to check file type and extension. However, with some damaged pdf 
> files Tika detects them as text file.
> Wonder if you can make Tika detect the damaged pdf file as pdf file type and 
> extension.
> Following is the sample code and the link to the tika-config.xml and the 
> sample PDF file is 
> [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es]
> The operating system is Ubuntu 20.04. Java version is 21.  Tika version is 
> 2.9.2 and POI version is 5.2.3.   
>  
>  
> {code:java}
> import org.apache.tika.config.TikaConfig;
> import org.apache.tika.detect.Detector;
> import org.apache.tika.io.TikaInputStream;
> import org.apache.tika.metadata.Metadata;
> import org.apache.tika.metadata.TikaCoreProperties;
> import org.apache.tika.mime.MediaType;
> import org.apache.tika.mime.MimeType;
>  
> import java.io.FileInputStream;
>  
> public class DetectDamagedPDF {
>  
>     public static void main(String args[]) {
>         try {
>             String filePath = 
> "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf";
>             TikaConfig config = new 
> TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml");
>             Detector detector = config.getDetector();
>             Metadata metadata = new Metadata();
>             FileInputStream fis = new FileInputStream(filePath);
>             TikaInputStream stream = TikaInputStream.get(fis);
>             metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath);
>             MediaType mediaType = detector.detect(stream, metadata);
>             MimeType mimeType = 
> config.getMimeRepository().forName(mediaType.toString());
>             String tikaExtension = mimeType.getExtension();
>             System.out.println("tikaExtension = " + tikaExtension);
>         }
>         catch(Exception ex) {
>             ex.printStackTrace();
>         }
>     }
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to