[ https://issues.apache.org/jira/browse/TIKA-4276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr closed TIKA-4276. --------------------------------- Resolution: Not A Bug > Tika fails to detect damaged pdf > -------------------------------- > > Key: TIKA-4276 > URL: https://issues.apache.org/jira/browse/TIKA-4276 > Project: Tika > Issue Type: Bug > Affects Versions: 2.9.2 > Reporter: Xiaohong Yang > Priority: Major > > We use Tika to check file type and extension. However, with some damaged pdf > files Tika detects them as text file. > Wonder if you can make Tika detect the damaged pdf file as pdf file type and > extension. > Following is the sample code and the link to the tika-config.xml and the > sample PDF file is > [https://1drv.ms/u/s!AvHwMs711s9lgfhtXqh0ycQyzqfG2w?e=q6y2es] > The operating system is Ubuntu 20.04. Java version is 21. Tika version is > 2.9.2 and POI version is 5.2.3. > > > {code:java} > import org.apache.tika.config.TikaConfig; > import org.apache.tika.detect.Detector; > import org.apache.tika.io.TikaInputStream; > import org.apache.tika.metadata.Metadata; > import org.apache.tika.metadata.TikaCoreProperties; > import org.apache.tika.mime.MediaType; > import org.apache.tika.mime.MimeType; > > import java.io.FileInputStream; > > public class DetectDamagedPDF { > > public static void main(String args[]) { > try { > String filePath = > "/home/ubuntu/testdirs/testdir_damaged_pdf/DamagedPDF.pdf"; > TikaConfig config = new > TikaConfig("/home/ubuntu/testdirs/testdir_damaged_pdf/tika-config.xml"); > Detector detector = config.getDetector(); > Metadata metadata = new Metadata(); > FileInputStream fis = new FileInputStream(filePath); > TikaInputStream stream = TikaInputStream.get(fis); > metadata.add(TikaCoreProperties.RESOURCE_NAME_KEY, filePath); > MediaType mediaType = detector.detect(stream, metadata); > MimeType mimeType = > config.getMimeRepository().forName(mediaType.toString()); > String tikaExtension = mimeType.getExtension(); > System.out.println("tikaExtension = " + tikaExtension); > } > catch(Exception ex) { > ex.printStackTrace(); > } > } > } > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010)