[ https://issues.apache.org/jira/browse/TIKA-1770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178184#comment-17178184 ]
Peter Lee commented on TIKA-1770: --------------------------------- Test 3 given file in tika-1.24.1 . here is tika content-type detection result : ||File Name||Content Type|| |the-acl-rd-tec_chunk_15.txt|audio/mpeg| |the-acl-rd-tec_chunk_9113.txt|image/x-portable-bitmap| |the-acl-rd-tec_chunk_10228.txt|image/x-portable-bitmap| Reason: Content of file `the-acl-rd-tec_chunk_15.txt` start with string "ID3" which is magic byte of audio/mpeg. Content of file `the-acl-rd-tec_chunk_9113.txt` start with string "P1" which is magic byte of image/x-portable-bitmap. Content of file `the-acl-rd-tec_chunk_10228.txt` start with string "P4" which is magic byte of image/x-portable-bitmap. After google these two formats, I can't find the way to improve these formats magic byte match configure. Maybe we should setup a rule : some format must have both extendtion name and magic byte match. > AutoDetectParser wrongly detects plain text as images/audio > ----------------------------------------------------------- > > Key: TIKA-1770 > URL: https://issues.apache.org/jira/browse/TIKA-1770 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 1.10 > Environment: OS independent (tested on both Windows, MAC OS) > Reporter: Ziqi > Priority: Minor > Attachments: the-acl-rd-tec_chunk_10228.txt, > the-acl-rd-tec_chunk_15.txt, the-acl-rd-tec_chunk_9113.txt > > > AutoDetectParser fails to recognize certain plain-text files as plain text. > In the attachment are three testing files, as you can see they are all plain > text. > The following code is used for testing: > ———————— > AutoDetectParser parser = new AutoDetectParser(); > for (File f : new File("path").listFiles()) { > InputStream in = new BufferedInputStream(new > FileInputStream(f.toString())); > BodyContentHandler handler = new BodyContentHandler(-1); > Metadata metadata = new Metadata(); > try { > parser.parse(in, handler, metadata); > String content = handler.toString(); > System.out.println(metadata); //line A > }catch (Exception e){ > e.printStackTrace(); > } > } > ———————— > for the three testing files, line A prints the following: > X-Parsed-By=org.apache.tika.parser.EmptyParser > Content-Type=image/x-portable-bitmap > X-Parsed-By=org.apache.tika.parser.DefaultParser > X-Parsed-By=org.apache.tika.parser.mp3.Mp3Parser xmpDM:audioCompressor=MP3 > Content-Type=audio/mpeg > X-Parsed-By=org.apache.tika.parser.EmptyParser > Content-Type=image/x-portable-bitmap > And as a result, variable "content" is always empty. -- This message was sent by Atlassian Jira (v8.3.4#803005)