[ 
https://issues.apache.org/jira/browse/TIKA-851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194813#comment-13194813
 ] 

Nick Burch commented on TIKA-851:
---------------------------------

It looks like most files (not sure if it's all of them though) have a ftyp atom 
at byte 4. This has "ftyp" followed by a 4 byte (space padded if needed) string 
of the main type. There's a list of the common ones at http://www.ftyps.com/

I've added more specific matches for the common types in r1236700. Using the 
tika-app jar, I can now correctly detect mp4 video, Apple m4v video, mp4 audio 
and old quicktime movs (using the lower priority fallback)

I'm not sure if the ftyp atom has to be first or not, if it isn't then this 
detection won't work. Longer term, a proper file format aware detector would be 
best, ideally one that can also understand the rest of the format to report on 
different streams etc
                
> M4V and M4A detection invalid
> -----------------------------
>
>                 Key: TIKA-851
>                 URL: https://issues.apache.org/jira/browse/TIKA-851
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.0
>            Reporter: Alexander Chow
>             Fix For: 1.1
>
>
> When the mime type of an M4V file is detected using its name only, it returns 
> video/x-m4v.  When it is detected using the InputStream (hence utilising the 
> MagicDetector), it incorrectly returns video/quicktime.
> Using the sample M4V file from Apple's [knowledge 
> base|http://support.apple.com/kb/HT1425]:
> {code:title=TikaTest.java}
> public class TikaTest {
>       public static void main(String[] args) throws Exception {
>               String userHome = System.getProperty("user.home");
>               File file = new File(userHome + "/Desktop/sample_iPod.m4v");
>               InputStream is = TikaInputStream.get(file);
>               Detector detector = new DefaultDetector(
>                       MimeTypes.getDefaultMimeTypes());
>               Metadata metadata = new Metadata();
>               metadata.set(Metadata.RESOURCE_NAME_KEY, file.getName());
>               System.out.println("File + filename: " + detector.detect(is, 
> metadata));
>               System.out.println("File only:       " + detector.detect(is, 
> new Metadata()));
>               System.out.println("Filename only:   " + detector.detect(null, 
> metadata));
>       }
> }
> {code}
> Renders the output:
> {code}
> File + filename: video/quicktime
> File only:       video/quicktime
> Filename only:   video/x-m4v
> {code}
> Moreover, if the same test is run against an M4A file, the results are even 
> more incorrect:
> {code}
> File + filename: video/quicktime
> File only:       video/quicktime
> Filename only:   application/octet-stream
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to