[ 
https://issues.apache.org/jira/browse/TIKA-1180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17750493#comment-17750493
 ] 

Wladimir Leite commented on TIKA-1180:
--------------------------------------

I made a test with a set of ~5.000 files collected from many different sources, 
and only ~{*}4%{*} are correctly identified by the signatures defined in 
*tika-mimetypes.xml* (file extensions were hidden for the test).

Inspecting the content of these files and going through the format 
specification ([https://www.matroska.org/index.html]), I created a modified 
configuration (shown below) that seems to work better: *100%* of the WEBM and 
MKV files were correctly identified; MKA still relies on the file extension, 
but these audio files are extremely rare (while both video formats are widely 
used).

It would enhance the current configuration, without having to deal with 
additional code / libraries. 

By the way, tested the detector mentioned above 
([https://github.com/OmarAssadi/matroska-tika]). It worked fine, but it missed 
25 videos (~0.5%) that are correctly identified with the signatures described 
below. The detector also doesn't handle MKA's.

 
{code:java}
    <mime-type type="video/x-matroska">
        <magic priority="60">
            <match value="0x1A45DFA3" type="string" offset="0">
                <match value="matroska" type="string" offset="4:64">
                </match>
            </match>
        </magic>
        <glob pattern="*.mkv" />
    </mime-type>    

    <mime-type type="audio/x-matroska">
        <sub-class-of type="video/x-matroska" />
        <glob pattern="*.mka" />
    </mime-type>    
    
    <mime-type type="video/webm">
        <magic priority="60">
            <match value="0x1A45DFA3" type="string" offset="0">
                <match value="webm" type="string" offset="4:64">
                </match>
            </match>
        </magic>
        <glob pattern="*.webm" />
    </mime-type>{code}
 

 

 

> Matroska (mkv, mka, webm) Detector
> ----------------------------------
>
>                 Key: TIKA-1180
>                 URL: https://issues.apache.org/jira/browse/TIKA-1180
>             Project: Tika
>          Issue Type: New Feature
>          Components: detector
>    Affects Versions: 1.5
>            Reporter: Nick Burch
>            Priority: Major
>              Labels: new-parser
>
> Following the work on TIKA-1177, we now have mimetype entries for the various 
> formats which are based on the Matroska container (mkv, mka, webm etc). 
> However, we are unable to properly identify the specific type just from some 
> mime magic
> Instead, for fully accurate detection, we'll need a new Detector for the 
> Matroska family, which does some very simple container/stream processing to 
> work out what the container contains



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to