[ 
https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Jackson updated TIKA-1170:
---------------------------------

    Attachment: 0001-Added-CGM-test-file-test-and-improved-magic.patch

Patch containing test file, test, and improved magic.
                
> Insufficiently specific magic for binary image/cgm files
> --------------------------------------------------------
>
>                 Key: TIKA-1170
>                 URL: https://issues.apache.org/jira/browse/TIKA-1170
>             Project: Tika
>          Issue Type: Bug
>          Components: mime
>    Affects Versions: 1.4
>            Reporter: Andrew Jackson
>            Priority: Minor
>         Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, 
> plotutils-example.cgm
>
>
> I've been running Tika against a large corpus of web archives files, and I'm 
> seeing a number of false positives for image/cgm. The Tika magic is
> {code}
>       <match value="BEGMF" type="string" offset="0"/>
>       <match value="0x0020" mask="0xffe0" type="string" offset="0"/>
> {code}
> The issue seems to be that the second magic matcher is not very specific, 
> e.g. matching files that start 0x002a. To be fair, this is only c.700 false 
> matches out of >300 million resources, but it would be nice if this could be 
> tightened up. 
> Looking at the PRONOM signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures
> * 
> http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures
> it seems we have a variable position marker that changes slightly for each 
> version. Therefore, a more robust signature should be:
> {code}
>       <match value="BEGMF" type="string" offset="0"/>
>       <match value="0x0020" mask="0xffe0" type="string" offset="0">
>         <match value="0x10220001" type="string" offset="2:64"/>
>         <match value="0x10220002" type="string" offset="2:64"/>
>         <match value="0x10220003" type="string" offset="2:64"/>
>         <match value="0x10220004" type="string" offset="2:64"/>
>       </match>
> {code}
> Where I have assumed the filename part of the CGM file will be less that 64 
> characters long.
> Could this magic be considered for inclusion?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to