[ https://issues.apache.org/jira/browse/TIKA-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ray Gauss II reassigned TIKA-1170: ---------------------------------- Assignee: Ray Gauss II > Insufficiently specific magic for binary image/cgm files > -------------------------------------------------------- > > Key: TIKA-1170 > URL: https://issues.apache.org/jira/browse/TIKA-1170 > Project: Tika > Issue Type: Bug > Components: mime > Affects Versions: 1.4 > Reporter: Andrew Jackson > Assignee: Ray Gauss II > Priority: Minor > Attachments: 0001-Added-CGM-test-file-test-and-improved-magic.patch, > plotutils-example.cgm > > > I've been running Tika against a large corpus of web archives files, and I'm > seeing a number of false positives for image/cgm. The Tika magic is > {code} > <match value="BEGMF" type="string" offset="0"/> > <match value="0x0020" mask="0xffe0" type="string" offset="0"/> > {code} > The issue seems to be that the second magic matcher is not very specific, > e.g. matching files that start 0x002a. To be fair, this is only c.700 false > matches out of >300 million resources, but it would be nice if this could be > tightened up. > Looking at the PRONOM signatures > * > http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1048&strPageToDisplay=signatures > * > http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1049&strPageToDisplay=signatures > * > http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1050&strPageToDisplay=signatures > * > http://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=1051&strPageToDisplay=signatures > it seems we have a variable position marker that changes slightly for each > version. Therefore, a more robust signature should be: > {code} > <match value="BEGMF" type="string" offset="0"/> > <match value="0x0020" mask="0xffe0" type="string" offset="0"> > <match value="0x10220001" type="string" offset="2:64"/> > <match value="0x10220002" type="string" offset="2:64"/> > <match value="0x10220003" type="string" offset="2:64"/> > <match value="0x10220004" type="string" offset="2:64"/> > </match> > {code} > Where I have assumed the filename part of the CGM file will be less that 64 > characters long. > Could this magic be considered for inclusion? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira