[ 
https://issues.apache.org/jira/browse/TIKA-86?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187018#comment-13187018
 ] 

Andrew Jackson commented on TIKA-86:
------------------------------------

We've done some work in this area, and noticed that other identification tools 
(including file) use a wider range of matching methods than Tika currently 
supports, e.g. regular expressions. To this end, we've extended Tika so that it 
can support RegEx magic (see e.g. this commit on our GitHub repo 
https://github.com/openplanets/tika/commit/b8de9e77c8b432788e3f73a4dbccca8ea044b503).
 We'd be happy to tidy this code up and submit it here if being able to re-use 
RegEx magic from other tools is of interest to the core Tika project.

However, to get back to the point, I agree that simply having a parser for file 
magic would not work as porting the magic is necessarily a manual process. Even 
when there is a MIME type, you can't reliably tell which bits of the magic are 
identifying the format and which bits are doing 'set-up' or extracting 
properties. This implies that this feature request should be turned down.

                
> Support magic(5) files
> ----------------------
>
>                 Key: TIKA-86
>                 URL: https://issues.apache.org/jira/browse/TIKA-86
>             Project: Tika
>          Issue Type: New Feature
>          Components: general
>            Reporter: Jukka Zitting
>
> Tika should have a parser for the magic(5) file format used by the file(1) 
> command. Then we could use existing magic rules from places like 
> http://svn.apache.org/repos/asf/httpd/httpd/trunk/docs/conf/magic.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to