[ 
https://issues.apache.org/jira/browse/TIKA-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated TIKA-1208:
---------------------------------------

    Attachment: TIKA-1208.patch

Hi [~p_ansell], I have been working on a patch for this issue... which I did 
not wish to push ti Jira... however I've been taken off course by bugs in a 
Gora branch.

I attach a patch for migrating Any23 mime package to Tika which retains the 
Purifier concept of cleaning documents prior to them being processed for 
mime/mediaType detection. I've not touched  the Tika API or the Dectect API 
within this implementation as (I personally think) it would be more of a task 
to succeed in the code migration if we attempt to change well know and well 
designed 'dectect' and base 'Tika' API's.

This therefore means that Purifier implementations are detector specific... 
right now all we can offer id the WhiteSpacePurifierw which is OK... but 8it 
NOT configurable e.g. if someone wished to pass a Purifier as a parameter to 
detect(InputStream, Metedata, Purifier) ... and I think that if other 
Purifier's were to be introduced then we could revisit this issue.

Apart from that, this (WIP) patch introduces an Any23Detector which basically 
stems from the Tika detector we maintained in Any23... please comment on this 
as I am not sure if this is the right way to process...           

THIS PATCH IS MERELY A START... I need input from the Any23 team to see if I am 
'attempting' to implement the Any23 mime code in the correct way.

It should also be noted that the last time I ran this patch with Tika trunk 
there were issues with detection of 'semantic' mime types.

Hopefully this is a start which we can build from. I am committed to getting 
this code suitable for proposal to Tika.

Any comment are VERY appreciated.   

> Migrate Any23 mime contributions to Tika
> ----------------------------------------
>
>                 Key: TIKA-1208
>                 URL: https://issues.apache.org/jira/browse/TIKA-1208
>             Project: Tika
>          Issue Type: Sub-task
>          Components: mime
>            Reporter: Lewis John McGibbney
>             Fix For: 1.5
>
>         Attachments: TIKA-1208.patch
>
>
> We begin with one of the most obvious areas in which there
> is overlap.
> In short, the appeal of this package is the addition of detection 
> for the following types:
>  - text/n3
>  - text/rdf+n3
>  - application/n3
>  - text/x-nquads
>  - text/rdf+nq
>  - text/nq
>  - application/nq
>  - text/turtle
>  - application/x-turtle
>  - application/turtle
>  - application/trix
>  
> Therefore although both Tika and Any23 execute the task of Mimetype-related
> tasks, there is a contribution to be made. This involves the trasferral of
> code pertaining to pattern recogition, Mimetype XML defitinions within 
> tika-mimetypes.xml and a Purifier implementation that removes all 
> the eventual blank characters at the header of a file that might 
> prevents its MIME Type detection.  



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to