[jira] Resolved: (TIKA-154) Better detection of plain text versus binary formats with a text header

Jukka Zitting (JIRA) Fri, 16 Jan 2009 17:18:31 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-154.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 0.3
         Assignee: Jukka Zitting

In revision 735193 I implemented the plain text detection mechanism described 
in section 4 of the content type sniffing draft [1] I mentioned earlier on the 
mailing list.

This seems to work pretty fine, and finally allows us to detect plain text 
documents with no file name or type hints. :-)

Resolving as Fixed.

[1] http://webblaze.cs.berkeley.edu/2009/mime-sniff/mime-sniff.txt

> Better detection of plain text versus binary formats with a text header
> -----------------------------------------------------------------------
>
>                 Key: TIKA-154
>                 URL: https://issues.apache.org/jira/browse/TIKA-154
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.3
>
>
> Antoni Mylka noted on the mailing list:
>     Many binary formats begin with magic byte sequences composed of ASCII 
> characters, e.g.
>     zipfiles begin with PK
>     pdfs begin with %PDF-
>     chms help files begin with ITSF
>     etc.
> Tika should do a better job of detecting such cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (TIKA-154) Better detection of plain text versus binary formats with a text header

Reply via email to