[
https://issues.apache.org/jira/browse/TIKA-154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved TIKA-154.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.3
Assignee: Jukka Zitting
In revision 735193 I implemented the plain text detection mechanism described
in section 4 of the content type sniffing draft [1] I mentioned earlier on the
mailing list.
This seems to work pretty fine, and finally allows us to detect plain text
documents with no file name or type hints. :-)
Resolving as Fixed.
[1] http://webblaze.cs.berkeley.edu/2009/mime-sniff/mime-sniff.txt
> Better detection of plain text versus binary formats with a text header
> -----------------------------------------------------------------------
>
> Key: TIKA-154
> URL: https://issues.apache.org/jira/browse/TIKA-154
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.3
>
>
> Antoni Mylka noted on the mailing list:
> Many binary formats begin with magic byte sequences composed of ASCII
> characters, e.g.
> zipfiles begin with PK
> pdfs begin with %PDF-
> chms help files begin with ITSF
> etc.
> Tika should do a better job of detecting such cases.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.