[
https://issues.apache.org/jira/browse/TIKA-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664767#action_12664767
]
Jukka Zitting commented on TIKA-154:
------------------------------------
Re: Andrzej. The implementation I committed is based on a similar idea than the
one you suggest. It looks at the first few bytes of the document and treats the
document as text if no non-printable control characters are found.
> Better detection of plain text versus binary formats with a text header
> -----------------------------------------------------------------------
>
> Key: TIKA-154
> URL: https://issues.apache.org/jira/browse/TIKA-154
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Priority: Minor
> Fix For: 0.3
>
>
> Antoni Mylka noted on the mailing list:
> Many binary formats begin with magic byte sequences composed of ASCII
> characters, e.g.
> zipfiles begin with PK
> pdfs begin with %PDF-
> chms help files begin with ITSF
> etc.
> Tika should do a better job of detecting such cases.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.