[ 
https://issues.apache.org/jira/browse/TIKA-154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12664767#action_12664767
 ] 

Jukka Zitting commented on TIKA-154:
------------------------------------

Re: Andrzej. The implementation I committed is based on a similar idea than the 
one you suggest. It looks at the first few bytes of the document and treats the 
document as text if no non-printable control characters are found.

> Better detection of plain text versus binary formats with a text header
> -----------------------------------------------------------------------
>
>                 Key: TIKA-154
>                 URL: https://issues.apache.org/jira/browse/TIKA-154
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.3
>
>
> Antoni Mylka noted on the mailing list:
>     Many binary formats begin with magic byte sequences composed of ASCII 
> characters, e.g.
>     zipfiles begin with PK
>     pdfs begin with %PDF-
>     chms help files begin with ITSF
>     etc.
> Tika should do a better job of detecting such cases.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to