[ 
https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoni Mylka updated TIKA-814:
------------------------------

    Attachment: tika-textdetector.patch

A patch, which makes the text detector work on the entire array supplied by 
MimeTypes
                
> Increase the amount of bytes read by TextDetector
> -------------------------------------------------
>
>                 Key: TIKA-814
>                 URL: https://issues.apache.org/jira/browse/TIKA-814
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>         Attachments: tika-textdetector.patch
>
>
> In TIKA-688 Jukka implemented a plain text detector. It is fired 
> automatically inside MimeTypes. I find a number of files in my collections, 
> which are binary but are still detected as plain text. They wouldn't be if 
> the plain text detector were allowed to look at more than the initial 512 
> bytes. I think that the TextDetector should look at MimeTypes.getMinLength 
> bytes. It is given a ByteArrayInputStream backed by an Array. It should read 
> all bytes in that array. 
> The performance impact should be negligible (no I/O, no allocations, just 
> pure array lookups), while my experiments show that there are cases when 512 
> bytes is not enough.
> If anyone objects due to performance reasons, I'll create another patch, 
> which will allow the users to decouple the TextDetector from MimeTypes and 
> supply their own, with different settings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to