[ 
https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoni Mylka closed TIKA-814.
-----------------------------

       Resolution: Fixed
    Fix Version/s: 1.1

Committed in r1220698.

This is a change, which theoretically impacts all users of Tika invoking 
MimeTypes. I say it has negligible performance overhead and yields better 
results on 5 broken BMP files I have in my collections. 

If you disagree: revert the change and reopen this issue. I'll create a second 
solution, with customizable plain text detection.

For now, I close this.
                
> Increase the amount of bytes read by TextDetector
> -------------------------------------------------
>
>                 Key: TIKA-814
>                 URL: https://issues.apache.org/jira/browse/TIKA-814
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 1.1
>            Reporter: Antoni Mylka
>             Fix For: 1.1
>
>         Attachments: tika-textdetector.patch
>
>
> In TIKA-688 Jukka implemented a plain text detector. It is fired 
> automatically inside MimeTypes. I find a number of files in my collections, 
> which are binary but are still detected as plain text. They wouldn't be if 
> the plain text detector were allowed to look at more than the initial 512 
> bytes. I think that the TextDetector should look at MimeTypes.getMinLength 
> bytes. It is given a ByteArrayInputStream backed by an Array. It should read 
> all bytes in that array. 
> The performance impact should be negligible (no I/O, no allocations, just 
> pure array lookups), while my experiments show that there are cases when 512 
> bytes is not enough.
> If anyone objects due to performance reasons, I'll create another patch, 
> which will allow the users to decouple the TextDetector from MimeTypes and 
> supply their own, with different settings.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to