[ https://issues.apache.org/jira/browse/TIKA-814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoni Mylka closed TIKA-814. ----------------------------- Resolution: Fixed Fix Version/s: 1.1 Committed in r1220698. This is a change, which theoretically impacts all users of Tika invoking MimeTypes. I say it has negligible performance overhead and yields better results on 5 broken BMP files I have in my collections. If you disagree: revert the change and reopen this issue. I'll create a second solution, with customizable plain text detection. For now, I close this. > Increase the amount of bytes read by TextDetector > ------------------------------------------------- > > Key: TIKA-814 > URL: https://issues.apache.org/jira/browse/TIKA-814 > Project: Tika > Issue Type: Improvement > Affects Versions: 1.1 > Reporter: Antoni Mylka > Fix For: 1.1 > > Attachments: tika-textdetector.patch > > > In TIKA-688 Jukka implemented a plain text detector. It is fired > automatically inside MimeTypes. I find a number of files in my collections, > which are binary but are still detected as plain text. They wouldn't be if > the plain text detector were allowed to look at more than the initial 512 > bytes. I think that the TextDetector should look at MimeTypes.getMinLength > bytes. It is given a ByteArrayInputStream backed by an Array. It should read > all bytes in that array. > The performance impact should be negligible (no I/O, no allocations, just > pure array lookups), while my experiments show that there are cases when 512 > bytes is not enough. > If anyone objects due to performance reasons, I'll create another patch, > which will allow the users to decouple the TextDetector from MimeTypes and > supply their own, with different settings. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira