[
https://issues.apache.org/jira/browse/TIKA-322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12778137#action_12778137
]
Luke Nezda commented on TIKA-322:
---------------------------------
http://code.google.com/p/juniversalchardet/ has a pretty good, efficient
charset decoder which is a Java port of the Mozilla universalchardet
algorithms. It is licensed under Mozilla Public License Version 1.1. I am not
sure if MPL is ASF compatible; it appears to be, but ianal. afaik, it does not
provide detection confidence or language detection features ICU4J does and I
think it has code/data files for less encodings, but it is primarily
statistical so they could be added. I am also not sure what choices were made
with regard to multiple encodings. In theory, it should detect what Firefox
detects for a given URL/file.
> Improve encoding detection speed and accuracy
> ---------------------------------------------
>
> Key: TIKA-322
> URL: https://issues.apache.org/jira/browse/TIKA-322
> Project: Tika
> Issue Type: Improvement
> Components: mime
> Reporter: Jukka Zitting
> Priority: Minor
>
> The encoding detection code we took from ICU4J is not very efficient and
> sometimes produces odd results when more than one encoding matches the given
> input data. It would be good to refactor the code to be faster for
> easy-to-detect encodings and to have better heuristics in case multiple
> matches are found.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.