Tim Allison created TIKA-4685:
---------------------------------

             Summary: Add a new charset detector for 4.x
                 Key: TIKA-4685
                 URL: https://issues.apache.org/jira/browse/TIKA-4685
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


While I was building out the maxent model for the updated language detector, I 
realized we had the resources (language files by language) and a maxent model 
just sitting around and ready to build a new charset detector based on byte 
ngrams.

I have something working that appears to be quite good. We can replace both 
universal and icu4j. There's a chance that the results are hallucinated or that 
there's something surprising going on, but I think we should merge this and see 
what happens on our regression set.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to