Tim Allison created TIKA-4720:
---------------------------------

             Summary: Improve charset detection in 4.x, take 2
                 Key: TIKA-4720
                 URL: https://issues.apache.org/jira/browse/TIKA-4720
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


I had some really good luck with simple naive bayes with careful scaling.

 

This ticket includes the move to that as the main charset detector. This ticket 
also includes work to improve our default html charset detector to get some of 
the benefits of our StandardHtml charset detector without its rigidity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to