Tim Allison created TIKA-4720:
---------------------------------
Summary: Improve charset detection in 4.x, take 2
Key: TIKA-4720
URL: https://issues.apache.org/jira/browse/TIKA-4720
Project: Tika
Issue Type: Task
Reporter: Tim Allison
I had some really good luck with simple naive bayes with careful scaling.
This ticket includes the move to that as the main charset detector. This ticket
also includes work to improve our default html charset detector to get some of
the benefits of our StandardHtml charset detector without its rigidity.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)