[
https://issues.apache.org/jira/browse/TIKA-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18076263#comment-18076263
]
ASF GitHub Bot commented on TIKA-4720:
--------------------------------------
tballison opened a new pull request, #2787:
URL: https://github.com/apache/tika/pull/2787
Wire in recent encoding detector and junk detector work. Remove glm
> Improve charset detection in 4.x, take 2
> ----------------------------------------
>
> Key: TIKA-4720
> URL: https://issues.apache.org/jira/browse/TIKA-4720
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> I had some really good luck with simple naive bayes with careful scaling.
>
> This ticket includes the move to that as the main charset detector. This
> ticket also includes work to improve our default html charset detector to get
> some of the benefits of our StandardHtml charset detector without its
> rigidity.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)