[
https://issues.apache.org/jira/browse/TIKA-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18075841#comment-18075841
]
ASF GitHub Bot commented on TIKA-4720:
--------------------------------------
tballison opened a new pull request, #2785:
URL: https://github.com/apache/tika/pull/2785
The GLM was part of the previous attempt at a second pass junk detector to
improve charset detection. It may work with some tweaks. It did not perform as
hoped.
> Improve charset detection in 4.x, take 2
> ----------------------------------------
>
> Key: TIKA-4720
> URL: https://issues.apache.org/jira/browse/TIKA-4720
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> I had some really good luck with simple naive bayes with careful scaling.
>
> This ticket includes the move to that as the main charset detector. This
> ticket also includes work to improve our default html charset detector to get
> some of the benefits of our StandardHtml charset detector without its
> rigidity.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)