[ 
https://issues.apache.org/jira/browse/TIKA-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18075841#comment-18075841
 ] 

ASF GitHub Bot commented on TIKA-4720:
--------------------------------------

tballison opened a new pull request, #2785:
URL: https://github.com/apache/tika/pull/2785

   The GLM was part of the previous attempt at a second pass junk detector to 
improve charset detection. It may work with some tweaks. It did not perform as 
hoped.




> Improve charset detection in 4.x, take 2
> ----------------------------------------
>
>                 Key: TIKA-4720
>                 URL: https://issues.apache.org/jira/browse/TIKA-4720
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> I had some really good luck with simple naive bayes with careful scaling.
>  
> This ticket includes the move to that as the main charset detector. This 
> ticket also includes work to improve our default html charset detector to get 
> some of the benefits of our StandardHtml charset detector without its 
> rigidity.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to