[ 
https://issues.apache.org/jira/browse/TIKA-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083615#comment-18083615
 ] 

ASF GitHub Bot commented on TIKA-4731:
--------------------------------------

tballison commented on PR #2839:
URL: https://github.com/apache/tika/pull/2839#issuecomment-4548138717

   :popcorn: @copilot 
   
   Thank you @THausherr !




> Ongoing improvements to the junk detector
> -----------------------------------------
>
>                 Key: TIKA-4731
>                 URL: https://issues.apache.org/jira/browse/TIKA-4731
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> With [https://github.com/apache/tika/pull/2818,] I think we have a decent 
> shape for the junk detector. 
> There are several areas for improvement, but I think it is ready to go.
> This ticket tracks follow on work, including:
>  * Smaller model
>  * Handling pathological code block changes
>  * Handling candidates with different character counts
>  * Other items to be discovered in our commoncrawl/govdocs1 corpus?
> We have some coverage for the middle two item, but need to address those more 
> directly.
> This work is not a blocker on the 4.0.0-beta-1 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to