[ 
https://issues.apache.org/jira/browse/TIKA-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18063660#comment-18063660
 ] 

ASF GitHub Bot commented on TIKA-4659:
--------------------------------------

tballison closed pull request #2604: TIKA-4659 

> Add tika-eval-lite for embedded junk detection
> ----------------------------------------------
>
>                 Key: TIKA-4659
>                 URL: https://issues.apache.org/jira/browse/TIKA-4659
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Trivial
>
> We have the oov statistic in tika-eval, which requires lists of 20k words per 
> 120+ languages. It would be useful to have something lighter weight for use 
> in charset detectors and/or parsers. 
> If we use a simple bigram model, we'd be able to run comparative stats -- is 
> this text run better as rtl or ltr in a PDF (at parse time) or in encoding 
> detection. We couldn't easily get a "this is junk" score by itself, but the 
> comparison part would be really useful.
> We can generate bigram stats from the original tika-eval word lists 
> trivially. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to