[ 
https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754238#comment-16754238
 ] 

Hudson commented on TIKA-2822:
------------------------------

UNSTABLE: Integrated in Jenkins build tika-2.x-windows #377 (See 
[https://builds.apache.org/job/tika-2.x-windows/377/])
TIKA-2822 -- update common tokens lists with 7.x Lucene. (tallison: rev 
cb3b8fde0b2c7134efdfdce5596bccf2cbc5489c)
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tools/BatchTopCommonTokenCounter.java
* (edit) tika-eval/src/main/resources/common_tokens/es
* (edit) CHANGES.txt
* (edit) tika-eval/src/main/resources/common_tokens/vi
* (edit) tika-eval/src/main/resources/common_tokens/el
* (edit) tika-eval/src/main/resources/common_tokens/ja
* (edit) tika-eval/src/main/resources/common_tokens/zh-tw
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tools/TopCommonTokenCounter.java
* (edit) tika-eval/src/main/resources/common_tokens/en
* (add) 
tika-eval/src/main/java/org/apache/tika/eval/tools/SlowCompositeReaderWrapper.java
* (edit) tika-eval/src/main/resources/common_tokens/it
* (edit) tika-eval/src/main/resources/common_tokens/fa
* (edit) tika-eval/src/main/resources/common_tokens/pt
* (edit) tika-eval/src/main/resources/common_tokens/ur
* (edit) tika-eval/src/main/resources/common_tokens/ko
* (add) tika-eval/src/main/resources/common_tokens/bn
* (edit) tika-eval/src/main/resources/lucene-analyzers.json
* (add) 
tika-eval/src/test/java/org/apache/tika/tools/TopCommonTokenCounterTest.java
* (edit) tika-eval/src/main/resources/common_tokens/ar
* (edit) tika-eval/src/main/resources/common_tokens/nl
* (edit) tika-eval/src/main/resources/common_tokens/id
* (edit) tika-eval/src/main/resources/common_tokens/ru
* (edit) tika-eval/src/main/resources/common_tokens/zh-cn
* (edit) tika-eval/src/main/resources/common_tokens/de
* (edit) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java
* (edit) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java
* (edit) 
tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java
* (edit) tika-eval/src/main/resources/common_tokens/he
* (edit) tika-eval/src/main/resources/common_tokens/fr
* (edit) tika-eval/src/main/resources/common_tokens/hi
* (edit) 
tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java


> Update common tokens files for tika-eval
> ----------------------------------------
>
>                 Key: TIKA-2822
>                 URL: https://issues.apache.org/jira/browse/TIKA-2822
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-eval
>            Reporter: Tim Allison
>            Assignee: Tim Allison
>            Priority: Trivial
>             Fix For: 1.21
>
>
> We initially created the common tokens files (top 20k tokens by document 
> frequency) in Wikipedia with Lucene 6.x.  We should rerun that code with an 
> updated Lucene on the off chance that there are slight changes in 
> tokenization.  
> While doing this work, I found a trivial bug in filtering common tokens that 
> we should fix as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to