[ https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754238#comment-16754238 ]
Hudson commented on TIKA-2822: ------------------------------ UNSTABLE: Integrated in Jenkins build tika-2.x-windows #377 (See [https://builds.apache.org/job/tika-2.x-windows/377/]) TIKA-2822 -- update common tokens lists with 7.x Lucene. (tallison: rev cb3b8fde0b2c7134efdfdce5596bccf2cbc5489c) * (add) tika-eval/src/main/java/org/apache/tika/eval/tools/BatchTopCommonTokenCounter.java * (edit) tika-eval/src/main/resources/common_tokens/es * (edit) CHANGES.txt * (edit) tika-eval/src/main/resources/common_tokens/vi * (edit) tika-eval/src/main/resources/common_tokens/el * (edit) tika-eval/src/main/resources/common_tokens/ja * (edit) tika-eval/src/main/resources/common_tokens/zh-tw * (add) tika-eval/src/main/java/org/apache/tika/eval/tools/TopCommonTokenCounter.java * (edit) tika-eval/src/main/resources/common_tokens/en * (add) tika-eval/src/main/java/org/apache/tika/eval/tools/SlowCompositeReaderWrapper.java * (edit) tika-eval/src/main/resources/common_tokens/it * (edit) tika-eval/src/main/resources/common_tokens/fa * (edit) tika-eval/src/main/resources/common_tokens/pt * (edit) tika-eval/src/main/resources/common_tokens/ur * (edit) tika-eval/src/main/resources/common_tokens/ko * (add) tika-eval/src/main/resources/common_tokens/bn * (edit) tika-eval/src/main/resources/lucene-analyzers.json * (add) tika-eval/src/test/java/org/apache/tika/tools/TopCommonTokenCounterTest.java * (edit) tika-eval/src/main/resources/common_tokens/ar * (edit) tika-eval/src/main/resources/common_tokens/nl * (edit) tika-eval/src/main/resources/common_tokens/id * (edit) tika-eval/src/main/resources/common_tokens/ru * (edit) tika-eval/src/main/resources/common_tokens/zh-cn * (edit) tika-eval/src/main/resources/common_tokens/de * (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/AlphaIdeographFilterFactory.java * (edit) tika-eval/src/test/java/org/apache/tika/eval/AnalyzerManagerTest.java * (edit) tika-eval/src/test/java/org/apache/tika/eval/tokens/TokenCounterTest.java * (edit) tika-eval/src/main/resources/common_tokens/he * (edit) tika-eval/src/main/resources/common_tokens/fr * (edit) tika-eval/src/main/resources/common_tokens/hi * (edit) tika-eval/src/main/java/org/apache/tika/eval/tokens/CommonTokenCountManager.java > Update common tokens files for tika-eval > ---------------------------------------- > > Key: TIKA-2822 > URL: https://issues.apache.org/jira/browse/TIKA-2822 > Project: Tika > Issue Type: Improvement > Components: tika-eval > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Trivial > Fix For: 1.21 > > > We initially created the common tokens files (top 20k tokens by document > frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an > updated Lucene on the off chance that there are slight changes in > tokenization. > While doing this work, I found a trivial bug in filtering common tokens that > we should fix as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)