[ https://issues.apache.org/jira/browse/TIKA-2822?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-2822. ------------------------------- Resolution: Fixed Removed common html entities and fixed code to do that automatically next time. > Update common tokens files for tika-eval > ---------------------------------------- > > Key: TIKA-2822 > URL: https://issues.apache.org/jira/browse/TIKA-2822 > Project: Tika > Issue Type: Improvement > Components: tika-eval > Reporter: Tim Allison > Assignee: Tim Allison > Priority: Trivial > Fix For: 1.21 > > > We initially created the common tokens files (top 20k tokens by document > frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an > updated Lucene on the off chance that there are slight changes in > tokenization. > While doing this work, I found a trivial bug in filtering common tokens that > we should fix as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005)