[ https://issues.apache.org/jira/browse/LUCENE-7004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15137367#comment-15137367 ]
Jean-Baptiste Lespiau commented on LUCENE-7004: ----------------------------------------------- I don't know the process for a patch to be committed to the code base. I imagine that it needs to be reviewed, and I am well aware that reviewers should have a lot of work. Do I have to do something ? I'm just following this through :) > Duplicate tokens using WordDelimiterFilter for a specific configuration > ----------------------------------------------------------------------- > > Key: LUCENE-7004 > URL: https://issues.apache.org/jira/browse/LUCENE-7004 > Project: Lucene - Core > Issue Type: Bug > Reporter: Jean-Baptiste Lespiau > Priority: Minor > Attachments: FIX-LUCENE-7004.PATCH, TEST-LUCENE-7004.PATCH, > wdf-analysis.png > > > When using both the options > PRESERVE_ORIGINAL|SPLIT_ON_CASE_CHANGE|CONCATENATE_ALL using the > WordDelimiterFilter, we have duplicate tokens on strings contaning only case > changes. > When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", > "Def". > When having PRESERVE_ORIGINAL, we keep "abcDef". > However, when one uses CONCATENATE_ALL (or CATENATE_WORDS ?), it also adds > another token built from the concatenation of the splited words, giving > "abcDef" again. > I'm not 100% certain that token filters should not produce duplicate tokens > (same word, same start and end positions). Can someone confirm this is a bug ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org