[ https://issues.apache.org/jira/browse/SOLR-8606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lespiau updated SOLR-8606: -------------------------- Attachment: (was: LUCENE-8686-TEST.patch) > Duplicate tokens using WordDelimiterFilter for a specific configuration > ----------------------------------------------------------------------- > > Key: SOLR-8606 > URL: https://issues.apache.org/jira/browse/SOLR-8606 > Project: Solr > Issue Type: Bug > Reporter: Lespiau > Priority: Minor > Attachments: SOLR-8686-TEST.patch > > > When using both the options PRESERVE_ORIGINAL| SPLIT_ON_CASE_CHANGE and > CONCATENATE_ALL|CATENATE_WORDS using the WordDelimiterFilter, we have > duplicate tokens on strings contaning only case changes. > When using the SPLIT_ON_CASE_CHANGE option, "abcDef" is split into "abc", > "Def". > When having PRESERVE_ORIGINAL, we keep "abcDef". > However, when one uses CONCATENATE_ALL or CATENATE_WORDS, it also adds an > other token built from the concatenation of the splited words, giving > "abcDef" again. > I'm not 100% certain that token filters should not produce duplicate tokens > (same word, same start and end positions). Can someone confirm this is a bug ? > I supply a patch that gives a test explosing the incorrect behavior. > I'm willing to work on the following days to fix that. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org