On 7/14/2015 10:46 AM, Alessandro Benedetti wrote: > Furthermore I was checking with Solr 5.1 to find the WDFilter factory > actually to work in a proper way. > Is it possible to know what was the conclusion for this issue ? > Is there an issue in the WordDelimiter token filter in the current Solr > version? Has it been fixed ? > Any update ?
It appears that the problem is not with WDF alone ... something about the combination of filters that I have chosen is causing this, but only with certain kinds of input. If I set up a minimal fieldType with the keyword tokenizer, then I cannot get the problem to reproduce: <fieldType name="testType" class="solr.TextField" sortMissingLast="true" positionIncrementGap="100"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" preserveOriginal="1" /> </analyzer> </fieldType> I tried with inputs of "aaa-bbb ccc" and "aaa-bbb: ccc" and everything worked as expected. I then tried some other analysis combinations trying to find the minimal problem fieldType, and I finally hit on the one that causes a problem. It's a combination of the ICUTokenizer with a custom rulefile, a pattern replace filter that eats leading and trailing punctuation, and the WDF. That must be combined with input text that includes trailing punctuation: "aaa-bbb: ccc" <fieldType name="testType" class="solr.TextField" sortMissingLast="true" positionIncrementGap="100"> <analyzer> <tokenizer class="org.apache.lucene.analysis.icu.segmentation.ICUTokenizerFactory" rulefiles="Latn:Latin-break-only-on-whitespace.rbbi"/> <filter class="solr.PatternReplaceFilterFactory" pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2" /> <filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" preserveOriginal="1" /> </analyzer> </fieldType> If the rulefile is not specified, then the problem doesn't occur, because the trailing punctuation is missing by the time it makes it to the PRF. If the PRF isn't there, then the problem doesn't occur. So the problem might be with the rulefile, or with some strange combination of these analysis components. I did not build this rulefile myself. It was built by another, eitherRobert Muir or Steve Rowe if I remember right, when SOLR-4123 was underway. The normal settings for ICUTokenizer eliminate most of the things that WDF uses for making tokens, which is why I'm using this custom rulefile. https://issues.apache.org/jira/browse/SOLR-4123 Any advice would be appreciated. I can make the .rbbi file available. Thanks, Shawn