On 8/21/2013 7:54 PM, Floyd Wu wrote: > When using StandardAnalyzer to tokenize string "Pacific_Rim" will get > > ST > textraw_bytesstartendtypeposition > pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1 > > How to make this string to be tokenized to these two tokens "Pacific", > "Rim"? > Set _ as stopword? > Please kindly help on this. > Many thanks.
Interesting. I thought that the StandardTokenizer always split on punctuation, but apparently that's not the case for the underscore character. You can always use the WordDelimeterFilter after the StandardTokenizer. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory Thanks, Shawn