Dan, StandardTokenizer implements the word boundary rules from the Unicode Text Segmentation standard annex UAX#29:
http://www.unicode.org/reports/tr29/#Word_Boundaries Every character sequence within UAX#29 boundaries that contains a numeric or an alphabetic character is emitted as a term, and nothing else is emitted. Punctuation can be included within a term, e.g. "1,248.99" or "192.168.1.1". To split on underscores, you can convert underscores to e.g. spaces by adding PatternReplaeCharFilterFactory to your analyzer: <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="_" replacement=" "/> This replacement will be performed prior to StandardTokenizer, which will then see token-splitting spaces instead of underscores. Steve On Aug 22, 2013, at 10:23 PM, Dan Davis <dansm...@gmail.com> wrote: > Ah, but what is the definition of punctuation in Solr? > > > On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky > <j...@basetechnology.com>wrote: > >> "I thought that the StandardTokenizer always split on punctuation, " >> >> Proving that you haven't read my book! The section on the standard >> tokenizer details the rules that the tokenizer uses (in addition to >> extensive examples.) That's what I mean by "deep dive." >> >> -- Jack Krupansky >> >> -----Original Message----- From: Shawn Heisey >> Sent: Wednesday, August 21, 2013 10:41 PM >> To: solr-user@lucene.apache.org >> Subject: Re: How to avoid underscore sign indexing problem? >> >> >> On 8/21/2013 7:54 PM, Floyd Wu wrote: >> >>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get >>> >>> ST >>> textraw_**bytesstartendtypeposition >>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1 >>> >>> How to make this string to be tokenized to these two tokens "Pacific", >>> "Rim"? >>> Set _ as stopword? >>> Please kindly help on this. >>> Many thanks. >>> >> >> Interesting. I thought that the StandardTokenizer always split on >> punctuation, but apparently that's not the case for the underscore >> character. >> >> You can always use the WordDelimeterFilter after the StandardTokenizer. >> >> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.** >> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory> >> >> Thanks, >> Shawn >>