Re: How to avoid underscore sign indexing problem?

Steve Rowe Thu, 22 Aug 2013 21:32:06 -0700

Dan,

StandardTokenizer implements the word boundary rules from the Unicode Text 
Segmentation standard annex UAX#29:


   http://www.unicode.org/reports/tr29/#Word_Boundaries

Every character sequence within UAX#29 boundaries that contains a numeric or an 
alphabetic character is emitted as a term, and nothing else is emitted.

Punctuation can be included within a term, e.g. "1,248.99" or "192.168.1.1".

To split on underscores, you can convert underscores to e.g. spaces by adding 
PatternReplaeCharFilterFactory to your analyzer:

    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="_" 
replacement=" "/>

This replacement will be performed prior to StandardTokenizer, which will then 
see token-splitting spaces instead of underscores.

Steve

On Aug 22, 2013, at 10:23 PM, Dan Davis <dansm...@gmail.com> wrote:

> Ah, but what is the definition of punctuation in Solr?
> 
> 
> On Wed, Aug 21, 2013 at 11:15 PM, Jack Krupansky 
> <j...@basetechnology.com>wrote:
> 
>> "I thought that the StandardTokenizer always split on punctuation, "
>> 
>> Proving that you haven't read my book! The section on the standard
>> tokenizer details the rules that the tokenizer uses (in addition to
>> extensive examples.) That's what I mean by "deep dive."
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Shawn Heisey
>> Sent: Wednesday, August 21, 2013 10:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: How to avoid underscore sign indexing problem?
>> 
>> 
>> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>> 
>>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>> 
>>> ST
>>> textraw_**bytesstartendtypeposition
>>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>> 
>>> How to make this string to be tokenized to these two tokens "Pacific",
>>> "Rim"?
>>> Set _ as stopword?
>>> Please kindly help on this.
>>> Many thanks.
>>> 
>> 
>> Interesting.  I thought that the StandardTokenizer always split on
>> punctuation, but apparently that's not the case for the underscore
>> character.
>> 
>> You can always use the WordDelimeterFilter after the StandardTokenizer.
>> 
>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
>> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>> 
>> Thanks,
>> Shawn
>>

Re: How to avoid underscore sign indexing problem?

Reply via email to