Re: How to avoid underscore sign indexing problem?

Floyd Wu Thu, 22 Aug 2013 19:24:06 -0700

Alright, thanks for all your help. I finally fix this problem using
PatternReplaceFilterFactory + WordDelimeterfilterFactory.


I first replace _ (underscore) using PatternReplaceFilterFactory and then
using WordDelimeterFilterFactory to generate word and number part to
increase user search hit. Although this decrease search quality a little,
but user need higher recall rate than precision.

Thank you all.

Floyd





2013/8/22 Floyd Wu <floyd...@gmail.com>

> After trying some search case and different params combination of
> WordDelimeter. I wonder what is the best strategy to index string
> "2DA012_ISO MARK 2" and can be search by term "2DA012"?
>
> What if I just want _ to be removed both query/index time, what and how to
> configure?
>
> Floyd
>
>
>
> 2013/8/22 Floyd Wu <floyd...@gmail.com>
>
>> Thank you all.
>> By the way, Jack I gonna by your book. Where to buy?
>> Floyd
>>
>>
>> 2013/8/22 Jack Krupansky <j...@basetechnology.com>
>>
>>> "I thought that the StandardTokenizer always split on punctuation, "
>>>
>>> Proving that you haven't read my book! The section on the standard
>>> tokenizer details the rules that the tokenizer uses (in addition to
>>> extensive examples.) That's what I mean by "deep dive."
>>>
>>> -- Jack Krupansky
>>>
>>> -----Original Message----- From: Shawn Heisey
>>> Sent: Wednesday, August 21, 2013 10:41 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: How to avoid underscore sign indexing problem?
>>>
>>>
>>> On 8/21/2013 7:54 PM, Floyd Wu wrote:
>>>
>>>> When using StandardAnalyzer to tokenize string "Pacific_Rim" will get
>>>>
>>>> ST
>>>> textraw_**bytesstartendtypeposition
>>>> pacific_rim[70 61 63 69 66 69 63 5f 72 69 6d]011<ALPHANUM>1
>>>>
>>>> How to make this string to be tokenized to these two tokens "Pacific",
>>>> "Rim"?
>>>> Set _ as stopword?
>>>> Please kindly help on this.
>>>> Many thanks.
>>>>
>>>
>>> Interesting.  I thought that the StandardTokenizer always split on
>>> punctuation, but apparently that's not the case for the underscore
>>> character.
>>>
>>> You can always use the WordDelimeterFilter after the StandardTokenizer.
>>>
>>> http://wiki.apache.org/solr/**AnalyzersTokenizersTokenFilter**s#solr.**
>>> WordDelimiterFilterFactory<http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory>
>>>
>>> Thanks,
>>> Shawn
>>>
>>
>>
>

Re: How to avoid underscore sign indexing problem?

Reply via email to