Re: Tokenization and wild card search

Erick Erickson Tue, 19 Jan 2010 07:31:09 -0800

I'm pretty sure you're going to be disappointed about
the re-indexing part.


I'm pretty sure that WordDelimiterFilterFactory is tokenizing
your input in ways you don't expect, making your use-case
hard to accomplish.

It's basically splitting your input on all non-alpha characters,
so you're indexing see
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

I'd *strongly* suggest you examine the results of your indexing
in order to understand what's possible.

Get a copy of luke and examine your index or use the
SOLR admin Analysis page...

I suspect what you're really looking for is WhitespaceAnalyzer
or Keyword

On Tue, Jan 19, 2010 at 9:50 AM, <johnmu...@aol.com> wrote:

>
>
> I want the following searches to work:
>
>  MyField:SDD_Expedition_PCB
>
> This should match the word "SDD_Expedition_PCB" only, and not matching
> individual words such as "SDD" or "Expedition", or "PCB".
>
> And the following search:
>
>  MyField:SDD_Expedition*
>
> Should match any word starting with "SDD_Expedition" and ending with
> anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One",
> "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc,
> but not matching individual words such as "SDD" or "Expedition".
>
>
> The field type for "MyField" is (the field name is keywords):
>
>    <field name="Keywords" type="text" indexed="true" stored="false"
> required="false" multiValued="true"></field>
>
> And here is the analyzer I'm using:
>
>    <fieldType name="text" class="solr.TextField"
> positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/> -->
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
> Any help on how I can achieve the above is greatly appreciated.
>
> Btw, if at all possible, I would like to be able to achieve this search
> without having to change how I'm indexing / tokenizing the data.  I'm
> looking for search syntax to make this work.
>
> -- JM
>
> -----Original Message-----
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Tuesday, January 19, 2010 7:57 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Tokenization and wild card search
>
> > I have an issue and I'm not sure how to address it, so I
> > hope someone can help me.
> >
> > I have the following text in one of my fields:
> > "ABC_Expedition_ERROR".���When I search on it
> > like: "MyField:SDD_Expedition_PCB" (without quotes) it will
> > fail to find me only this word �ABC_Expedition_ERROR�
> > which I think is due to tokenization because of the
> > underscore.
>
> Do you want or do not want your query MyField:SDD_Expedition_PCB to return
> documents containing ABC_Expedition_ERROR?
>
> > My solution is: "MyField:"SDD_Expedition_PCB"" (without the
> > outer quotes, but quotes around the word
> > �ABC_Expedition_ERROR�).� This works fine.�
> > But then, how do I search on "SDD_Expedition_PCB" with wild
> > card?� For example: "MyField:SDD_Expedition*" will not
> > work.
>
> Can you paste your field type of MyField? And give some examples what
> queries should return what documents.
>
>

Re: Tokenization and wild card search

Reply via email to