Re: Tokenization and wild card search

Erick Erickson Tue, 19 Jan 2010 08:54:05 -0800

What I suspect would work is phrase queries with no slop.
Unfortunately, to get this to work right you need wildcards
inside phrases, which is NOT supported out of the box.


However, see SOLR 1604 for patches that address this...

http://issues.apache.org/jira/browse/SOLR-1604

HTH
Erick

P.S. Are you absolutely sure you can't re-index <G>.....


On Tue, Jan 19, 2010 at 11:11 AM, <johnmu...@aol.com> wrote:

>
>
> You are correct, the way I'm using tokenization is my issue.  It's too late
> to re-index now, this is why I'm looking for a search syntax that will to
> make the search work.
>
> I have tried various search syntax with no luck.  Is there no search syntax
> to make this work without re-indexing?!
>
> -- JM
>
>
> -----Original Message-----
> From: Erick Erickson <erickerick...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, Jan 19, 2010 10:30 am
> Subject: Re: Tokenization and wild card search
>
>
> I'm pretty sure you're going to be disappointed about
> he re-indexing part.
> I'm pretty sure that WordDelimiterFilterFactory is tokenizing
> our input in ways you don't expect, making your use-case
> ard to accomplish.
> It's basically splitting your input on all non-alpha characters,
> o you're indexing see
> ttp://
> wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> I'd *strongly* suggest you examine the results of your indexing
> n order to understand what's possible.
> Get a copy of luke and examine your index or use the
> OLR admin Analysis page...
> I suspect what you're really looking for is WhitespaceAnalyzer
> r Keyword
> On Tue, Jan 19, 2010 at 9:50 AM, <johnmu...@aol.com> wrote:
> >
>
>  I want the following searches to work:
>
>  MyField:SDD_Expedition_PCB
>
>  This should match the word "SDD_Expedition_PCB" only, and not matching
>  individual words such as "SDD" or "Expedition", or "PCB".
>
>  And the following search:
>
>  MyField:SDD_Expedition*
>
>  Should match any word starting with "SDD_Expedition" and ending with
>  anything else such as "SDD_Expedition_PBC", "SDD_Expedition_One",
>  "SDD_Expedition_Two", "SDD_ExpeditionSolr", "SDD_ExpeditionSolr1.4", etc,
>  but not matching individual words such as "SDD" or "Expedition".
>
>
>  The field type for "MyField" is (the field name is keywords):
>
>    <field name="Keywords" type="text" indexed="true" stored="false"
>  required="false" multiValued="true"></field>
>
>  And here is the analyzer I'm using:
>
>    <fieldType name="text" class="solr.TextField"
>  positionIncrementGap="100">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
>  synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>  words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
>  generateWordParts="0" generateNumberParts="1" catenateWords="1"
>  catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
>  protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!-- <filter class="solr.SynonymFilterFactory"
>  synonyms="synonyms.txt" ignoreCase="true" expand="true"/> -->
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>  words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
>  generateWordParts="0" generateNumberParts="1" catenateWords="1"
>  catenateNumbers="1" catenateAll="0"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
>  protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>      </analyzer>
>    </fieldType>
>
>  Any help on how I can achieve the above is greatly appreciated.
>
>  Btw, if at all possible, I would like to be able to achieve this search
>  without having to change how I'm indexing / tokenizing the data.  I'm
>  looking for search syntax to make this work.
>
>  -- JM
>
>  -----Original Message-----
>  From: Ahmet Arslan [mailto:iori...@yahoo.com]
>  Sent: Tuesday, January 19, 2010 7:57 AM
>  To: solr-user@lucene.apache.org
>  Subject: Re: Tokenization and wild card search
>
>  > I have an issue and I'm not sure how to address it, so I
>  > hope someone can help me.
>  >
>  > I have the following text in one of my fields:
>  > "ABC_Expedition_ERROR".���When I search on it
>  > like: "MyField:SDD_Expedition_PCB" (without quotes) it will
>  > fail to find me only this word �ABC_Expedition_ERROR�
>  > which I think is due to tokenization because of the
>  > underscore.
>
>  Do you want or do not want your query MyField:SDD_Expedition_PCB to return
>  documents containing ABC_Expedition_ERROR?
>
>  > My solution is: "MyField:"SDD_Expedition_PCB"" (without the
>  > outer quotes, but quotes around the word
>  > �ABC_Expedition_ERROR�).� This works fine.�
>  > But then, how do I search on "SDD_Expedition_PCB" with wild
>  > card?� For example: "MyField:SDD_Expedition*" will not
>  > work.
>
>  Can you paste your field type of MyField? And give some examples what
>  queries should return what documents.
>
>
>
>

Re: Tokenization and wild card search

Reply via email to