Re: Tokenization at query time

Tanguy Moal Mon, 12 Aug 2013 03:37:40 -0700

Hello Andrea,
I think you face a rather common issue involving keyword tokenization and query 
parsing in Lucene:
The query parser splits the input query on white spaces, and then each token is 
analysed according to your configuration.
So those queries with a whitespace won't behave as expected because each token 
is analysed separately. Consequently, the catenated version of the reference 
cannot be generated.
I think you could try surrounding your query with double quotes or escaping the 
space characters in your query using a backslash so that the whole sequence is 
analysed in the same analyser and the catenation occurs.
You should be aware that this approach has a drawback: you will probably not be 
able to combine the search for Mag. 778 G 69 with other words in other fields 
unless you are able to identify which spaces are to be escaped:
For example, if input the query is:
Awesome Mag. 778 G 69
you would want to transform it to:
Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
or
Awesome "Mag. 778 G 69" // only the reference is turned into a phrase query


Do you get the point?

Look at the differences between what you tried and the following examples which 
should all do what you want:
http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax
OR
http://localhost:8983/solr/collection1/select?q=myfield:Mag.\%20778\%20G\%2069&debugQuery=on
OR
http://localhost:8983/solr/collection1/select?q=Mag.\%20778\%20G\%2069&debugQuery=on&qf=text%20myfield&defType=edismax

I hope this helps

Tanguy

On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <andrea.gazzar...@gmail.com> 
wrote:

> Hi all,
> I have a field (among others)in my schema defined like this:
> 
> <fieldtype name="mytype" class="solr.TextField" positionIncrementGap="100">
>    <analyzer>
>        <tokenizer class="solr.*KeywordTokenizerFactory*" />
>        <filter class="solr.LowerCaseFilterFactory" />
>        <filter class="solr.WordDelimiterFilterFactory"
>            generateWordParts="0"
>            generateNumberParts="0"
>            catenateWords="0"
>            catenateNumbers="0"
>            catenateAll="1"
>            splitOnCaseChange="0" />
>    </analyzer>
> </fieldtype>
> 
> <field name="myfield" type="mytype" indexed="true"/>
> 
> Basically, both at index and query time the field value is normalized like 
> this.
> 
> Mag. 778 G 69 => mag778g69
> 
> Now, in my solrconfig I'm using a search handler like this:
> 
> <requestHandler ....>
>    ...
>    <str name="defType">dismax</str>
>    ...
>    <str name="mm">100%</str>
>    <str name="qf">myfield^3000</str>
>    <str name="pf">myfield^30000</str>
> 
> </requestHandler>
> 
> What I'm expecting is that if I index a document with a value for my field 
> "Mag. 778 G 69", I will be able to get this document by querying
> 
> 1. Mag. 778 G 69
> 2. mag 778 g69
> 3. mag778g69
> 
> But that doesn't wotk: i'm able to get the document only and if only I use 
> the "normalized2 form: mag778g69
> 
> After doing a little bit of debug, I see that, even I used a KeywordTokenizer 
> in my field type declaration, SOLR is doing soemthign like this:
> /
> // +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1) 
> DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1) 
> DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1) 
> DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4) 
> DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/
> 
> That is, it is tokenizing the original query string (mag + 778 + g + 69) and 
> obviously querying the field for separate tokens doesn't match anything (at 
> least this is what I think)
> 
> Does anybody could please explain me that?
> 
> Thanks in advance
> Andrea

Re: Tokenization at query time

Reply via email to