Re: Tokenization at query time

Andrea Gazzarini Tue, 13 Aug 2013 06:12:29 -0700

Hi Erick,

sorry if that wasn't clear: this is what I'm actually observing in myapplication.

I wrote the first post after looking at the explain (debugQuery=true):the query


q=mag 778 G 69

is translated as follow:

/  +((DisjunctionMaxQuery((//myfield://*mag*//^3000.0)~0.1)
      DisjunctionMaxQuery((//myfield://*778*//^3000.0)~0.1)
      DisjunctionMaxQuery((//myfield://*g*//^3000.0)~0.1)
      DisjunctionMaxQuery((//myfield://*69*//^3000.0)~0.1))~4)
      DisjunctionMaxQuery((//myfield://*mag778g69*//^30000.0)~0.1)/

It seems that althouhg I declare myfield with this type

/<fieldtype name="type1" class="solr.TextField" >
    <analyzer>
        <tokenizer class="solr.KeywordTokenizerFactory*" />

        <filter class="solr.LowerCaseFilterFactory" />
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="0" 
generateNumberParts="0"
            catenateWords="0" catenateNumbers="0" 
catenateAll="1"splitOnCaseChange="0" />
    </analyzer>
</fieldtype>

/SOLR is tokenizing it therefore by producing several tokens (mag,778,g,69)/
/

And I can't put double quotes on the query (q="mag 778 G 69") becausethe request handler searches also in other fields (with differentconfiguration chains)

As I understood the query parser, (i.e. query time), does a whitespacetokenization on its own before invoking my (query-time) chain. The samedoesn't happen at index time...this is my problem...because at indextime the field is analyzed exactly as I want...but unfortunately cannotsay the same at query time.


Sorry for my wonderful english, did you get the point?

On 08/13/2013 02:18 PM, Erick Erickson wrote:

On a quick scan I don't see a problem here. Attach
&debug=query to your url and that'll show you the
parsed query, which will in turn show you what's been
pushed through the analysis chain you've defined.

You haven't stated whether you've tried this and it's
not working or you're looking for guidance as to how
to accomplish this so it's a little unclear how to
respond.

BTW, the admin/analysis page is your friend here....

Best
Erick


On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
andrea.gazzar...@gmail.com> wrote:

Clear, thanks for response.

So, if I have two fields

<fieldtype name="type1" class="solr.TextField" >
     <analyzer>
         <tokenizer class="solr.**KeywordTokenizerFactory*" />

         <filter class="solr.**LowerCaseFilterFactory" />
         <filter class="solr.**WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0"
             catenateWords="0" catenateNumbers="0" catenateAll="1"
splitOnCaseChange="0" />
     </analyzer>
</fieldtype>
<fieldtype name="type2" class="solr.TextField" >
     <analyzer>
         <charFilter class="solr.**MappingCharFilterFactory"
mapping="mapping-FoldToASCII.**txt"/>
         <tokenizer class="solr.**WhitespaceTokenizerFactory" />
         <filter class="solr.**LowerCaseFilterFactory" />
         <filter class="solr.**WordDelimiterFilterFactory" .../>
     </analyzer>
</fieldtype>

(first field type *Mag. 78 D 99* becomes *mag78d99* while second field
type ends with several tokens)

And I want to use the same request handler to query against both of them.
I mean I want the user search something like

http//..../search?q=Mag 78 D 99

and this search should search within both the first (with type1) and
second (with type 2) by matching

- a document which has field_with_type1 equals to *mag78d99* or
- a document which has field_with_type2 that contains a text like "go to
*mag 78*, class *d* and subclass *99*)


<requestHandler ....>
     ...
     <str name="defType">dismax</str>
     ...
     <str name="mm">100%</str>
     <str name="qf">
         field_with_type1
         field_with_type_2
     </str>
     ...
</requestHandler>

is not possible? If so, is possible to do that in some other way?

Sorry for the long email and thanks again
Andrea


On 08/12/2013 04:01 PM, Jack Krupansky wrote:

Quoted phrases will be passed to the analyzer as one string, so there a
white space tokenizer is needed.

-- Jack Krupansky

-----Original Message----- From: Andrea Gazzarini
Sent: Monday, August 12, 2013 6:52 AM
To: solr-user@lucene.apache.org
Subject: Re: Tokenization at query time

Hi Tanguy,
thanks for fast response. What you are saying corresponds perfectly with
the behaviour I'm observing.
Now, other than having a big problem (I have several other fields both
in the pf and qf where spaces doesn't matter, field types like the
"text_en" field type in the example schema) what I'm wondering is:

/"The query parser splits the input query on white spaces, and the each
token is analysed according to your configuration"//
/
Is there a valid reason to declare a WhiteSpaceTokenizer in a query
analyzer? If the input query is already parsed (i.e. whitespace
tokenized) what is its effect?

Thank you very much for the help
Andrea

On 08/12/2013 12:37 PM, Tanguy Moal wrote:

Hello Andrea,
I think you face a rather common issue involving keyword tokenization
and query parsing in Lucene:
The query parser splits the input query on white spaces, and then each
token is analysed according to your configuration.
So those queries with a whitespace won't behave as expected because each
token is analysed separately. Consequently, the catenated version of the
reference cannot be generated.
I think you could try surrounding your query with double quotes or
escaping the space characters in your query using a backslash so that the
whole sequence is analysed in the same analyser and the catenation occurs.
You should be aware that this approach has a drawback: you will probably
not be able to combine the search for Mag. 778 G 69 with other words in
other fields unless you are able to identify which spaces are to be escaped:
For example, if input the query is:
Awesome Mag. 778 G 69
you would want to transform it to:
Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
or
Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
query

Do you get the point?

Look at the differences between what you tried and the following
examples which should all do what you want:
http://localhost:8983/solr/**collection1/select?q=%22Mag.%**
20778%20G%2069%22&debugQuery=**on&qf=text%20myfield&defType=**dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
OR
http://localhost:8983/solr/**collection1/select?q=myfield:**Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
.\%20778\%20G\%2069&**debugQuery=on
OR
http://localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
.\%**20778\%20G\%2069&debugQuery=**on&qf=text%20myfield&defType=**edismax


I hope this helps

Tanguy

On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
andrea.gazzar...@gmail.com> wrote:

  Hi all,

I have a field (among others)in my schema defined like this:

<fieldtype name="mytype" class="solr.TextField"
positionIncrementGap="100">
     <analyzer>
         <tokenizer class="solr.***KeywordTokenizerFactory*" />
         <filter class="solr.**LowerCaseFilterFactory" />
         <filter class="solr.**WordDelimiterFilterFactory"
             generateWordParts="0"
             generateNumberParts="0"
             catenateWords="0"
             catenateNumbers="0"
             catenateAll="1"
             splitOnCaseChange="0" />
     </analyzer>
</fieldtype>

<field name="myfield" type="mytype" indexed="true"/>

Basically, both at index and query time the field value is normalized
like this.

Mag. 778 G 69 => mag778g69

Now, in my solrconfig I'm using a search handler like this:
fossero solo le sue le gambe

<requestHandler ....>
     ...
     <str name="defType">dismax</str>
     ...
     <str name="mm">100%</str>
     <str name="qf">myfield^3000</str>
     <str name="pf">myfield^30000</str>

</requestHandler>

What I'm expecting is that if I index a document with a value for my
field "Mag. 778 G 69", I will be able to get this document by querying

1. Mag. 778 G 69
2. mag 778 g69
3. mag778g69

But that doesn't wotk: i'm able to get the document only and if only I
use the "normalized2 form: mag778g69

After doing a little bit of debug, I see that, even I used a
KeywordTokenizer in my field type declaration, SOLR is doing soemthign like
this:
/
// +((DisjunctionMaxQuery((//**myfield://*mag*//^3000.0)~0.1)
DisjunctionMaxQuery((//**myfield://*778*//^3000.0)~0.1)
DisjunctionMaxQuery((//**myfield://*g*//^3000.0)~0.1)
DisjunctionMaxQuery((//**myfield://*69*//^3000.0)~0.1))**~4)
DisjunctionMaxQuery((//**myfield://*mag778g69*//^30000.**0)~0.1)/

That is, it is tokenizing the original query string (mag + 778 + g +
69) and obviously querying the field for separate tokens doesn't match
anything (at least this is what I think)

Does anybody could please explain me that?

Thanks in advance
Andrea

Re: Tokenization at query time

Reply via email to