Re: Tokenization at query time

Erick Erickson Mon, 26 Aug 2013 05:49:15 -0700

Andrea:

Works for me, admittedly through the browser....


I suspect the problem is here: ClientUtils.**escapeQueryChars

That doesn't do anything about escaping the spaces, it just handles
characters that have special meaning to the query syntax, things like +-
etc.

Using your field definition, this:
http://localhost:8983/solr/select?wt=json&q=ab\ cd\
ef&debug=query&defType=edismax&qf=name eoe
produced this output..

   - parsedquery_toString: "+(eoe:abcdef | (name:ab name:cd name:ef))",


where the field "eoe" is your isbn_issn type.

Best,
Erick


On Mon, Aug 26, 2013 at 4:55 AM, Andrea Gazzarini <
andrea.gazzar...@gmail.com> wrote:

> Hi Erick,
> escaping spaces doesn't work...
>
> Briefly,
>
> - In a document I have an ISBN field that (stored value) is
> *978-90-04-23560-1*
> - In the index I have this value: *9789004235601*
>
> Now, I want be able to search the document by using:
>
> 1) q=*978-90-04-23560-1*
> 2) q=*978 90 04 23560 1*
> 3) q=*9789004235601*
>
> 1 and 3 works perfectly, 2 doesn't work.
>
> My code is:
>
> /SolrQuery query = new SolrQuery(ClientUtils.**escapeQueryChars(req.**
> getParameter("q")));/
>
> isbn is declared in this way
>
> <fieldtype name="isbn_issn" class="solr.TextField"
> positionIncrementGap="100">
>     <analyzer>
>         <tokenizer class="*solr.**KeywordTokenizerFactory*"/>
>
>         <filter class="solr.**LowerCaseFilterFactory"/>
>         <filter class="solr.**WordDelimiterFilterFactory"
> generateWordParts="0" generateNumberParts="0" catenateWords="0"
> catenateNumbers="0" catenateAll="1" splitOnCaseChange="0"/>
>     </analyzer>
> </fieldtype>
> <field name="isbn_issn_search" type="issn_isbn" indexed="true"/>
> search handler is:
>
>     <requestHandler name="any_bc" class="solr.SearchHandler"
> default="true">
>         <lst name="defaults">
>             <str name="defType">*dismax*</str>
>
>             <str name="mm">100%</str>
>             <str name="qf">
> *isbn_issn_search*^10000
>             </str>
>             <str name="pf">
> *isbn_issn_search*^100000
>             </str>
>             <int name="ps">0</int>
>             <float name="tie">0.1</float>
>             ...
>     </requestHandler>
>
> This is what I get:
>
> *1) 978-90-04-23560-1**
> *path=/select params={start=0&q=*978\-90\-**04\-23560\-1*&sfield=&qt=any_*
> *bc&wt=javabin&rows=10&version=**2} *hits=1* status=0 QTime=5*
>
> 2) ***9789004235601*
> *webapp=/solr path=/select params={start=0&q=***
> 9789004235601*&sfield=&qt=any_**bc&wt=javabin&rows=10&version=**2}
> *hits=1* status=0 QTime=5*
>
> 3) **978 90 04 23560 1**
> *path=/select params={start=0&*q=978\+90\+**04\+23560\+1*&sfield=&qt=any_*
> *bc&wt=javabin&rows=10&version=**2} *hits=0 *status=0 QTime=2*
>
> *Extract from queryDebug=true:
>
> <str name="q">978\ 90\ 04\ 23560\ 1</str>
> ...
> <str name="rawquerystring">978\ 90\ 04\ 23560\ 1</str>
> <str name="querystring">978\ 90\ 04\ 23560\ 1</str>
> ...
> <str name="parsedquery">
>     +((DisjunctionMaxQuery((isbn_**issn_search:*978*^10000.0)~0.**1)
>     DisjunctionMaxQuery((isbn_**issn_search:*90*^10000.0)~0.1)
>     DisjunctionMaxQuery((isbn_**issn_search:*04*^10000.0)~0.1)
>     DisjunctionMaxQuery((isbn_**issn_search:*23560*^10000.0)~**0.1)
>     DisjunctionMaxQuery((isbn_**issn_search:*1*^10000.0)~0.1))**~5)
>     DisjunctionMaxQuery((isbn_**issn_search:*9789004235601*^**
> 100000.0)~0.1)
> </str>
>
> ------------------------------**------------------
> Probably this is a very stupid question but I'm going crazy. In this page
>
> http://wiki.apache.org/solr/**DisMaxQParserPlugin<http://wiki.apache.org/solr/DisMaxQParserPlugin>
>
> *Query Structure*
>
> /For each "word" in the query string, dismax builds a DisjunctionMaxQuery
> object for that word across all of the fields in the //qf//param...
>
> /And seems exactly what it is doing...but what is a "word"? How can I
> force//(without using double quotes) spaces in a way that they are
> considered part of the word/?
>
> /Many many many thanks
> Andrea
>
>
> On 08/13/2013 04:18 PM, Erick Erickson wrote:
>
>> I think you can get what you want by escaping the space with a
>> backslash....
>>
>> YMMV of course.
>> Erick
>>
>>
>> On Tue, Aug 13, 2013 at 9:11 AM, Andrea Gazzarini <
>> andrea.gazzar...@gmail.com> wrote:
>>
>>  Hi Erick,
>>> sorry if that wasn't clear: this is what I'm actually observing in my
>>> application.
>>>
>>> I wrote the first post after looking at the explain (debugQuery=true):
>>> the
>>> query
>>>
>>> q=mag 778 G 69
>>>
>>> is translated as follow:
>>>
>>>
>>> /  +((DisjunctionMaxQuery((//****myfield://*mag*//^3000.0)~0.1)
>>>        DisjunctionMaxQuery((//****myfield://*778*//^3000.0)~0.1)
>>>        DisjunctionMaxQuery((//****myfield://*g*//^3000.0)~0.1)
>>>        DisjunctionMaxQuery((//****myfield://*69*//^3000.0)~0.1))****~4)
>>>        DisjunctionMaxQuery((//****myfield://*mag778g69*//^30000.**
>>> **0)~0.1)/
>>>
>>> It seems that althouhg I declare myfield with this type
>>>
>>> /<fieldtype name="type1" class="solr.TextField" >
>>>
>>>      <analyzer>
>>>          <tokenizer class="solr.****KeywordTokenizerFactory*" />
>>>
>>>          <filter class="solr.****LowerCaseFilterFactory" />
>>>          <filter class="solr.****WordDelimiterFilterFactory"
>>> generateWordParts="0" generateNumberParts="0"
>>>              catenateWords="0" catenateNumbers="0" catenateAll="1"****
>>> splitOnCaseChange="0"
>>>
>>> />
>>>      </analyzer>
>>> </fieldtype>
>>>
>>> /SOLR is tokenizing it therefore by producing several tokens
>>> (mag,778,g,69)/
>>> /
>>>
>>> And I can't put double quotes on the query (q="mag 778 G 69") because the
>>> request handler searches also in other fields (with different
>>> configuration
>>> chains)
>>>
>>> As I understood the query parser, (i.e. query time), does a whitespace
>>> tokenization on its own before invoking my (query-time) chain. The same
>>> doesn't happen at index time...this is my problem...because at index time
>>> the field is analyzed exactly as I want...but unfortunately cannot say
>>> the
>>> same at query time.
>>>
>>> Sorry for my wonderful english, did you get the point?
>>>
>>>
>>> On 08/13/2013 02:18 PM, Erick Erickson wrote:
>>>
>>>  On a quick scan I don't see a problem here. Attach
>>>> &debug=query to your url and that'll show you the
>>>> parsed query, which will in turn show you what's been
>>>> pushed through the analysis chain you've defined.
>>>>
>>>> You haven't stated whether you've tried this and it's
>>>> not working or you're looking for guidance as to how
>>>> to accomplish this so it's a little unclear how to
>>>> respond.
>>>>
>>>> BTW, the admin/analysis page is your friend here....
>>>>
>>>> Best
>>>> Erick
>>>>
>>>>
>>>> On Mon, Aug 12, 2013 at 12:52 PM, Andrea Gazzarini <
>>>> andrea.gazzar...@gmail.com> wrote:
>>>>
>>>>   Clear, thanks for response.
>>>>
>>>>> So, if I have two fields
>>>>>
>>>>> <fieldtype name="type1" class="solr.TextField" >
>>>>>       <analyzer>
>>>>>           <tokenizer class="solr.******KeywordTokenizerFactory*" />
>>>>>
>>>>>           <filter class="solr.******LowerCaseFilterFactory" />
>>>>>           <filter class="solr.******WordDelimiterFilterFactory"
>>>>>
>>>>> generateWordParts="0" generateNumberParts="0"
>>>>>               catenateWords="0" catenateNumbers="0" catenateAll="1"
>>>>> splitOnCaseChange="0" />
>>>>>       </analyzer>
>>>>> </fieldtype>
>>>>> <fieldtype name="type2" class="solr.TextField" >
>>>>>       <analyzer>
>>>>>           <charFilter class="solr.******MappingCharFilterFactory"
>>>>> mapping="mapping-FoldToASCII.******txt"/>
>>>>>           <tokenizer class="solr.******WhitespaceTokenizerFactory" />
>>>>>           <filter class="solr.******LowerCaseFilterFactory" />
>>>>>           <filter class="solr.******WordDelimiterFilterFactory" .../>
>>>>>
>>>>>
>>>>>       </analyzer>
>>>>> </fieldtype>
>>>>>
>>>>> (first field type *Mag. 78 D 99* becomes *mag78d99* while second field
>>>>> type ends with several tokens)
>>>>>
>>>>> And I want to use the same request handler to query against both of
>>>>> them.
>>>>> I mean I want the user search something like
>>>>>
>>>>> http//..../search?q=Mag 78 D 99
>>>>>
>>>>> and this search should search within both the first (with type1) and
>>>>> second (with type 2) by matching
>>>>>
>>>>> - a document which has field_with_type1 equals to *mag78d99* or
>>>>> - a document which has field_with_type2 that contains a text like "go
>>>>> to
>>>>> *mag 78*, class *d* and subclass *99*)
>>>>>
>>>>>
>>>>> <requestHandler ....>
>>>>>       ...
>>>>>       <str name="defType">dismax</str>
>>>>>       ...
>>>>>       <str name="mm">100%</str>
>>>>>       <str name="qf">
>>>>>           field_with_type1
>>>>>           field_with_type_2
>>>>>       </str>
>>>>>       ...
>>>>> </requestHandler>
>>>>>
>>>>> is not possible? If so, is possible to do that in some other way?
>>>>>
>>>>> Sorry for the long email and thanks again
>>>>> Andrea
>>>>>
>>>>>
>>>>> On 08/12/2013 04:01 PM, Jack Krupansky wrote:
>>>>>
>>>>>   Quoted phrases will be passed to the analyzer as one string, so
>>>>> there a
>>>>>
>>>>>> white space tokenizer is needed.
>>>>>>
>>>>>> -- Jack Krupansky
>>>>>>
>>>>>> -----Original Message----- From: Andrea Gazzarini
>>>>>> Sent: Monday, August 12, 2013 6:52 AM
>>>>>> To: solr-user@lucene.apache.org
>>>>>> Subject: Re: Tokenization at query time
>>>>>>
>>>>>> Hi Tanguy,
>>>>>> thanks for fast response. What you are saying corresponds perfectly
>>>>>> with
>>>>>> the behaviour I'm observing.
>>>>>> Now, other than having a big problem (I have several other fields both
>>>>>> in the pf and qf where spaces doesn't matter, field types like the
>>>>>> "text_en" field type in the example schema) what I'm wondering is:
>>>>>>
>>>>>> /"The query parser splits the input query on white spaces, and the
>>>>>> each
>>>>>> token is analysed according to your configuration"//
>>>>>> /
>>>>>> Is there a valid reason to declare a WhiteSpaceTokenizer in a query
>>>>>> analyzer? If the input query is already parsed (i.e. whitespace
>>>>>> tokenized) what is its effect?
>>>>>>
>>>>>> Thank you very much for the help
>>>>>> Andrea
>>>>>>
>>>>>> On 08/12/2013 12:37 PM, Tanguy Moal wrote:
>>>>>>
>>>>>>   Hello Andrea,
>>>>>>
>>>>>>> I think you face a rather common issue involving keyword tokenization
>>>>>>> and query parsing in Lucene:
>>>>>>> The query parser splits the input query on white spaces, and then
>>>>>>> each
>>>>>>> token is analysed according to your configuration.
>>>>>>> So those queries with a whitespace won't behave as expected because
>>>>>>> each
>>>>>>> token is analysed separately. Consequently, the catenated version of
>>>>>>> the
>>>>>>> reference cannot be generated.
>>>>>>> I think you could try surrounding your query with double quotes or
>>>>>>> escaping the space characters in your query using a backslash so that
>>>>>>> the
>>>>>>> whole sequence is analysed in the same analyser and the catenation
>>>>>>> occurs.
>>>>>>> You should be aware that this approach has a drawback: you will
>>>>>>> probably
>>>>>>> not be able to combine the search for Mag. 778 G 69 with other words
>>>>>>> in
>>>>>>> other fields unless you are able to identify which spaces are to be
>>>>>>> escaped:
>>>>>>> For example, if input the query is:
>>>>>>> Awesome Mag. 778 G 69
>>>>>>> you would want to transform it to:
>>>>>>> Awesome Mag.\ 778\ G\ 69 // spaces are escaped in the reference only
>>>>>>> or
>>>>>>> Awesome "Mag. 778 G 69" // only the reference is turned into a phrase
>>>>>>> query
>>>>>>>
>>>>>>> Do you get the point?
>>>>>>>
>>>>>>> Look at the differences between what you tried and the following
>>>>>>> examples which should all do what you want:
>>>>>>> http://localhost:8983/solr/******collection1/select?q=%22Mag.%******
>>>>>>> 20778%20G%2069%22&debugQuery=******on&qf=text%20myfield&**defType=**
>>>>>>> **dismax<http://localhost:****8983/solr/collection1/select?****
>>>>>>> q=%22Mag.%20778%20G%2069%22&****debugQuery=on&qf=text%**
>>>>>>> 20myfield&defType=dismax<http:**//localhost:8983/solr/**
>>>>>>> collection1/select?q=%22Mag.%**20778%20G%2069%22&debugQuery=**
>>>>>>> on&qf=text%20myfield&defType=**dismax<http://localhost:8983/solr/collection1/select?q=%22Mag.%20778%20G%2069%22&debugQuery=on&qf=text%20myfield&defType=dismax>
>>>>>>> >
>>>>>>> OR
>>>>>>>
>>>>>>> http://localhost:8983/solr/******collection1/select?q=myfield:**
>>>>>>> ****Mag<http://localhost:8983/solr/****collection1/select?q=myfield:****Mag>
>>>>>>> <http://localhost:8983/**solr/**collection1/select?q=**myfield:**Mag<http://localhost:8983/solr/**collection1/select?q=myfield:**Mag>
>>>>>>> >
>>>>>>> <http://localhost:8983/**solr/**collection1/select?q=****myfield:Mag<http://localhost:8983/**solr/collection1/select?q=**myfield:Mag>
>>>>>>> <http://localhost:**8983/solr/collection1/select?**q=myfield:Mag<http://localhost:8983/solr/collection1/select?q=myfield:Mag>
>>>>>>> >
>>>>>>> .\%20778\%20G\%2069&******debugQuery=on
>>>>>>> OR
>>>>>>> http://localhost:8983/solr/******collection1/select?q=Mag<http://localhost:8983/solr/****collection1/select?q=Mag>
>>>>>>> <http**://localhost:8983/solr/****collection1/select?q=Mag<http://localhost:8983/solr/**collection1/select?q=Mag>
>>>>>>> >
>>>>>>> <http:**//localhost:8983/solr/****collection1/select?q=Mag<htt**
>>>>>>> p://localhost:8983/solr/**collection1/select?q=Mag<http://localhost:8983/solr/collection1/select?q=Mag>
>>>>>>> >
>>>>>>> .\%**20778\%20G\%2069&****debugQuery=**on&qf=text%**
>>>>>>>
>>>>>>> 20myfield&defType=**edismax
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I hope this helps
>>>>>>>
>>>>>>> Tanguy
>>>>>>>
>>>>>>> On Aug 12, 2013, at 11:13 AM, Andrea Gazzarini <
>>>>>>> andrea.gazzar...@gmail.com> wrote:
>>>>>>>
>>>>>>>    Hi all,
>>>>>>>
>>>>>>>  I have a field (among others)in my schema defined like this:
>>>>>>>>
>>>>>>>> <fieldtype name="mytype" class="solr.TextField"
>>>>>>>> positionIncrementGap="100">
>>>>>>>>       <analyzer>
>>>>>>>>           <tokenizer class="solr.*******KeywordTokenizerFactory*"
>>>>>>>> />
>>>>>>>>           <filter class="solr.******LowerCaseFilterFactory" />
>>>>>>>>           <filter class="solr.******WordDelimiterFilterFactory"
>>>>>>>>
>>>>>>>>               generateWordParts="0"
>>>>>>>>               generateNumberParts="0"
>>>>>>>>               catenateWords="0"
>>>>>>>>               catenateNumbers="0"
>>>>>>>>               catenateAll="1"
>>>>>>>>               splitOnCaseChange="0" />
>>>>>>>>       </analyzer>
>>>>>>>> </fieldtype>
>>>>>>>>
>>>>>>>> <field name="myfield" type="mytype" indexed="true"/>
>>>>>>>>
>>>>>>>> Basically, both at index and query time the field value is
>>>>>>>> normalized
>>>>>>>> like this.
>>>>>>>>
>>>>>>>> Mag. 778 G 69 => mag778g69
>>>>>>>>
>>>>>>>> Now, in my solrconfig I'm using a search handler like this:
>>>>>>>> fossero solo le sue le gambe
>>>>>>>>
>>>>>>>> <requestHandler ....>
>>>>>>>>       ...
>>>>>>>>       <str name="defType">dismax</str>
>>>>>>>>       ...
>>>>>>>>       <str name="mm">100%</str>
>>>>>>>>       <str name="qf">myfield^3000</str>
>>>>>>>>       <str name="pf">myfield^30000</str>
>>>>>>>>
>>>>>>>> </requestHandler>
>>>>>>>>
>>>>>>>> What I'm expecting is that if I index a document with a value for my
>>>>>>>> field "Mag. 778 G 69", I will be able to get this document by
>>>>>>>> querying
>>>>>>>>
>>>>>>>> 1. Mag. 778 G 69
>>>>>>>> 2. mag 778 g69
>>>>>>>> 3. mag778g69
>>>>>>>>
>>>>>>>> But that doesn't wotk: i'm able to get the document only and if
>>>>>>>> only I
>>>>>>>> use the "normalized2 form: mag778g69
>>>>>>>>
>>>>>>>> After doing a little bit of debug, I see that, even I used a
>>>>>>>> KeywordTokenizer in my field type declaration, SOLR is doing
>>>>>>>> soemthign like
>>>>>>>> this:
>>>>>>>> /
>>>>>>>> // +((DisjunctionMaxQuery((//******myfield://*mag*//^3000.0)~0.1)
>>>>>>>> DisjunctionMaxQuery((//******myfield://*778*//^3000.0)~0.1)
>>>>>>>> DisjunctionMaxQuery((//******myfield://*g*//^3000.0)~0.1)
>>>>>>>> DisjunctionMaxQuery((//******myfield://*69*//^3000.0)~0.1))**
>>>>>>>> ****~4)
>>>>>>>> DisjunctionMaxQuery((//******myfield://*mag778g69*//^30000.**
>>>>>>>> ****0)~0.1)/
>>>>>>>>
>>>>>>>>
>>>>>>>> That is, it is tokenizing the original query string (mag + 778 + g +
>>>>>>>> 69) and obviously querying the field for separate tokens doesn't
>>>>>>>> match
>>>>>>>> anything (at least this is what I think)
>>>>>>>>
>>>>>>>> Does anybody could please explain me that?
>>>>>>>>
>>>>>>>> Thanks in advance
>>>>>>>> Andrea
>>>>>>>>
>>>>>>>>
>>>>>>>>
>

Re: Tokenization at query time

Reply via email to