Re: Multi-words synonyms matching

Bernd Fehling Tue, 15 May 2012 01:30:05 -0700

Without reading the whole thread let me say that you should not trust
the solr admin analysis. It takes the whole multiword search and runs
it all together at once through each analyzer step (factory).
But this is not how the real system works. First pitfall, the query parser
is also splitting at white space (if not a phrase query). Due to this,
a multiword query is send chunk after chunk through the analyzer and,
second pitfall, each chunk runs through the whole analyzer by its own.


So if you are dealing with multiword synonyms you have the following
problems. Either you turn your query into a phrase so that the whole
phrase is analyzed at once and therefore looked up as multiword synonym
but phrase queries are not analyzed !!! OR you send your query chunk
by chunk through the analyzer but then they are not multiwords anymore
and are not found in your synonyms.txt.

>From my experience I can say that it requires some deep work to get it done
but it is possible. I have connected a thesaurus to solr which is doing
query time expansion (no need to reindex if the thesaurus changes).
The thesaurus holds synonyms and "used for terms" in 24 languages. So
it is also some kind of language translation. And naturally the thesaurus
translates from single term to multi term synonyms and vice versa.

Regards,
Bernd


Am 14.05.2012 13:54, schrieb elisabeth benoit:
> Just for the record, I'd like to conclude this thread
> 
> First, you were right, there was no behaviour difference between fq and q
> parameters.
> 
> I realized that:
> 
> 1) my synonym (hotel de ville) has a stopword in it (de) and since I used
> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms declaration,
> there was no stopword removal in the indewed expression, so when requesting
> "hotel de ville", after stopwords removal in query, Solr was comparing
> "hotel de ville"
> with "hotel ville"
> 
> but my queries never even got to that point since
> 
> 2) I made a mistake using "mairie" alone in the admin interface when
> testing my schema. The real field was something like "collectivités
> territoriales mairie",
> so the synonym "hotel de ville" was not even applied, because of the
> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonym definition
> not splitting field into words when parsing
> 
> So my problem is not solved, and I'm considering solving it outside of Solr
> scope, unless someone else has a clue
> 
> Thanks again,
> Elisabeth
> 
> 
> 
> 2012/4/25 Erick Erickson <erickerick...@gmail.com>
> 
>> A little farther down the debug info output you'll find something
>> like this (I specified fq=name:features)
>>
>> <arr name="parsed_filter_queries">
>> <str>name:features</str>
>> </arr>
>>
>>
>> so it may well give you some clue. But unless I'm reading things wrong,
>> your
>> q is going against a field that has much more information than the
>> CATEGORY_ANALYZED field, is it possible that the data from your
>> test cases simply isn't _in_ CATEGORY_ANALYZED?
>>
>> Best
>> Erick
>>
>> On Wed, Apr 25, 2012 at 9:39 AM, elisabeth benoit
>> <elisaelisael...@gmail.com> wrote:
>>> I'm not at the office until next Wednesday, and I don't have my Solr
>> under
>>> hand, but isn't debugQuery=on giving informations only about q parameter
>>> matching and nothing about fq parameter? Or do you mean
>>> "parsed_filter_querie"s gives information about fq?
>>>
>>> CATEGORY_ANALYZED is being populated by a copyField instruction in
>>> schema.xml, and has the same field type as my catchall field, the search
>>> field for my searchHandler (the one being used by q parameter).
>>>
>>> CATEGORY (a string) is copied in CATEGORY_ANALYZED (field type is text)
>>>
>>> CATEGORY (a string) is copied in catchall field (field type is text),
>> and a
>>> lot of other fields are copied too in that catchall field.
>>>
>>> So as far as I can see, the same analysis should be done in both cases,
>> but
>>> obviously I'm missing something, and the only thing I can think of is a
>>> different behavior between q and fq parameter.
>>>
>>> I'll check that parsed_filter_querie first thing in the morning next
>>> Wednesday.
>>>
>>> Thanks a lot for your help.
>>>
>>> Elisabeth
>>>
>>>
>>> 2012/4/24 Erick Erickson <erickerick...@gmail.com>
>>>
>>>> Elisabeth:
>>>>
>>>> What shows up in the debug section of the response when you add
>>>> &debugQuery=on? There should be some bit of that section like:
>>>> "parsed_filter_queries"
>>>>
>>>> My other question is "are you absolutely sure that your
>>>> CATEGORY_ANALYZED field has the correct content?". How does it
>>>> get populated?
>>>>
>>>> Nothing jumps out at me here....
>>>>
>>>> Best
>>>> Erick
>>>>
>>>> On Tue, Apr 24, 2012 at 9:55 AM, elisabeth benoit
>>>> <elisaelisael...@gmail.com> wrote:
>>>>> yes, thanks, but this is NOT my question.
>>>>>
>>>>> I was wondering why I have multiple matches with q="hotel de ville"
>> and
>>>> no
>>>>> match with fq=CATEGORY_ANALYZED:"hotel de ville", since in both case
>> I'm
>>>>> searching in the same solr fieldType.
>>>>>
>>>>> Why is q parameter behaving differently in that case? Why do the
>> quotes
>>>>> work in one case and not in the other?
>>>>>
>>>>> Does anyone know?
>>>>>
>>>>> Thanks,
>>>>> Elisabeth
>>>>>
>>>>> 2012/4/24 Jeevanandam <je...@myjeeva.com>
>>>>>
>>>>>>
>>>>>> usage of q and fq
>>>>>>
>>>>>> q => is typically the main query for the search request
>>>>>>
>>>>>> fq => is Filter Query; generally used to restrict the super set of
>>>>>> documents without influencing score (more info.
>>>>>> http://wiki.apache.org/solr/**CommonQueryParameters#q<
>>>> http://wiki.apache.org/solr/CommonQueryParameters#q>
>>>>>> )
>>>>>>
>>>>>> For example:
>>>>>> ------------
>>>>>> q="hotel de ville" ===> returns 100 documents
>>>>>>
>>>>>> q="hotel de ville"&fq=price:[100 To *]&fq=roomType:"King size Bed"
>> ===>
>>>>>> returns 40 documents from super set of 100 documents
>>>>>>
>>>>>>
>>>>>> hope this helps!
>>>>>>
>>>>>> - Jeevanandam
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 24-04-2012 3:08 pm, elisabeth benoit wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'd like to resume this post.
>>>>>>>
>>>>>>> The only way I found to do not split synonyms in words in
>> synonyms.txt
>>>> it
>>>>>>> to use the line
>>>>>>>
>>>>>>>  <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
>>>>>>> ignoreCase="true" expand="true"
>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/>
>>>>>>>
>>>>>>> in schema.xml
>>>>>>>
>>>>>>> where tokenizerFactory="solr.**KeywordTokenizerFactory"
>>>>>>>
>>>>>>> instructs SynonymFilterFactory not to break synonyms into words on
>>>> white
>>>>>>> spaces when parsing synonyms file.
>>>>>>>
>>>>>>> So now it works fine, "mairie" is mapped into "hotel de ville" and
>>>> when I
>>>>>>> send request q="hotel de ville" (quotes are mandatory to prevent
>>>> analyzer
>>>>>>> to split hotel de ville on white spaces), I get answers with word
>>>>>>> "mairie".
>>>>>>>
>>>>>>> But when I use fq parameter (fq=CATEGORY_ANALYZED:"hotel de
>> ville"), it
>>>>>>> doesn't work!!!
>>>>>>>
>>>>>>> CATEGORY_ANALYZED is same field type as default search field. This
>>>> means
>>>>>>> that when I send q="hotel de ville" and fq=CATEGORY_ANALYZED:"hotel
>> de
>>>>>>> ville", solr uses the same analyzer, the one with the line
>>>>>>>
>>>>>>> <filter class="solr.**SynonymFilterFactory" synonyms="synonyms.txt"
>>>>>>> ignoreCase="true" expand="true"
>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/>.
>>>>>>>
>>>>>>> Anyone as a clue what is different between q analysis behaviour and
>> fq
>>>>>>> analysis behaviour?
>>>>>>>
>>>>>>> Thanks a lot
>>>>>>> Elisabeth
>>>>>>>
>>>>>>> 2012/4/12 elisabeth benoit <elisaelisael...@gmail.com>
>>>>>>>
>>>>>>>  oh, that's right.
>>>>>>>>
>>>>>>>> thanks a lot,
>>>>>>>> Elisabeth
>>>>>>>>
>>>>>>>>
>>>>>>>> 2012/4/11 Jeevanandam Madanagopal <je...@myjeeva.com>
>>>>>>>>
>>>>>>>>  Elisabeth -
>>>>>>>>>
>>>>>>>>> As you described, below mapping might suit for your need.
>>>>>>>>> mairie => hotel de ville, mairie
>>>>>>>>>
>>>>>>>>> mairie gets expanded to "hotel de ville" and "mairie" at index
>> time.
>>>>  So
>>>>>>>>> "mairie" and "hotel de ville" searchable on document.
>>>>>>>>>
>>>>>>>>> However, still white space tokenizer splits at query time will be
>> a
>>>>>>>>> problem as described by Markus.
>>>>>>>>>
>>>>>>>>> --Jeevanandam
>>>>>>>>>
>>>>>>>>> On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:
>>>>>>>>>
>>>>>>>>>> <<Have you tried the "=>' mapping instead? Something
>>>>>>>>>> <<like
>>>>>>>>>> <<hotel de ville => mairie
>>>>>>>>>> <<might work for you.
>>>>>>>>>>
>>>>>>>>>> Yes, thanks, I've tried it but from what I undestand it doesn't
>>>> solve
>>>>>>>>> my
>>>>>>>>>> problem, since this means hotel de ville will be replace by
>> mairie
>>>> at
>>>>>>>>>> index time (I use synonyms only at index time). So when user
>> will
>>>> ask
>>>>>>>>>> "hôtel de ville", it won't match.
>>>>>>>>>>
>>>>>>>>>> In fact, at index time I have mairie in my data, but I want user
>>>> to be
>>>>>>>>> able
>>>>>>>>>> to request "mairie" or "hôtel de ville" and have mairie as
>> answer,
>>>> and
>>>>>>>>> not
>>>>>>>>>> have mairie as an answer when requesting "hôtel".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> <<To map `mairie` to `hotel de ville` as single token you must
>>>> escape
>>>>>>>>> your
>>>>>>>>>> white
>>>>>>>>>> <<space.
>>>>>>>>>>
>>>>>>>>>> <<mairie, hotel\ de\ ville
>>>>>>>>>>
>>>>>>>>>> <<This results in  a problem if your tokenizer splits on white
>>>> space
>>>>>>>>> at
>>>>>>>>>> query
>>>>>>>>>> <<time.
>>>>>>>>>>
>>>>>>>>>> Ok, I guess this means I have a problem. No simple solution
>> since
>>>> at
>>>>>>>>> query
>>>>>>>>>> time my tokenizer do split on white spaces.
>>>>>>>>>>
>>>>>>>>>> I guess my problem is more or less one of the problems
>> discussed in
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> http://lucene.472066.n3.**nabble.com/Multi-word-**
>>>>>>>>> synonyms-td3716292.html#**a3717215<
>>>>
>> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks a lot for your answers,
>>>>>>>>>> Elisabeth
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2012/4/10 Erick Erickson <erickerick...@gmail.com>
>>>>>>>>>>
>>>>>>>>>>> Have you tried the "=>' mapping instead? Something
>>>>>>>>>>> like
>>>>>>>>>>> hotel de ville => mairie
>>>>>>>>>>> might work for you.
>>>>>>>>>>>
>>>>>>>>>>> Best
>>>>>>>>>>> Erick
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
>>>>>>>>>>> <elisaelisael...@gmail.com> wrote:
>>>>>>>>>>>> Hello,
>>>>>>>>>>>>
>>>>>>>>>>>> I've read several post on this issue, but can't find a real
>>>> solution
>>>>>>>>> to
>>>>>>>>>>> my
>>>>>>>>>>>> multi-words synonyms matching problem.
>>>>>>>>>>>>
>>>>>>>>>>>> I have in my synonyms.txt an entry like
>>>>>>>>>>>>
>>>>>>>>>>>> mairie, hotel de ville
>>>>>>>>>>>>
>>>>>>>>>>>> and my index time analyzer is configured as followed for
>>>> synonyms.
>>>>>>>>>>>>
>>>>>>>>>>>> <filter class="solr.**SynonymFilterFactory"
>>>> synonyms="synonyms.txt"
>>>>>>>>>>>> ignoreCase="true" expand="true"/>
>>>>>>>>>>>>
>>>>>>>>>>>> The problem I have is that now "mairie" matches with "hotel"
>> and
>>>> I
>>>>>>>>> would
>>>>>>>>>>>> only want "mairie" to match with "hotel de ville" and
>> "mairie".
>>>>>>>>>>>>
>>>>>>>>>>>> When I look into the analyzer, I see that "mairie" is mapped
>> into
>>>>>>>>>>> "hotel",
>>>>>>>>>>>> and words "de ville" are added in second and third position.
>> To
>>>>>>>>> change
>>>>>>>>>>>> that, I tried to do
>>>>>>>>>>>>
>>>>>>>>>>>> <filter class="solr.**SynonymFilterFactory"
>>>> synonyms="synonyms.txt"
>>>>>>>>>>>> ignoreCase="true" expand="true"
>>>>>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/> (as I
>> read in
>>>>>>>>> one
>>>>>>>>> post)
>>>>>>>>>>>>
>>>>>>>>>>>> and I can see now in the analyzer that "mairie" is mapped to
>>>> "hotel
>>>>>>>>> de
>>>>>>>>>>>> ville", but now when I have query "hotel de ville", it doesn't
>>>> match
>>>>>>>>> at
>>>>>>>>>>> all
>>>>>>>>>>>> with "mairie".
>>>>>>>>>>>>
>>>>>>>>>>>> Anyone has a clue of what I'm doing wrong?
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using Solr 3.4.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>> Elisabeth
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
> 

-- 
*************************************************************
Bernd Fehling                Universitätsbibliothek Bielefeld
Dipl.-Inform. (FH)                        Universitätsstr. 25
Tel. +49 521 106-4060                   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de                33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Re: Multi-words synonyms matching

Reply via email to