Re: Multi-words synonyms matching

elisabeth benoit Tue, 29 May 2012 13:27:31 -0700

Hello Bernd,

Thanks a lot for your answer. I'll work on this.


Best regards,
Elisabeth

2012/5/29 Bernd Fehling <bernd.fehl...@uni-bielefeld.de>

> Hello Elisabeth,
>
> my synonyms.txt is like your 2nd example:
>
> naturwald, φυσικό\ δάσος, естествена\ гора, prírodný\ les, naravni\ gozd,
> foresta\ naturale, natuurbos, natural\ forest, bosque\ natural,
> természetes\ erdő,
> natūralus\ miškas, prirodna\ šuma, dabiskais\ mežs, floresta\ natural,
> naturskov,
> forêt\ naturelle, naturskog, přírodní\ les, luonnonmetsä, pădure\ naturală,
> las\ naturalny, natürlicher\ wald
>
>
> An example from my system with debugging turned on and searching for
> "naturwald":
>
> <lst name="debug">
>  <str name="rawquerystring">naturwald</str>
>  <str name="querystring">naturwald</str>
>  <str name="parsedquery">textth:naturwald textth:"φυσικό δάσος"
> textth:"естествена гора"
> textth:"prírodný les" textth:"naravni gozd" textth:"foresta naturale"
> textth:natuurbos
> textth:"natural forest" textth:"bosque natural" textth:"természetes erdő"
> textth:"natūralus miškas" textth:"prirodna šuma" textth:"dabiskais mežs"
> textth:"floresta natural" textth:naturskov textth:"forêt naturelle"
> textth:naturskog
> textth:"přírodní les" textth:luonnonmetsä textth:"pădure naturală"
> textth:"las naturalny"
> textth:"natürlicher wald"</str>
> ...
>
> As you can see my search for "naturwald" extends to single and multiword
> synonyms e.g. "forêt naturelle"
>
>
> My SynonymFilterFactory has the following settings:
>
> org.apache.solr.analysis.SynonymFilterFactory
> {tokenizerFactory=solr.KeywordTokenizerFactory,
> synonyms=synonyms_eurovoc_desc_desc_ufall.txt, expand=true, format=solr,
> ignoreCase=true,
> luceneMatchVersion=LUCENE_36}
>
> But as I already mentioned, there is much more work to be done to get it
> running than
> just using SynonymFilterFactory.
>
> Regards
> Bernd
>
>
>
> Am 23.05.2012 08:49, schrieb elisabeth benoit:
> > Hello Bernd,
> >
> > Thanks for your advice.
> >
> > I have one question: how did you manage to map one word to a multiwords
> > synonym???
> >
> > I've tried (in synonyms.txt)
> >
> > mairie, hotel de ville
> >
> > mairie, hotel\ de\ ville
> >
> > mairie => mairie, hotel de ville
> >
> > mairie => mairie, hotel\ de\ ville
> >
> > but nothing prevents mairie from matching with "hotel"...
> >
> > The only way I found is to use
> > tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms
> declaration
> > in schema.xml, but then since "mairie" is not alone in my index field, it
> > doesn't match.
> >
> >
> > best regards,
> > Elisabeth
> >
> >
> >
> >
> > the only way I found, I schema.xml, is to use
> >
> >
> >
> > 2012/5/15 Bernd Fehling <bernd.fehl...@uni-bielefeld.de>
> >
> >> Without reading the whole thread let me say that you should not trust
> >> the solr admin analysis. It takes the whole multiword search and runs
> >> it all together at once through each analyzer step (factory).
> >> But this is not how the real system works. First pitfall, the query
> parser
> >> is also splitting at white space (if not a phrase query). Due to this,
> >> a multiword query is send chunk after chunk through the analyzer and,
> >> second pitfall, each chunk runs through the whole analyzer by its own.
> >>
> >> So if you are dealing with multiword synonyms you have the following
> >> problems. Either you turn your query into a phrase so that the whole
> >> phrase is analyzed at once and therefore looked up as multiword synonym
> >> but phrase queries are not analyzed !!! OR you send your query chunk
> >> by chunk through the analyzer but then they are not multiwords anymore
> >> and are not found in your synonyms.txt.
> >>
> >> From my experience I can say that it requires some deep work to get it
> done
> >> but it is possible. I have connected a thesaurus to solr which is doing
> >> query time expansion (no need to reindex if the thesaurus changes).
> >> The thesaurus holds synonyms and "used for terms" in 24 languages. So
> >> it is also some kind of language translation. And naturally the
> thesaurus
> >> translates from single term to multi term synonyms and vice versa.
> >>
> >> Regards,
> >> Bernd
> >>
> >>
> >> Am 14.05.2012 13:54, schrieb elisabeth benoit:
> >>> Just for the record, I'd like to conclude this thread
> >>>
> >>> First, you were right, there was no behaviour difference between fq
> and q
> >>> parameters.
> >>>
> >>> I realized that:
> >>>
> >>> 1) my synonym (hotel de ville) has a stopword in it (de) and since I
> used
> >>> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonyms
> >> declaration,
> >>> there was no stopword removal in the indewed expression, so when
> >> requesting
> >>> "hotel de ville", after stopwords removal in query, Solr was comparing
> >>> "hotel de ville"
> >>> with "hotel ville"
> >>>
> >>> but my queries never even got to that point since
> >>>
> >>> 2) I made a mistake using "mairie" alone in the admin interface when
> >>> testing my schema. The real field was something like "collectivités
> >>> territoriales mairie",
> >>> so the synonym "hotel de ville" was not even applied, because of the
> >>> tokenizerFactory="solr.KeywordTokenizerFactory" in my synonym
> definition
> >>> not splitting field into words when parsing
> >>>
> >>> So my problem is not solved, and I'm considering solving it outside of
> >> Solr
> >>> scope, unless someone else has a clue
> >>>
> >>> Thanks again,
> >>> Elisabeth
> >>>
> >>>
> >>>
> >>> 2012/4/25 Erick Erickson <erickerick...@gmail.com>
> >>>
> >>>> A little farther down the debug info output you'll find something
> >>>> like this (I specified fq=name:features)
> >>>>
> >>>> <arr name="parsed_filter_queries">
> >>>> <str>name:features</str>
> >>>> </arr>
> >>>>
> >>>>
> >>>> so it may well give you some clue. But unless I'm reading things
> wrong,
> >>>> your
> >>>> q is going against a field that has much more information than the
> >>>> CATEGORY_ANALYZED field, is it possible that the data from your
> >>>> test cases simply isn't _in_ CATEGORY_ANALYZED?
> >>>>
> >>>> Best
> >>>> Erick
> >>>>
> >>>> On Wed, Apr 25, 2012 at 9:39 AM, elisabeth benoit
> >>>> <elisaelisael...@gmail.com> wrote:
> >>>>> I'm not at the office until next Wednesday, and I don't have my Solr
> >>>> under
> >>>>> hand, but isn't debugQuery=on giving informations only about q
> >> parameter
> >>>>> matching and nothing about fq parameter? Or do you mean
> >>>>> "parsed_filter_querie"s gives information about fq?
> >>>>>
> >>>>> CATEGORY_ANALYZED is being populated by a copyField instruction in
> >>>>> schema.xml, and has the same field type as my catchall field, the
> >> search
> >>>>> field for my searchHandler (the one being used by q parameter).
> >>>>>
> >>>>> CATEGORY (a string) is copied in CATEGORY_ANALYZED (field type is
> text)
> >>>>>
> >>>>> CATEGORY (a string) is copied in catchall field (field type is text),
> >>>> and a
> >>>>> lot of other fields are copied too in that catchall field.
> >>>>>
> >>>>> So as far as I can see, the same analysis should be done in both
> cases,
> >>>> but
> >>>>> obviously I'm missing something, and the only thing I can think of
> is a
> >>>>> different behavior between q and fq parameter.
> >>>>>
> >>>>> I'll check that parsed_filter_querie first thing in the morning next
> >>>>> Wednesday.
> >>>>>
> >>>>> Thanks a lot for your help.
> >>>>>
> >>>>> Elisabeth
> >>>>>
> >>>>>
> >>>>> 2012/4/24 Erick Erickson <erickerick...@gmail.com>
> >>>>>
> >>>>>> Elisabeth:
> >>>>>>
> >>>>>> What shows up in the debug section of the response when you add
> >>>>>> &debugQuery=on? There should be some bit of that section like:
> >>>>>> "parsed_filter_queries"
> >>>>>>
> >>>>>> My other question is "are you absolutely sure that your
> >>>>>> CATEGORY_ANALYZED field has the correct content?". How does it
> >>>>>> get populated?
> >>>>>>
> >>>>>> Nothing jumps out at me here....
> >>>>>>
> >>>>>> Best
> >>>>>> Erick
> >>>>>>
> >>>>>> On Tue, Apr 24, 2012 at 9:55 AM, elisabeth benoit
> >>>>>> <elisaelisael...@gmail.com> wrote:
> >>>>>>> yes, thanks, but this is NOT my question.
> >>>>>>>
> >>>>>>> I was wondering why I have multiple matches with q="hotel de ville"
> >>>> and
> >>>>>> no
> >>>>>>> match with fq=CATEGORY_ANALYZED:"hotel de ville", since in both
> case
> >>>> I'm
> >>>>>>> searching in the same solr fieldType.
> >>>>>>>
> >>>>>>> Why is q parameter behaving differently in that case? Why do the
> >>>> quotes
> >>>>>>> work in one case and not in the other?
> >>>>>>>
> >>>>>>> Does anyone know?
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Elisabeth
> >>>>>>>
> >>>>>>> 2012/4/24 Jeevanandam <je...@myjeeva.com>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> usage of q and fq
> >>>>>>>>
> >>>>>>>> q => is typically the main query for the search request
> >>>>>>>>
> >>>>>>>> fq => is Filter Query; generally used to restrict the super set of
> >>>>>>>> documents without influencing score (more info.
> >>>>>>>> http://wiki.apache.org/solr/**CommonQueryParameters#q<
> >>>>>> http://wiki.apache.org/solr/CommonQueryParameters#q>
> >>>>>>>> )
> >>>>>>>>
> >>>>>>>> For example:
> >>>>>>>> ------------
> >>>>>>>> q="hotel de ville" ===> returns 100 documents
> >>>>>>>>
> >>>>>>>> q="hotel de ville"&fq=price:[100 To *]&fq=roomType:"King size Bed"
> >>>> ===>
> >>>>>>>> returns 40 documents from super set of 100 documents
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> hope this helps!
> >>>>>>>>
> >>>>>>>> - Jeevanandam
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 24-04-2012 3:08 pm, elisabeth benoit wrote:
> >>>>>>>>
> >>>>>>>>> Hello,
> >>>>>>>>>
> >>>>>>>>> I'd like to resume this post.
> >>>>>>>>>
> >>>>>>>>> The only way I found to do not split synonyms in words in
> >>>> synonyms.txt
> >>>>>> it
> >>>>>>>>> to use the line
> >>>>>>>>>
> >>>>>>>>>  <filter class="solr.**SynonymFilterFactory"
> >> synonyms="synonyms.txt"
> >>>>>>>>> ignoreCase="true" expand="true"
> >>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/>
> >>>>>>>>>
> >>>>>>>>> in schema.xml
> >>>>>>>>>
> >>>>>>>>> where tokenizerFactory="solr.**KeywordTokenizerFactory"
> >>>>>>>>>
> >>>>>>>>> instructs SynonymFilterFactory not to break synonyms into words
> on
> >>>>>> white
> >>>>>>>>> spaces when parsing synonyms file.
> >>>>>>>>>
> >>>>>>>>> So now it works fine, "mairie" is mapped into "hotel de ville"
> and
> >>>>>> when I
> >>>>>>>>> send request q="hotel de ville" (quotes are mandatory to prevent
> >>>>>> analyzer
> >>>>>>>>> to split hotel de ville on white spaces), I get answers with word
> >>>>>>>>> "mairie".
> >>>>>>>>>
> >>>>>>>>> But when I use fq parameter (fq=CATEGORY_ANALYZED:"hotel de
> >>>> ville"), it
> >>>>>>>>> doesn't work!!!
> >>>>>>>>>
> >>>>>>>>> CATEGORY_ANALYZED is same field type as default search field.
> This
> >>>>>> means
> >>>>>>>>> that when I send q="hotel de ville" and
> fq=CATEGORY_ANALYZED:"hotel
> >>>> de
> >>>>>>>>> ville", solr uses the same analyzer, the one with the line
> >>>>>>>>>
> >>>>>>>>> <filter class="solr.**SynonymFilterFactory"
> synonyms="synonyms.txt"
> >>>>>>>>> ignoreCase="true" expand="true"
> >>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/>.
> >>>>>>>>>
> >>>>>>>>> Anyone as a clue what is different between q analysis behaviour
> and
> >>>> fq
> >>>>>>>>> analysis behaviour?
> >>>>>>>>>
> >>>>>>>>> Thanks a lot
> >>>>>>>>> Elisabeth
> >>>>>>>>>
> >>>>>>>>> 2012/4/12 elisabeth benoit <elisaelisael...@gmail.com>
> >>>>>>>>>
> >>>>>>>>>  oh, that's right.
> >>>>>>>>>>
> >>>>>>>>>> thanks a lot,
> >>>>>>>>>> Elisabeth
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 2012/4/11 Jeevanandam Madanagopal <je...@myjeeva.com>
> >>>>>>>>>>
> >>>>>>>>>>  Elisabeth -
> >>>>>>>>>>>
> >>>>>>>>>>> As you described, below mapping might suit for your need.
> >>>>>>>>>>> mairie => hotel de ville, mairie
> >>>>>>>>>>>
> >>>>>>>>>>> mairie gets expanded to "hotel de ville" and "mairie" at index
> >>>> time.
> >>>>>>  So
> >>>>>>>>>>> "mairie" and "hotel de ville" searchable on document.
> >>>>>>>>>>>
> >>>>>>>>>>> However, still white space tokenizer splits at query time will
> be
> >>>> a
> >>>>>>>>>>> problem as described by Markus.
> >>>>>>>>>>>
> >>>>>>>>>>> --Jeevanandam
> >>>>>>>>>>>
> >>>>>>>>>>> On Apr 11, 2012, at 12:30 PM, elisabeth benoit wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> <<Have you tried the "=>' mapping instead? Something
> >>>>>>>>>>>> <<like
> >>>>>>>>>>>> <<hotel de ville => mairie
> >>>>>>>>>>>> <<might work for you.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Yes, thanks, I've tried it but from what I undestand it
> doesn't
> >>>>>> solve
> >>>>>>>>>>> my
> >>>>>>>>>>>> problem, since this means hotel de ville will be replace by
> >>>> mairie
> >>>>>> at
> >>>>>>>>>>>> index time (I use synonyms only at index time). So when user
> >>>> will
> >>>>>> ask
> >>>>>>>>>>>> "hôtel de ville", it won't match.
> >>>>>>>>>>>>
> >>>>>>>>>>>> In fact, at index time I have mairie in my data, but I want
> user
> >>>>>> to be
> >>>>>>>>>>> able
> >>>>>>>>>>>> to request "mairie" or "hôtel de ville" and have mairie as
> >>>> answer,
> >>>>>> and
> >>>>>>>>>>> not
> >>>>>>>>>>>> have mairie as an answer when requesting "hôtel".
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> <<To map `mairie` to `hotel de ville` as single token you must
> >>>>>> escape
> >>>>>>>>>>> your
> >>>>>>>>>>>> white
> >>>>>>>>>>>> <<space.
> >>>>>>>>>>>>
> >>>>>>>>>>>> <<mairie, hotel\ de\ ville
> >>>>>>>>>>>>
> >>>>>>>>>>>> <<This results in  a problem if your tokenizer splits on white
> >>>>>> space
> >>>>>>>>>>> at
> >>>>>>>>>>>> query
> >>>>>>>>>>>> <<time.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Ok, I guess this means I have a problem. No simple solution
> >>>> since
> >>>>>> at
> >>>>>>>>>>> query
> >>>>>>>>>>>> time my tokenizer do split on white spaces.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I guess my problem is more or less one of the problems
> >>>> discussed in
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> http://lucene.472066.n3.**nabble.com/Multi-word-**
> >>>>>>>>>>> synonyms-td3716292.html#**a3717215<
> >>>>>>
> >>>>
> >>
> http://lucene.472066.n3.nabble.com/Multi-word-synonyms-td3716292.html#a3717215
> >>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks a lot for your answers,
> >>>>>>>>>>>> Elisabeth
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2012/4/10 Erick Erickson <erickerick...@gmail.com>
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Have you tried the "=>' mapping instead? Something
> >>>>>>>>>>>>> like
> >>>>>>>>>>>>> hotel de ville => mairie
> >>>>>>>>>>>>> might work for you.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Best
> >>>>>>>>>>>>> Erick
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Tue, Apr 10, 2012 at 1:41 AM, elisabeth benoit
> >>>>>>>>>>>>> <elisaelisael...@gmail.com> wrote:
> >>>>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I've read several post on this issue, but can't find a real
> >>>>>> solution
> >>>>>>>>>>> to
> >>>>>>>>>>>>> my
> >>>>>>>>>>>>>> multi-words synonyms matching problem.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I have in my synonyms.txt an entry like
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> mairie, hotel de ville
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> and my index time analyzer is configured as followed for
> >>>>>> synonyms.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> <filter class="solr.**SynonymFilterFactory"
> >>>>>> synonyms="synonyms.txt"
> >>>>>>>>>>>>>> ignoreCase="true" expand="true"/>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The problem I have is that now "mairie" matches with "hotel"
> >>>> and
> >>>>>> I
> >>>>>>>>>>> would
> >>>>>>>>>>>>>> only want "mairie" to match with "hotel de ville" and
> >>>> "mairie".
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> When I look into the analyzer, I see that "mairie" is mapped
> >>>> into
> >>>>>>>>>>>>> "hotel",
> >>>>>>>>>>>>>> and words "de ville" are added in second and third position.
> >>>> To
> >>>>>>>>>>> change
> >>>>>>>>>>>>>> that, I tried to do
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> <filter class="solr.**SynonymFilterFactory"
> >>>>>> synonyms="synonyms.txt"
> >>>>>>>>>>>>>> ignoreCase="true" expand="true"
> >>>>>>>>>>>>>> tokenizerFactory="solr.**KeywordTokenizerFactory"/> (as I
> >>>> read in
> >>>>>>>>>>> one
> >>>>>>>>>>> post)
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> and I can see now in the analyzer that "mairie" is mapped to
> >>>>>> "hotel
> >>>>>>>>>>> de
> >>>>>>>>>>>>>> ville", but now when I have query "hotel de ville", it
> doesn't
> >>>>>> match
> >>>>>>>>>>> at
> >>>>>>>>>>>>> all
> >>>>>>>>>>>>>> with "mairie".
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Anyone has a clue of what I'm doing wrong?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm using Solr 3.4.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Elisabeth
> >>>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>>
> >>
> >> --
> >> *************************************************************
> >> Bernd Fehling                Universitätsbibliothek Bielefeld
> >> Dipl.-Inform. (FH)                        Universitätsstr. 25
> >> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
> >> bernd.fehl...@uni-bielefeld.de                33615 Bielefeld
> >>
> >> BASE - Bielefeld Academic Search Engine - www.base-search.net
> >> *************************************************************
> >>
> >
>
> --
> *************************************************************
> Bernd Fehling                Universitätsbibliothek Bielefeld
> Dipl.-Inform. (FH)                        Universitätsstr. 25
> Tel. +49 521 106-4060                   Fax. +49 521 106-4052
> bernd.fehl...@uni-bielefeld.de                33615 Bielefeld
>
> BASE - Bielefeld Academic Search Engine - www.base-search.net
> *************************************************************
>

Re: Multi-words synonyms matching

Reply via email to