Re: Multi-word synonyms not working

elisabeth benoit Thu, 14 Mar 2024 00:47:06 -0700

Thanks a lot Annika for your explanation with links. We'll check that out.
And thanks to Charlie too.


Best regards,
Elisabeth

Le lun. 11 mars 2024 à 10:32, Charlie Hull <[email protected]>
a écrit :

> Hi Annika,
>
> Glad you like Bertrand's Haystack presentation! My colleague Daniel
> Wrigley recently wrote an overview blog on synonyms here
>
> https://opensourceconnections.com/blog/2023/03/29/applying-synonyms-types-strategies-tools-and-a-glimpse-into-the-future/
> which links to several other synonym blogs on our site.
>
> Best
>
> Charlie
>
> On 08/03/2024 15:09, Annika Gable wrote:
> > Hi Mikhail, Elisabeth, and Atin,
> >
> > Thank you for your inputs. After working on this issue for days, I
> finally
> > found the main culprits, and I've taken the following steps:
> >
> > 1. Doing synonym expansion only at query-time, not at index time, in
> order
> > to get correct multi-word synonyms.
> > 2. Using WhiteSpaceTokenizer instead of StandardTokenizer or
> > KeywordTokenizer, otherwise, hyphenated words like immuno-oncology will
> > always be split into immuno and oncology, which will not be found in the
> > synonyms definitions!
> > 3. Using the SnowballPorterFilter for stemming only after
> > SynonymGraphFilter => otherwise, immuno-oncology will be stemmed into
> > immuno-oncolog, which does not match the immuno-oncology in the
> synonyms.txt
> >   file.
> >
> > I found this presentation
> >
> https://www.slideshare.net/BertrandRigaldies/the-solr-multiterms-synonyms-maze-graphs
> > incredibly helpful, as well as setting up a minimal example of an index
> > containing only 5 documents.
> > It may turn out at a later point that I need to use synonyms at
> index-time
> > for speed, in which case I would only index the single-word synonyms
> there,
> > as suggested by Bertrand Rigaldies in the above presentation.
> >
> > @Mikhail "It's usually tough." I've noticed :)
> > @Elisabeth: Thank you for your suggestion. From the description, it seems
> > like this fixes query-time expansion of synonyms, which the
> > SynonymGraphFilter and the query parser handle correctly in newer Solr
> > versions.
> >
> > Best regards,
> > Annika
> >
> >
> >
> > On Thu, Mar 7, 2024 at 12:56 PM atin janki <[email protected]> wrote:
> >
> >> Hi Annika,
> >>
> >> Can you please share a sample query and how it is being expanded.
> >> Also, share how you expect it to be expanded.
> >> It would help to replicate your scenario and understand the problem
> better.
> >>
> >> Best Regards,
> >> Atin Janki
> >>
> >>
> >> On Tue, Mar 5, 2024 at 4:21 PM elisabeth benoit <
> [email protected]
> >> wrote:
> >>
> >>> Hello Annika,
> >>>
> >>> For multiwords synonyms, we have been using
> >>>
> >>
> https://checkpoint.url-protection.com/v1/url?o=https%3A//github.com/healthonnet/hon-lucene-synonyms&g=ZWU1ZmU1OWFjYWFmNTdhYw==&h=ZGJiZjQzY2Q3MTYwZDU3MmQ5OGViZDAzMTQ2YzRiZWRmMjUyODNmM2YzZjViMTA2ZjJlZWE2OTQ2NjRiMTdhZQ==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA==
> >> jar, that we just
> >>> rebuild with solr 9.2.1 (a modification is needed, if you ever need
> >>> details).
> >>>
> >>> It overrides edismax query parser and expands multiwords synonyms at
> >> query
> >>> time.
> >>>
> >>> We didnt want to expand synonyms at index time cause we had this
> problem:
> >>>
> >>> in the index: mairie
> >>> synonym: hotel de ville
> >>>
> >>> and then at query time, with query 'hotel', mairie would match.
> >>>
> >>> With hon-lucene, when user asks for "hotel de ville", we match with
> >> mairie,
> >>> but "hotel" doesnt match with mairie.
> >>>
> >>> You might have performance issues with hon-lucene if you have hundred
> of
> >>> synonyms. But it's worth testing.
> >>>
> >>> Best regards,
> >>> Elisabeth
> >>>
> >>> Le lun. 4 mars 2024 à 17:16, Mikhail Khludnev <[email protected]> a
> écrit
> >> :
> >>>> Hello Annika,
> >>>> You may use SolrAdmin/Analysys page, debugQuery and explainOther
> params
> >>> to
> >>>> dig into particular case. It's usually tough.
> >>>>   I've found one clue in the ref guide:
> >>>>   To get fully correct positional queries when your synonym
> replacements
> >>> are
> >>>> multiple tokens, you should instead apply synonyms using this filter
> at
> >>>> query time.
> >>>> Probably you may start from something simple.
> >>>>
> >>>> On Mon, Mar 4, 2024 at 5:23 PM Annika Gable
> >>>> <[email protected]> wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> I'm using Solr 9.1, and I'm trying to set up synonyms. I managed to
> >> get
> >>>>> synonyms to work for single-word synonyms, but not for multiword and
> >>>>> hyphenated synonyms.
> >>>>>
> >>>>> In the final state, I am planning on having a very extensive synonym
> >>> file
> >>>>> (hundreds, if not thousands of lines) because I want to always find
> >>>> results
> >>>>> for all child terms and other synonyms of a given search term. This
> >> is
> >>>> why
> >>>>> I thought it may make sense to list all synonyms in the index. But
> >>>> getting
> >>>>> it to work with query-time synonym expansion would also be great
> >>> already.
> >>>>> For now, I am testing with equivalent synonyms. I am always querying
> >>>> using
> >>>>> quotation marks around the multi-word query.
> >>>>>
> >>>>> What I have tried:
> >>>>> 1. I included sow=false in the query as recommended here
> >>>>>
> >>>>>
> >>
> https://checkpoint.url-protection.com/v1/url?o=https%3A//lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/&g=OTQzMzE0MjVhNzNmYTcwMQ==&h=MmNjMmFhOWY4ZDE0ODUwMDA0NWE1NTQzZGI3NzYyOGJkODQ3MDBiZmUxZTYxMzg2OWE0ZTZlOTMxZmE2MDgzOA==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA==
> >>>>> 2. I used the SynonymGraphFilter either only at query time, or at
> >> index
> >>>>> time, or both -> I got the same number of results when querying
> >>>> single-word
> >>>>> synonyms, as expected (e.g. TIGIT, domvanalimab), but querying
> >>> multi-word
> >>>>> synonyms did not find the other synonyms correctly.
> >>>>> 3. I made all text fields into a text_field (which uses the
> >>>>> KeywordTokenizer) instead of text_general (which uses the
> >>>>> StandardTokenizer), in order to prevent splitting up multi-word
> >>> queries.
> >>>> ->
> >>>>> This still did not make multiword-synonyms work.
> >>>>>
> >>>>>
> >>>>> My country-synonyms.txt file looks like this:
> >>>>>
> >>>>> TIGIT, domvanalimab, COM902, BMS-986207, Anti-TIGIT Antibody
> >>>>> immuno-oncology, immunooncology
> >>>>> Afghanistan, AF, AFG
> >>>>> Albania, AL, ALB
> >>>>>
> >>>>>
> >>>>> And the relevant query fields from my schema.xml look like this, with
> >>>>> text_general being the fieldtype of the catchall field
> >>>>>
> >>>>> <fieldType name="text_field" class="solr.TextField"
> >>>>> positionIncrementGap="100">
> >>>>>      <analyzer type="index">
> >>>>>         <tokenizer class="solr.KeywordTokenizerFactory" />
> >>>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>>         <filter class="solr.SynonymGraphFilterFactory"
> >>>>> synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
> >>>>>         <filter class="solr.FlattenGraphFilterFactory"/>
> >>>>>      </analyzer>
> >>>>>      <analyzer type="query">
> >>>>>         <tokenizer class="solr.KeywordTokenizerFactory" />
> >>>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>>         <filter class="solr.SynonymGraphFilterFactory"
> >>>>> synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
> >>>>>      </analyzer>
> >>>>> </fieldType>
> >>>>> <fieldType name="text_general" class="solr.TextField"
> >>>>> positionIncrementGap="100">
> >>>>>      <analyzer type="index">
> >>>>>         <tokenizer class="solr.StandardTokenizerFactory" />
> >>>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>>         <filter class="solr.SnowballPorterFilterFactory"
> >>>> language="English"
> >>>>> />
> >>>>>         <filter class="solr.SynonymGraphFilterFactory"
> >>>>> synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
> >>>>>         <filter class="solr.FlattenGraphFilterFactory"/>
> >>>>>      </analyzer>
> >>>>>      <analyzer type="query">
> >>>>>         <tokenizer class="solr.StandardTokenizerFactory" />
> >>>>>         <filter class="solr.LowerCaseFilterFactory" />
> >>>>>         <filter class="solr.SnowballPorterFilterFactory"
> >>>> language="English"
> >>>>> />
> >>>>>         <filter class="solr.SynonymGraphFilterFactory"
> >>>>> synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/>
> >>>>>      </analyzer>
> >>>>> </fieldType>
> >>>>>
> >>>>>
> >>>>> Any hints would be appreciated!
> >>>>>
> >>>>> --
> >>>>> PRIVILEGED AND CONFIDENTIAL
> >>>>> PLEASE NOTE: The information contained in this
> >>>>> message is privileged and confidential, and is intended only for the
> >>> use
> >>>>> of
> >>>>> the individual to whom it is addressed and others who have been
> >>>>> specifically authorized to receive it. If you are not the intended
> >>>>> recipient, you are hereby notified that any dissemination,
> >> distribution
> >>>> or
> >>>>> copying of this communication is strictly prohibited. If you have
> >>>> received
> >>>>> this communication in error, or if any problems occur with
> >>> transmission,
> >>>>> please contact the sender and kindly delete any copies of this
> >>>>> communication. Thank you.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>> --
> >>>> Sincerely yours
> >>>> Mikhail Khludnev
> >>>>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> Founding member of The Search Network and co-author of Searching the
> Enterprise
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>
> OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin
> Amtsgericht Charlottenburg | HRB 230712 B
> Geschäftsführer: John M. Woodell | David E. Pugh
> Finanzamt: Berlin Finanzamt für Körperschaften II
>
>

Re: Multi-word synonyms not working

Reply via email to