Thanks a lot Annika for your explanation with links. We'll check that out. And thanks to Charlie too.
Best regards, Elisabeth Le lun. 11 mars 2024 à 10:32, Charlie Hull <[email protected]> a écrit : > Hi Annika, > > Glad you like Bertrand's Haystack presentation! My colleague Daniel > Wrigley recently wrote an overview blog on synonyms here > > https://opensourceconnections.com/blog/2023/03/29/applying-synonyms-types-strategies-tools-and-a-glimpse-into-the-future/ > which links to several other synonym blogs on our site. > > Best > > Charlie > > On 08/03/2024 15:09, Annika Gable wrote: > > Hi Mikhail, Elisabeth, and Atin, > > > > Thank you for your inputs. After working on this issue for days, I > finally > > found the main culprits, and I've taken the following steps: > > > > 1. Doing synonym expansion only at query-time, not at index time, in > order > > to get correct multi-word synonyms. > > 2. Using WhiteSpaceTokenizer instead of StandardTokenizer or > > KeywordTokenizer, otherwise, hyphenated words like immuno-oncology will > > always be split into immuno and oncology, which will not be found in the > > synonyms definitions! > > 3. Using the SnowballPorterFilter for stemming only after > > SynonymGraphFilter => otherwise, immuno-oncology will be stemmed into > > immuno-oncolog, which does not match the immuno-oncology in the > synonyms.txt > > file. > > > > I found this presentation > > > https://www.slideshare.net/BertrandRigaldies/the-solr-multiterms-synonyms-maze-graphs > > incredibly helpful, as well as setting up a minimal example of an index > > containing only 5 documents. > > It may turn out at a later point that I need to use synonyms at > index-time > > for speed, in which case I would only index the single-word synonyms > there, > > as suggested by Bertrand Rigaldies in the above presentation. > > > > @Mikhail "It's usually tough." I've noticed :) > > @Elisabeth: Thank you for your suggestion. From the description, it seems > > like this fixes query-time expansion of synonyms, which the > > SynonymGraphFilter and the query parser handle correctly in newer Solr > > versions. > > > > Best regards, > > Annika > > > > > > > > On Thu, Mar 7, 2024 at 12:56 PM atin janki <[email protected]> wrote: > > > >> Hi Annika, > >> > >> Can you please share a sample query and how it is being expanded. > >> Also, share how you expect it to be expanded. > >> It would help to replicate your scenario and understand the problem > better. > >> > >> Best Regards, > >> Atin Janki > >> > >> > >> On Tue, Mar 5, 2024 at 4:21 PM elisabeth benoit < > [email protected] > >> wrote: > >> > >>> Hello Annika, > >>> > >>> For multiwords synonyms, we have been using > >>> > >> > https://checkpoint.url-protection.com/v1/url?o=https%3A//github.com/healthonnet/hon-lucene-synonyms&g=ZWU1ZmU1OWFjYWFmNTdhYw==&h=ZGJiZjQzY2Q3MTYwZDU3MmQ5OGViZDAzMTQ2YzRiZWRmMjUyODNmM2YzZjViMTA2ZjJlZWE2OTQ2NjRiMTdhZQ==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA== > >> jar, that we just > >>> rebuild with solr 9.2.1 (a modification is needed, if you ever need > >>> details). > >>> > >>> It overrides edismax query parser and expands multiwords synonyms at > >> query > >>> time. > >>> > >>> We didnt want to expand synonyms at index time cause we had this > problem: > >>> > >>> in the index: mairie > >>> synonym: hotel de ville > >>> > >>> and then at query time, with query 'hotel', mairie would match. > >>> > >>> With hon-lucene, when user asks for "hotel de ville", we match with > >> mairie, > >>> but "hotel" doesnt match with mairie. > >>> > >>> You might have performance issues with hon-lucene if you have hundred > of > >>> synonyms. But it's worth testing. > >>> > >>> Best regards, > >>> Elisabeth > >>> > >>> Le lun. 4 mars 2024 à 17:16, Mikhail Khludnev <[email protected]> a > écrit > >> : > >>>> Hello Annika, > >>>> You may use SolrAdmin/Analysys page, debugQuery and explainOther > params > >>> to > >>>> dig into particular case. It's usually tough. > >>>> I've found one clue in the ref guide: > >>>> To get fully correct positional queries when your synonym > replacements > >>> are > >>>> multiple tokens, you should instead apply synonyms using this filter > at > >>>> query time. > >>>> Probably you may start from something simple. > >>>> > >>>> On Mon, Mar 4, 2024 at 5:23 PM Annika Gable > >>>> <[email protected]> wrote: > >>>> > >>>>> Hello, > >>>>> > >>>>> I'm using Solr 9.1, and I'm trying to set up synonyms. I managed to > >> get > >>>>> synonyms to work for single-word synonyms, but not for multiword and > >>>>> hyphenated synonyms. > >>>>> > >>>>> In the final state, I am planning on having a very extensive synonym > >>> file > >>>>> (hundreds, if not thousands of lines) because I want to always find > >>>> results > >>>>> for all child terms and other synonyms of a given search term. This > >> is > >>>> why > >>>>> I thought it may make sense to list all synonyms in the index. But > >>>> getting > >>>>> it to work with query-time synonym expansion would also be great > >>> already. > >>>>> For now, I am testing with equivalent synonyms. I am always querying > >>>> using > >>>>> quotation marks around the multi-word query. > >>>>> > >>>>> What I have tried: > >>>>> 1. I included sow=false in the query as recommended here > >>>>> > >>>>> > >> > https://checkpoint.url-protection.com/v1/url?o=https%3A//lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/&g=OTQzMzE0MjVhNzNmYTcwMQ==&h=MmNjMmFhOWY4ZDE0ODUwMDA0NWE1NTQzZGI3NzYyOGJkODQ3MDBiZmUxZTYxMzg2OWE0ZTZlOTMxZmE2MDgzOA==&p=YzJlOmltbXVuYWk6YzpnOjhhNTQzYzk1Y2IyYTVmMWRmMjk0NTJmMWQxMDk0NTg4OnYxOnA6VA== > >>>>> 2. I used the SynonymGraphFilter either only at query time, or at > >> index > >>>>> time, or both -> I got the same number of results when querying > >>>> single-word > >>>>> synonyms, as expected (e.g. TIGIT, domvanalimab), but querying > >>> multi-word > >>>>> synonyms did not find the other synonyms correctly. > >>>>> 3. I made all text fields into a text_field (which uses the > >>>>> KeywordTokenizer) instead of text_general (which uses the > >>>>> StandardTokenizer), in order to prevent splitting up multi-word > >>> queries. > >>>> -> > >>>>> This still did not make multiword-synonyms work. > >>>>> > >>>>> > >>>>> My country-synonyms.txt file looks like this: > >>>>> > >>>>> TIGIT, domvanalimab, COM902, BMS-986207, Anti-TIGIT Antibody > >>>>> immuno-oncology, immunooncology > >>>>> Afghanistan, AF, AFG > >>>>> Albania, AL, ALB > >>>>> > >>>>> > >>>>> And the relevant query fields from my schema.xml look like this, with > >>>>> text_general being the fieldtype of the catchall field > >>>>> > >>>>> <fieldType name="text_field" class="solr.TextField" > >>>>> positionIncrementGap="100"> > >>>>> <analyzer type="index"> > >>>>> <tokenizer class="solr.KeywordTokenizerFactory" /> > >>>>> <filter class="solr.LowerCaseFilterFactory" /> > >>>>> <filter class="solr.SynonymGraphFilterFactory" > >>>>> synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/> > >>>>> <filter class="solr.FlattenGraphFilterFactory"/> > >>>>> </analyzer> > >>>>> <analyzer type="query"> > >>>>> <tokenizer class="solr.KeywordTokenizerFactory" /> > >>>>> <filter class="solr.LowerCaseFilterFactory" /> > >>>>> <filter class="solr.SynonymGraphFilterFactory" > >>>>> synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/> > >>>>> </analyzer> > >>>>> </fieldType> > >>>>> <fieldType name="text_general" class="solr.TextField" > >>>>> positionIncrementGap="100"> > >>>>> <analyzer type="index"> > >>>>> <tokenizer class="solr.StandardTokenizerFactory" /> > >>>>> <filter class="solr.LowerCaseFilterFactory" /> > >>>>> <filter class="solr.SnowballPorterFilterFactory" > >>>> language="English" > >>>>> /> > >>>>> <filter class="solr.SynonymGraphFilterFactory" > >>>>> synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/> > >>>>> <filter class="solr.FlattenGraphFilterFactory"/> > >>>>> </analyzer> > >>>>> <analyzer type="query"> > >>>>> <tokenizer class="solr.StandardTokenizerFactory" /> > >>>>> <filter class="solr.LowerCaseFilterFactory" /> > >>>>> <filter class="solr.SnowballPorterFilterFactory" > >>>> language="English" > >>>>> /> > >>>>> <filter class="solr.SynonymGraphFilterFactory" > >>>>> synonyms="country-synonyms.txt" ignoreCase="true" expand="true"/> > >>>>> </analyzer> > >>>>> </fieldType> > >>>>> > >>>>> > >>>>> Any hints would be appreciated! > >>>>> > >>>>> -- > >>>>> PRIVILEGED AND CONFIDENTIAL > >>>>> PLEASE NOTE: The information contained in this > >>>>> message is privileged and confidential, and is intended only for the > >>> use > >>>>> of > >>>>> the individual to whom it is addressed and others who have been > >>>>> specifically authorized to receive it. If you are not the intended > >>>>> recipient, you are hereby notified that any dissemination, > >> distribution > >>>> or > >>>>> copying of this communication is strictly prohibited. If you have > >>>> received > >>>>> this communication in error, or if any problems occur with > >>> transmission, > >>>>> please contact the sender and kindly delete any copies of this > >>>>> communication. Thank you. > >>>>> > >>>>> > >>>>> > >>>>> > >>>> -- > >>>> Sincerely yours > >>>> Mikhail Khludnev > >>>> > -- > Charlie Hull - Managing Consultant at OpenSource Connections Limited > Founding member of The Search Network and co-author of Searching the > Enterprise > tel/fax: +44 (0)8700 118334 > mobile: +44 (0)7767 825828 > > OpenSource Connections Europe GmbH | Pappelallee 78/79 | 10437 Berlin > Amtsgericht Charlottenburg | HRB 230712 B > Geschäftsführer: John M. Woodell | David E. Pugh > Finanzamt: Berlin Finanzamt für Körperschaften II > >
