The equivalent terms on the right-hand side of the `=>` operator in the
example you sent should be separated by a comma. You mention you already
tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
and that that yielded unexpected results as well. I would recommend
pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), and
applying the synonym filter _after_ case normalization in the analysis
chain (there are other ways you could do, but the key point being that you
need to pay attention to case and how it interacts with the order in which
filters are applied).

Re: Charlie's recommendation to apply these at index-time, a word of
caution (and it's possible that this is in fact the underlying cause of
some of the unexpected behavior you're observine?): be careful if you're
using term _expansion_ at index-time (i.e., mapping single terms to
multiple terms, which I note appears to be what you're trying to do in the
example lines you provided). Multi-term index-time synonyms can lead to
unexpected results for positional queries (either explicit phrase queries,
or implicit, e.g. as configured by `pf` param in edismax). I'm aware of at
least two good overviews of this topic, one by Mike McCandless focusing on
Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
relevant.

One way to work around this is to "collapse" (rather than expand) synonyms,
at both index and query time. Another option would be to apply synonym
expansion only at query-time. It's also worth noting that increasing phrase
slop (`ps` param, etc.) can cause the issues with index-time synonym
expansion to "fly under the radar" a little, wrt the most blatant "false
negative" manifestations of index-time synonym issues for phrase queries.

[1]
https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
[2]
https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
[3] https://issues.apache.org/jira/browse/LUCENE-4312

On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
ch...@opensourceconnections.com> wrote:

> I'm wondering if you should be using these acronyms at index time, not
> search time. It will make your index bigger and you'll have to re-index
> to add new synonyms (as they may apply to old documents) but this could
> be an occasional task, and in the meantime you could use query-time
> synonyms for the new ones.
>
> Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to me.
>
> Cheers
>
> Charlie
>
> On 15/01/2021 09:48, Shaun Campbell wrote:
> > I have a medical journals search application and I've a list of some
> 9,000
> > acronyms like this:
> >
> > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening Questionnaire
> > SRN=>SRN Stroke Research Network
> > IGBP=>IGBP isolated gastric bypass
> > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> Obstructive
> > sleep apnoea–hypopnoea
> > SRM=>SRM standardised response mean
> > SRT=>SRT substrate reduction therapy
> > SRS=>SRS Sexual Rating Scale
> > SRU=>SRU stroke rehabilitation unit
> > T2w=>T2w T2-weighted
> > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > MSOA=>MSOA middle-layer super output area
> > SSA=>SSA site-specific assessment
> > SSC=>SSC Study Steering Committee
> > SSB=>SSB short-stretch bandage
> > SSE=>SSE sum squared error
> > SSD=>SSD social services department
> > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> >
> > I tried to put them in a synonyms file, either just with a comma between,
> > or with an arrow in between and the acronym repeated on the right like
> > above, and no matter what I try I'm getting really strange search
> results.
> > It's like words in one acronym are matching with the same word in another
> > acronym and then searching with that acronym which is completely
> unrelated.
> >
> > I don't think Solr can handle this, but does anyone know of any crafty
> > tricks in Solr to handle this situation where I can either search by the
> > acronym or by the text?
> >
> > Shaun
> >
>
> --
> Charlie Hull - Managing Consultant at OpenSource Connections Limited
> <www.o19s.com>
> Founding member of The Search Network <https://thesearchnetwork.com/>
> and co-author of Searching the Enterprise
> <https://opensourceconnections.com/about-us/books-resources/>
> tel/fax: +44 (0)8700 118334
> mobile: +44 (0)7767 825828
>

Reply via email to