Re: Handling acronyms

Shaun Campbell Fri, 15 Jan 2021 06:52:33 -0800

Hi Michael

Thanks for that I'll have a study later.  It's just reminded me of the
expand option which I meant to have a look at.


Thanks
Shaun

On Fri, 15 Jan 2021 at 14:33, Michael Gibney <mich...@michaelgibney.net>
wrote:

> The equivalent terms on the right-hand side of the `=>` operator in the
> example you sent should be separated by a comma. You mention you already
> tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
> and that that yielded unexpected results as well. I would recommend
> pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), and
> applying the synonym filter _after_ case normalization in the analysis
> chain (there are other ways you could do, but the key point being that you
> need to pay attention to case and how it interacts with the order in which
> filters are applied).
>
> Re: Charlie's recommendation to apply these at index-time, a word of
> caution (and it's possible that this is in fact the underlying cause of
> some of the unexpected behavior you're observine?): be careful if you're
> using term _expansion_ at index-time (i.e., mapping single terms to
> multiple terms, which I note appears to be what you're trying to do in the
> example lines you provided). Multi-term index-time synonyms can lead to
> unexpected results for positional queries (either explicit phrase queries,
> or implicit, e.g. as configured by `pf` param in edismax). I'm aware of at
> least two good overviews of this topic, one by Mike McCandless focusing on
> Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
> issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
> relevant.
>
> One way to work around this is to "collapse" (rather than expand) synonyms,
> at both index and query time. Another option would be to apply synonym
> expansion only at query-time. It's also worth noting that increasing phrase
> slop (`ps` param, etc.) can cause the issues with index-time synonym
> expansion to "fly under the radar" a little, wrt the most blatant "false
> negative" manifestations of index-time synonym issues for phrase queries.
>
> [1]
>
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> [2]
>
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> [3] https://issues.apache.org/jira/browse/LUCENE-4312
>
> On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
> ch...@opensourceconnections.com> wrote:
>
> > I'm wondering if you should be using these acronyms at index time, not
> > search time. It will make your index bigger and you'll have to re-index
> > to add new synonyms (as they may apply to old documents) but this could
> > be an occasional task, and in the meantime you could use query-time
> > synonyms for the new ones.
> >
> > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to
> me.
> >
> > Cheers
> >
> > Charlie
> >
> > On 15/01/2021 09:48, Shaun Campbell wrote:
> > > I have a medical journals search application and I've a list of some
> > 9,000
> > > acronyms like this:
> > >
> > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
> Questionnaire
> > > SRN=>SRN Stroke Research Network
> > > IGBP=>IGBP isolated gastric bypass
> > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> > Obstructive
> > > sleep apnoea–hypopnoea
> > > SRM=>SRM standardised response mean
> > > SRT=>SRT substrate reduction therapy
> > > SRS=>SRS Sexual Rating Scale
> > > SRU=>SRU stroke rehabilitation unit
> > > T2w=>T2w T2-weighted
> > > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > > MSOA=>MSOA middle-layer super output area
> > > SSA=>SSA site-specific assessment
> > > SSC=>SSC Study Steering Committee
> > > SSB=>SSB short-stretch bandage
> > > SSE=>SSE sum squared error
> > > SSD=>SSD social services department
> > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> > >
> > > I tried to put them in a synonyms file, either just with a comma
> between,
> > > or with an arrow in between and the acronym repeated on the right like
> > > above, and no matter what I try I'm getting really strange search
> > results.
> > > It's like words in one acronym are matching with the same word in
> another
> > > acronym and then searching with that acronym which is completely
> > unrelated.
> > >
> > > I don't think Solr can handle this, but does anyone know of any crafty
> > > tricks in Solr to handle this situation where I can either search by
> the
> > > acronym or by the text?
> > >
> > > Shaun
> > >
> >
> > --
> > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > <www.o19s.com>
> > Founding member of The Search Network <https://thesearchnetwork.com/>
> > and co-author of Searching the Enterprise
> > <https://opensourceconnections.com/about-us/books-resources/>
> > tel/fax: +44 (0)8700 118334
> > mobile: +44 (0)7767 825828
> >
>

Re: Handling acronyms

Reply via email to