Re: Handling acronyms

Michael Gibney Fri, 15 Jan 2021 07:02:42 -0800

Shaun,

I'm not 100% sure, but don't give up on this just yet:


> For example if I enter diabetes it finds the acronym DM for diabetes
mellitus

I think the behavior you're observing may simply be a side-effect of a
misconfiguration of synonyms.txt. In the example you posted, the equivalent
terms are separated by commas (as they should be), which would lead to
treating line `DM diabetes mellitus` as effectively "DM == diabetes ==
mellitus", which as you point out is clearly not what you want. Do you see
similar results for `DM, diabetes mellitus` (which should be parsed as
meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)?

(see the note about ensuring proper comma-separation in my earlier response)

Michael


On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell <campbell.sh...@gmail.com>
wrote:

> Hi Michael
>
> Thanks for that I'll have a study later.  It's just reminded me of the
> expand option which I meant to have a look at.
>
> Thanks
> Shaun
>
> On Fri, 15 Jan 2021 at 14:33, Michael Gibney <mich...@michaelgibney.net>
> wrote:
>
> > The equivalent terms on the right-hand side of the `=>` operator in the
> > example you sent should be separated by a comma. You mention you already
> > tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`)
> > and that that yielded unexpected results as well. I would recommend
> > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case),
> and
> > applying the synonym filter _after_ case normalization in the analysis
> > chain (there are other ways you could do, but the key point being that
> you
> > need to pay attention to case and how it interacts with the order in
> which
> > filters are applied).
> >
> > Re: Charlie's recommendation to apply these at index-time, a word of
> > caution (and it's possible that this is in fact the underlying cause of
> > some of the unexpected behavior you're observine?): be careful if you're
> > using term _expansion_ at index-time (i.e., mapping single terms to
> > multiple terms, which I note appears to be what you're trying to do in
> the
> > example lines you provided). Multi-term index-time synonyms can lead to
> > unexpected results for positional queries (either explicit phrase
> queries,
> > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of
> at
> > least two good overviews of this topic, one by Mike McCandless focusing
> on
> > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying
> > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are
> > relevant.
> >
> > One way to work around this is to "collapse" (rather than expand)
> synonyms,
> > at both index and query time. Another option would be to apply synonym
> > expansion only at query-time. It's also worth noting that increasing
> phrase
> > slop (`ps` param, etc.) can cause the issues with index-time synonym
> > expansion to "fly under the radar" a little, wrt the most blatant "false
> > negative" manifestations of index-time synonym issues for phrase queries.
> >
> > [1]
> >
> >
> https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch
> > [2]
> >
> >
> https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/
> > [3] https://issues.apache.org/jira/browse/LUCENE-4312
> >
> > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull <
> > ch...@opensourceconnections.com> wrote:
> >
> > > I'm wondering if you should be using these acronyms at index time, not
> > > search time. It will make your index bigger and you'll have to re-index
> > > to add new synonyms (as they may apply to old documents) but this could
> > > be an occasional task, and in the meantime you could use query-time
> > > synonyms for the new ones.
> > >
> > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to
> > me.
> > >
> > > Cheers
> > >
> > > Charlie
> > >
> > > On 15/01/2021 09:48, Shaun Campbell wrote:
> > > > I have a medical journals search application and I've a list of some
> > > 9,000
> > > > acronyms like this:
> > > >
> > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening
> > Questionnaire
> > > > SRN=>SRN Stroke Research Network
> > > > IGBP=>IGBP isolated gastric bypass
> > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for
> > > Obstructive
> > > > sleep apnoea–hypopnoea
> > > > SRM=>SRM standardised response mean
> > > > SRT=>SRT substrate reduction therapy
> > > > SRS=>SRS Sexual Rating Scale
> > > > SRU=>SRU stroke rehabilitation unit
> > > > T2w=>T2w T2-weighted
> > > > Ab-P=>Ab-P Aberdeen participation restriction subscale
> > > > MSOA=>MSOA middle-layer super output area
> > > > SSA=>SSA site-specific assessment
> > > > SSC=>SSC Study Steering Committee
> > > > SSB=>SSB short-stretch bandage
> > > > SSE=>SSE sum squared error
> > > > SSD=>SSD social services department
> > > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument
> > > >
> > > > I tried to put them in a synonyms file, either just with a comma
> > between,
> > > > or with an arrow in between and the acronym repeated on the right
> like
> > > > above, and no matter what I try I'm getting really strange search
> > > results.
> > > > It's like words in one acronym are matching with the same word in
> > another
> > > > acronym and then searching with that acronym which is completely
> > > unrelated.
> > > >
> > > > I don't think Solr can handle this, but does anyone know of any
> crafty
> > > > tricks in Solr to handle this situation where I can either search by
> > the
> > > > acronym or by the text?
> > > >
> > > > Shaun
> > > >
> > >
> > > --
> > > Charlie Hull - Managing Consultant at OpenSource Connections Limited
> > > <www.o19s.com>
> > > Founding member of The Search Network <https://thesearchnetwork.com/>
> > > and co-author of Searching the Enterprise
> > > <https://opensourceconnections.com/about-us/books-resources/>
> > > tel/fax: +44 (0)8700 118334
> > > mobile: +44 (0)7767 825828
> > >
> >
>

Re: Handling acronyms

Reply via email to