Shaun, I'm not 100% sure, but don't give up on this just yet:
> For example if I enter diabetes it finds the acronym DM for diabetes mellitus I think the behavior you're observing may simply be a side-effect of a misconfiguration of synonyms.txt. In the example you posted, the equivalent terms are separated by commas (as they should be), which would lead to treating line `DM diabetes mellitus` as effectively "DM == diabetes == mellitus", which as you point out is clearly not what you want. Do you see similar results for `DM, diabetes mellitus` (which should be parsed as meaning "DM == 'diabetes mellitus'", which iiuc _is_ what you want)? (see the note about ensuring proper comma-separation in my earlier response) Michael On Fri, Jan 15, 2021 at 9:52 AM Shaun Campbell <campbell.sh...@gmail.com> wrote: > Hi Michael > > Thanks for that I'll have a study later. It's just reminded me of the > expand option which I meant to have a look at. > > Thanks > Shaun > > On Fri, 15 Jan 2021 at 14:33, Michael Gibney <mich...@michaelgibney.net> > wrote: > > > The equivalent terms on the right-hand side of the `=>` operator in the > > example you sent should be separated by a comma. You mention you already > > tried only-comma-separated (e.g. one line: `SRN,Stroke Research Network`) > > and that that yielded unexpected results as well. I would recommend > > pre-case-normalizing all the terms in synonyms.txt (i.e., lower-case), > and > > applying the synonym filter _after_ case normalization in the analysis > > chain (there are other ways you could do, but the key point being that > you > > need to pay attention to case and how it interacts with the order in > which > > filters are applied). > > > > Re: Charlie's recommendation to apply these at index-time, a word of > > caution (and it's possible that this is in fact the underlying cause of > > some of the unexpected behavior you're observine?): be careful if you're > > using term _expansion_ at index-time (i.e., mapping single terms to > > multiple terms, which I note appears to be what you're trying to do in > the > > example lines you provided). Multi-term index-time synonyms can lead to > > unexpected results for positional queries (either explicit phrase > queries, > > or implicit, e.g. as configured by `pf` param in edismax). I'm aware of > at > > least two good overviews of this topic, one by Mike McCandless focusing > on > > Elasticsearch [1], one by Steve Rowe focusing on Solr [2]. The underlying > > issue is related LUCENE-4312 [3], so both posts (ES- & Solr-related) are > > relevant. > > > > One way to work around this is to "collapse" (rather than expand) > synonyms, > > at both index and query time. Another option would be to apply synonym > > expansion only at query-time. It's also worth noting that increasing > phrase > > slop (`ps` param, etc.) can cause the issues with index-time synonym > > expansion to "fly under the radar" a little, wrt the most blatant "false > > negative" manifestations of index-time synonym issues for phrase queries. > > > > [1] > > > > > https://www.elastic.co/blog/multitoken-synonyms-and-graph-queries-in-elasticsearch > > [2] > > > > > https://lucidworks.com/post/multi-word-synonyms-solr-adds-query-time-support/ > > [3] https://issues.apache.org/jira/browse/LUCENE-4312 > > > > On Fri, Jan 15, 2021 at 6:18 AM Charlie Hull < > > ch...@opensourceconnections.com> wrote: > > > > > I'm wondering if you should be using these acronyms at index time, not > > > search time. It will make your index bigger and you'll have to re-index > > > to add new synonyms (as they may apply to old documents) but this could > > > be an occasional task, and in the meantime you could use query-time > > > synonyms for the new ones. > > > > > > Maintaining 9000 synonyms in Solr's synonyms.txt file seems unweildy to > > me. > > > > > > Cheers > > > > > > Charlie > > > > > > On 15/01/2021 09:48, Shaun Campbell wrote: > > > > I have a medical journals search application and I've a list of some > > > 9,000 > > > > acronyms like this: > > > > > > > > MSNQ=>MSNQ Multiple Sclerosis Neuropsychological Screening > > Questionnaire > > > > SRN=>SRN Stroke Research Network > > > > IGBP=>IGBP isolated gastric bypass > > > > TOMADO=>TOMADO Trial of Oral Mandibular Advancement Devices for > > > Obstructive > > > > sleep apnoea–hypopnoea > > > > SRM=>SRM standardised response mean > > > > SRT=>SRT substrate reduction therapy > > > > SRS=>SRS Sexual Rating Scale > > > > SRU=>SRU stroke rehabilitation unit > > > > T2w=>T2w T2-weighted > > > > Ab-P=>Ab-P Aberdeen participation restriction subscale > > > > MSOA=>MSOA middle-layer super output area > > > > SSA=>SSA site-specific assessment > > > > SSC=>SSC Study Steering Committee > > > > SSB=>SSB short-stretch bandage > > > > SSE=>SSE sum squared error > > > > SSD=>SSD social services department > > > > NVPI=>NVPI Nausea and Vomiting of Pregnancy Instrument > > > > > > > > I tried to put them in a synonyms file, either just with a comma > > between, > > > > or with an arrow in between and the acronym repeated on the right > like > > > > above, and no matter what I try I'm getting really strange search > > > results. > > > > It's like words in one acronym are matching with the same word in > > another > > > > acronym and then searching with that acronym which is completely > > > unrelated. > > > > > > > > I don't think Solr can handle this, but does anyone know of any > crafty > > > > tricks in Solr to handle this situation where I can either search by > > the > > > > acronym or by the text? > > > > > > > > Shaun > > > > > > > > > > -- > > > Charlie Hull - Managing Consultant at OpenSource Connections Limited > > > <www.o19s.com> > > > Founding member of The Search Network <https://thesearchnetwork.com/> > > > and co-author of Searching the Enterprise > > > <https://opensourceconnections.com/about-us/books-resources/> > > > tel/fax: +44 (0)8700 118334 > > > mobile: +44 (0)7767 825828 > > > > > >