Re: Mutli term synonyms
((hey foo) +bar) OR (hey +(foo bar)) > > > > > > > > > > > > i'm simplifying it here, the fun starts when you are seeing a > > phrase > > > > > query > > > > > > :) > > > > > > > > > > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik > > > > > wrote: > > > > > > > Hi there, > > > > > > > > > > > > > > I tried the solution provided in > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ > > > > > > > .The mentioned solution works when the indexed data does not > have > > > > alpha > > > > > > > numerics or special characters. But in my case the synonyms > are > > > > > > something > > > > > > > like the below. > > > > > > > > > > > > > > > > > > > > > T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN > > > > > > > MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE > > > POLYOXYETHYLENE > > > > > > > SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL > > LAURATE > > > > > > > 300 POLYSORBATE > > > > > > > 20 [FHFI] FEMA NO. 2915 > > > > > > > > > > > > > > They have alpha numerics, special characters, spaces, etc. Is > > > there a > > > > > way > > > > > > > to implment synonyms even in such case? > > > > > > > > > > > > > > Thanks, > > > > > > > Kaushik > > > > > > > > > > > > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] < > > > > > > > daniel.da...@nih.gov> wrote: > > > > > > > > > > > > > >> Handling MESH descriptor preferred terms and such is similar. > > I > > > > > > >> encountered this during evaluation of Solr for a project here > at > > > > NLM. > > > > > > We > > > > > > >> decided to use Solr for different projects instead. I > > > considered > > > > > the > > > > > > >> following approaches: > > > > > > >> - use a custom tokenizer at index time that indexed all of > the > > > > > multiple > > > > > > >> term alternatives. > > > > > > >> - index the data, and then have an enrichment process that > > > queries > > > > on > > > > > > >> each source synonym, and generates an update to add the target > > > > > synonyms. > > > > > > >>Follow this with an optimize. > > > > > > >> - During the indexing process, but before sending the data to > > > Solr, > > > > > > >> process the data to tokenize and add synonyms to another > field. > > > > > > >> > > > > > > >> Both the custom tokenizer and enrichment process share the > > feature > > > > > that > > > > > > >> they use Solr's own tokenizer rather than duplicate it. The > > > > > enrichment > > > > > > >> process seems to me only workable in environments where you > can > > > > > re-index > > > > > > >> all data periodically, so no continuous stream of data to > index > > > that > > > > > > needs > > > > > > >> to be handled relatively quickly once it is generated.The > > last > > > > > > method > > > > > > >> of pre-processing the data seems the least desirable to me > from > > a > > > > > > blue-sky > > > > > > >> perspective, but is probably the easiest to implement and the > > most > > > > > > >> independent of Solr. > > > > > > >> > > > > > > >> Hope this helps, > > > > > > >> > > > > > > >> Dan Davis, Systems/Applications Architect (Contractor), > > > > > > >> Office of Computer and Communications Systems, > > > > > > >> National Library of Medicine, NIH > > > > > > >> > > > > > > >> -Original Message- > > > > > > >> From: Kaushik [mailto:kaushika...@gmail.com] > > > > > > >> Sent: Monday, April 20, 2015 10:47 AM > > > > > > >> To: solr-user@lucene.apache.org > > > > > > >> Subject: Mutli term synonyms > > > > > > >> > > > > > > >> Hello, > > > > > > >> > > > > > > >> Reading up on synonyms it looks like there is no real solution > > for > > > > > multi > > > > > > >> term synonyms. Is that right? I have a use case where I need > to > > > map > > > > > one > > > > > > >> multi term phrase to another. i.e. Tween 20 needs to be > > translated > > > > to > > > > > > >> Polysorbate 40. > > > > > > >> > > > > > > >> Any thoughts as to how this can be achieved? > > > > > > >> > > > > > > >> Thanks, > > > > > > >> Kaushik > > > > > > >> > > > > > > > > > > > > > > > > > > > > >
Re: Mutli term synonyms
> > > > 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 > > > > [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN > > > > [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ > > > > 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN > > > > MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE > > > > SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE > > > > 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 > > [FCC],POLYSORBATE > > > 20 > > > > [WHO-DD],POLYSORBATE 20 [VANDF] > > > > > > > > *Autophrase.txt...* > > > > > > > > Has all the above phrases in one column > > > > > > > > *Indexed document* > > > > > > > > > > > > 31 > > > > Polysorbate 20 > > > > > > > > > > > > So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I > expect > > > to > > > > see the record containig Polysorbate 20. i.e. > > > > > > > > > > > > > > http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true > > > > should have retrieved it; but it doesnt. > > > > > > > > What could I be doing wrong? > > > > > > > > On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla > > > > wrote: > > > > > > > > > I'm not sure I understand - the autophrasing filter will allow the > > > > > parser to see all the tokens, so that they can be parsed (and > > > > > multi-token synonyms) identified. So if you are using the same > > > > > analyzer at query and index time, they should be able to see the > same > > > > > stuff. > > > > > > > > > > are you using multi-token synonyms, or just entries that look like > > > > > multi synonym? (in the first case, the tokens are separated by null > > > > > byte) - in the second case, they are just strings even with > > > > > whitespaces, your synonym file must contain exactly the same > entries > > > > > as your analyzer sees them (and in the same order; or you have to > use > > > > > the same analyzer to load the synonym files) > > > > > > > > > > can you post the relevant part of your schema.xml? > > > > > > > > > > > > > > > note: I can confirm that multi-token synonym expansion can be made > to > > > > > work, even in complex cases - we do it - but likely, if you need > > > > > multi-token synonyms, you will also need a smarter query parser. > > > > > sometimes your users will use query strings that contain > overlapping > > > > > synonym entries, to handle that, you will have to know how to > > generate > > > > > all possible 'reads', example > > > > > > > > > > synonym: > > > > > > > > > > foo bar, foobar > > > > > hey foo, heyfoo > > > > > > > > > > user input: > > > > > > > > > > hey foo bar > > > > > > > > > > possible readings: > > > > > > > > > > ((hey foo) +bar) OR (hey +(foo bar)) > > > > > > > > > > i'm simplifying it here, the fun starts when you are seeing a > phrase > > > > query > > > > > :) > > > > > > > > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik > > > wrote: > > > > > > Hi there, > > > > > > > > > > > > I tried the solution provided in > > > > > > > > > > > > > > > > > > > > > https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ > > > > > > .The mentioned solution works when the indexed data does not have > > > alpha > > > > > > numerics or special characters. But in my case the synonyms are > > > > > something > > > > > > like the below. > > > > > > > > > > > > > > > > > > T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN > > > > > > MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE > > POLYOXYETHYLENE > > > > > > SORBITAN MONOLAURATE P
Re: Mutli term synonyms
29, 2015 at 2:10 AM, Roman Chyla > > > wrote: > > > > > > > I'm not sure I understand - the autophrasing filter will allow the > > > > parser to see all the tokens, so that they can be parsed (and > > > > multi-token synonyms) identified. So if you are using the same > > > > analyzer at query and index time, they should be able to see the same > > > > stuff. > > > > > > > > are you using multi-token synonyms, or just entries that look like > > > > multi synonym? (in the first case, the tokens are separated by null > > > > byte) - in the second case, they are just strings even with > > > > whitespaces, your synonym file must contain exactly the same entries > > > > as your analyzer sees them (and in the same order; or you have to use > > > > the same analyzer to load the synonym files) > > > > > > > > can you post the relevant part of your schema.xml? > > > > > > > > > > > > note: I can confirm that multi-token synonym expansion can be made to > > > > work, even in complex cases - we do it - but likely, if you need > > > > multi-token synonyms, you will also need a smarter query parser. > > > > sometimes your users will use query strings that contain overlapping > > > > synonym entries, to handle that, you will have to know how to > generate > > > > all possible 'reads', example > > > > > > > > synonym: > > > > > > > > foo bar, foobar > > > > hey foo, heyfoo > > > > > > > > user input: > > > > > > > > hey foo bar > > > > > > > > possible readings: > > > > > > > > ((hey foo) +bar) OR (hey +(foo bar)) > > > > > > > > i'm simplifying it here, the fun starts when you are seeing a phrase > > > query > > > > :) > > > > > > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik > > wrote: > > > > > Hi there, > > > > > > > > > > I tried the solution provided in > > > > > > > > > > > > > > > https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ > > > > > .The mentioned solution works when the indexed data does not have > > alpha > > > > > numerics or special characters. But in my case the synonyms are > > > > something > > > > > like the below. > > > > > > > > > > > > > > > T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN > > > > > MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE > POLYOXYETHYLENE > > > > > SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE > > > > > 300 POLYSORBATE > > > > > 20 [FHFI] FEMA NO. 2915 > > > > > > > > > > They have alpha numerics, special characters, spaces, etc. Is > there a > > > way > > > > > to implment synonyms even in such case? > > > > > > > > > > Thanks, > > > > > Kaushik > > > > > > > > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] < > > > > > daniel.da...@nih.gov> wrote: > > > > > > > > > >> Handling MESH descriptor preferred terms and such is similar. I > > > > >> encountered this during evaluation of Solr for a project here at > > NLM. > > > > We > > > > >> decided to use Solr for different projects instead. I > considered > > > the > > > > >> following approaches: > > > > >> - use a custom tokenizer at index time that indexed all of the > > > multiple > > > > >> term alternatives. > > > > >> - index the data, and then have an enrichment process that > queries > > on > > > > >> each source synonym, and generates an update to add the target > > > synonyms. > > > > >>Follow this with an optimize. > > > > >> - During the indexing process, but before sending the data to > Solr, > > > > >> process the data to tokenize and add synonyms to another field. > > > > >> > > > > >> Both the custom tokenizer and enrichment process share the feature > > > that > > > > >> they use Solr's own tokenizer rather than duplicate it. The > > > enrichment > > > > >> process seems to me only workable in environments where you can > > > re-index > > > > >> all data periodically, so no continuous stream of data to index > that > > > > needs > > > > >> to be handled relatively quickly once it is generated.The last > > > > method > > > > >> of pre-processing the data seems the least desirable to me from a > > > > blue-sky > > > > >> perspective, but is probably the easiest to implement and the most > > > > >> independent of Solr. > > > > >> > > > > >> Hope this helps, > > > > >> > > > > >> Dan Davis, Systems/Applications Architect (Contractor), > > > > >> Office of Computer and Communications Systems, > > > > >> National Library of Medicine, NIH > > > > >> > > > > >> -Original Message- > > > > >> From: Kaushik [mailto:kaushika...@gmail.com] > > > > >> Sent: Monday, April 20, 2015 10:47 AM > > > > >> To: solr-user@lucene.apache.org > > > > >> Subject: Mutli term synonyms > > > > >> > > > > >> Hello, > > > > >> > > > > >> Reading up on synonyms it looks like there is no real solution for > > > multi > > > > >> term synonyms. Is that right? I have a use case where I need to > map > > > one > > > > >> multi term phrase to another. i.e. Tween 20 needs to be translated > > to > > > > >> Polysorbate 40. > > > > >> > > > > >> Any thoughts as to how this can be achieved? > > > > >> > > > > >> Thanks, > > > > >> Kaushik > > > > >> > > > > > > > > > >
Re: Mutli term synonyms
evant part of your schema.xml? > > > > > > > > > note: I can confirm that multi-token synonym expansion can be made to > > > work, even in complex cases - we do it - but likely, if you need > > > multi-token synonyms, you will also need a smarter query parser. > > > sometimes your users will use query strings that contain overlapping > > > synonym entries, to handle that, you will have to know how to generate > > > all possible 'reads', example > > > > > > synonym: > > > > > > foo bar, foobar > > > hey foo, heyfoo > > > > > > user input: > > > > > > hey foo bar > > > > > > possible readings: > > > > > > ((hey foo) +bar) OR (hey +(foo bar)) > > > > > > i'm simplifying it here, the fun starts when you are seeing a phrase > > query > > > :) > > > > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik > wrote: > > > > Hi there, > > > > > > > > I tried the solution provided in > > > > > > > > > > https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ > > > > .The mentioned solution works when the indexed data does not have > alpha > > > > numerics or special characters. But in my case the synonyms are > > > something > > > > like the below. > > > > > > > > > > > > T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN > > > > MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE > > > > SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE > > > > 300 POLYSORBATE > > > > 20 [FHFI] FEMA NO. 2915 > > > > > > > > They have alpha numerics, special characters, spaces, etc. Is there a > > way > > > > to implment synonyms even in such case? > > > > > > > > Thanks, > > > > Kaushik > > > > > > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] < > > > > daniel.da...@nih.gov> wrote: > > > > > > > >> Handling MESH descriptor preferred terms and such is similar. I > > > >> encountered this during evaluation of Solr for a project here at > NLM. > > > We > > > >> decided to use Solr for different projects instead. I considered > > the > > > >> following approaches: > > > >> - use a custom tokenizer at index time that indexed all of the > > multiple > > > >> term alternatives. > > > >> - index the data, and then have an enrichment process that queries > on > > > >> each source synonym, and generates an update to add the target > > synonyms. > > > >>Follow this with an optimize. > > > >> - During the indexing process, but before sending the data to Solr, > > > >> process the data to tokenize and add synonyms to another field. > > > >> > > > >> Both the custom tokenizer and enrichment process share the feature > > that > > > >> they use Solr's own tokenizer rather than duplicate it. The > > enrichment > > > >> process seems to me only workable in environments where you can > > re-index > > > >> all data periodically, so no continuous stream of data to index that > > > needs > > > >> to be handled relatively quickly once it is generated.The last > > > method > > > >> of pre-processing the data seems the least desirable to me from a > > > blue-sky > > > >> perspective, but is probably the easiest to implement and the most > > > >> independent of Solr. > > > >> > > > >> Hope this helps, > > > >> > > > >> Dan Davis, Systems/Applications Architect (Contractor), > > > >> Office of Computer and Communications Systems, > > > >> National Library of Medicine, NIH > > > >> > > > >> -Original Message- > > > >> From: Kaushik [mailto:kaushika...@gmail.com] > > > >> Sent: Monday, April 20, 2015 10:47 AM > > > >> To: solr-user@lucene.apache.org > > > >> Subject: Mutli term synonyms > > > >> > > > >> Hello, > > > >> > > > >> Reading up on synonyms it looks like there is no real solution for > > multi > > > >> term synonyms. Is that right? I have a use case where I need to map > > one > > > >> multi term phrase to another. i.e. Tween 20 needs to be translated > to > > > >> Polysorbate 40. > > > >> > > > >> Any thoughts as to how this can be achieved? > > > >> > > > >> Thanks, > > > >> Kaushik > > > >> > > > > > >
Re: Mutli term synonyms
w. > > > > > > > > > T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN > > > MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE > > > SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE > > > 300 POLYSORBATE > > > 20 [FHFI] FEMA NO. 2915 > > > > > > They have alpha numerics, special characters, spaces, etc. Is there a > way > > > to implment synonyms even in such case? > > > > > > Thanks, > > > Kaushik > > > > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] < > > > daniel.da...@nih.gov> wrote: > > > > > >> Handling MESH descriptor preferred terms and such is similar. I > > >> encountered this during evaluation of Solr for a project here at NLM. > > We > > >> decided to use Solr for different projects instead. I considered > the > > >> following approaches: > > >> - use a custom tokenizer at index time that indexed all of the > multiple > > >> term alternatives. > > >> - index the data, and then have an enrichment process that queries on > > >> each source synonym, and generates an update to add the target > synonyms. > > >>Follow this with an optimize. > > >> - During the indexing process, but before sending the data to Solr, > > >> process the data to tokenize and add synonyms to another field. > > >> > > >> Both the custom tokenizer and enrichment process share the feature > that > > >> they use Solr's own tokenizer rather than duplicate it. The > enrichment > > >> process seems to me only workable in environments where you can > re-index > > >> all data periodically, so no continuous stream of data to index that > > needs > > >> to be handled relatively quickly once it is generated.The last > > method > > >> of pre-processing the data seems the least desirable to me from a > > blue-sky > > >> perspective, but is probably the easiest to implement and the most > > >> independent of Solr. > > >> > > >> Hope this helps, > > >> > > >> Dan Davis, Systems/Applications Architect (Contractor), > > >> Office of Computer and Communications Systems, > > >> National Library of Medicine, NIH > > >> > > >> -Original Message- > > >> From: Kaushik [mailto:kaushika...@gmail.com] > > >> Sent: Monday, April 20, 2015 10:47 AM > > >> To: solr-user@lucene.apache.org > > >> Subject: Mutli term synonyms > > >> > > >> Hello, > > >> > > >> Reading up on synonyms it looks like there is no real solution for > multi > > >> term synonyms. Is that right? I have a use case where I need to map > one > > >> multi term phrase to another. i.e. Tween 20 needs to be translated to > > >> Polysorbate 40. > > >> > > >> Any thoughts as to how this can be achieved? > > >> > > >> Thanks, > > >> Kaushik > > >> > > >
Re: Mutli term synonyms
to me only workable in environments where you can re-index > >> all data periodically, so no continuous stream of data to index that > needs > >> to be handled relatively quickly once it is generated.The last > method > >> of pre-processing the data seems the least desirable to me from a > blue-sky > >> perspective, but is probably the easiest to implement and the most > >> independent of Solr. > >> > >> Hope this helps, > >> > >> Dan Davis, Systems/Applications Architect (Contractor), > >> Office of Computer and Communications Systems, > >> National Library of Medicine, NIH > >> > >> -Original Message- > >> From: Kaushik [mailto:kaushika...@gmail.com] > >> Sent: Monday, April 20, 2015 10:47 AM > >> To: solr-user@lucene.apache.org > >> Subject: Mutli term synonyms > >> > >> Hello, > >> > >> Reading up on synonyms it looks like there is no real solution for multi > >> term synonyms. Is that right? I have a use case where I need to map one > >> multi term phrase to another. i.e. Tween 20 needs to be translated to > >> Polysorbate 40. > >> > >> Any thoughts as to how this can be achieved? > >> > >> Thanks, > >> Kaushik > >> >
Re: Mutli term synonyms
I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token synonyms, or just entries that look like multi synonym? (in the first case, the tokens are separated by null byte) - in the second case, they are just strings even with whitespaces, your synonym file must contain exactly the same entries as your analyzer sees them (and in the same order; or you have to use the same analyzer to load the synonym files) can you post the relevant part of your schema.xml? note: I can confirm that multi-token synonym expansion can be made to work, even in complex cases - we do it - but likely, if you need multi-token synonyms, you will also need a smarter query parser. sometimes your users will use query strings that contain overlapping synonym entries, to handle that, you will have to know how to generate all possible 'reads', example synonym: foo bar, foobar hey foo, heyfoo user input: hey foo bar possible readings: ((hey foo) +bar) OR (hey +(foo bar)) i'm simplifying it here, the fun starts when you are seeing a phrase query :) On Tue, Apr 28, 2015 at 10:31 AM, Kaushik wrote: > Hi there, > > I tried the solution provided in > https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ > .The mentioned solution works when the indexed data does not have alpha > numerics or special characters. But in my case the synonyms are something > like the below. > > > T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN > MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE > SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE > 300 POLYSORBATE > 20 [FHFI] FEMA NO. 2915 > > They have alpha numerics, special characters, spaces, etc. Is there a way > to implment synonyms even in such case? > > Thanks, > Kaushik > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] < > daniel.da...@nih.gov> wrote: > >> Handling MESH descriptor preferred terms and such is similar. I >> encountered this during evaluation of Solr for a project here at NLM. We >> decided to use Solr for different projects instead. I considered the >> following approaches: >> - use a custom tokenizer at index time that indexed all of the multiple >> term alternatives. >> - index the data, and then have an enrichment process that queries on >> each source synonym, and generates an update to add the target synonyms. >>Follow this with an optimize. >> - During the indexing process, but before sending the data to Solr, >> process the data to tokenize and add synonyms to another field. >> >> Both the custom tokenizer and enrichment process share the feature that >> they use Solr's own tokenizer rather than duplicate it. The enrichment >> process seems to me only workable in environments where you can re-index >> all data periodically, so no continuous stream of data to index that needs >> to be handled relatively quickly once it is generated.The last method >> of pre-processing the data seems the least desirable to me from a blue-sky >> perspective, but is probably the easiest to implement and the most >> independent of Solr. >> >> Hope this helps, >> >> Dan Davis, Systems/Applications Architect (Contractor), >> Office of Computer and Communications Systems, >> National Library of Medicine, NIH >> >> -Original Message- >> From: Kaushik [mailto:kaushika...@gmail.com] >> Sent: Monday, April 20, 2015 10:47 AM >> To: solr-user@lucene.apache.org >> Subject: Mutli term synonyms >> >> Hello, >> >> Reading up on synonyms it looks like there is no real solution for multi >> term synonyms. Is that right? I have a use case where I need to map one >> multi term phrase to another. i.e. Tween 20 needs to be translated to >> Polysorbate 40. >> >> Any thoughts as to how this can be achieved? >> >> Thanks, >> Kaushik >>
Re: Mutli term synonyms
Hi there, I tried the solution provided in https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ .The mentioned solution works when the indexed data does not have alpha numerics or special characters. But in my case the synonyms are something like the below. T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE 300 POLYSORBATE 20 [FHFI] FEMA NO. 2915 They have alpha numerics, special characters, spaces, etc. Is there a way to implment synonyms even in such case? Thanks, Kaushik On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] < daniel.da...@nih.gov> wrote: > Handling MESH descriptor preferred terms and such is similar. I > encountered this during evaluation of Solr for a project here at NLM. We > decided to use Solr for different projects instead. I considered the > following approaches: > - use a custom tokenizer at index time that indexed all of the multiple > term alternatives. > - index the data, and then have an enrichment process that queries on > each source synonym, and generates an update to add the target synonyms. >Follow this with an optimize. > - During the indexing process, but before sending the data to Solr, > process the data to tokenize and add synonyms to another field. > > Both the custom tokenizer and enrichment process share the feature that > they use Solr's own tokenizer rather than duplicate it. The enrichment > process seems to me only workable in environments where you can re-index > all data periodically, so no continuous stream of data to index that needs > to be handled relatively quickly once it is generated.The last method > of pre-processing the data seems the least desirable to me from a blue-sky > perspective, but is probably the easiest to implement and the most > independent of Solr. > > Hope this helps, > > Dan Davis, Systems/Applications Architect (Contractor), > Office of Computer and Communications Systems, > National Library of Medicine, NIH > > -Original Message- > From: Kaushik [mailto:kaushika...@gmail.com] > Sent: Monday, April 20, 2015 10:47 AM > To: solr-user@lucene.apache.org > Subject: Mutli term synonyms > > Hello, > > Reading up on synonyms it looks like there is no real solution for multi > term synonyms. Is that right? I have a use case where I need to map one > multi term phrase to another. i.e. Tween 20 needs to be translated to > Polysorbate 40. > > Any thoughts as to how this can be achieved? > > Thanks, > Kaushik >
RE: Mutli term synonyms
Handling MESH descriptor preferred terms and such is similar. I encountered this during evaluation of Solr for a project here at NLM. We decided to use Solr for different projects instead. I considered the following approaches: - use a custom tokenizer at index time that indexed all of the multiple term alternatives. - index the data, and then have an enrichment process that queries on each source synonym, and generates an update to add the target synonyms. Follow this with an optimize. - During the indexing process, but before sending the data to Solr, process the data to tokenize and add synonyms to another field. Both the custom tokenizer and enrichment process share the feature that they use Solr's own tokenizer rather than duplicate it. The enrichment process seems to me only workable in environments where you can re-index all data periodically, so no continuous stream of data to index that needs to be handled relatively quickly once it is generated.The last method of pre-processing the data seems the least desirable to me from a blue-sky perspective, but is probably the easiest to implement and the most independent of Solr. Hope this helps, Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -Original Message- From: Kaushik [mailto:kaushika...@gmail.com] Sent: Monday, April 20, 2015 10:47 AM To: solr-user@lucene.apache.org Subject: Mutli term synonyms Hello, Reading up on synonyms it looks like there is no real solution for multi term synonyms. Is that right? I have a use case where I need to map one multi term phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40. Any thoughts as to how this can be achieved? Thanks, Kaushik
Mutli term synonyms
Hello, Reading up on synonyms it looks like there is no real solution for multi term synonyms. Is that right? I have a use case where I need to map one multi term phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40. Any thoughts as to how this can be achieved? Thanks, Kaushik