Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
((hey foo) +bar) OR (hey +(foo bar))
> > > > > >
> > > > > > i'm simplifying it here, the fun starts when you are seeing a
> > phrase
> > > > > query
> > > > > > :)
> > > > > >
> > > > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik  >
> > > > wrote:
> > > > > > > Hi there,
> > > > > > >
> > > > > > > I tried the solution provided in
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
> > > > > > > .The mentioned solution works when the indexed data does not
> have
> > > > alpha
> > > > > > > numerics or special characters. But in  my case the synonyms
> are
> > > > > > something
> > > > > > > like the below.
> > > > > > >
> > > > > > >
> > > > > > >  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> > > > > > > MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE
> > > POLYOXYETHYLENE
> > > > > > > SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL
> > LAURATE
> > > > > > > 300  POLYSORBATE
> > > > > > > 20 [FHFI]  FEMA NO. 2915
> > > > > > >
> > > > > > > They have alpha numerics, special characters, spaces, etc. Is
> > > there a
> > > > > way
> > > > > > > to implment synonyms even in such case?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Kaushik
> > > > > > >
> > > > > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
> > > > > > > daniel.da...@nih.gov> wrote:
> > > > > > >
> > > > > > >> Handling MESH descriptor preferred terms and such is similar.
> >  I
> > > > > > >> encountered this during evaluation of Solr for a project here
> at
> > > > NLM.
> > > > > >  We
> > > > > > >> decided to use Solr for different projects instead. I
> > > considered
> > > > > the
> > > > > > >> following approaches:
> > > > > > >>  - use a custom tokenizer at index time that indexed all of
> the
> > > > > multiple
> > > > > > >> term alternatives.
> > > > > > >>  - index the data, and then have an enrichment process that
> > > queries
> > > > on
> > > > > > >> each source synonym, and generates an update to add the target
> > > > > synonyms.
> > > > > > >>Follow this with an optimize.
> > > > > > >>  - During the indexing process, but before sending the data to
> > > Solr,
> > > > > > >> process the data to tokenize and add synonyms to another
> field.
> > > > > > >>
> > > > > > >> Both the custom tokenizer and enrichment process share the
> > feature
> > > > > that
> > > > > > >> they use Solr's own tokenizer rather than duplicate it.   The
> > > > > enrichment
> > > > > > >> process seems to me only workable in environments where you
> can
> > > > > re-index
> > > > > > >> all data periodically, so no continuous stream of data to
> index
> > > that
> > > > > > needs
> > > > > > >> to be handled relatively quickly once it is generated.The
> > last
> > > > > > method
> > > > > > >> of pre-processing the data seems the least desirable to me
> from
> > a
> > > > > > blue-sky
> > > > > > >> perspective, but is probably the easiest to implement and the
> > most
> > > > > > >> independent of Solr.
> > > > > > >>
> > > > > > >> Hope this helps,
> > > > > > >>
> > > > > > >> Dan Davis, Systems/Applications Architect (Contractor),
> > > > > > >> Office of Computer and Communications Systems,
> > > > > > >> National Library of Medicine, NIH
> > > > > > >>
> > > > > > >> -Original Message-
> > > > > > >> From: Kaushik [mailto:kaushika...@gmail.com]
> > > > > > >> Sent: Monday, April 20, 2015 10:47 AM
> > > > > > >> To: solr-user@lucene.apache.org
> > > > > > >> Subject: Mutli term synonyms
> > > > > > >>
> > > > > > >> Hello,
> > > > > > >>
> > > > > > >> Reading up on synonyms it looks like there is no real solution
> > for
> > > > > multi
> > > > > > >> term synonyms. Is that right? I have a use case where I need
> to
> > > map
> > > > > one
> > > > > > >> multi term phrase to another. i.e. Tween 20 needs to be
> > translated
> > > > to
> > > > > > >> Polysorbate 40.
> > > > > > >>
> > > > > > >> Any thoughts as to how this can be achieved?
> > > > > > >>
> > > > > > >> Thanks,
> > > > > > >> Kaushik
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>


Re: Mutli term synonyms

2015-04-29 Thread Kaushik
> > > > 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
> > > > [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
> > > > [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
> > > > 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
> > > > MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
> > > > SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
> > > > 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20
> > [FCC],POLYSORBATE
> > > 20
> > > > [WHO-DD],POLYSORBATE 20 [VANDF]
> > > >
> > > > *Autophrase.txt...*
> > > >
> > > > Has all the above phrases in one column
> > > >
> > > > *Indexed document*
> > > >
> > > > 
> > > >   31
> > > >   Polysorbate 20
> > > >   
> > > >
> > > > So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I
> expect
> > > to
> > > > see the record containig Polysorbate 20. i.e.
> > > >
> > > >
> > >
> >
> http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true
> > > > should have retrieved it; but it doesnt.
> > > >
> > > > What could I be doing wrong?
> > > >
> > > > On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla 
> > > > wrote:
> > > >
> > > > > I'm not sure I understand - the autophrasing filter will allow the
> > > > > parser to see all the tokens, so that they can be parsed (and
> > > > > multi-token synonyms) identified. So if you are using the same
> > > > > analyzer at query and index time, they should be able to see the
> same
> > > > > stuff.
> > > > >
> > > > > are you using multi-token synonyms, or just entries that look like
> > > > > multi synonym? (in the first case, the tokens are separated by null
> > > > > byte) - in the second case, they are just strings even with
> > > > > whitespaces, your synonym file must contain exactly the same
> entries
> > > > > as your analyzer sees them (and in the same order; or you have to
> use
> > > > > the same analyzer to load the synonym files)
> > > > >
> > > > > can you post the relevant part of your schema.xml?
> > > > >
> > > > >
> > > > > note: I can confirm that multi-token synonym expansion can be made
> to
> > > > > work, even in complex cases - we do it - but likely, if you need
> > > > > multi-token synonyms, you will also need a smarter query parser.
> > > > > sometimes your users will use query strings that contain
> overlapping
> > > > > synonym entries, to handle that, you will have to know how to
> > generate
> > > > > all possible 'reads', example
> > > > >
> > > > > synonym:
> > > > >
> > > > > foo bar, foobar
> > > > > hey foo, heyfoo
> > > > >
> > > > > user input:
> > > > >
> > > > > hey foo bar
> > > > >
> > > > > possible readings:
> > > > >
> > > > > ((hey foo) +bar) OR (hey +(foo bar))
> > > > >
> > > > > i'm simplifying it here, the fun starts when you are seeing a
> phrase
> > > > query
> > > > > :)
> > > > >
> > > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik 
> > > wrote:
> > > > > > Hi there,
> > > > > >
> > > > > > I tried the solution provided in
> > > > > >
> > > > >
> > > >
> > >
> >
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
> > > > > > .The mentioned solution works when the indexed data does not have
> > > alpha
> > > > > > numerics or special characters. But in  my case the synonyms are
> > > > > something
> > > > > > like the below.
> > > > > >
> > > > > >
> > > > > >  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> > > > > > MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE
> > POLYOXYETHYLENE
> > > > > > SORBITAN MONOLAURATE  P

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
29, 2015 at 2:10 AM, Roman Chyla 
> > > wrote:
> > >
> > > > I'm not sure I understand - the autophrasing filter will allow the
> > > > parser to see all the tokens, so that they can be parsed (and
> > > > multi-token synonyms) identified. So if you are using the same
> > > > analyzer at query and index time, they should be able to see the same
> > > > stuff.
> > > >
> > > > are you using multi-token synonyms, or just entries that look like
> > > > multi synonym? (in the first case, the tokens are separated by null
> > > > byte) - in the second case, they are just strings even with
> > > > whitespaces, your synonym file must contain exactly the same entries
> > > > as your analyzer sees them (and in the same order; or you have to use
> > > > the same analyzer to load the synonym files)
> > > >
> > > > can you post the relevant part of your schema.xml?
> > > >
> > > >
> > > > note: I can confirm that multi-token synonym expansion can be made to
> > > > work, even in complex cases - we do it - but likely, if you need
> > > > multi-token synonyms, you will also need a smarter query parser.
> > > > sometimes your users will use query strings that contain overlapping
> > > > synonym entries, to handle that, you will have to know how to
> generate
> > > > all possible 'reads', example
> > > >
> > > > synonym:
> > > >
> > > > foo bar, foobar
> > > > hey foo, heyfoo
> > > >
> > > > user input:
> > > >
> > > > hey foo bar
> > > >
> > > > possible readings:
> > > >
> > > > ((hey foo) +bar) OR (hey +(foo bar))
> > > >
> > > > i'm simplifying it here, the fun starts when you are seeing a phrase
> > > query
> > > > :)
> > > >
> > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik 
> > wrote:
> > > > > Hi there,
> > > > >
> > > > > I tried the solution provided in
> > > > >
> > > >
> > >
> >
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
> > > > > .The mentioned solution works when the indexed data does not have
> > alpha
> > > > > numerics or special characters. But in  my case the synonyms are
> > > > something
> > > > > like the below.
> > > > >
> > > > >
> > > > >  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> > > > > MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE
> POLYOXYETHYLENE
> > > > > SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
> > > > > 300  POLYSORBATE
> > > > > 20 [FHFI]  FEMA NO. 2915
> > > > >
> > > > > They have alpha numerics, special characters, spaces, etc. Is
> there a
> > > way
> > > > > to implment synonyms even in such case?
> > > > >
> > > > > Thanks,
> > > > > Kaushik
> > > > >
> > > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
> > > > > daniel.da...@nih.gov> wrote:
> > > > >
> > > > >> Handling MESH descriptor preferred terms and such is similar.   I
> > > > >> encountered this during evaluation of Solr for a project here at
> > NLM.
> > > >  We
> > > > >> decided to use Solr for different projects instead. I
> considered
> > > the
> > > > >> following approaches:
> > > > >>  - use a custom tokenizer at index time that indexed all of the
> > > multiple
> > > > >> term alternatives.
> > > > >>  - index the data, and then have an enrichment process that
> queries
> > on
> > > > >> each source synonym, and generates an update to add the target
> > > synonyms.
> > > > >>Follow this with an optimize.
> > > > >>  - During the indexing process, but before sending the data to
> Solr,
> > > > >> process the data to tokenize and add synonyms to another field.
> > > > >>
> > > > >> Both the custom tokenizer and enrichment process share the feature
> > > that
> > > > >> they use Solr's own tokenizer rather than duplicate it.   The
> > > enrichment
> > > > >> process seems to me only workable in environments where you can
> > > re-index
> > > > >> all data periodically, so no continuous stream of data to index
> that
> > > > needs
> > > > >> to be handled relatively quickly once it is generated.The last
> > > > method
> > > > >> of pre-processing the data seems the least desirable to me from a
> > > > blue-sky
> > > > >> perspective, but is probably the easiest to implement and the most
> > > > >> independent of Solr.
> > > > >>
> > > > >> Hope this helps,
> > > > >>
> > > > >> Dan Davis, Systems/Applications Architect (Contractor),
> > > > >> Office of Computer and Communications Systems,
> > > > >> National Library of Medicine, NIH
> > > > >>
> > > > >> -Original Message-
> > > > >> From: Kaushik [mailto:kaushika...@gmail.com]
> > > > >> Sent: Monday, April 20, 2015 10:47 AM
> > > > >> To: solr-user@lucene.apache.org
> > > > >> Subject: Mutli term synonyms
> > > > >>
> > > > >> Hello,
> > > > >>
> > > > >> Reading up on synonyms it looks like there is no real solution for
> > > multi
> > > > >> term synonyms. Is that right? I have a use case where I need to
> map
> > > one
> > > > >> multi term phrase to another. i.e. Tween 20 needs to be translated
> > to
> > > > >> Polysorbate 40.
> > > > >>
> > > > >> Any thoughts as to how this can be achieved?
> > > > >>
> > > > >> Thanks,
> > > > >> Kaushik
> > > > >>
> > > >
> > >
> >
>


Re: Mutli term synonyms

2015-04-29 Thread Kaushik
evant part of your schema.xml?
> > >
> > >
> > > note: I can confirm that multi-token synonym expansion can be made to
> > > work, even in complex cases - we do it - but likely, if you need
> > > multi-token synonyms, you will also need a smarter query parser.
> > > sometimes your users will use query strings that contain overlapping
> > > synonym entries, to handle that, you will have to know how to generate
> > > all possible 'reads', example
> > >
> > > synonym:
> > >
> > > foo bar, foobar
> > > hey foo, heyfoo
> > >
> > > user input:
> > >
> > > hey foo bar
> > >
> > > possible readings:
> > >
> > > ((hey foo) +bar) OR (hey +(foo bar))
> > >
> > > i'm simplifying it here, the fun starts when you are seeing a phrase
> > query
> > > :)
> > >
> > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik 
> wrote:
> > > > Hi there,
> > > >
> > > > I tried the solution provided in
> > > >
> > >
> >
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
> > > > .The mentioned solution works when the indexed data does not have
> alpha
> > > > numerics or special characters. But in  my case the synonyms are
> > > something
> > > > like the below.
> > > >
> > > >
> > > >  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> > > > MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
> > > > SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
> > > > 300  POLYSORBATE
> > > > 20 [FHFI]  FEMA NO. 2915
> > > >
> > > > They have alpha numerics, special characters, spaces, etc. Is there a
> > way
> > > > to implment synonyms even in such case?
> > > >
> > > > Thanks,
> > > > Kaushik
> > > >
> > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
> > > > daniel.da...@nih.gov> wrote:
> > > >
> > > >> Handling MESH descriptor preferred terms and such is similar.   I
> > > >> encountered this during evaluation of Solr for a project here at
> NLM.
> > >  We
> > > >> decided to use Solr for different projects instead. I considered
> > the
> > > >> following approaches:
> > > >>  - use a custom tokenizer at index time that indexed all of the
> > multiple
> > > >> term alternatives.
> > > >>  - index the data, and then have an enrichment process that queries
> on
> > > >> each source synonym, and generates an update to add the target
> > synonyms.
> > > >>Follow this with an optimize.
> > > >>  - During the indexing process, but before sending the data to Solr,
> > > >> process the data to tokenize and add synonyms to another field.
> > > >>
> > > >> Both the custom tokenizer and enrichment process share the feature
> > that
> > > >> they use Solr's own tokenizer rather than duplicate it.   The
> > enrichment
> > > >> process seems to me only workable in environments where you can
> > re-index
> > > >> all data periodically, so no continuous stream of data to index that
> > > needs
> > > >> to be handled relatively quickly once it is generated.The last
> > > method
> > > >> of pre-processing the data seems the least desirable to me from a
> > > blue-sky
> > > >> perspective, but is probably the easiest to implement and the most
> > > >> independent of Solr.
> > > >>
> > > >> Hope this helps,
> > > >>
> > > >> Dan Davis, Systems/Applications Architect (Contractor),
> > > >> Office of Computer and Communications Systems,
> > > >> National Library of Medicine, NIH
> > > >>
> > > >> -Original Message-
> > > >> From: Kaushik [mailto:kaushika...@gmail.com]
> > > >> Sent: Monday, April 20, 2015 10:47 AM
> > > >> To: solr-user@lucene.apache.org
> > > >> Subject: Mutli term synonyms
> > > >>
> > > >> Hello,
> > > >>
> > > >> Reading up on synonyms it looks like there is no real solution for
> > multi
> > > >> term synonyms. Is that right? I have a use case where I need to map
> > one
> > > >> multi term phrase to another. i.e. Tween 20 needs to be translated
> to
> > > >> Polysorbate 40.
> > > >>
> > > >> Any thoughts as to how this can be achieved?
> > > >>
> > > >> Thanks,
> > > >> Kaushik
> > > >>
> > >
> >
>


Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
w.
> > >
> > >
> > >  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> > > MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
> > > SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
> > > 300  POLYSORBATE
> > > 20 [FHFI]  FEMA NO. 2915
> > >
> > > They have alpha numerics, special characters, spaces, etc. Is there a
> way
> > > to implment synonyms even in such case?
> > >
> > > Thanks,
> > > Kaushik
> > >
> > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
> > > daniel.da...@nih.gov> wrote:
> > >
> > >> Handling MESH descriptor preferred terms and such is similar.   I
> > >> encountered this during evaluation of Solr for a project here at NLM.
> >  We
> > >> decided to use Solr for different projects instead. I considered
> the
> > >> following approaches:
> > >>  - use a custom tokenizer at index time that indexed all of the
> multiple
> > >> term alternatives.
> > >>  - index the data, and then have an enrichment process that queries on
> > >> each source synonym, and generates an update to add the target
> synonyms.
> > >>Follow this with an optimize.
> > >>  - During the indexing process, but before sending the data to Solr,
> > >> process the data to tokenize and add synonyms to another field.
> > >>
> > >> Both the custom tokenizer and enrichment process share the feature
> that
> > >> they use Solr's own tokenizer rather than duplicate it.   The
> enrichment
> > >> process seems to me only workable in environments where you can
> re-index
> > >> all data periodically, so no continuous stream of data to index that
> > needs
> > >> to be handled relatively quickly once it is generated.The last
> > method
> > >> of pre-processing the data seems the least desirable to me from a
> > blue-sky
> > >> perspective, but is probably the easiest to implement and the most
> > >> independent of Solr.
> > >>
> > >> Hope this helps,
> > >>
> > >> Dan Davis, Systems/Applications Architect (Contractor),
> > >> Office of Computer and Communications Systems,
> > >> National Library of Medicine, NIH
> > >>
> > >> -Original Message-
> > >> From: Kaushik [mailto:kaushika...@gmail.com]
> > >> Sent: Monday, April 20, 2015 10:47 AM
> > >> To: solr-user@lucene.apache.org
> > >> Subject: Mutli term synonyms
> > >>
> > >> Hello,
> > >>
> > >> Reading up on synonyms it looks like there is no real solution for
> multi
> > >> term synonyms. Is that right? I have a use case where I need to map
> one
> > >> multi term phrase to another. i.e. Tween 20 needs to be translated to
> > >> Polysorbate 40.
> > >>
> > >> Any thoughts as to how this can be achieved?
> > >>
> > >> Thanks,
> > >> Kaushik
> > >>
> >
>


Re: Mutli term synonyms

2015-04-29 Thread Kaushik
to me only workable in environments where you can re-index
> >> all data periodically, so no continuous stream of data to index that
> needs
> >> to be handled relatively quickly once it is generated.The last
> method
> >> of pre-processing the data seems the least desirable to me from a
> blue-sky
> >> perspective, but is probably the easiest to implement and the most
> >> independent of Solr.
> >>
> >> Hope this helps,
> >>
> >> Dan Davis, Systems/Applications Architect (Contractor),
> >> Office of Computer and Communications Systems,
> >> National Library of Medicine, NIH
> >>
> >> -Original Message-
> >> From: Kaushik [mailto:kaushika...@gmail.com]
> >> Sent: Monday, April 20, 2015 10:47 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Mutli term synonyms
> >>
> >> Hello,
> >>
> >> Reading up on synonyms it looks like there is no real solution for multi
> >> term synonyms. Is that right? I have a use case where I need to map one
> >> multi term phrase to another. i.e. Tween 20 needs to be translated to
> >> Polysorbate 40.
> >>
> >> Any thoughts as to how this can be achieved?
> >>
> >> Thanks,
> >> Kaushik
> >>
>


Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
I'm not sure I understand - the autophrasing filter will allow the
parser to see all the tokens, so that they can be parsed (and
multi-token synonyms) identified. So if you are using the same
analyzer at query and index time, they should be able to see the same
stuff.

are you using multi-token synonyms, or just entries that look like
multi synonym? (in the first case, the tokens are separated by null
byte) - in the second case, they are just strings even with
whitespaces, your synonym file must contain exactly the same entries
as your analyzer sees them (and in the same order; or you have to use
the same analyzer to load the synonym files)

can you post the relevant part of your schema.xml?


note: I can confirm that multi-token synonym expansion can be made to
work, even in complex cases - we do it - but likely, if you need
multi-token synonyms, you will also need a smarter query parser.
sometimes your users will use query strings that contain overlapping
synonym entries, to handle that, you will have to know how to generate
all possible 'reads', example

synonym:

foo bar, foobar
hey foo, heyfoo

user input:

hey foo bar

possible readings:

((hey foo) +bar) OR (hey +(foo bar))

i'm simplifying it here, the fun starts when you are seeing a phrase query :)

On Tue, Apr 28, 2015 at 10:31 AM, Kaushik  wrote:
> Hi there,
>
> I tried the solution provided in
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
> .The mentioned solution works when the indexed data does not have alpha
> numerics or special characters. But in  my case the synonyms are something
> like the below.
>
>
>  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
> SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
> 300  POLYSORBATE
> 20 [FHFI]  FEMA NO. 2915
>
> They have alpha numerics, special characters, spaces, etc. Is there a way
> to implment synonyms even in such case?
>
> Thanks,
> Kaushik
>
> On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
> daniel.da...@nih.gov> wrote:
>
>> Handling MESH descriptor preferred terms and such is similar.   I
>> encountered this during evaluation of Solr for a project here at NLM.   We
>> decided to use Solr for different projects instead. I considered the
>> following approaches:
>>  - use a custom tokenizer at index time that indexed all of the multiple
>> term alternatives.
>>  - index the data, and then have an enrichment process that queries on
>> each source synonym, and generates an update to add the target synonyms.
>>Follow this with an optimize.
>>  - During the indexing process, but before sending the data to Solr,
>> process the data to tokenize and add synonyms to another field.
>>
>> Both the custom tokenizer and enrichment process share the feature that
>> they use Solr's own tokenizer rather than duplicate it.   The enrichment
>> process seems to me only workable in environments where you can re-index
>> all data periodically, so no continuous stream of data to index that needs
>> to be handled relatively quickly once it is generated.The last method
>> of pre-processing the data seems the least desirable to me from a blue-sky
>> perspective, but is probably the easiest to implement and the most
>> independent of Solr.
>>
>> Hope this helps,
>>
>> Dan Davis, Systems/Applications Architect (Contractor),
>> Office of Computer and Communications Systems,
>> National Library of Medicine, NIH
>>
>> -Original Message-
>> From: Kaushik [mailto:kaushika...@gmail.com]
>> Sent: Monday, April 20, 2015 10:47 AM
>> To: solr-user@lucene.apache.org
>> Subject: Mutli term synonyms
>>
>> Hello,
>>
>> Reading up on synonyms it looks like there is no real solution for multi
>> term synonyms. Is that right? I have a use case where I need to map one
>> multi term phrase to another. i.e. Tween 20 needs to be translated to
>> Polysorbate 40.
>>
>> Any thoughts as to how this can be achieved?
>>
>> Thanks,
>> Kaushik
>>


Re: Mutli term synonyms

2015-04-28 Thread Kaushik
Hi there,

I tried the solution provided in
https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
.The mentioned solution works when the indexed data does not have alpha
numerics or special characters. But in  my case the synonyms are something
like the below.


 T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
300  POLYSORBATE
20 [FHFI]  FEMA NO. 2915

They have alpha numerics, special characters, spaces, etc. Is there a way
to implment synonyms even in such case?

Thanks,
Kaushik

On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
daniel.da...@nih.gov> wrote:

> Handling MESH descriptor preferred terms and such is similar.   I
> encountered this during evaluation of Solr for a project here at NLM.   We
> decided to use Solr for different projects instead. I considered the
> following approaches:
>  - use a custom tokenizer at index time that indexed all of the multiple
> term alternatives.
>  - index the data, and then have an enrichment process that queries on
> each source synonym, and generates an update to add the target synonyms.
>Follow this with an optimize.
>  - During the indexing process, but before sending the data to Solr,
> process the data to tokenize and add synonyms to another field.
>
> Both the custom tokenizer and enrichment process share the feature that
> they use Solr's own tokenizer rather than duplicate it.   The enrichment
> process seems to me only workable in environments where you can re-index
> all data periodically, so no continuous stream of data to index that needs
> to be handled relatively quickly once it is generated.The last method
> of pre-processing the data seems the least desirable to me from a blue-sky
> perspective, but is probably the easiest to implement and the most
> independent of Solr.
>
> Hope this helps,
>
> Dan Davis, Systems/Applications Architect (Contractor),
> Office of Computer and Communications Systems,
> National Library of Medicine, NIH
>
> -Original Message-
> From: Kaushik [mailto:kaushika...@gmail.com]
> Sent: Monday, April 20, 2015 10:47 AM
> To: solr-user@lucene.apache.org
> Subject: Mutli term synonyms
>
> Hello,
>
> Reading up on synonyms it looks like there is no real solution for multi
> term synonyms. Is that right? I have a use case where I need to map one
> multi term phrase to another. i.e. Tween 20 needs to be translated to
> Polysorbate 40.
>
> Any thoughts as to how this can be achieved?
>
> Thanks,
> Kaushik
>


RE: Mutli term synonyms

2015-04-20 Thread Davis, Daniel (NIH/NLM) [C]
Handling MESH descriptor preferred terms and such is similar.   I encountered 
this during evaluation of Solr for a project here at NLM.   We decided to use 
Solr for different projects instead. I considered the following approaches:
 - use a custom tokenizer at index time that indexed all of the multiple term 
alternatives.   
 - index the data, and then have an enrichment process that queries on each 
source synonym, and generates an update to add the target synonyms.  
   Follow this with an optimize.
 - During the indexing process, but before sending the data to Solr, process 
the data to tokenize and add synonyms to another field.

Both the custom tokenizer and enrichment process share the feature that they 
use Solr's own tokenizer rather than duplicate it.   The enrichment process 
seems to me only workable in environments where you can re-index all data 
periodically, so no continuous stream of data to index that needs to be handled 
relatively quickly once it is generated.The last method of pre-processing 
the data seems the least desirable to me from a blue-sky perspective, but is 
probably the easiest to implement and the most independent of Solr.

Hope this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH

-Original Message-
From: Kaushik [mailto:kaushika...@gmail.com] 
Sent: Monday, April 20, 2015 10:47 AM
To: solr-user@lucene.apache.org
Subject: Mutli term synonyms

Hello,

Reading up on synonyms it looks like there is no real solution for multi term 
synonyms. Is that right? I have a use case where I need to map one multi term 
phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40.

Any thoughts as to how this can be achieved?

Thanks,
Kaushik


Mutli term synonyms

2015-04-20 Thread Kaushik
Hello,

Reading up on synonyms it looks like there is no real solution for multi
term synonyms. Is that right? I have a use case where I need to map one
multi term phrase to another. i.e. Tween 20 needs to be translated to
Polysorbate 40.

Any thoughts as to how this can be achieved?

Thanks,
Kaushik