Re: Mutli term synonyms

Roman Chyla Wed, 29 Apr 2015 14:48:51 -0700

Hi Kaushik, I meant to compare tween 20 against "tween 20".

Your autophrase filter replaces whitespace with x, but your synonym filter
expects whitespaces. Try that.


Roman
On Apr 29, 2015 2:27 PM, "Kaushik" <kaushika...@gmail.com> wrote:

> Hi Roman,
>
> When I used the debugQuery using
>
> http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true&debugQuery=true
> I see the following in the response. The autophrase plugin seems to be
> doing its part. Just not the synonym expansion. When you say use phrase
> queries, what do you mean? Please clarify.
>
> response": {
>     "numFound": 0,
>     "start": 0,
>     "docs": []
>   },
>   "debug": {
>     "rawquerystring": "tween 20",
>     "querystring": "tween 20",
>     "parsedquery": "name:tweenx20",
>     "parsedquery_toString": "name:tweenx20",
>     "explain": {},
>
> Thank you,
>
> Kaushik
>
>
> On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla <roman.ch...@gmail.com>
> wrote:
>
> > Pls post output of the request with debugQuery=true
> >
> > Do you see the synonyms being expanded? Probably not.
> >
> > You can go to the administer iface, in the analyzer section play with the
> > input until you see the synonyms. Use phrase queries too. That will be
> > helpful to elliminate autophrase filter
> > On Apr 29, 2015 6:18 AM, "Kaushik" <kaushika...@gmail.com> wrote:
> >
> > > Hi Roman,
> > >
> > > Following is my use case:
> > >
> > > *Schema.xml*...
> > >
> > >    <field name="name" type="text_autophrase" indexed="true"
> > stored="true"/>
> > >
> > > <fieldType name="text_autophrase" class="solr.TextField"
> > >            positionIncrementGap="100">
> > >       <analyzer type="index">
> > >         <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >         <filter class="solr.LowerCaseFilterFactory" />
> > >         <filter
> > > class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory"
> > >                 phrases="autophrases.txt" includeTokens="false"
> > >                 replaceWhitespaceWith="X" />
> > >         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> > >                 ignoreCase="true" expand="true" />
> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> > >                 words="stopwords.txt" enablePositionIncrements="true"
> />
> > >       </analyzer>
> > >       <analyzer type="query">
> > >         <tokenizer class="solr.KeywordTokenizerFactory"/>
> > >         <filter class="solr.LowerCaseFilterFactory" />
> > >         <filter class="solr.SynonymFilterFactory"
> synonyms="synonyms.txt"
> > >                 ignoreCase="true" expand="true" />
> > >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> > >                 words="stopwords.txt" enablePositionIncrements="true"
> />
> > >       </analyzer>
> > >     </fieldType>
> > >
> > > *SolrConfig.xml...*
> > >
> > > name="/autophrase" class="solr.SearchHandler">
> > >    <lst name="defaults">
> > >      <str name="echoParams">explicit</str>
> > >      <int name="rows">10</int>
> > >      <str name="df">name</str>
> > >      <str name="defType">autophrasingParser</str>
> > >    </lst>
> > >   </requestHandler>
> > >
> > >   <queryParser name="autophrasingParser"
> > >
> class="com.lucidworks.analysis.AutoPhrasingQParserPlugin"
> > >
> > >     <str name="phrases">autophrases.txt</str>
> > >     <str name="replaceWhitespaceWith">X</str>
> > >   </queryParser>
> > >
> > >
> > > *Synonyms.txt....*
> > > PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
> > > 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
> > > [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
> > > [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
> > > 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
> > > MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
> > > SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
> > > 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20
> [FCC],POLYSORBATE
> > 20
> > > [WHO-DD],POLYSORBATE 20 [VANDF]
> > >
> > > *Autophrase.txt...*
> > >
> > > Has all the above phrases in one column
> > >
> > > *Indexed document....*
> > >
> > > <doc>
> > >   <field name="id">31</field>
> > >   <field name="name">Polysorbate 20</field>
> > >   </doc>
> > >
> > > So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect
> > to
> > > see the record containig Polysorbate 20. i.e.
> > >
> > >
> >
> http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true
> > > should have retrieved it; but it doesnt.
> > >
> > > What could I be doing wrong?
> > >
> > > On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla <roman.ch...@gmail.com>
> > > wrote:
> > >
> > > > I'm not sure I understand - the autophrasing filter will allow the
> > > > parser to see all the tokens, so that they can be parsed (and
> > > > multi-token synonyms) identified. So if you are using the same
> > > > analyzer at query and index time, they should be able to see the same
> > > > stuff.
> > > >
> > > > are you using multi-token synonyms, or just entries that look like
> > > > multi synonym? (in the first case, the tokens are separated by null
> > > > byte) - in the second case, they are just strings even with
> > > > whitespaces, your synonym file must contain exactly the same entries
> > > > as your analyzer sees them (and in the same order; or you have to use
> > > > the same analyzer to load the synonym files)
> > > >
> > > > can you post the relevant part of your schema.xml?
> > > >
> > > >
> > > > note: I can confirm that multi-token synonym expansion can be made to
> > > > work, even in complex cases - we do it - but likely, if you need
> > > > multi-token synonyms, you will also need a smarter query parser.
> > > > sometimes your users will use query strings that contain overlapping
> > > > synonym entries, to handle that, you will have to know how to
> generate
> > > > all possible 'reads', example
> > > >
> > > > synonym:
> > > >
> > > > foo bar, foobar
> > > > hey foo, heyfoo
> > > >
> > > > user input:
> > > >
> > > > hey foo bar
> > > >
> > > > possible readings:
> > > >
> > > > ((hey foo) +bar) OR (hey +(foo bar))
> > > >
> > > > i'm simplifying it here, the fun starts when you are seeing a phrase
> > > query
> > > > :)
> > > >
> > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik <kaushika...@gmail.com>
> > wrote:
> > > > > Hi there,
> > > > >
> > > > > I tried the solution provided in
> > > > >
> > > >
> > >
> >
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
> > > > > .The mentioned solution works when the indexed data does not have
> > alpha
> > > > > numerics or special characters. But in  my case the synonyms are
> > > > something
> > > > > like the below.
> > > > >
> > > > >
> > > > >  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> > > > > MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE
> POLYOXYETHYLENE
> > > > > SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
> > > > > 300  POLYSORBATE
> > > > > 20 [FHFI]  FEMA NO. 2915
> > > > >
> > > > > They have alpha numerics, special characters, spaces, etc. Is
> there a
> > > way
> > > > > to implment synonyms even in such case?
> > > > >
> > > > > Thanks,
> > > > > Kaushik
> > > > >
> > > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
> > > > > daniel.da...@nih.gov> wrote:
> > > > >
> > > > >> Handling MESH descriptor preferred terms and such is similar.   I
> > > > >> encountered this during evaluation of Solr for a project here at
> > NLM.
> > > >  We
> > > > >> decided to use Solr for different projects instead.     I
> considered
> > > the
> > > > >> following approaches:
> > > > >>  - use a custom tokenizer at index time that indexed all of the
> > > multiple
> > > > >> term alternatives.
> > > > >>  - index the data, and then have an enrichment process that
> queries
> > on
> > > > >> each source synonym, and generates an update to add the target
> > > synonyms.
> > > > >>    Follow this with an optimize.
> > > > >>  - During the indexing process, but before sending the data to
> Solr,
> > > > >> process the data to tokenize and add synonyms to another field.
> > > > >>
> > > > >> Both the custom tokenizer and enrichment process share the feature
> > > that
> > > > >> they use Solr's own tokenizer rather than duplicate it.   The
> > > enrichment
> > > > >> process seems to me only workable in environments where you can
> > > re-index
> > > > >> all data periodically, so no continuous stream of data to index
> that
> > > > needs
> > > > >> to be handled relatively quickly once it is generated.    The last
> > > > method
> > > > >> of pre-processing the data seems the least desirable to me from a
> > > > blue-sky
> > > > >> perspective, but is probably the easiest to implement and the most
> > > > >> independent of Solr.
> > > > >>
> > > > >> Hope this helps,
> > > > >>
> > > > >> Dan Davis, Systems/Applications Architect (Contractor),
> > > > >> Office of Computer and Communications Systems,
> > > > >> National Library of Medicine, NIH
> > > > >>
> > > > >> -----Original Message-----
> > > > >> From: Kaushik [mailto:kaushika...@gmail.com]
> > > > >> Sent: Monday, April 20, 2015 10:47 AM
> > > > >> To: solr-user@lucene.apache.org
> > > > >> Subject: Mutli term synonyms
> > > > >>
> > > > >> Hello,
> > > > >>
> > > > >> Reading up on synonyms it looks like there is no real solution for
> > > multi
> > > > >> term synonyms. Is that right? I have a use case where I need to
> map
> > > one
> > > > >> multi term phrase to another. i.e. Tween 20 needs to be translated
> > to
> > > > >> Polysorbate 40.
> > > > >>
> > > > >> Any thoughts as to how this can be achieved?
> > > > >>
> > > > >> Thanks,
> > > > >> Kaushik
> > > > >>
> > > >
> > >
> >
>

Re: Mutli term synonyms

Reply via email to