Re: Mutli term synonyms

Kaushik Wed, 29 Apr 2015 14:28:34 -0700

Hi Roman,

When I used the debugQuery using
http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true&debugQuery=true
I see the following in the response. The autophrase plugin seems to be
doing its part. Just not the synonym expansion. When you say use phrase
queries, what do you mean? Please clarify.


response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  },
  "debug": {
    "rawquerystring": "tween 20",
    "querystring": "tween 20",
    "parsedquery": "name:tweenx20",
    "parsedquery_toString": "name:tweenx20",
    "explain": {},

Thank you,

Kaushik


On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla <[email protected]> wrote:

> Pls post output of the request with debugQuery=true
>
> Do you see the synonyms being expanded? Probably not.
>
> You can go to the administer iface, in the analyzer section play with the
> input until you see the synonyms. Use phrase queries too. That will be
> helpful to elliminate autophrase filter
> On Apr 29, 2015 6:18 AM, "Kaushik" <[email protected]> wrote:
>
> > Hi Roman,
> >
> > Following is my use case:
> >
> > *Schema.xml*...
> >
> >    <field name="name" type="text_autophrase" indexed="true"
> stored="true"/>
> >
> > <fieldType name="text_autophrase" class="solr.TextField"
> >            positionIncrementGap="100">
> >       <analyzer type="index">
> >         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >         <filter class="solr.LowerCaseFilterFactory" />
> >         <filter
> > class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory"
> >                 phrases="autophrases.txt" includeTokens="false"
> >                 replaceWhitespaceWith="X" />
> >         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >                 ignoreCase="true" expand="true" />
> >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> >                 words="stopwords.txt" enablePositionIncrements="true" />
> >       </analyzer>
> >       <analyzer type="query">
> >         <tokenizer class="solr.KeywordTokenizerFactory"/>
> >         <filter class="solr.LowerCaseFilterFactory" />
> >         <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> >                 ignoreCase="true" expand="true" />
> >         <filter class="solr.StopFilterFactory" ignoreCase="true"
> >                 words="stopwords.txt" enablePositionIncrements="true" />
> >       </analyzer>
> >     </fieldType>
> >
> > *SolrConfig.xml...*
> >
> > name="/autophrase" class="solr.SearchHandler">
> >    <lst name="defaults">
> >      <str name="echoParams">explicit</str>
> >      <int name="rows">10</int>
> >      <str name="df">name</str>
> >      <str name="defType">autophrasingParser</str>
> >    </lst>
> >   </requestHandler>
> >
> >   <queryParser name="autophrasingParser"
> >                class="com.lucidworks.analysis.AutoPhrasingQParserPlugin"
> >
> >     <str name="phrases">autophrases.txt</str>
> >     <str name="replaceWhitespaceWith">X</str>
> >   </queryParser>
> >
> >
> > *Synonyms.txt....*
> > PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
> > 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
> > [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
> > [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
> > 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
> > MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
> > SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
> > 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE
> 20
> > [WHO-DD],POLYSORBATE 20 [VANDF]
> >
> > *Autophrase.txt...*
> >
> > Has all the above phrases in one column
> >
> > *Indexed document....*
> >
> > <doc>
> >   <field name="id">31</field>
> >   <field name="name">Polysorbate 20</field>
> >   </doc>
> >
> > So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect
> to
> > see the record containig Polysorbate 20. i.e.
> >
> >
> http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true
> > should have retrieved it; but it doesnt.
> >
> > What could I be doing wrong?
> >
> > On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla <[email protected]>
> > wrote:
> >
> > > I'm not sure I understand - the autophrasing filter will allow the
> > > parser to see all the tokens, so that they can be parsed (and
> > > multi-token synonyms) identified. So if you are using the same
> > > analyzer at query and index time, they should be able to see the same
> > > stuff.
> > >
> > > are you using multi-token synonyms, or just entries that look like
> > > multi synonym? (in the first case, the tokens are separated by null
> > > byte) - in the second case, they are just strings even with
> > > whitespaces, your synonym file must contain exactly the same entries
> > > as your analyzer sees them (and in the same order; or you have to use
> > > the same analyzer to load the synonym files)
> > >
> > > can you post the relevant part of your schema.xml?
> > >
> > >
> > > note: I can confirm that multi-token synonym expansion can be made to
> > > work, even in complex cases - we do it - but likely, if you need
> > > multi-token synonyms, you will also need a smarter query parser.
> > > sometimes your users will use query strings that contain overlapping
> > > synonym entries, to handle that, you will have to know how to generate
> > > all possible 'reads', example
> > >
> > > synonym:
> > >
> > > foo bar, foobar
> > > hey foo, heyfoo
> > >
> > > user input:
> > >
> > > hey foo bar
> > >
> > > possible readings:
> > >
> > > ((hey foo) +bar) OR (hey +(foo bar))
> > >
> > > i'm simplifying it here, the fun starts when you are seeing a phrase
> > query
> > > :)
> > >
> > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik <[email protected]>
> wrote:
> > > > Hi there,
> > > >
> > > > I tried the solution provided in
> > > >
> > >
> >
> https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
> > > > .The mentioned solution works when the indexed data does not have
> alpha
> > > > numerics or special characters. But in  my case the synonyms are
> > > something
> > > > like the below.
> > > >
> > > >
> > > >  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
> > > > MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
> > > > SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
> > > > 300  POLYSORBATE
> > > > 20 [FHFI]  FEMA NO. 2915
> > > >
> > > > They have alpha numerics, special characters, spaces, etc. Is there a
> > way
> > > > to implment synonyms even in such case?
> > > >
> > > > Thanks,
> > > > Kaushik
> > > >
> > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] <
> > > > [email protected]> wrote:
> > > >
> > > >> Handling MESH descriptor preferred terms and such is similar.   I
> > > >> encountered this during evaluation of Solr for a project here at
> NLM.
> > >  We
> > > >> decided to use Solr for different projects instead.     I considered
> > the
> > > >> following approaches:
> > > >>  - use a custom tokenizer at index time that indexed all of the
> > multiple
> > > >> term alternatives.
> > > >>  - index the data, and then have an enrichment process that queries
> on
> > > >> each source synonym, and generates an update to add the target
> > synonyms.
> > > >>    Follow this with an optimize.
> > > >>  - During the indexing process, but before sending the data to Solr,
> > > >> process the data to tokenize and add synonyms to another field.
> > > >>
> > > >> Both the custom tokenizer and enrichment process share the feature
> > that
> > > >> they use Solr's own tokenizer rather than duplicate it.   The
> > enrichment
> > > >> process seems to me only workable in environments where you can
> > re-index
> > > >> all data periodically, so no continuous stream of data to index that
> > > needs
> > > >> to be handled relatively quickly once it is generated.    The last
> > > method
> > > >> of pre-processing the data seems the least desirable to me from a
> > > blue-sky
> > > >> perspective, but is probably the easiest to implement and the most
> > > >> independent of Solr.
> > > >>
> > > >> Hope this helps,
> > > >>
> > > >> Dan Davis, Systems/Applications Architect (Contractor),
> > > >> Office of Computer and Communications Systems,
> > > >> National Library of Medicine, NIH
> > > >>
> > > >> -----Original Message-----
> > > >> From: Kaushik [mailto:[email protected]]
> > > >> Sent: Monday, April 20, 2015 10:47 AM
> > > >> To: [email protected]
> > > >> Subject: Mutli term synonyms
> > > >>
> > > >> Hello,
> > > >>
> > > >> Reading up on synonyms it looks like there is no real solution for
> > multi
> > > >> term synonyms. Is that right? I have a use case where I need to map
> > one
> > > >> multi term phrase to another. i.e. Tween 20 needs to be translated
> to
> > > >> Polysorbate 40.
> > > >>
> > > >> Any thoughts as to how this can be achieved?
> > > >>
> > > >> Thanks,
> > > >> Kaushik
> > > >>
> > >
> >
>

Re: Mutli term synonyms

Reply via email to