Hi Kaushik, I meant to compare tween 20 against "tween 20". Your autophrase filter replaces whitespace with x, but your synonym filter expects whitespaces. Try that.
Roman On Apr 29, 2015 2:27 PM, "Kaushik" <kaushika...@gmail.com> wrote: > Hi Roman, > > When I used the debugQuery using > > http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true&debugQuery=true > I see the following in the response. The autophrase plugin seems to be > doing its part. Just not the synonym expansion. When you say use phrase > queries, what do you mean? Please clarify. > > response": { > "numFound": 0, > "start": 0, > "docs": [] > }, > "debug": { > "rawquerystring": "tween 20", > "querystring": "tween 20", > "parsedquery": "name:tweenx20", > "parsedquery_toString": "name:tweenx20", > "explain": {}, > > Thank you, > > Kaushik > > > On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla <roman.ch...@gmail.com> > wrote: > > > Pls post output of the request with debugQuery=true > > > > Do you see the synonyms being expanded? Probably not. > > > > You can go to the administer iface, in the analyzer section play with the > > input until you see the synonyms. Use phrase queries too. That will be > > helpful to elliminate autophrase filter > > On Apr 29, 2015 6:18 AM, "Kaushik" <kaushika...@gmail.com> wrote: > > > > > Hi Roman, > > > > > > Following is my use case: > > > > > > *Schema.xml*... > > > > > > <field name="name" type="text_autophrase" indexed="true" > > stored="true"/> > > > > > > <fieldType name="text_autophrase" class="solr.TextField" > > > positionIncrementGap="100"> > > > <analyzer type="index"> > > > <tokenizer class="solr.KeywordTokenizerFactory"/> > > > <filter class="solr.LowerCaseFilterFactory" /> > > > <filter > > > class="com.lucidworks.analysis.AutoPhrasingTokenFilterFactory" > > > phrases="autophrases.txt" includeTokens="false" > > > replaceWhitespaceWith="X" /> > > > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms.txt" > > > ignoreCase="true" expand="true" /> > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > > words="stopwords.txt" enablePositionIncrements="true" > /> > > > </analyzer> > > > <analyzer type="query"> > > > <tokenizer class="solr.KeywordTokenizerFactory"/> > > > <filter class="solr.LowerCaseFilterFactory" /> > > > <filter class="solr.SynonymFilterFactory" > synonyms="synonyms.txt" > > > ignoreCase="true" expand="true" /> > > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > > words="stopwords.txt" enablePositionIncrements="true" > /> > > > </analyzer> > > > </fieldType> > > > > > > *SolrConfig.xml...* > > > > > > name="/autophrase" class="solr.SearchHandler"> > > > <lst name="defaults"> > > > <str name="echoParams">explicit</str> > > > <int name="rows">10</int> > > > <str name="df">name</str> > > > <str name="defType">autophrasingParser</str> > > > </lst> > > > </requestHandler> > > > > > > <queryParser name="autophrasingParser" > > > > class="com.lucidworks.analysis.AutoPhrasingQParserPlugin" > > > > > > <str name="phrases">autophrases.txt</str> > > > <str name="replaceWhitespaceWith">X</str> > > > </queryParser> > > > > > > > > > *Synonyms.txt....* > > > PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN > > > 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 > > > [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN > > > [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ > > > 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN > > > MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE > > > SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE > > > 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 > [FCC],POLYSORBATE > > 20 > > > [WHO-DD],POLYSORBATE 20 [VANDF] > > > > > > *Autophrase.txt...* > > > > > > Has all the above phrases in one column > > > > > > *Indexed document....* > > > > > > <doc> > > > <field name="id">31</field> > > > <field name="name">Polysorbate 20</field> > > > </doc> > > > > > > So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect > > to > > > see the record containig Polysorbate 20. i.e. > > > > > > > > > http://localhost:8983/solr/collection1/autophrase?q=tween+20&wt=json&indent=true > > > should have retrieved it; but it doesnt. > > > > > > What could I be doing wrong? > > > > > > On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla <roman.ch...@gmail.com> > > > wrote: > > > > > > > I'm not sure I understand - the autophrasing filter will allow the > > > > parser to see all the tokens, so that they can be parsed (and > > > > multi-token synonyms) identified. So if you are using the same > > > > analyzer at query and index time, they should be able to see the same > > > > stuff. > > > > > > > > are you using multi-token synonyms, or just entries that look like > > > > multi synonym? (in the first case, the tokens are separated by null > > > > byte) - in the second case, they are just strings even with > > > > whitespaces, your synonym file must contain exactly the same entries > > > > as your analyzer sees them (and in the same order; or you have to use > > > > the same analyzer to load the synonym files) > > > > > > > > can you post the relevant part of your schema.xml? > > > > > > > > > > > > note: I can confirm that multi-token synonym expansion can be made to > > > > work, even in complex cases - we do it - but likely, if you need > > > > multi-token synonyms, you will also need a smarter query parser. > > > > sometimes your users will use query strings that contain overlapping > > > > synonym entries, to handle that, you will have to know how to > generate > > > > all possible 'reads', example > > > > > > > > synonym: > > > > > > > > foo bar, foobar > > > > hey foo, heyfoo > > > > > > > > user input: > > > > > > > > hey foo bar > > > > > > > > possible readings: > > > > > > > > ((hey foo) +bar) OR (hey +(foo bar)) > > > > > > > > i'm simplifying it here, the fun starts when you are seeing a phrase > > > query > > > > :) > > > > > > > > On Tue, Apr 28, 2015 at 10:31 AM, Kaushik <kaushika...@gmail.com> > > wrote: > > > > > Hi there, > > > > > > > > > > I tried the solution provided in > > > > > > > > > > > > > > > https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ > > > > > .The mentioned solution works when the indexed data does not have > > alpha > > > > > numerics or special characters. But in my case the synonyms are > > > > something > > > > > like the below. > > > > > > > > > > > > > > > T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN > > > > > MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE > POLYOXYETHYLENE > > > > > SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE > > > > > 300 POLYSORBATE > > > > > 20 [FHFI] FEMA NO. 2915 > > > > > > > > > > They have alpha numerics, special characters, spaces, etc. Is > there a > > > way > > > > > to implment synonyms even in such case? > > > > > > > > > > Thanks, > > > > > Kaushik > > > > > > > > > > On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] < > > > > > daniel.da...@nih.gov> wrote: > > > > > > > > > >> Handling MESH descriptor preferred terms and such is similar. I > > > > >> encountered this during evaluation of Solr for a project here at > > NLM. > > > > We > > > > >> decided to use Solr for different projects instead. I > considered > > > the > > > > >> following approaches: > > > > >> - use a custom tokenizer at index time that indexed all of the > > > multiple > > > > >> term alternatives. > > > > >> - index the data, and then have an enrichment process that > queries > > on > > > > >> each source synonym, and generates an update to add the target > > > synonyms. > > > > >> Follow this with an optimize. > > > > >> - During the indexing process, but before sending the data to > Solr, > > > > >> process the data to tokenize and add synonyms to another field. > > > > >> > > > > >> Both the custom tokenizer and enrichment process share the feature > > > that > > > > >> they use Solr's own tokenizer rather than duplicate it. The > > > enrichment > > > > >> process seems to me only workable in environments where you can > > > re-index > > > > >> all data periodically, so no continuous stream of data to index > that > > > > needs > > > > >> to be handled relatively quickly once it is generated. The last > > > > method > > > > >> of pre-processing the data seems the least desirable to me from a > > > > blue-sky > > > > >> perspective, but is probably the easiest to implement and the most > > > > >> independent of Solr. > > > > >> > > > > >> Hope this helps, > > > > >> > > > > >> Dan Davis, Systems/Applications Architect (Contractor), > > > > >> Office of Computer and Communications Systems, > > > > >> National Library of Medicine, NIH > > > > >> > > > > >> -----Original Message----- > > > > >> From: Kaushik [mailto:kaushika...@gmail.com] > > > > >> Sent: Monday, April 20, 2015 10:47 AM > > > > >> To: solr-user@lucene.apache.org > > > > >> Subject: Mutli term synonyms > > > > >> > > > > >> Hello, > > > > >> > > > > >> Reading up on synonyms it looks like there is no real solution for > > > multi > > > > >> term synonyms. Is that right? I have a use case where I need to > map > > > one > > > > >> multi term phrase to another. i.e. Tween 20 needs to be translated > > to > > > > >> Polysorbate 40. > > > > >> > > > > >> Any thoughts as to how this can be achieved? > > > > >> > > > > >> Thanks, > > > > >> Kaushik > > > > >> > > > > > > > > > >