Re: Mutli term synonyms
Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field field name=namePolysorbate 20/field /doc So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to see the record containig Polysorbate 20. i.e. http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true should have retrieved it; but it doesnt. What could I be doing wrong? On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com wrote: I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token synonyms, or just entries that look like multi synonym? (in the first case, the tokens are separated by null byte) - in the second case, they are just strings even with whitespaces, your synonym file must contain exactly the same entries as your analyzer sees them (and in the same order; or you have to use the same analyzer to load the synonym files) can you post the relevant part of your schema.xml? note: I can confirm that multi-token synonym expansion can be made to work, even in complex cases - we do it - but likely, if you need multi-token synonyms, you will also need a smarter query parser. sometimes your users will use query strings that contain overlapping synonym entries, to handle that, you will have to know how to generate all possible 'reads', example synonym: foo bar, foobar hey foo, heyfoo user input: hey foo bar possible readings: ((hey foo) +bar) OR (hey +(foo bar)) i'm simplifying it here, the fun starts when you are seeing a phrase query :) On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote: Hi there, I tried the solution provided in https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ .The mentioned solution works when the indexed data does not have alpha numerics or special characters. But in my case the synonyms are something like the below. T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE 300 POLYSORBATE 20 [FHFI] FEMA NO. 2915 They have alpha numerics, special characters, spaces, etc. Is there a way to implment synonyms even in such case? Thanks, Kaushik On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] daniel.da...@nih.gov wrote: Handling MESH descriptor preferred terms and such is similar.
Re: Mutli term synonyms
I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token synonyms, or just entries that look like multi synonym? (in the first case, the tokens are separated by null byte) - in the second case, they are just strings even with whitespaces, your synonym file must contain exactly the same entries as your analyzer sees them (and in the same order; or you have to use the same analyzer to load the synonym files) can you post the relevant part of your schema.xml? note: I can confirm that multi-token synonym expansion can be made to work, even in complex cases - we do it - but likely, if you need multi-token synonyms, you will also need a smarter query parser. sometimes your users will use query strings that contain overlapping synonym entries, to handle that, you will have to know how to generate all possible 'reads', example synonym: foo bar, foobar hey foo, heyfoo user input: hey foo bar possible readings: ((hey foo) +bar) OR (hey +(foo bar)) i'm simplifying it here, the fun starts when you are seeing a phrase query :) On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote: Hi there, I tried the solution provided in https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ .The mentioned solution works when the indexed data does not have alpha numerics or special characters. But in my case the synonyms are something like the below. T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE 300 POLYSORBATE 20 [FHFI] FEMA NO. 2915 They have alpha numerics, special characters, spaces, etc. Is there a way to implment synonyms even in such case? Thanks, Kaushik On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] daniel.da...@nih.gov wrote: Handling MESH descriptor preferred terms and such is similar. I encountered this during evaluation of Solr for a project here at NLM. We decided to use Solr for different projects instead. I considered the following approaches: - use a custom tokenizer at index time that indexed all of the multiple term alternatives. - index the data, and then have an enrichment process that queries on each source synonym, and generates an update to add the target synonyms. Follow this with an optimize. - During the indexing process, but before sending the data to Solr, process the data to tokenize and add synonyms to another field. Both the custom tokenizer and enrichment process share the feature that they use Solr's own tokenizer rather than duplicate it. The enrichment process seems to me only workable in environments where you can re-index all data periodically, so no continuous stream of data to index that needs to be handled relatively quickly once it is generated.The last method of pre-processing the data seems the least desirable to me from a blue-sky perspective, but is probably the easiest to implement and the most independent of Solr. Hope this helps, Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -Original Message- From: Kaushik [mailto:kaushika...@gmail.com] Sent: Monday, April 20, 2015 10:47 AM To: solr-user@lucene.apache.org Subject: Mutli term synonyms Hello, Reading up on synonyms it looks like there is no real solution for multi term synonyms. Is that right? I have a use case where I need to map one multi term phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40. Any thoughts as to how this can be achieved? Thanks, Kaushik
Re: Mutli term synonyms
Pls post output of the request with debugQuery=true Do you see the synonyms being expanded? Probably not. You can go to the administer iface, in the analyzer section play with the input until you see the synonyms. Use phrase queries too. That will be helpful to elliminate autophrase filter On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote: Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field field name=namePolysorbate 20/field /doc So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to see the record containig Polysorbate 20. i.e. http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true should have retrieved it; but it doesnt. What could I be doing wrong? On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com wrote: I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token synonyms, or just entries that look like multi synonym? (in the first case, the tokens are separated by null byte) - in the second case, they are just strings even with whitespaces, your synonym file must contain exactly the same entries as your analyzer sees them (and in the same order; or you have to use the same analyzer to load the synonym files) can you post the relevant part of your schema.xml? note: I can confirm that multi-token synonym expansion can be made to work, even in complex cases - we do it - but likely, if you need multi-token synonyms, you will also need a smarter query parser. sometimes your users will use query strings that contain overlapping synonym entries, to handle that, you will have to know how to generate all possible 'reads', example synonym: foo bar, foobar hey foo, heyfoo user input: hey foo bar possible readings: ((hey foo) +bar) OR (hey +(foo bar)) i'm simplifying it here, the fun starts when you are seeing a phrase query :) On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote: Hi there, I tried the solution provided in https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ .The mentioned solution works when the indexed data does not have alpha numerics or special characters. But in my case the synonyms are something like the below. T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN
Re: Mutli term synonyms
Hi Kaushik, I meant to compare tween 20 against tween 20. Your autophrase filter replaces whitespace with x, but your synonym filter expects whitespaces. Try that. Roman On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote: Hi Roman, When I used the debugQuery using http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true I see the following in the response. The autophrase plugin seems to be doing its part. Just not the synonym expansion. When you say use phrase queries, what do you mean? Please clarify. response: { numFound: 0, start: 0, docs: [] }, debug: { rawquerystring: tween 20, querystring: tween 20, parsedquery: name:tweenx20, parsedquery_toString: name:tweenx20, explain: {}, Thank you, Kaushik On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com wrote: Pls post output of the request with debugQuery=true Do you see the synonyms being expanded? Probably not. You can go to the administer iface, in the analyzer section play with the input until you see the synonyms. Use phrase queries too. That will be helpful to elliminate autophrase filter On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote: Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field field name=namePolysorbate 20/field /doc So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to see the record containig Polysorbate 20. i.e. http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true should have retrieved it; but it doesnt. What could I be doing wrong? On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com wrote: I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token synonyms, or just entries that look like multi synonym? (in the first case, the tokens are separated by null byte) - in the second case, they are just strings even with whitespaces, your synonym file must contain exactly the same entries as your analyzer sees them (and in the same order; or you have to use the same analyzer to load the synonym files) can you post the relevant part of your schema.xml?
Re: Mutli term synonyms
Hi Roman, When I used the debugQuery using http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true I see the following in the response. The autophrase plugin seems to be doing its part. Just not the synonym expansion. When you say use phrase queries, what do you mean? Please clarify. response: { numFound: 0, start: 0, docs: [] }, debug: { rawquerystring: tween 20, querystring: tween 20, parsedquery: name:tweenx20, parsedquery_toString: name:tweenx20, explain: {}, Thank you, Kaushik On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com wrote: Pls post output of the request with debugQuery=true Do you see the synonyms being expanded? Probably not. You can go to the administer iface, in the analyzer section play with the input until you see the synonyms. Use phrase queries too. That will be helpful to elliminate autophrase filter On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote: Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field field name=namePolysorbate 20/field /doc So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to see the record containig Polysorbate 20. i.e. http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true should have retrieved it; but it doesnt. What could I be doing wrong? On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com wrote: I'm not sure I understand - the autophrasing filter will allow the parser to see all the tokens, so that they can be parsed (and multi-token synonyms) identified. So if you are using the same analyzer at query and index time, they should be able to see the same stuff. are you using multi-token synonyms, or just entries that look like multi synonym? (in the first case, the tokens are separated by null byte) - in the second case, they are just strings even with whitespaces, your synonym file must contain exactly the same entries as your analyzer sees them (and in the same order; or you have to use the same analyzer to load the synonym files) can you post the relevant part of your schema.xml? note: I can confirm that multi-token synonym expansion can be made to work, even in complex cases - we do it - but likely, if you need multi-token synonyms, you will also need a smarter query parser. sometimes your users will use query strings that contain overlapping synonym entries, to handle that, you will have to know how to generate all possible
Re: Mutli term synonyms
Hi Roman, Tween 20 also did not retrieve me results. So I replaced the whitespaces in the synonyms.txt with 'x' and now when I search, I get the results back. One problem however still exits. i.e. when I search for POLYSORBATE 20[MART.] which is a synonym for POLYSORBATE 20, I get error as below, msg: org.apache.solr.search.SyntaxError: Cannot parse 'polysORbate 20[mart.] ': Encountered \ \]\ \] \\ at line 1, column 20.\r\nWas expecting one of:\r\n\TO\ ...\r\nRANGE_QUOTED ...\r\nRANGE_GOOP ...\r\n, code: 400 If I am able to solve this, I think I am pretty close to the solution. Any thoughts there? I appreciate your help on this matter. Thank you, Kaushik On Wed, Apr 29, 2015 at 5:48 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Kaushik, I meant to compare tween 20 against tween 20. Your autophrase filter replaces whitespace with x, but your synonym filter expects whitespaces. Try that. Roman On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote: Hi Roman, When I used the debugQuery using http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true I see the following in the response. The autophrase plugin seems to be doing its part. Just not the synonym expansion. When you say use phrase queries, what do you mean? Please clarify. response: { numFound: 0, start: 0, docs: [] }, debug: { rawquerystring: tween 20, querystring: tween 20, parsedquery: name:tweenx20, parsedquery_toString: name:tweenx20, explain: {}, Thank you, Kaushik On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com wrote: Pls post output of the request with debugQuery=true Do you see the synonyms being expanded? Probably not. You can go to the administer iface, in the analyzer section play with the input until you see the synonyms. Use phrase queries too. That will be helpful to elliminate autophrase filter On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote: Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field field name=namePolysorbate 20/field /doc So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to see the record containig Polysorbate 20. i.e. http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true should have retrieved it; but
Re: Mutli term synonyms
Brackets are range operators for the parser, you need to escape them \[ or enclose in quotes. On Apr 29, 2015 10:27 PM, Kaushik kaushika...@gmail.com wrote: Hi Roman, Tween 20 also did not retrieve me results. So I replaced the whitespaces in the synonyms.txt with 'x' and now when I search, I get the results back. One problem however still exits. i.e. when I search for POLYSORBATE 20[MART.] which is a synonym for POLYSORBATE 20, I get error as below, msg: org.apache.solr.search.SyntaxError: Cannot parse 'polysORbate 20[mart.] ': Encountered \ \]\ \] \\ at line 1, column 20.\r\nWas expecting one of:\r\n\TO\ ...\r\nRANGE_QUOTED ...\r\nRANGE_GOOP ...\r\n, code: 400 If I am able to solve this, I think I am pretty close to the solution. Any thoughts there? I appreciate your help on this matter. Thank you, Kaushik On Wed, Apr 29, 2015 at 5:48 PM, Roman Chyla roman.ch...@gmail.com wrote: Hi Kaushik, I meant to compare tween 20 against tween 20. Your autophrase filter replaces whitespace with x, but your synonym filter expects whitespaces. Try that. Roman On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote: Hi Roman, When I used the debugQuery using http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true I see the following in the response. The autophrase plugin seems to be doing its part. Just not the synonym expansion. When you say use phrase queries, what do you mean? Please clarify. response: { numFound: 0, start: 0, docs: [] }, debug: { rawquerystring: tween 20, querystring: tween 20, parsedquery: name:tweenx20, parsedquery_toString: name:tweenx20, explain: {}, Thank you, Kaushik On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com wrote: Pls post output of the request with debugQuery=true Do you see the synonyms being expanded? Probably not. You can go to the administer iface, in the analyzer section play with the input until you see the synonyms. Use phrase queries too. That will be helpful to elliminate autophrase filter On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote: Hi Roman, Following is my use case: *Schema.xml*... field name=name type=text_autophrase indexed=true stored=true/ fieldType name=text_autophrase class=solr.TextField positionIncrementGap=100 analyzer type=index tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory phrases=autophrases.txt includeTokens=false replaceWhitespaceWith=X / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer analyzer type=query tokenizer class=solr.KeywordTokenizerFactory/ filter class=solr.LowerCaseFilterFactory / filter class=solr.SynonymFilterFactory synonyms=synonyms.txt ignoreCase=true expand=true / filter class=solr.StopFilterFactory ignoreCase=true words=stopwords.txt enablePositionIncrements=true / /analyzer /fieldType *SolrConfig.xml...* name=/autophrase class=solr.SearchHandler lst name=defaults str name=echoParamsexplicit/str int name=rows10/int str name=dfname/str str name=defTypeautophrasingParser/str /lst /requestHandler queryParser name=autophrasingParser class=com.lucidworks.analysis.AutoPhrasingQParserPlugin str name=phrasesautophrases.txt/str str name=replaceWhitespaceWithX/str /queryParser *Synonyms.txt* PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20 [WHO-DD],POLYSORBATE 20 [VANDF] *Autophrase.txt...* Has all the above phrases in one column *Indexed document* doc field name=id31/field
Re: Mutli term synonyms
Hi there, I tried the solution provided in https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/ .The mentioned solution works when the indexed data does not have alpha numerics or special characters. But in my case the synonyms are something like the below. T-MAZ 20 POLYOXYETHYLENE (20) SORBITAN MONOLAURATE SORBITAN MONODODECANOATE POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE POLYOXYETHYLENE SORBITAN MONOLAURATE POLYSORBATE 20 [MART.] SORBIMACROGOL LAURATE 300 POLYSORBATE 20 [FHFI] FEMA NO. 2915 They have alpha numerics, special characters, spaces, etc. Is there a way to implment synonyms even in such case? Thanks, Kaushik On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] daniel.da...@nih.gov wrote: Handling MESH descriptor preferred terms and such is similar. I encountered this during evaluation of Solr for a project here at NLM. We decided to use Solr for different projects instead. I considered the following approaches: - use a custom tokenizer at index time that indexed all of the multiple term alternatives. - index the data, and then have an enrichment process that queries on each source synonym, and generates an update to add the target synonyms. Follow this with an optimize. - During the indexing process, but before sending the data to Solr, process the data to tokenize and add synonyms to another field. Both the custom tokenizer and enrichment process share the feature that they use Solr's own tokenizer rather than duplicate it. The enrichment process seems to me only workable in environments where you can re-index all data periodically, so no continuous stream of data to index that needs to be handled relatively quickly once it is generated.The last method of pre-processing the data seems the least desirable to me from a blue-sky perspective, but is probably the easiest to implement and the most independent of Solr. Hope this helps, Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -Original Message- From: Kaushik [mailto:kaushika...@gmail.com] Sent: Monday, April 20, 2015 10:47 AM To: solr-user@lucene.apache.org Subject: Mutli term synonyms Hello, Reading up on synonyms it looks like there is no real solution for multi term synonyms. Is that right? I have a use case where I need to map one multi term phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40. Any thoughts as to how this can be achieved? Thanks, Kaushik
RE: Mutli term synonyms
Handling MESH descriptor preferred terms and such is similar. I encountered this during evaluation of Solr for a project here at NLM. We decided to use Solr for different projects instead. I considered the following approaches: - use a custom tokenizer at index time that indexed all of the multiple term alternatives. - index the data, and then have an enrichment process that queries on each source synonym, and generates an update to add the target synonyms. Follow this with an optimize. - During the indexing process, but before sending the data to Solr, process the data to tokenize and add synonyms to another field. Both the custom tokenizer and enrichment process share the feature that they use Solr's own tokenizer rather than duplicate it. The enrichment process seems to me only workable in environments where you can re-index all data periodically, so no continuous stream of data to index that needs to be handled relatively quickly once it is generated.The last method of pre-processing the data seems the least desirable to me from a blue-sky perspective, but is probably the easiest to implement and the most independent of Solr. Hope this helps, Dan Davis, Systems/Applications Architect (Contractor), Office of Computer and Communications Systems, National Library of Medicine, NIH -Original Message- From: Kaushik [mailto:kaushika...@gmail.com] Sent: Monday, April 20, 2015 10:47 AM To: solr-user@lucene.apache.org Subject: Mutli term synonyms Hello, Reading up on synonyms it looks like there is no real solution for multi term synonyms. Is that right? I have a use case where I need to map one multi term phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40. Any thoughts as to how this can be achieved? Thanks, Kaushik