Re: Mutli term synonyms

2015-04-29 Thread Kaushik
Hi Roman,

Following is my use case:

*Schema.xml*...

   field name=name type=text_autophrase indexed=true stored=true/

fieldType name=text_autophrase class=solr.TextField
   positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter
class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
phrases=autophrases.txt includeTokens=false
replaceWhitespaceWith=X /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true /
  /analyzer
/fieldType

*SolrConfig.xml...*

name=/autophrase class=solr.SearchHandler
   lst name=defaults
 str name=echoParamsexplicit/str
 int name=rows10/int
 str name=dfname/str
 str name=defTypeautophrasingParser/str
   /lst
  /requestHandler

  queryParser name=autophrasingParser
   class=com.lucidworks.analysis.AutoPhrasingQParserPlugin 
str name=phrasesautophrases.txt/str
str name=replaceWhitespaceWithX/str
  /queryParser


*Synonyms.txt*
PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
[II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
[VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20
[WHO-DD],POLYSORBATE 20 [VANDF]

*Autophrase.txt...*

Has all the above phrases in one column

*Indexed document*

doc
  field name=id31/field
  field name=namePolysorbate 20/field
  /doc

So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to
see the record containig Polysorbate 20. i.e.
http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true
should have retrieved it; but it doesnt.

What could I be doing wrong?

On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com wrote:

 I'm not sure I understand - the autophrasing filter will allow the
 parser to see all the tokens, so that they can be parsed (and
 multi-token synonyms) identified. So if you are using the same
 analyzer at query and index time, they should be able to see the same
 stuff.

 are you using multi-token synonyms, or just entries that look like
 multi synonym? (in the first case, the tokens are separated by null
 byte) - in the second case, they are just strings even with
 whitespaces, your synonym file must contain exactly the same entries
 as your analyzer sees them (and in the same order; or you have to use
 the same analyzer to load the synonym files)

 can you post the relevant part of your schema.xml?


 note: I can confirm that multi-token synonym expansion can be made to
 work, even in complex cases - we do it - but likely, if you need
 multi-token synonyms, you will also need a smarter query parser.
 sometimes your users will use query strings that contain overlapping
 synonym entries, to handle that, you will have to know how to generate
 all possible 'reads', example

 synonym:

 foo bar, foobar
 hey foo, heyfoo

 user input:

 hey foo bar

 possible readings:

 ((hey foo) +bar) OR (hey +(foo bar))

 i'm simplifying it here, the fun starts when you are seeing a phrase query
 :)

 On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote:
  Hi there,
 
  I tried the solution provided in
 
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
  .The mentioned solution works when the indexed data does not have alpha
  numerics or special characters. But in  my case the synonyms are
 something
  like the below.
 
 
   T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
  MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
  SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
  300  POLYSORBATE
  20 [FHFI]  FEMA NO. 2915
 
  They have alpha numerics, special characters, spaces, etc. Is there a way
  to implment synonyms even in such case?
 
  Thanks,
  Kaushik
 
  On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
  daniel.da...@nih.gov wrote:
 
  Handling MESH descriptor preferred terms and such is similar.  

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
I'm not sure I understand - the autophrasing filter will allow the
parser to see all the tokens, so that they can be parsed (and
multi-token synonyms) identified. So if you are using the same
analyzer at query and index time, they should be able to see the same
stuff.

are you using multi-token synonyms, or just entries that look like
multi synonym? (in the first case, the tokens are separated by null
byte) - in the second case, they are just strings even with
whitespaces, your synonym file must contain exactly the same entries
as your analyzer sees them (and in the same order; or you have to use
the same analyzer to load the synonym files)

can you post the relevant part of your schema.xml?


note: I can confirm that multi-token synonym expansion can be made to
work, even in complex cases - we do it - but likely, if you need
multi-token synonyms, you will also need a smarter query parser.
sometimes your users will use query strings that contain overlapping
synonym entries, to handle that, you will have to know how to generate
all possible 'reads', example

synonym:

foo bar, foobar
hey foo, heyfoo

user input:

hey foo bar

possible readings:

((hey foo) +bar) OR (hey +(foo bar))

i'm simplifying it here, the fun starts when you are seeing a phrase query :)

On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote:
 Hi there,

 I tried the solution provided in
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
 .The mentioned solution works when the indexed data does not have alpha
 numerics or special characters. But in  my case the synonyms are something
 like the below.


  T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
 MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
 SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
 300  POLYSORBATE
 20 [FHFI]  FEMA NO. 2915

 They have alpha numerics, special characters, spaces, etc. Is there a way
 to implment synonyms even in such case?

 Thanks,
 Kaushik

 On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
 daniel.da...@nih.gov wrote:

 Handling MESH descriptor preferred terms and such is similar.   I
 encountered this during evaluation of Solr for a project here at NLM.   We
 decided to use Solr for different projects instead. I considered the
 following approaches:
  - use a custom tokenizer at index time that indexed all of the multiple
 term alternatives.
  - index the data, and then have an enrichment process that queries on
 each source synonym, and generates an update to add the target synonyms.
Follow this with an optimize.
  - During the indexing process, but before sending the data to Solr,
 process the data to tokenize and add synonyms to another field.

 Both the custom tokenizer and enrichment process share the feature that
 they use Solr's own tokenizer rather than duplicate it.   The enrichment
 process seems to me only workable in environments where you can re-index
 all data periodically, so no continuous stream of data to index that needs
 to be handled relatively quickly once it is generated.The last method
 of pre-processing the data seems the least desirable to me from a blue-sky
 perspective, but is probably the easiest to implement and the most
 independent of Solr.

 Hope this helps,

 Dan Davis, Systems/Applications Architect (Contractor),
 Office of Computer and Communications Systems,
 National Library of Medicine, NIH

 -Original Message-
 From: Kaushik [mailto:kaushika...@gmail.com]
 Sent: Monday, April 20, 2015 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: Mutli term synonyms

 Hello,

 Reading up on synonyms it looks like there is no real solution for multi
 term synonyms. Is that right? I have a use case where I need to map one
 multi term phrase to another. i.e. Tween 20 needs to be translated to
 Polysorbate 40.

 Any thoughts as to how this can be achieved?

 Thanks,
 Kaushik



Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
Pls post output of the request with debugQuery=true

Do you see the synonyms being expanded? Probably not.

You can go to the administer iface, in the analyzer section play with the
input until you see the synonyms. Use phrase queries too. That will be
helpful to elliminate autophrase filter
On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote:

 Hi Roman,

 Following is my use case:

 *Schema.xml*...

field name=name type=text_autophrase indexed=true stored=true/

 fieldType name=text_autophrase class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter
 class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
 phrases=autophrases.txt includeTokens=false
 replaceWhitespaceWith=X /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
   /analyzer
   analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt enablePositionIncrements=true /
   /analyzer
 /fieldType

 *SolrConfig.xml...*

 name=/autophrase class=solr.SearchHandler
lst name=defaults
  str name=echoParamsexplicit/str
  int name=rows10/int
  str name=dfname/str
  str name=defTypeautophrasingParser/str
/lst
   /requestHandler

   queryParser name=autophrasingParser
class=com.lucidworks.analysis.AutoPhrasingQParserPlugin 
 str name=phrasesautophrases.txt/str
 str name=replaceWhitespaceWithX/str
   /queryParser


 *Synonyms.txt*
 PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
 [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
 MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
 SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE 20
 [WHO-DD],POLYSORBATE 20 [VANDF]

 *Autophrase.txt...*

 Has all the above phrases in one column

 *Indexed document*

 doc
   field name=id31/field
   field name=namePolysorbate 20/field
   /doc

 So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect to
 see the record containig Polysorbate 20. i.e.

 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true
 should have retrieved it; but it doesnt.

 What could I be doing wrong?

 On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com
 wrote:

  I'm not sure I understand - the autophrasing filter will allow the
  parser to see all the tokens, so that they can be parsed (and
  multi-token synonyms) identified. So if you are using the same
  analyzer at query and index time, they should be able to see the same
  stuff.
 
  are you using multi-token synonyms, or just entries that look like
  multi synonym? (in the first case, the tokens are separated by null
  byte) - in the second case, they are just strings even with
  whitespaces, your synonym file must contain exactly the same entries
  as your analyzer sees them (and in the same order; or you have to use
  the same analyzer to load the synonym files)
 
  can you post the relevant part of your schema.xml?
 
 
  note: I can confirm that multi-token synonym expansion can be made to
  work, even in complex cases - we do it - but likely, if you need
  multi-token synonyms, you will also need a smarter query parser.
  sometimes your users will use query strings that contain overlapping
  synonym entries, to handle that, you will have to know how to generate
  all possible 'reads', example
 
  synonym:
 
  foo bar, foobar
  hey foo, heyfoo
 
  user input:
 
  hey foo bar
 
  possible readings:
 
  ((hey foo) +bar) OR (hey +(foo bar))
 
  i'm simplifying it here, the fun starts when you are seeing a phrase
 query
  :)
 
  On Tue, Apr 28, 2015 at 10:31 AM, Kaushik kaushika...@gmail.com wrote:
   Hi there,
  
   I tried the solution provided in
  
 
 https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
   .The mentioned solution works when the indexed data does not have alpha
   numerics or special characters. But in  my case the synonyms are
  something
   like the below.
  
  
T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
   

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
Hi Kaushik, I meant to compare tween 20 against tween 20.

Your autophrase filter replaces whitespace with x, but your synonym filter
expects whitespaces. Try that.

Roman
On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote:

 Hi Roman,

 When I used the debugQuery using

 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true
 I see the following in the response. The autophrase plugin seems to be
 doing its part. Just not the synonym expansion. When you say use phrase
 queries, what do you mean? Please clarify.

 response: {
 numFound: 0,
 start: 0,
 docs: []
   },
   debug: {
 rawquerystring: tween 20,
 querystring: tween 20,
 parsedquery: name:tweenx20,
 parsedquery_toString: name:tweenx20,
 explain: {},

 Thank you,

 Kaushik


 On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Pls post output of the request with debugQuery=true
 
  Do you see the synonyms being expanded? Probably not.
 
  You can go to the administer iface, in the analyzer section play with the
  input until you see the synonyms. Use phrase queries too. That will be
  helpful to elliminate autophrase filter
  On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote:
 
   Hi Roman,
  
   Following is my use case:
  
   *Schema.xml*...
  
  field name=name type=text_autophrase indexed=true
  stored=true/
  
   fieldType name=text_autophrase class=solr.TextField
  positionIncrementGap=100
 analyzer type=index
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory /
   filter
   class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
   phrases=autophrases.txt includeTokens=false
   replaceWhitespaceWith=X /
   filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
   ignoreCase=true expand=true /
   filter class=solr.StopFilterFactory ignoreCase=true
   words=stopwords.txt enablePositionIncrements=true
 /
 /analyzer
 analyzer type=query
   tokenizer class=solr.KeywordTokenizerFactory/
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
   ignoreCase=true expand=true /
   filter class=solr.StopFilterFactory ignoreCase=true
   words=stopwords.txt enablePositionIncrements=true
 /
 /analyzer
   /fieldType
  
   *SolrConfig.xml...*
  
   name=/autophrase class=solr.SearchHandler
  lst name=defaults
str name=echoParamsexplicit/str
int name=rows10/int
str name=dfname/str
str name=defTypeautophrasingParser/str
  /lst
 /requestHandler
  
 queryParser name=autophrasingParser
  
 class=com.lucidworks.analysis.AutoPhrasingQParserPlugin
  
   str name=phrasesautophrases.txt/str
   str name=replaceWhitespaceWithX/str
 /queryParser
  
  
   *Synonyms.txt*
   PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
   20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
   [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
   [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
   20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
   MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
   SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
   300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20
 [FCC],POLYSORBATE
  20
   [WHO-DD],POLYSORBATE 20 [VANDF]
  
   *Autophrase.txt...*
  
   Has all the above phrases in one column
  
   *Indexed document*
  
   doc
 field name=id31/field
 field name=namePolysorbate 20/field
 /doc
  
   So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect
  to
   see the record containig Polysorbate 20. i.e.
  
  
 
 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true
   should have retrieved it; but it doesnt.
  
   What could I be doing wrong?
  
   On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
I'm not sure I understand - the autophrasing filter will allow the
parser to see all the tokens, so that they can be parsed (and
multi-token synonyms) identified. So if you are using the same
analyzer at query and index time, they should be able to see the same
stuff.
   
are you using multi-token synonyms, or just entries that look like
multi synonym? (in the first case, the tokens are separated by null
byte) - in the second case, they are just strings even with
whitespaces, your synonym file must contain exactly the same entries
as your analyzer sees them (and in the same order; or you have to use
the same analyzer to load the synonym files)
   
can you post the relevant part of your schema.xml?
   
   

Re: Mutli term synonyms

2015-04-29 Thread Kaushik
Hi Roman,

When I used the debugQuery using
http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true
I see the following in the response. The autophrase plugin seems to be
doing its part. Just not the synonym expansion. When you say use phrase
queries, what do you mean? Please clarify.

response: {
numFound: 0,
start: 0,
docs: []
  },
  debug: {
rawquerystring: tween 20,
querystring: tween 20,
parsedquery: name:tweenx20,
parsedquery_toString: name:tweenx20,
explain: {},

Thank you,

Kaushik


On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Pls post output of the request with debugQuery=true

 Do you see the synonyms being expanded? Probably not.

 You can go to the administer iface, in the analyzer section play with the
 input until you see the synonyms. Use phrase queries too. That will be
 helpful to elliminate autophrase filter
 On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote:

  Hi Roman,
 
  Following is my use case:
 
  *Schema.xml*...
 
 field name=name type=text_autophrase indexed=true
 stored=true/
 
  fieldType name=text_autophrase class=solr.TextField
 positionIncrementGap=100
analyzer type=index
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory /
  filter
  class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
  phrases=autophrases.txt includeTokens=false
  replaceWhitespaceWith=X /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true /
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
/analyzer
analyzer type=query
  tokenizer class=solr.KeywordTokenizerFactory/
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.SynonymFilterFactory synonyms=synonyms.txt
  ignoreCase=true expand=true /
  filter class=solr.StopFilterFactory ignoreCase=true
  words=stopwords.txt enablePositionIncrements=true /
/analyzer
  /fieldType
 
  *SolrConfig.xml...*
 
  name=/autophrase class=solr.SearchHandler
 lst name=defaults
   str name=echoParamsexplicit/str
   int name=rows10/int
   str name=dfname/str
   str name=defTypeautophrasingParser/str
 /lst
/requestHandler
 
queryParser name=autophrasingParser
 class=com.lucidworks.analysis.AutoPhrasingQParserPlugin
 
  str name=phrasesautophrases.txt/str
  str name=replaceWhitespaceWithX/str
/queryParser
 
 
  *Synonyms.txt*
  PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
  20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
  [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
  [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
  20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
  MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
  SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
  300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20 [FCC],POLYSORBATE
 20
  [WHO-DD],POLYSORBATE 20 [VANDF]
 
  *Autophrase.txt...*
 
  Has all the above phrases in one column
 
  *Indexed document*
 
  doc
field name=id31/field
field name=namePolysorbate 20/field
/doc
 
  So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I expect
 to
  see the record containig Polysorbate 20. i.e.
 
 
 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true
  should have retrieved it; but it doesnt.
 
  What could I be doing wrong?
 
  On Wed, Apr 29, 2015 at 2:10 AM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   I'm not sure I understand - the autophrasing filter will allow the
   parser to see all the tokens, so that they can be parsed (and
   multi-token synonyms) identified. So if you are using the same
   analyzer at query and index time, they should be able to see the same
   stuff.
  
   are you using multi-token synonyms, or just entries that look like
   multi synonym? (in the first case, the tokens are separated by null
   byte) - in the second case, they are just strings even with
   whitespaces, your synonym file must contain exactly the same entries
   as your analyzer sees them (and in the same order; or you have to use
   the same analyzer to load the synonym files)
  
   can you post the relevant part of your schema.xml?
  
  
   note: I can confirm that multi-token synonym expansion can be made to
   work, even in complex cases - we do it - but likely, if you need
   multi-token synonyms, you will also need a smarter query parser.
   sometimes your users will use query strings that contain overlapping
   synonym entries, to handle that, you will have to know how to generate
   all possible 

Re: Mutli term synonyms

2015-04-29 Thread Kaushik
Hi Roman,

Tween 20 also did not retrieve me results. So I replaced the whitespaces
in the synonyms.txt with 'x' and now when I search, I get the results back.
One problem however still exits. i.e. when I search for POLYSORBATE
20[MART.] which is a synonym for POLYSORBATE 20, I get error as below,

msg: org.apache.solr.search.SyntaxError: Cannot parse 'polysORbate
20[mart.] ': Encountered \ \]\ \] \\ at line 1, column
20.\r\nWas expecting one of:\r\n\TO\ ...\r\nRANGE_QUOTED
...\r\nRANGE_GOOP ...\r\n,
code: 400

If I am able to solve this, I think I am pretty close to the solution.
Any thoughts there?

I appreciate your help on this matter.

Thank you,

Kaushik



On Wed, Apr 29, 2015 at 5:48 PM, Roman Chyla roman.ch...@gmail.com wrote:

 Hi Kaushik, I meant to compare tween 20 against tween 20.

 Your autophrase filter replaces whitespace with x, but your synonym filter
 expects whitespaces. Try that.

 Roman
 On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote:

  Hi Roman,
 
  When I used the debugQuery using
 
 
 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true
  I see the following in the response. The autophrase plugin seems to be
  doing its part. Just not the synonym expansion. When you say use phrase
  queries, what do you mean? Please clarify.
 
  response: {
  numFound: 0,
  start: 0,
  docs: []
},
debug: {
  rawquerystring: tween 20,
  querystring: tween 20,
  parsedquery: name:tweenx20,
  parsedquery_toString: name:tweenx20,
  explain: {},
 
  Thank you,
 
  Kaushik
 
 
  On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com
  wrote:
 
   Pls post output of the request with debugQuery=true
  
   Do you see the synonyms being expanded? Probably not.
  
   You can go to the administer iface, in the analyzer section play with
 the
   input until you see the synonyms. Use phrase queries too. That will be
   helpful to elliminate autophrase filter
   On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote:
  
Hi Roman,
   
Following is my use case:
   
*Schema.xml*...
   
   field name=name type=text_autophrase indexed=true
   stored=true/
   
fieldType name=text_autophrase class=solr.TextField
   positionIncrementGap=100
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter
class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
phrases=autophrases.txt includeTokens=false
replaceWhitespaceWith=X /
filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt
ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true
  /
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory/
filter class=solr.LowerCaseFilterFactory /
filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt
ignoreCase=true expand=true /
filter class=solr.StopFilterFactory ignoreCase=true
words=stopwords.txt enablePositionIncrements=true
  /
  /analyzer
/fieldType
   
*SolrConfig.xml...*
   
name=/autophrase class=solr.SearchHandler
   lst name=defaults
 str name=echoParamsexplicit/str
 int name=rows10/int
 str name=dfname/str
 str name=defTypeautophrasingParser/str
   /lst
  /requestHandler
   
  queryParser name=autophrasingParser
   
  class=com.lucidworks.analysis.AutoPhrasingQParserPlugin
   
str name=phrasesautophrases.txt/str
str name=replaceWhitespaceWithX/str
  /queryParser
   
   
*Synonyms.txt*
PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN MONOLAURATE,TWEEN
20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
[II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
[VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20
  [FCC],POLYSORBATE
   20
[WHO-DD],POLYSORBATE 20 [VANDF]
   
*Autophrase.txt...*
   
Has all the above phrases in one column
   
*Indexed document*
   
doc
  field name=id31/field
  field name=namePolysorbate 20/field
  /doc
   
So when I query SOLR /autphrase for tween 20 or FEMA NO. 2915, I
 expect
   to
see the record containig Polysorbate 20. i.e.
   
   
  
 
 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=true
should have retrieved it; but 

Re: Mutli term synonyms

2015-04-29 Thread Roman Chyla
Brackets are range operators for the parser, you need to escape them \[ or
enclose in quotes.
 On Apr 29, 2015 10:27 PM, Kaushik kaushika...@gmail.com wrote:

 Hi Roman,

 Tween 20 also did not retrieve me results. So I replaced the whitespaces
 in the synonyms.txt with 'x' and now when I search, I get the results back.
 One problem however still exits. i.e. when I search for POLYSORBATE
 20[MART.] which is a synonym for POLYSORBATE 20, I get error as below,

 msg: org.apache.solr.search.SyntaxError: Cannot parse 'polysORbate
 20[mart.] ': Encountered \ \]\ \] \\ at line 1, column
 20.\r\nWas expecting one of:\r\n\TO\ ...\r\nRANGE_QUOTED
 ...\r\nRANGE_GOOP ...\r\n,
 code: 400

 If I am able to solve this, I think I am pretty close to the solution.
 Any thoughts there?

 I appreciate your help on this matter.

 Thank you,

 Kaushik



 On Wed, Apr 29, 2015 at 5:48 PM, Roman Chyla roman.ch...@gmail.com
 wrote:

  Hi Kaushik, I meant to compare tween 20 against tween 20.
 
  Your autophrase filter replaces whitespace with x, but your synonym
 filter
  expects whitespaces. Try that.
 
  Roman
  On Apr 29, 2015 2:27 PM, Kaushik kaushika...@gmail.com wrote:
 
   Hi Roman,
  
   When I used the debugQuery using
  
  
 
 http://localhost:8983/solr/collection1/autophrase?q=tween+20wt=jsonindent=truedebugQuery=true
   I see the following in the response. The autophrase plugin seems to be
   doing its part. Just not the synonym expansion. When you say use phrase
   queries, what do you mean? Please clarify.
  
   response: {
   numFound: 0,
   start: 0,
   docs: []
 },
 debug: {
   rawquerystring: tween 20,
   querystring: tween 20,
   parsedquery: name:tweenx20,
   parsedquery_toString: name:tweenx20,
   explain: {},
  
   Thank you,
  
   Kaushik
  
  
   On Wed, Apr 29, 2015 at 4:00 PM, Roman Chyla roman.ch...@gmail.com
   wrote:
  
Pls post output of the request with debugQuery=true
   
Do you see the synonyms being expanded? Probably not.
   
You can go to the administer iface, in the analyzer section play with
  the
input until you see the synonyms. Use phrase queries too. That will
 be
helpful to elliminate autophrase filter
On Apr 29, 2015 6:18 AM, Kaushik kaushika...@gmail.com wrote:
   
 Hi Roman,

 Following is my use case:

 *Schema.xml*...

field name=name type=text_autophrase indexed=true
stored=true/

 fieldType name=text_autophrase class=solr.TextField
positionIncrementGap=100
   analyzer type=index
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter
 class=com.lucidworks.analysis.AutoPhrasingTokenFilterFactory
 phrases=autophrases.txt includeTokens=false
 replaceWhitespaceWith=X /
 filter class=solr.SynonymFilterFactory
   synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
   /
   /analyzer
   analyzer type=query
 tokenizer class=solr.KeywordTokenizerFactory/
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.SynonymFilterFactory
   synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=stopwords.txt
 enablePositionIncrements=true
   /
   /analyzer
 /fieldType

 *SolrConfig.xml...*

 name=/autophrase class=solr.SearchHandler
lst name=defaults
  str name=echoParamsexplicit/str
  int name=rows10/int
  str name=dfname/str
  str name=defTypeautophrasingParser/str
/lst
   /requestHandler

   queryParser name=autophrasingParser

   class=com.lucidworks.analysis.AutoPhrasingQParserPlugin

 str name=phrasesautophrases.txt/str
 str name=replaceWhitespaceWithX/str
   /queryParser


 *Synonyms.txt*
 PEG-20 SORBITAN LAURATE,POLYOXYETHYLENE 20 SORBITAN
 MONOLAURATE,TWEEN
 20,POLYSORBATE 20 [USAN],POLYSORBATE 20 [INCI],POLYSORBATE 20
 [II],POLYSORBATE 20 [HSDB],TWEEN-20,PEG-20 SORBITAN,PEG-20 SORBITAN
 [VANDF],POLYSORBATE-20,POLYSORBATE 20,SORETHYTAN MONOLAURATE,T-MAZ
 20,POLYOXYETHYLENE (20) SORBITAN MONOLAURATE,SORBITAN
 MONODODECANOATE,POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE,POLYOXYETHYLENE
 SORBITAN MONOLAURATE,POLYSORBATE 20 [MART.],SORBIMACROGOL LAURATE
 300,POLYSORBATE 20 [FHFI],FEMA NO. 2915,POLYSORBATE 20
   [FCC],POLYSORBATE
20
 [WHO-DD],POLYSORBATE 20 [VANDF]

 *Autophrase.txt...*

 Has all the above phrases in one column

 *Indexed document*

 doc
   field name=id31/field
   

Re: Mutli term synonyms

2015-04-28 Thread Kaushik
Hi there,

I tried the solution provided in
https://lucidworks.com/blog/solution-for-multi-term-synonyms-in-lucenesolr-using-the-auto-phrasing-tokenfilter/
.The mentioned solution works when the indexed data does not have alpha
numerics or special characters. But in  my case the synonyms are something
like the below.


 T-MAZ 20  POLYOXYETHYLENE (20) SORBITAN MONOLAURATE  SORBITAN
MONODODECANOATE  POLY(OXY-1,2-ETHANEDIYL) DERIVATIVE  POLYOXYETHYLENE
SORBITAN MONOLAURATE  POLYSORBATE 20 [MART.]  SORBIMACROGOL LAURATE
300  POLYSORBATE
20 [FHFI]  FEMA NO. 2915

They have alpha numerics, special characters, spaces, etc. Is there a way
to implment synonyms even in such case?

Thanks,
Kaushik

On Mon, Apr 20, 2015 at 11:03 AM, Davis, Daniel (NIH/NLM) [C] 
daniel.da...@nih.gov wrote:

 Handling MESH descriptor preferred terms and such is similar.   I
 encountered this during evaluation of Solr for a project here at NLM.   We
 decided to use Solr for different projects instead. I considered the
 following approaches:
  - use a custom tokenizer at index time that indexed all of the multiple
 term alternatives.
  - index the data, and then have an enrichment process that queries on
 each source synonym, and generates an update to add the target synonyms.
Follow this with an optimize.
  - During the indexing process, but before sending the data to Solr,
 process the data to tokenize and add synonyms to another field.

 Both the custom tokenizer and enrichment process share the feature that
 they use Solr's own tokenizer rather than duplicate it.   The enrichment
 process seems to me only workable in environments where you can re-index
 all data periodically, so no continuous stream of data to index that needs
 to be handled relatively quickly once it is generated.The last method
 of pre-processing the data seems the least desirable to me from a blue-sky
 perspective, but is probably the easiest to implement and the most
 independent of Solr.

 Hope this helps,

 Dan Davis, Systems/Applications Architect (Contractor),
 Office of Computer and Communications Systems,
 National Library of Medicine, NIH

 -Original Message-
 From: Kaushik [mailto:kaushika...@gmail.com]
 Sent: Monday, April 20, 2015 10:47 AM
 To: solr-user@lucene.apache.org
 Subject: Mutli term synonyms

 Hello,

 Reading up on synonyms it looks like there is no real solution for multi
 term synonyms. Is that right? I have a use case where I need to map one
 multi term phrase to another. i.e. Tween 20 needs to be translated to
 Polysorbate 40.

 Any thoughts as to how this can be achieved?

 Thanks,
 Kaushik



RE: Mutli term synonyms

2015-04-20 Thread Davis, Daniel (NIH/NLM) [C]
Handling MESH descriptor preferred terms and such is similar.   I encountered 
this during evaluation of Solr for a project here at NLM.   We decided to use 
Solr for different projects instead. I considered the following approaches:
 - use a custom tokenizer at index time that indexed all of the multiple term 
alternatives.   
 - index the data, and then have an enrichment process that queries on each 
source synonym, and generates an update to add the target synonyms.  
   Follow this with an optimize.
 - During the indexing process, but before sending the data to Solr, process 
the data to tokenize and add synonyms to another field.

Both the custom tokenizer and enrichment process share the feature that they 
use Solr's own tokenizer rather than duplicate it.   The enrichment process 
seems to me only workable in environments where you can re-index all data 
periodically, so no continuous stream of data to index that needs to be handled 
relatively quickly once it is generated.The last method of pre-processing 
the data seems the least desirable to me from a blue-sky perspective, but is 
probably the easiest to implement and the most independent of Solr.

Hope this helps,

Dan Davis, Systems/Applications Architect (Contractor),
Office of Computer and Communications Systems,
National Library of Medicine, NIH

-Original Message-
From: Kaushik [mailto:kaushika...@gmail.com] 
Sent: Monday, April 20, 2015 10:47 AM
To: solr-user@lucene.apache.org
Subject: Mutli term synonyms

Hello,

Reading up on synonyms it looks like there is no real solution for multi term 
synonyms. Is that right? I have a use case where I need to map one multi term 
phrase to another. i.e. Tween 20 needs to be translated to Polysorbate 40.

Any thoughts as to how this can be achieved?

Thanks,
Kaushik