Re: Is there Downside to a huge synonyms file?

2009-06-04 Thread Yonik Seeley
On Tue, Jun 2, 2009 at 11:28 PM, anuvenk anuvenkat...@hotmail.com wrote:
 I'm using query time synonyms.

These don't currently work if the synonyms expand to more than one
option, and those options have a different number of words.

-Yonik
http://www.lucidimagination.com


Re: Is there Downside to a huge synonyms file?

2009-06-03 Thread anuvenk

I tried adding some city to state mappings in the synonyms file. I'm using
the dismax handler for phrase matching. So as  when i add more  more city
to state mappings, I end up with zero results for state based searches.
Eg: ca,california,los angeles
 ca,california,san diego
 ca,california,san francisco
 ca,california,burbankand so on
now a city based search returns a few other california results but a state
based search like dui california is returning zero results. 
I checked the parsedquery_toString and I see no 'OR' although the default
operator is 'OR' in schema. It looks like its trying to find matches for all
those cities as they are mapped to 'california' and hence returns zero
results. How to force dismax to use 'OR' and not 'AND' even though the
schema has 'OR'.
Or is this how dismax works? Can someone explain how to overcome this
problem. 
Here is my custom request handler that extends dismax
requestHandler name=qfacet class=solr.DisMaxRequestHandler 
lst name=defaults
 str name=echoParamsexplicit/str
 float name=tie0.01/float
 str name=qfname^2.0 text^0.8/str
 !-- until 3 all should match;4 - 3 shld match; 5 - 4 shld match; 6 - 5
shld match; above 6 - 90% match --
 str name=mm3lt;-1 4lt;-1 5lt;-1 6lt;90%/str
 str name=pf
 text^0.8 name^2.0
 /str
 int name=qs4/int
 int name=ps4/int
 str name=fl
 *,score
 /str  

/lst
lst name=invariants
  !--str name=facet.fieldresourceType/str
  str name=facet.fieldcategory/str
  str name=facet.fieldstateName/str--
  str name=facet.sortfalse/str
  int name=facet.mincount1/int
/lst
  /requestHandler

Thanks.



Otis Gospodnetic wrote:
 
 
 Hello,
 
 300K is a pretty small index.  I wouldn't worry about the number of
 synonyms unless you are turning a single term into dozens of ORed terms.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: anuvenk anuvenkat...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, June 2, 2009 11:28:43 PM
 Subject: Re: Is there Downside to a huge synonyms file?
 
 
 I'm using query time synonyms. I have more fields in my index though.
 This is
 just an example or sample of data from my index. Yes, we don't have
 millions
 of documents. Could be around 300,000 and might increase in future. The
 reason i'm using query time synonyms is because of the nature of my data.
 I
 can't re-index the data everytime i add or remove a synonym. But for this
 particular requirement is it best to have index time synonyms because of
 the
 multi-word synonym nature. Again if i add more cities list to the synonym
 file, I can't be re-indexing all the data over and over again. 
 
 
 
 anuvenk wrote:
  
  In my index i have legal faqs, forms, legal videos etc with a state
 field
  for each resource.
  Now if i search for real estate san diego, I want to be able to return
  other 'california' results i.e results from san francisco.
  I have the following fields in the index
  
  title  state  
  description...
  real estate san diego example 1   california some
  description
  real estate carlsbad example 2 california some desc
  
  so when i search for real estate san francisco, since there is no
 match, i
  want to be able to return the other real estate results in california
  instead of returning none. Because sometimes they might be searching
 for a
  real estate form and city probably doesn't matter. 
  
  I have two things in mind. One is adding a synonym mapping
  san diego, california
  carlsbad, california
  san francisco, california
  
  (which probably isn't the best way)
  hoping that search for san francisco real estate would map san
 francisco
  to california and hence return the other two california results
  
  OR
  
  adding the mapping of city to state in the index itself like..
  
  title state city   
 
 
  description...
  real estate san diego eg 1california   carlsbad, san francisco, san
  diegosome description
  real estate carlsbad eg 2  california   carlsbad, san francisco,
 san
  diegosome description
  
  which of the above two is better. Does a huge synonym file affect
  performance. Or Is there a even better way? I'm sure there is but I
 can't
  put my finger on it yet  I'm not familiar with java either.
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23844761.html
 Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 

-- 
View this message in context: 
http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23861631.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Is there Downside to a huge synonyms file?

2009-06-03 Thread anuvenk

A small addition to my earlier post. I wonder if its because of the 'mm'
param, which requires that until 3 words in search phrase, all the words
should be matched. If i alter this now, i'd get ir-relevant results for a
lot of popular 1, 2, 3 word search terms. How to solve for this? 

anuvenk wrote:
 
 I tried adding some city to state mappings in the synonyms file. I'm using
 the dismax handler for phrase matching. So as  when i add more  more
 city to state mappings, I end up with zero results for state based
 searches.
 Eg: ca,california,los angeles
  ca,california,san diego
  ca,california,san francisco
  ca,california,burbankand so on
 now a city based search returns a few other california results but a state
 based search like dui california is returning zero results. 
 I checked the parsedquery_toString and I see no 'OR' although the default
 operator is 'OR' in schema. It looks like its trying to find matches for
 all those cities as they are mapped to 'california' and hence returns zero
 results. How to force dismax to use 'OR' and not 'AND' even though the
 schema has 'OR'.
 Or is this how dismax works? Can someone explain how to overcome this
 problem. 
 Here is my custom request handler that extends dismax
 requestHandler name=qfacet class=solr.DisMaxRequestHandler 
 lst name=defaults
  str name=echoParamsexplicit/str
  float name=tie0.01/float
  str name=qfname^2.0 text^0.8/str
  !-- until 3 all should match;4 - 3 shld match; 5 - 4 shld match; 6 -
 5 shld match; above 6 - 90% match --
  str name=mm3lt;-1 4lt;-1 5lt;-1 6lt;90%/str
  str name=pf
  text^0.8 name^2.0
  /str
  int name=qs4/int
  int name=ps4/int
  str name=fl
  *,score
  /str  
 
 /lst
 lst name=invariants
   !--str name=facet.fieldresourceType/str
   str name=facet.fieldcategory/str
   str name=facet.fieldstateName/str--
   str name=facet.sortfalse/str
   int name=facet.mincount1/int
 /lst
   /requestHandler
 
 Thanks.
 
 
 
 Otis Gospodnetic wrote:
 
 
 Hello,
 
 300K is a pretty small index.  I wouldn't worry about the number of
 synonyms unless you are turning a single term into dozens of ORed terms.
 
 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
 - Original Message 
 From: anuvenk anuvenkat...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, June 2, 2009 11:28:43 PM
 Subject: Re: Is there Downside to a huge synonyms file?
 
 
 I'm using query time synonyms. I have more fields in my index though.
 This is
 just an example or sample of data from my index. Yes, we don't have
 millions
 of documents. Could be around 300,000 and might increase in future. The
 reason i'm using query time synonyms is because of the nature of my
 data. I
 can't re-index the data everytime i add or remove a synonym. But for
 this
 particular requirement is it best to have index time synonyms because of
 the
 multi-word synonym nature. Again if i add more cities list to the
 synonym
 file, I can't be re-indexing all the data over and over again. 
 
 
 
 anuvenk wrote:
  
  In my index i have legal faqs, forms, legal videos etc with a state
 field
  for each resource.
  Now if i search for real estate san diego, I want to be able to return
  other 'california' results i.e results from san francisco.
  I have the following fields in the index
  
  title  state  
  description...
  real estate san diego example 1   california some
  description
  real estate carlsbad example 2 california some
 desc
  
  so when i search for real estate san francisco, since there is no
 match, i
  want to be able to return the other real estate results in california
  instead of returning none. Because sometimes they might be searching
 for a
  real estate form and city probably doesn't matter. 
  
  I have two things in mind. One is adding a synonym mapping
  san diego, california
  carlsbad, california
  san francisco, california
  
  (which probably isn't the best way)
  hoping that search for san francisco real estate would map san
 francisco
  to california and hence return the other two california results
  
  OR
  
  adding the mapping of city to state in the index itself like..
  
  title state city  
  
 
  description...
  real estate san diego eg 1california   carlsbad, san francisco,
 san
  diegosome description
  real estate carlsbad eg 2  california   carlsbad, san francisco,
 san
  diegosome description
  
  which of the above two is better. Does a huge synonym file affect
  performance. Or Is there a even better way? I'm sure there is but I
 can't
  put my finger on it yet  I'm not familiar with java either.
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Is-there-Downside

Is there Downside to a huge synonyms file?

2009-06-02 Thread anuvenk

In my index i have legal faqs, forms, legal videos etc with a state field for
each resource.
Now if i search for real estate san diego, I want to be able to return other
'california' results i.e results from san francisco.
I have the following fields in the index

title  state  
description...
real estate san diego example 1   california some
description
real estate carlsbad example 2 california some desc

so when i search for real estate san francisco, since there is no match, i
want to be able to return the other real estate results in california
instead of returning none. Because sometimes they might be searching for a
real estate form and city probably doesn't matter. 

I have two things in mind. One is adding a synonym mapping
san diego, california
carlsbad, california
san francisco, california

(which probably isn't the best way)
hoping that search for san francisco real estate would map san francisco to
california and hence return the other two california results

OR

adding the mapping of city to state in the index itself like..

title state city
  
description...
real estate san diego eg 1california   carlsbad, san francisco, san
diegosome description
real estate carlsbad eg 2  california   carlsbad, san francisco, san
diegosome description

which of the above two is better. Does a huge synonym file affect
performance. Or Is there a even better way? I'm sure there is but I can't
put my finger on it yet  I'm not familiar with java either.

-- 
View this message in context: 
http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23842527.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Is there Downside to a huge synonyms file?

2009-06-02 Thread Otis Gospodnetic

Hi,

If index-time synonym expansion/indexing is used, then a large synonym file 
means your index is going to be bigger.
If query-time synonym expansion is used, then your queries are going to be 
larger (i.e. more ORs, thus a bit slower).

How much, it really depends on your specific synonyms, so I can't generalize.  
I have a feeling you are not dealing with millions of documents, in which case 
you can most likely ignore increase in index or query size.

 
Adding synonyms sounds like the easiest approach.  I'd try that and worry about 
improvement only IF I see that doesn't give adequate results.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: anuvenk anuvenkat...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, June 2, 2009 6:55:27 PM
 Subject: Is there Downside to a huge synonyms file?
 
 
 In my index i have legal faqs, forms, legal videos etc with a state field for
 each resource.
 Now if i search for real estate san diego, I want to be able to return other
 'california' results i.e results from san francisco.
 I have the following fields in the index
 
 title  state  
 description...
 real estate san diego example 1   california some
 description
 real estate carlsbad example 2 california some desc
 
 so when i search for real estate san francisco, since there is no match, i
 want to be able to return the other real estate results in california
 instead of returning none. Because sometimes they might be searching for a
 real estate form and city probably doesn't matter. 
 
 I have two things in mind. One is adding a synonym mapping
 san diego, california
 carlsbad, california
 san francisco, california
 
 (which probably isn't the best way)
 hoping that search for san francisco real estate would map san francisco to
 california and hence return the other two california results
 
 OR
 
 adding the mapping of city to state in the index itself like..
 
 title state city  
   
   
 description...
 real estate san diego eg 1california   carlsbad, san francisco, san
 diegosome description
 real estate carlsbad eg 2  california   carlsbad, san francisco, san
 diegosome description
 
 which of the above two is better. Does a huge synonym file affect
 performance. Or Is there a even better way? I'm sure there is but I can't
 put my finger on it yet  I'm not familiar with java either.
 
 -- 
 View this message in context: 
 http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23842527.html
 Sent from the Solr - User mailing list archive at Nabble.com.



Re: Is there Downside to a huge synonyms file?

2009-06-02 Thread anuvenk

I'm using query time synonyms. I have more fields in my index though. This is
just an example or sample of data from my index. Yes, we don't have millions
of documents. Could be around 300,000 and might increase in future. The
reason i'm using query time synonyms is because of the nature of my data. I
can't re-index the data everytime i add or remove a synonym. But for this
particular requirement is it best to have index time synonyms because of the
multi-word synonym nature. Again if i add more cities list to the synonym
file, I can't be re-indexing all the data over and over again. 



anuvenk wrote:
 
 In my index i have legal faqs, forms, legal videos etc with a state field
 for each resource.
 Now if i search for real estate san diego, I want to be able to return
 other 'california' results i.e results from san francisco.
 I have the following fields in the index
 
 title  state  
 description...
 real estate san diego example 1   california some
 description
 real estate carlsbad example 2 california some desc
 
 so when i search for real estate san francisco, since there is no match, i
 want to be able to return the other real estate results in california
 instead of returning none. Because sometimes they might be searching for a
 real estate form and city probably doesn't matter. 
 
 I have two things in mind. One is adding a synonym mapping
 san diego, california
 carlsbad, california
 san francisco, california
 
 (which probably isn't the best way)
 hoping that search for san francisco real estate would map san francisco
 to california and hence return the other two california results
 
 OR
 
 adding the mapping of city to state in the index itself like..
 
 title state city  
 
 description...
 real estate san diego eg 1california   carlsbad, san francisco, san
 diegosome description
 real estate carlsbad eg 2  california   carlsbad, san francisco, san
 diegosome description
 
 which of the above two is better. Does a huge synonym file affect
 performance. Or Is there a even better way? I'm sure there is but I can't
 put my finger on it yet  I'm not familiar with java either.
 
 

-- 
View this message in context: 
http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23844761.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Is there Downside to a huge synonyms file?

2009-06-02 Thread Otis Gospodnetic

Hello,

300K is a pretty small index.  I wouldn't worry about the number of synonyms 
unless you are turning a single term into dozens of ORed terms.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: anuvenk anuvenkat...@hotmail.com
 To: solr-user@lucene.apache.org
 Sent: Tuesday, June 2, 2009 11:28:43 PM
 Subject: Re: Is there Downside to a huge synonyms file?
 
 
 I'm using query time synonyms. I have more fields in my index though. This is
 just an example or sample of data from my index. Yes, we don't have millions
 of documents. Could be around 300,000 and might increase in future. The
 reason i'm using query time synonyms is because of the nature of my data. I
 can't re-index the data everytime i add or remove a synonym. But for this
 particular requirement is it best to have index time synonyms because of the
 multi-word synonym nature. Again if i add more cities list to the synonym
 file, I can't be re-indexing all the data over and over again. 
 
 
 
 anuvenk wrote:
  
  In my index i have legal faqs, forms, legal videos etc with a state field
  for each resource.
  Now if i search for real estate san diego, I want to be able to return
  other 'california' results i.e results from san francisco.
  I have the following fields in the index
  
  title  state  
  description...
  real estate san diego example 1   california some
  description
  real estate carlsbad example 2 california some desc
  
  so when i search for real estate san francisco, since there is no match, i
  want to be able to return the other real estate results in california
  instead of returning none. Because sometimes they might be searching for a
  real estate form and city probably doesn't matter. 
  
  I have two things in mind. One is adding a synonym mapping
  san diego, california
  carlsbad, california
  san francisco, california
  
  (which probably isn't the best way)
  hoping that search for san francisco real estate would map san francisco
  to california and hence return the other two california results
  
  OR
  
  adding the mapping of city to state in the index itself like..
  
  title state city

 
  description...
  real estate san diego eg 1california   carlsbad, san francisco, san
  diegosome description
  real estate carlsbad eg 2  california   carlsbad, san francisco, san
  diegosome description
  
  which of the above two is better. Does a huge synonym file affect
  performance. Or Is there a even better way? I'm sure there is but I can't
  put my finger on it yet  I'm not familiar with java either.
  
  
 
 -- 
 View this message in context: 
 http://www.nabble.com/Is-there-Downside-to-a-huge-synonyms-file--tp23842527p23844761.html
 Sent from the Solr - User mailing list archive at Nabble.com.