Getting term offsets from Solr

2013-09-20 Thread Nalini Kartha
Hi,

We're looking at implementing highlighting for some fields which may be too
large to store in the index.

As an alternative to using the Solr Highlighter (which needs fields to be
stored), I was wondering if a) the offsets of terms are stored BY DEFAULT
in the index (even if we're not using the TermVectorComponent) and if so,
b) is there a way to get the offset information from Solr.

Thanks,
Nalini


Re: Getting term offsets from Solr

2013-09-20 Thread Nalini Kartha
Thanks for the reply.

We tried enabling these options but that's also causing too much index
bloat so I was wondering if there's a way to get at the offset information
more cheaply?

Thanks,
Nalini


On Fri, Sep 20, 2013 at 4:41 PM, Jack Krupansky j...@basetechnology.comwrote:

 Set:

 termVectors=true
 termPositions=true
 termOffsets=true

 And use the fast vector highlighter.

 -- Jack Krupansky

 -Original Message- From: Nalini Kartha Sent: Friday, September 20,
 2013 7:34 PM To: solr-user@lucene.apache.org Subject: Getting term
 offsets from Solr
  Hi,

 We're looking at implementing highlighting for some fields which may be too
 large to store in the index.

 As an alternative to using the Solr Highlighter (which needs fields to be
 stored), I was wondering if a) the offsets of terms are stored BY DEFAULT
 in the index (even if we're not using the TermVectorComponent) and if so,
 b) is there a way to get the offset information from Solr.

 Thanks,
 Nalini



Re: Getting term offsets from Solr

2013-09-20 Thread Nalini Kartha
I'm wondering if storing just the offset as a payload would be cheaper from
storage perspective than enabling termOffsets, termVectors and
termPositions? Maybe we could get the offset info to return with results
from there then?

Thanks,
Nalini


On Fri, Sep 20, 2013 at 5:02 PM, Nalini Kartha nalinikar...@gmail.comwrote:

 Thanks for the reply.

 We tried enabling these options but that's also causing too much index
 bloat so I was wondering if there's a way to get at the offset information
 more cheaply?

 Thanks,
 Nalini


 On Fri, Sep 20, 2013 at 4:41 PM, Jack Krupansky 
 j...@basetechnology.comwrote:

 Set:

 termVectors=true
 termPositions=true
 termOffsets=true

 And use the fast vector highlighter.

 -- Jack Krupansky

 -Original Message- From: Nalini Kartha Sent: Friday, September
 20, 2013 7:34 PM To: solr-user@lucene.apache.org Subject: Getting term
 offsets from Solr
  Hi,

 We're looking at implementing highlighting for some fields which may be
 too
 large to store in the index.

 As an alternative to using the Solr Highlighter (which needs fields to be
 stored), I was wondering if a) the offsets of terms are stored BY DEFAULT
 in the index (even if we're not using the TermVectorComponent) and if so,
 b) is there a way to get the offset information from Solr.

 Thanks,
 Nalini





Re: Converting fq params to Filter object

2012-12-28 Thread Nalini Kartha
Hi James,

We're using Solr but reason I wanted to issue the queries from
DirectSpellChecker was so that we don't end up returning a bunch of
corrections from suggestSimilar() which then later get weeded out when we
run the extra correction queries because they would return no hits taking
fqs into account. If we were able to issue the queries at the time of
building up the list of corrections then we know that they are all valid.

Thanks for the pointer to the EarlyTerminatingCollector, that seems like it
would improve perf a lot.

I'm still not sure if there's an easy way to build the Filter object from
the fq params though, will keep digging around. If someone could point me
to any code that does this conversion (I'm guessing that conversion needs
to be done at some point for regular queries when Solr calls into Lucene
but I could be wrong) that would be much appreciated.

Thanks,
Nalini


On Thu, Dec 27, 2012 at 4:28 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 Nalini,

 Assuming that you're using Solr, the hook into the collate functionality
 is in SpellCheckComponent#addCollationsToResponse .  To do what you want,
 you would have to modify the call to SpellCheckCollator to issue test
 queries against the individual words instead of the collations.

 See
 http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java

 Of course if you're using Lucene directly and not Solr, then you would
 want to build a series of queries that each query one word with the filters
 applied.  DirectSpellChecker#suggestSimilar returns an array of SuggestWord
 instances that contain the individual words you would want to try.  To
 optimize this, you can use the same approach as in SOLR-3240, implementing
 a Collector that only looks for 1 document then quits.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Thursday, December 27, 2012 2:31 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Converting fq params to Filter object

 Hi James,

 Yup, that was what I tried to do initially but it seems like calling
 through to those Solr methods from DirectSpellChecker was not a good idea -
 am I wrong? And like you mentioned, this seemed like it wasn't low-level
 enough.

 Eric: Unfortunately the collate functionality does not work for our use
 case since the queries we're correcting are default OR. Here's the original
 thread about this -


 http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E

 Thanks,
 Nalini

 On Thu, Dec 27, 2012 at 2:46 PM, Dyer, James
 james.d...@ingramcontent.comwrote:

  https://issues.apache.org/jira/browse/SOLR-3240




Re: Converting fq params to Filter object

2012-12-27 Thread Nalini Kartha
Hi Lance,

Thanks for the response.

I didn't quite understand how to issue the queries from DirectSpellChecker
with the fq params applied like you were suggesting - could you point me to
the API that can be used for this?

Also, we haven't benchmarked the DirectSpellChecker against the
IndexBasedSpellChecker.

I considered issuing one large OR query with all corrections but that
doesn't ensure that *every* correction would return some hits with the fq
params applied, it only tells us that some correction returned hits so this
isn't restrictive enough for us. And ANDing the corrections together
becomes too restrictive since it requires that *all* corrections existed in
the same documents instead of checking that they individually exist in some
docs (which satisfy the filter queries of course).

Thanks,
Nalini


On Wed, Dec 26, 2012 at 9:32 PM, Lance Norskog goks...@gmail.com wrote:

 A Solr facet query does a boolean query, caches the Lucene facet data
 structure, and uses it as a Lucene filter. After that until you do a full
 commit, using the same fq=string (you must match the string exactly)
 fetches the cached data structure and uses it again as a Lucene filter.

 Have you benchmarked the DirectSpellChecker against
 IndexBasedSpellChecker? If you use the fq= filter query as the
 spellcheck.q= query it should use the cached filter.

 Also, since you are checking all words against the same filter query, can
 you just do one large OR query with all of the words?


 On 12/26/2012 03:10 PM, Nalini Kartha wrote:

 Hi Otis,

 Sorry, let me be more specific.

 The end goal is for the DirectSpellChecker to make sure that the
 corrections it is returning will return some results taking into account
 the fq params included in the original query. This is a follow up question
 to another question I had posted earlier -

 http://mail-archives.apache.**org/mod_mbox/lucene-solr-user/**
 201212.mbox/%**3CCAMqOzYFTgiWyRbvwSdF0hFZ1SZN**
 kQ9gnBJfDb_OBNeLsMvR0XA@mail.**gmail.com%3Ehttp://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E

 Initially, the way I was thinking of implementing this was to call one of
 the SolrIndexSearcher.getDocSet() methods for ever correction, passing in
 the correction as the Query and a DocSet created from the fq queries. But
 I
 didn't think that calling a SolrIndexSearcher method in Lucene code
 (DirectSpellChecker) was a good idea. So I started looking at which method
 on IndexSearcher would accomplish this. That's where I'm stuck trying to
 figure out how to convert the fq params into a Filter object.

 Does this approach make sense? Also I realize that this implementation is
 probably non-performant but wanted to give it a try and measure how it
 does. Any advice about what the perf overhead from issuing such queries
 for
 say 50 corrections would be? Note that the filter from the fq params is
 the
 same for every query - would that be cached and help speed things up?

 Thanks,
 Nalini


 On Wed, Dec 26, 2012 at 3:34 PM, Otis Gospodnetic 
 otis.gospodne...@gmail.com wrote:

  Hi,

 The fq *is* for filtering.

 What is your end goal, what are you trying to achieve?

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Dec 26, 2012 11:22 AM, Nalini Kartha nalinikar...@gmail.com
 wrote:

  Hi,

 I'm trying to figure out how to convert the fq params that are being

 passed

 to Solr into something that can be used to filter the results of a query
 that's being issued against the Lucene IndexSearcher (I'm modifying some
 Lucene code to issue the query so calling through to one of the
 SolrIndexSearcher methods would be ugly).

 Looks like one of the IndexSearcher.search(Query query, Filter filter,

 ...)

   methods would do what I want but I'm wondering if there's any easy way

 of

 converting the fq params into a Filter? Or is there a better way of
 doing
 all of this?

 Thanks,
 Nalini





Re: Converting fq params to Filter object

2012-12-27 Thread Nalini Kartha
Hi Eric,

Sorry, I think I wasn't very clear in explaining what we need to do.

We don't really need to do any complicated overriding, just want to change
the DirectSpellChecker to issue a query for every correction it finds *with
fq params from the original query taken into account* so that we can check
if the correction would actually result in some hits.

I was thinking of implementing this using the IndexSearcher.search(Query
query, Filter filter, int n) method where 'query' is a regular TermQuery
(the term is the correction) and 'filter' would represent the fq params.
What I'm not sure about is how to convert the fq params from Solr into a
Filter object and whether this is something we need to build ourselves or
if there's an existing API for this.

Also, I'm new to this code so not sure if I'm approaching this the wrong
way. Any advice/pointers are much appreciated.

Thanks,
Nalini



On Thu, Dec 27, 2012 at 12:53 PM, Erik Hatcher erik.hatc...@gmail.comwrote:

 I think the answer is yes, that there's a better way to doing all of this.
  But I'm not yet sure what this all entails in your situation.  What are
 you overriding with the Lucene searches?   I imagine Solr has the
 flexibility to handle what you're trying to do without overriding anything
 core in SolrIndexSearcher.

 Generally, the way to get a custom filter in place is to create a custom
 query parser and use that for your fq parameter, like fq={!myparser
 param1='some value'}possible+expression+if+needed, so maybe that helps?

 Tell us more about what you're doing specifically, and maybe we can guide
 you to a more elegant way to plug in any custom logic you want.

 Erik

 On Dec 26, 2012, at 11:21 , Nalini Kartha wrote:

  Hi,
 
  I'm trying to figure out how to convert the fq params that are being
 passed
  to Solr into something that can be used to filter the results of a query
  that's being issued against the Lucene IndexSearcher (I'm modifying some
  Lucene code to issue the query so calling through to one of the
  SolrIndexSearcher methods would be ugly).
 
  Looks like one of the IndexSearcher.search(Query query, Filter filter,
 ...)
  methods would do what I want but I'm wondering if there's any easy way of
  converting the fq params into a Filter? Or is there a better way of doing
  all of this?
 
  Thanks,
  Nalini




Re: Converting fq params to Filter object

2012-12-27 Thread Nalini Kartha
Hi James,

Yup, that was what I tried to do initially but it seems like calling
through to those Solr methods from DirectSpellChecker was not a good idea -
am I wrong? And like you mentioned, this seemed like it wasn't low-level
enough.

Eric: Unfortunately the collate functionality does not work for our use
case since the queries we're correcting are default OR. Here's the original
thread about this -

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E

Thanks,
Nalini

On Thu, Dec 27, 2012 at 2:46 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 https://issues.apache.org/jira/browse/SOLR-3240


Re: Converting fq params to Filter object

2012-12-26 Thread Nalini Kartha
Hi Otis,

Sorry, let me be more specific.

The end goal is for the DirectSpellChecker to make sure that the
corrections it is returning will return some results taking into account
the fq params included in the original query. This is a follow up question
to another question I had posted earlier -

http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E

Initially, the way I was thinking of implementing this was to call one of
the SolrIndexSearcher.getDocSet() methods for ever correction, passing in
the correction as the Query and a DocSet created from the fq queries. But I
didn't think that calling a SolrIndexSearcher method in Lucene code
(DirectSpellChecker) was a good idea. So I started looking at which method
on IndexSearcher would accomplish this. That's where I'm stuck trying to
figure out how to convert the fq params into a Filter object.

Does this approach make sense? Also I realize that this implementation is
probably non-performant but wanted to give it a try and measure how it
does. Any advice about what the perf overhead from issuing such queries for
say 50 corrections would be? Note that the filter from the fq params is the
same for every query - would that be cached and help speed things up?

Thanks,
Nalini


On Wed, Dec 26, 2012 at 3:34 PM, Otis Gospodnetic 
otis.gospodne...@gmail.com wrote:

 Hi,

 The fq *is* for filtering.

 What is your end goal, what are you trying to achieve?

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Dec 26, 2012 11:22 AM, Nalini Kartha nalinikar...@gmail.com wrote:

  Hi,
 
  I'm trying to figure out how to convert the fq params that are being
 passed
  to Solr into something that can be used to filter the results of a query
  that's being issued against the Lucene IndexSearcher (I'm modifying some
  Lucene code to issue the query so calling through to one of the
  SolrIndexSearcher methods would be ugly).
 
  Looks like one of the IndexSearcher.search(Query query, Filter filter,
 ...)
   methods would do what I want but I'm wondering if there's any easy way
 of
  converting the fq params into a Filter? Or is there a better way of doing
  all of this?
 
  Thanks,
  Nalini
 



Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query

2012-12-21 Thread Nalini Kartha
Thanks for the advice.

Unfortunately what we really need is for the corrections to satisfy fq
params.

Was wondering how bad the perf would be if we're using the same DocSet (or
should it be an OpenBitSet? sorry, I'm still trying to figure all that code
out) for each 'correction query'? Seems like this is similar to how facet
counts are calculated?

Thanks,
Nalini


On Thu, Dec 20, 2012 at 12:12 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 The spellchecker doesn't support checking the indivdual words against the
 index with fq applied.  This is only done for collations (and only if
 maxCollationTries is greater than 0).  Checking every suggested word
 individually against the index after applying filter queries is probably
 going to be very expensive no matter how you implement it.  However,
 someone with more lucene-core knowledge than I have might be able to give
 you better advice.

 If your root problem, though, is getting good did-you-mean-style
 suggestions with dismax queries and mm=0, and if you want to consider the
 case where some words in the query are misspelled and others are entirely
 irrelevant (and can't be corrected), then setting maxResultsForSuggest to
 a high value might give you the end result you want.  Unlike if you use 
 spellcheck.collateParam.mm=100%, it won't insist that the irrelevant
 terms (or a corrected irrelevant term) match anything.  On the other
 hand, it won't assume the query is Correctly
 Spelled just because you got some hits from it (because mm=0 will just
 cause the misspelled terms to be thrown out).

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Thursday, December 20, 2012 8:53 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Ensuring SpellChecker returns corrections which satisfy fq
 params for default OR query

 Hi James,

 I don't get how the spellcheck.maxResultsForSuggest param helps with making
 sure that the suggestions returned satisfy the fq params?

 That's the main problem we're trying to solve, how often suggestions are
 being returned is not really an issue for us at the moment.

 Thanks,
 Nalini


 On Wed, Dec 19, 2012 at 4:35 PM, Dyer, James
 james.d...@ingramcontent.comwrote:

  Instead of using spellcheck.collateParam.mm, try just setting
  spellcheck.maxResultsForSuggest to a very high value (you can use up to
  Integer.MAX_VALUE here).  So long as the user gets fewer results that
  whatever this is set for, you will get suggestions (and collations if
  desired).  I was just playing with this and if I am understanding you
  correctly think this combination of parameters will give you what you
 want:
 
  spellcheck=true
 
  spellcheck.dictionary=whatever
 
  spellcheck.maxResultsForSuggest=1000 (or whatever the cut off is
  before you don't want suggestions)
 
  spellcheck.count=20 (+/- depending on performance vs # suggestions
  required)
 
  spellcheck.maxCollationTries=10 (+/- depending on performance vs #
  suggestions required)
 
  spellcheck.maxCollations=10 (+/- depending on performance vs #
 suggestions
  required)
 
  spellcheck.collate=true
 
  spellcheck.collateExtendedResults=true
 
  James Dyer
  E-Commerce Systems
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Nalini Kartha [mailto:nalinikar...@gmail.com]
  Sent: Wednesday, December 19, 2012 2:06 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Ensuring SpellChecker returns corrections which satisfy fq
  params for default OR query
 
  Hi James,
 
  Yup the example you gave about sums it up. Reason we use an OR query is
  that we want the flexibility of every term not having to match but when
 it
  comes to corrections we want to be sure that the ones we pick will
 actually
  return results (we message the user with the corrected query so it would
 be
  bad/confusing if there were no matches for the corrections).
 
  *- by default the spellchecker doesn't see this as a problem because it
 has
  hits (mm=0 and wrapping matches something).  So you get neither
  individual words back nor collations from the spellchecker.*
  *
  *
  I think we would still get back 'papr - paper' as a correction and
  'christmas wrapping paper' as a collation in this case - I've seen that
  corrections are returned even for OR queries. Problem is these would be
  returned even if 'paper' doesn't exist in any docs that have
 item:in_stock.
 
  *- with spellcheck.collateParam.mm http://spellcheck.collateparam.mm/
  =100
  it tries to fix both papr and christmas but can't fix christmas
  because spelling isn't the problem here (it is an irrelevant term not in
  the index).  So while you get words suggested there are no collations.
  The
  individual words would be helpful, but you're not sure because they might
  all apply to items that do not match fq=item:in_stock.*
 
  Yup, exactly.
 
  Do you think the workaround I suggested would work

Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query

2012-12-20 Thread Nalini Kartha
Hi James,

I don't get how the spellcheck.maxResultsForSuggest param helps with making
sure that the suggestions returned satisfy the fq params?

That's the main problem we're trying to solve, how often suggestions are
being returned is not really an issue for us at the moment.

Thanks,
Nalini


On Wed, Dec 19, 2012 at 4:35 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 Instead of using spellcheck.collateParam.mm, try just setting
 spellcheck.maxResultsForSuggest to a very high value (you can use up to
 Integer.MAX_VALUE here).  So long as the user gets fewer results that
 whatever this is set for, you will get suggestions (and collations if
 desired).  I was just playing with this and if I am understanding you
 correctly think this combination of parameters will give you what you want:

 spellcheck=true

 spellcheck.dictionary=whatever

 spellcheck.maxResultsForSuggest=1000 (or whatever the cut off is
 before you don't want suggestions)

 spellcheck.count=20 (+/- depending on performance vs # suggestions
 required)

 spellcheck.maxCollationTries=10 (+/- depending on performance vs #
 suggestions required)

 spellcheck.maxCollations=10 (+/- depending on performance vs # suggestions
 required)

 spellcheck.collate=true

 spellcheck.collateExtendedResults=true

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Wednesday, December 19, 2012 2:06 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Ensuring SpellChecker returns corrections which satisfy fq
 params for default OR query

 Hi James,

 Yup the example you gave about sums it up. Reason we use an OR query is
 that we want the flexibility of every term not having to match but when it
 comes to corrections we want to be sure that the ones we pick will actually
 return results (we message the user with the corrected query so it would be
 bad/confusing if there were no matches for the corrections).

 *- by default the spellchecker doesn't see this as a problem because it has
 hits (mm=0 and wrapping matches something).  So you get neither
 individual words back nor collations from the spellchecker.*
 *
 *
 I think we would still get back 'papr - paper' as a correction and
 'christmas wrapping paper' as a collation in this case - I've seen that
 corrections are returned even for OR queries. Problem is these would be
 returned even if 'paper' doesn't exist in any docs that have item:in_stock.

 *- with spellcheck.collateParam.mm http://spellcheck.collateparam.mm/
 =100
 it tries to fix both papr and christmas but can't fix christmas
 because spelling isn't the problem here (it is an irrelevant term not in
 the index).  So while you get words suggested there are no collations.  The
 individual words would be helpful, but you're not sure because they might
 all apply to items that do not match fq=item:in_stock.*

 Yup, exactly.

 Do you think the workaround I suggested would work (and not have terrible
 perf)? Or any other ideas?

 Thanks,
 Nalini


 On Wed, Dec 19, 2012 at 1:09 PM, Dyer, James
 james.d...@ingramcontent.comwrote:

  Let me try and get a better idea of what you're after.  Is it that your
  users might query a combination of irrelevant terms and misspelled terms,
  so you want the ability to ignore the irrelevant terms but still get
  suggestions for the misspelled terms?
 
  For instance if someone wanted q=christmas wrapping
  paprmm=0fq=item:in_stock, but christmas was not in the index and you
  wanted to return results for just wrapping paper, the problem is...
 
  - by default the spellchecker doesn't see this as a problem because it
 has
  hits (mm=0 and wrapping matches something).  So you get neither
  individual words back nor collations from the spellchecker.
 
  - with spellcheck.collateParam.mm=100 it tries to fix both papr and
  christmas but can't fix christmas because spelling isn't the problem
  here (it is an irrelevant term not in the index).  So while you get words
  suggested there are no collations.  The individual words would be
 helpful,
  but you're not sure because they might all apply to items that do not
 match
  fq=item:in_stock.
 
  Is this the problem?
 
  James Dyer
  E-Commerce Systems
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Nalini Kartha [mailto:nalinikar...@gmail.com]
  Sent: Wednesday, December 19, 2012 11:20 AM
  To: solr-user@lucene.apache.org
  Subject: Ensuring SpellChecker returns corrections which satisfy fq
 params
  for default OR query
 
  Hi,
 
  With the DirectSolrSpellChecker, we want to be able to make sure that the
  corrections that are being returned satisfy the fq params of the original
  query.
 
  The collate functionality helps with this but seems to only work with
  default AND queries - our use case is for default OR queries. I also saw
  that there is now a spellcheck.collateParam.XX param which allows you to
  override params

Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query

2012-12-19 Thread Nalini Kartha
Hi James,

Yup the example you gave about sums it up. Reason we use an OR query is
that we want the flexibility of every term not having to match but when it
comes to corrections we want to be sure that the ones we pick will actually
return results (we message the user with the corrected query so it would be
bad/confusing if there were no matches for the corrections).

*- by default the spellchecker doesn't see this as a problem because it has
hits (mm=0 and wrapping matches something).  So you get neither
individual words back nor collations from the spellchecker.*
*
*
I think we would still get back 'papr - paper' as a correction and
'christmas wrapping paper' as a collation in this case - I've seen that
corrections are returned even for OR queries. Problem is these would be
returned even if 'paper' doesn't exist in any docs that have item:in_stock.

*- with spellcheck.collateParam.mm http://spellcheck.collateparam.mm/=100
it tries to fix both papr and christmas but can't fix christmas
because spelling isn't the problem here (it is an irrelevant term not in
the index).  So while you get words suggested there are no collations.  The
individual words would be helpful, but you're not sure because they might
all apply to items that do not match fq=item:in_stock.*

Yup, exactly.

Do you think the workaround I suggested would work (and not have terrible
perf)? Or any other ideas?

Thanks,
Nalini


On Wed, Dec 19, 2012 at 1:09 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 Let me try and get a better idea of what you're after.  Is it that your
 users might query a combination of irrelevant terms and misspelled terms,
 so you want the ability to ignore the irrelevant terms but still get
 suggestions for the misspelled terms?

 For instance if someone wanted q=christmas wrapping
 paprmm=0fq=item:in_stock, but christmas was not in the index and you
 wanted to return results for just wrapping paper, the problem is...

 - by default the spellchecker doesn't see this as a problem because it has
 hits (mm=0 and wrapping matches something).  So you get neither
 individual words back nor collations from the spellchecker.

 - with spellcheck.collateParam.mm=100 it tries to fix both papr and
 christmas but can't fix christmas because spelling isn't the problem
 here (it is an irrelevant term not in the index).  So while you get words
 suggested there are no collations.  The individual words would be helpful,
 but you're not sure because they might all apply to items that do not match
 fq=item:in_stock.

 Is this the problem?

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Wednesday, December 19, 2012 11:20 AM
 To: solr-user@lucene.apache.org
 Subject: Ensuring SpellChecker returns corrections which satisfy fq params
 for default OR query

 Hi,

 With the DirectSolrSpellChecker, we want to be able to make sure that the
 corrections that are being returned satisfy the fq params of the original
 query.

 The collate functionality helps with this but seems to only work with
 default AND queries - our use case is for default OR queries. I also saw
 that there is now a spellcheck.collateParam.XX param which allows you to
 override params from the original query - the example mentioned was to
 override the mm param to be 100% which would make the collated query
 default AND. This doesn't quite do what we want either though because it
 seems like all collations would be thrown out if one of the correctly
 spelled terms in the query did not satisfy the fq params. We don't want it
 to check that the correctly spelled terms MUST be in results, just that
 each correction (individually) would result in some hits taking into
 account the fqs.

 I was wondering whether it is possible (and what the perf overhead would
 be) to use the SolrIndexSearcher.getDocSet(Query, DocSet) method to check
 that each correction being considered (the Query) matches some docs taking
 into account the fqs (the DocSet)?

 Would appreciate other suggestions/ideas if this isn't feasible.

 Thanks!

 - Nalini




Re: Differentiate between correctly spelled term and mis-spelled term with no corrections

2012-12-18 Thread Nalini Kartha
Got it. Thanks again for all the info! Will open a JIRA and follow up about
this sometime soon.

Thanks,
Nalini


On Fri, Dec 14, 2012 at 1:32 PM, Dyer, James
james.d...@ingramcontent.comwrote:

 Nalini,

 I don't think you can change the *default* response format until a new
 major release (so its ok for Trunk/5.0 but not for the 4.x branch).  What
 you can do, however, is create a new spellcheck.xxx parameter to let
 users opt-in to the new functionality in 4.x as desired.  We'd also want to
 update solrj so java clients could easily use the new feature (see
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/solrj/src/java/org/apache/solr/client/solrj/response/SpellCheckResponse.java
 ).

 I'm not sure I ever heard someone wanting to combine suggestions from
 multiple cores before.  I'd be interested in hearing more about what you're
 trying to do.  But this does seem similar to the problem of combining
 suggestions between multiple SpellCheckers.  See
 https://issues.apache.org/jira/browse/SOLR-2993 , which adds a new
 spellchecker that corrects word break problems.  This added a new class,
 ConjunctionSolrSpellChecker that interleaves the results from the main
 String-Distance-based checker with results from the word break checker.
  You might be able to generalize this class to also be able to combine
 results from multiple DirectSolrSpellCheckers together.  While you want to
 get suggestions from multiple cores, others might want this feature to be
 able to have separate dictionaries per-field from the same core.

 I think its ok to rank combined results by String Distance so long as you
 knew the same metric was applied to all.  This is in constrast to how it is
 with the Word Break spellchecker which uses an incompatible distance
 metric.  So for this case, ConjunctionSolrSpellChecker just interleaves the
 results round-robin.

 So expanding on ConjunctionSolrSpellChecker might be one possible way to
 accomplish what you want to do.  You might find something else that works
 better. For whatever you come up with, by all means open a JIRA issue and
 attach your work as a patch and see where it goes from there.  (subscribe
 to the dev list if you haven't already as that's where these type of
 discussions usually happen).

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311

 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Friday, December 14, 2012 11:06 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Differentiate between correctly spelled term and mis-spelled
 term with no corrections

 Hi James,

 Couple more follow up questions -

 1. Do changes to the response format have to be backwards compatible at
 this point? Seems like if we changed it to always return the origFreq even
 if there are no suggestions then that could break things right?
 2. For our purposes, we need to be able to order suggestions from multiple
 Solr cores so we were thinking of changing the format to also include the
 score that is calculated for each suggestion (which isn't exposed right
 now). Are these scores from different dictionary fields comparable
 (assuming we use the default INTERNAL_LEVENSHTEIN_DISTANCE metric)? And do
 you think this would be of general use i.e. could it be contributed back to
 Solr?

 Thanks,
 Nalini


 On Fri, Dec 7, 2012 at 2:20 PM, Nalini Kartha nalinikar...@gmail.com
 wrote:

  Ah I see what you mean. Will probably try to change the response to look
  like the internal shard one then.
 
  Thanks for the detailed explanation!
 
  - Nalini
 
 
  On Fri, Dec 7, 2012 at 1:38 PM, Dyer, James 
 james.d...@ingramcontent.comwrote:
 
  The response from the shards is different from the final spellcheck
  response in that it does include the term even if there are no
 suggestions
  for it.  So to get the behavior you want, we'd probably just have to
 make
  it so you could get the shard-to-shard-internal version.
 
  See
 
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java
 
  ...and method toNamedList(...)
 
  ...and this line:
 
  if (theSuggestions != null  (theSuggestions.size()  0 ||
  shardRequest)) {
  ...
  }
 
  ...the shardRequest boolean is passed with true here if its the 1st
  stage of a distributed request (from #process).  The various shards send
  their responses to the main shard which then integrates them together
 (in
  #finishStage)  Note that #finishStage always passes
 shardRequest=false to
  #toNamedList so that the end user gets a normal response back,
 omitting
  terms for which there are no suggestions.
 
  James Dyer
  E-Commerce Systems
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Nalini Kartha [mailto:nalinikar...@gmail.com]
  Sent: Friday, December 07, 2012 9:54 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Differentiate between correctly spelled term and
 mis-spelled
  term

Re: Differentiate between correctly spelled term and mis-spelled term with no corrections

2012-12-14 Thread Nalini Kartha
Hi James,

Couple more follow up questions -

1. Do changes to the response format have to be backwards compatible at
this point? Seems like if we changed it to always return the origFreq even
if there are no suggestions then that could break things right?
2. For our purposes, we need to be able to order suggestions from multiple
Solr cores so we were thinking of changing the format to also include the
score that is calculated for each suggestion (which isn't exposed right
now). Are these scores from different dictionary fields comparable
(assuming we use the default INTERNAL_LEVENSHTEIN_DISTANCE metric)? And do
you think this would be of general use i.e. could it be contributed back to
Solr?

Thanks,
Nalini


On Fri, Dec 7, 2012 at 2:20 PM, Nalini Kartha nalinikar...@gmail.comwrote:

 Ah I see what you mean. Will probably try to change the response to look
 like the internal shard one then.

 Thanks for the detailed explanation!

 - Nalini


 On Fri, Dec 7, 2012 at 1:38 PM, Dyer, James 
 james.d...@ingramcontent.comwrote:

 The response from the shards is different from the final spellcheck
 response in that it does include the term even if there are no suggestions
 for it.  So to get the behavior you want, we'd probably just have to make
 it so you could get the shard-to-shard-internal version.

 See
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java

 ...and method toNamedList(...)

 ...and this line:

 if (theSuggestions != null  (theSuggestions.size()  0 ||
 shardRequest)) {
 ...
 }

 ...the shardRequest boolean is passed with true here if its the 1st
 stage of a distributed request (from #process).  The various shards send
 their responses to the main shard which then integrates them together (in
 #finishStage)  Note that #finishStage always passes shardRequest=false to
 #toNamedList so that the end user gets a normal response back, omitting
 terms for which there are no suggestions.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Friday, December 07, 2012 9:54 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Differentiate between correctly spelled term and mis-spelled
 term with no corrections

 Hi James,

 Thanks for the response, will open a JIRA for this.

 Had one follow-up question - how does the Distributed SpellCheckComponent
 handle this? I tried looking at the code but it's not obvious to me how it
 is able to differentiate between these 2 cases. I see that it only
 considers a term to be wrongly spelt if all shards return a suggestion for
 it but isn't it possible that a suggestion is not returned because nothing
 close enough could be found in some shard? Or is the response from shards
 different than the final spellcheck response we get from Solr in some way?

 Thanks,
 Nalini


 On Fri, Dec 7, 2012 at 10:26 AM, Dyer, James
 james.d...@ingramcontent.comwrote:

  You might want to open a jira issue for this to request that the feature
  be added.  If you haven't used it before, you need to create an account.
 
  https://issues.apache.org/jira/browse/SOLR
 
  In the mean time, If you need to get the document frequency of the query
  terms, see http://wiki.apache.org/solr/TermsComponent , which maybe
 would
  provide you a viable workaround.
 
  James Dyer
  E-Commerce Systems
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Nalini Kartha [mailto:nalinikar...@gmail.com]
  Sent: Thursday, December 06, 2012 2:44 PM
  To: solr-user@lucene.apache.org
  Subject: Differentiate between correctly spelled term and mis-spelled
 term
  with no corrections
 
  Hi,
 
  When using the SolrSpellChecker, is there currently any way to
  differentiate between a term that exists in the dictionary and a
  mis-spelled term for which no corrections were found when looking at the
  spellcheck response?
 
  From reading the doc and trying out some simple test cases it seems like
  there isn't - in both cases it looks like the response doesn't include
 the
  term.
 
  Could the extended results format be changed to include the original
 term
  frequency even if there are no suggestions? This would allow us to make
  this differentiation.
 
  Thanks,
  Nalini
 
 





Re: Differentiate between correctly spelled term and mis-spelled term with no corrections

2012-12-07 Thread Nalini Kartha
Hi James,

Thanks for the response, will open a JIRA for this.

Had one follow-up question - how does the Distributed SpellCheckComponent
handle this? I tried looking at the code but it's not obvious to me how it
is able to differentiate between these 2 cases. I see that it only
considers a term to be wrongly spelt if all shards return a suggestion for
it but isn't it possible that a suggestion is not returned because nothing
close enough could be found in some shard? Or is the response from shards
different than the final spellcheck response we get from Solr in some way?

Thanks,
Nalini


On Fri, Dec 7, 2012 at 10:26 AM, Dyer, James
james.d...@ingramcontent.comwrote:

 You might want to open a jira issue for this to request that the feature
 be added.  If you haven't used it before, you need to create an account.

 https://issues.apache.org/jira/browse/SOLR

 In the mean time, If you need to get the document frequency of the query
 terms, see http://wiki.apache.org/solr/TermsComponent , which maybe would
 provide you a viable workaround.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Thursday, December 06, 2012 2:44 PM
 To: solr-user@lucene.apache.org
 Subject: Differentiate between correctly spelled term and mis-spelled term
 with no corrections

 Hi,

 When using the SolrSpellChecker, is there currently any way to
 differentiate between a term that exists in the dictionary and a
 mis-spelled term for which no corrections were found when looking at the
 spellcheck response?

 From reading the doc and trying out some simple test cases it seems like
 there isn't - in both cases it looks like the response doesn't include the
 term.

 Could the extended results format be changed to include the original term
 frequency even if there are no suggestions? This would allow us to make
 this differentiation.

 Thanks,
 Nalini




Re: Differentiate between correctly spelled term and mis-spelled term with no corrections

2012-12-07 Thread Nalini Kartha
Ah I see what you mean. Will probably try to change the response to look
like the internal shard one then.

Thanks for the detailed explanation!

- Nalini


On Fri, Dec 7, 2012 at 1:38 PM, Dyer, James james.d...@ingramcontent.comwrote:

 The response from the shards is different from the final spellcheck
 response in that it does include the term even if there are no suggestions
 for it.  So to get the behavior you want, we'd probably just have to make
 it so you could get the shard-to-shard-internal version.

 See
 http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java

 ...and method toNamedList(...)

 ...and this line:

 if (theSuggestions != null  (theSuggestions.size()  0 || shardRequest))
 {
 ...
 }

 ...the shardRequest boolean is passed with true here if its the 1st
 stage of a distributed request (from #process).  The various shards send
 their responses to the main shard which then integrates them together (in
 #finishStage)  Note that #finishStage always passes shardRequest=false to
 #toNamedList so that the end user gets a normal response back, omitting
 terms for which there are no suggestions.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311


 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Friday, December 07, 2012 9:54 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Differentiate between correctly spelled term and mis-spelled
 term with no corrections

 Hi James,

 Thanks for the response, will open a JIRA for this.

 Had one follow-up question - how does the Distributed SpellCheckComponent
 handle this? I tried looking at the code but it's not obvious to me how it
 is able to differentiate between these 2 cases. I see that it only
 considers a term to be wrongly spelt if all shards return a suggestion for
 it but isn't it possible that a suggestion is not returned because nothing
 close enough could be found in some shard? Or is the response from shards
 different than the final spellcheck response we get from Solr in some way?

 Thanks,
 Nalini


 On Fri, Dec 7, 2012 at 10:26 AM, Dyer, James
 james.d...@ingramcontent.comwrote:

  You might want to open a jira issue for this to request that the feature
  be added.  If you haven't used it before, you need to create an account.
 
  https://issues.apache.org/jira/browse/SOLR
 
  In the mean time, If you need to get the document frequency of the query
  terms, see http://wiki.apache.org/solr/TermsComponent , which maybe
 would
  provide you a viable workaround.
 
  James Dyer
  E-Commerce Systems
  Ingram Content Group
  (615) 213-4311
 
 
  -Original Message-
  From: Nalini Kartha [mailto:nalinikar...@gmail.com]
  Sent: Thursday, December 06, 2012 2:44 PM
  To: solr-user@lucene.apache.org
  Subject: Differentiate between correctly spelled term and mis-spelled
 term
  with no corrections
 
  Hi,
 
  When using the SolrSpellChecker, is there currently any way to
  differentiate between a term that exists in the dictionary and a
  mis-spelled term for which no corrections were found when looking at the
  spellcheck response?
 
  From reading the doc and trying out some simple test cases it seems like
  there isn't - in both cases it looks like the response doesn't include
 the
  term.
 
  Could the extended results format be changed to include the original term
  frequency even if there are no suggestions? This would allow us to make
  this differentiation.
 
  Thanks,
  Nalini
 
 




minPrefix attribute of DirectSolrSpellChecker

2012-12-06 Thread Nalini Kartha
Hi,

In most of the examples I have seen for configuring the
DirectSolrSpellChecker the minPrefix attribute is set to 1 (and this is the
default value as well).

Is there any specific reason for this - would performance take a hit if it
was set to 0? We'd like to support returning corrections which don't start
with the same letter so just wanted to confirm that there aren't any issues
with changing this.

Thanks,
Nalini


Re: Using multiple DirectSolrSpellcheckers for a query

2012-03-12 Thread Nalini Kartha
Hi James/Robert,

Thanks for the responses.

Robert: What is it about the current APIs that makes this hard? How
much/what kind of refactoring would open this up?

James: I didn't quite understand the usage you suggested. I thought that
the spellcheck.q param shouldn't include field names, etc and that the
purpose of specifying this param is to avoid the extra parsing out of the
field names, etc. from the q param to get the query terms for spell
checking. This is based on this bit in the SpellCheckComponent wiki -

 The spellcheck.q parameter is intended to be the original query, minus
any extra markup like field names, boosts, etc.

Did I misunderstand something?

I agree that it's impossible to know if the query run should be corrected
to sun or running in the example I gave but I guess I'm asking more
from the angle of how to avoid correcting terms that will be matched
because they exist in other more processed fields that are being searched.
Since the recommendation is to build spellcheck fields from minimally
processed source fields, seems like this would be a common problem?

And another kind of unrelated question - all the examples of spellcheck
dictionaries I've seen in sample solrconfig.xmls have minPrefix set to 1.
Is this for performance reasons? And with this setting, we wouldn't get
run as a correction for eon right?

Thanks,
Nalini

On Wed, Mar 7, 2012 at 11:04 AM, Robert Muir rcm...@gmail.com wrote:

 On Wed, Jan 25, 2012 at 12:55 PM, Nalini Kartha nalinikar...@gmail.com
 wrote:
 
  Is there any reason why Solr doesn't support using multiple spellcheckers
  for a query? Is it because of performance overhead?
 

 Thats not the case really, see
 https://issues.apache.org/jira/browse/SOLR-2926

 I think the issue is that the spellchecker APIs need to be extended to
 allow this to happen easier, there is no real hard
 performance/technical/algorithmic issue, its just a matter of
 refactoring spellchecker APIs to allow this!

 --
 lucidimagination.com



Re: Using multiple DirectSolrSpellcheckers for a query

2012-03-06 Thread Nalini Kartha
Hi James,

Thanks for the detailed reply and sorry for the delay getting back.

One issue for us with using the collate functionality is that some of our
query types  are default OR (implemented using the mm param value). Since
the collate functionality reruns the query using all param values specified
in the original query, it'll effectively be issuing an OR query again
right? Which means that again we could end up with corrections which aren't
the best for the current query?

Another issue we're running into is that we're using unstemmed fields as
the source for our spell correction field and so we could end up
unnecessarily correcting queries containing stemmed versions of words.

So for eg. if I have a document containing running my fields look like
this -

docUnstemmed: running
docStemmed: run, ...
spellcheck: running

If a user searches for run OR jump, there are matching results (since we
search against both the stemmed and unstemmed fields) but the spellcheck
results will contain corrections for run, let's say sun. We don't want
to overcorrect queries which are returning valid results like this one. Any
suggestions for how to deal with this?

I was thinking that there might be value in having another dictionary which
is used for vetting words but not for finding corrections - the stemmed
fields could be used as a source for this dictionary. So before finding
corrections for a term if it doesn't exist in the primary dictionary, check
the secondary dictionary and make sure the term does not exist in it as
well. But then, this would require an extra copyfield (we could have
multiple unstemmed fields as a source for this secondary dictionary) and
bloat the index even more so I'm not sure if it's feasible.

Thanks,
Nalini

On Thu, Jan 26, 2012 at 10:23 AM, Dyer, James james.d...@ingrambook.comwrote:

 Nalini,

 Right now the best you can do is to use copyField to combine everything
 into a catch-all for spellchecking purposes.  While this seems wasteful,
 this often has to be done anyhow because typically you'll need
 less/different analysis for spellchecking than for searching.  But rather
 than having separate copyFields to create multiple dictionaries, put
 everything into one field to create a single master dictionary.

 From there, you need to set spellcheck.collate to true and also
 spellcheck.maxCollationTries greater than zero (5-10 usually works).  The
 first parameter tells it to generate re-written queries with spelling
 suggestions (collations).  The second parameter tells it to weed out any
 collations that won't generate hits if you re-query them.  This is
 important because having unrelated keywords in your master dictionary will
 increase the chances the spellchecker will pick the wrong words as
 corrections.

 There is a significant caveat to this:  The spellchecker typically only
 suggests for words in the dictionary.  So by creating a huge, master
 dictionary you might find that many misspelled words won't generate
 suggestions.  See this thread for some workarounds:
 http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-td3658411.html

 I think having multiple, per-field dictionaries as you suggest might be a
 good way to go.  While this is not supported, I don't think its because of
 performance concerns.  (There would be an overhead cost to this but I think
 it would still be practical).  It just hasn't been implemented yet.  But we
 might be getting to a possible start to this type of functionality.  In
 https://issues.apache.org/jira/browse/SOLR-2585 a separate spellchecker
 is added that just corrects wordbreak (or is it word break?) problems,
 then a ConjunctionSolrSpellChecker combines the results from the main
 spellchecker and the wordbreak spellcheker.  I could see a next step beyond
 this being to support per-field dictionaries, checking them separately,
 then combining the results.

 James Dyer
 E-Commerce Systems
 Ingram Content Group
 (615) 213-4311

 -Original Message-
 From: Nalini Kartha [mailto:nalinikar...@gmail.com]
 Sent: Wednesday, January 25, 2012 11:56 AM
 To: solr-user@lucene.apache.org
 Subject: Using multiple DirectSolrSpellcheckers for a query

 Hi,

 We are trying to use the DirectSolrSpellChecker to get corrections for
 mis-spelled query terms directly from fields in the Solr index.

 However, we need to use multiple fields for spellchecking a query. It looks
 looks like you can only use one spellchecker for a request and so the
 workaround for this it to create a copy field from the fields required for
 spell correction?

 We'd like to avoid this because we allow users to perform different kinds
 of queries on different sets of fields and so to provide meaningful
 corrections we'd have to create multiple copy fields - one for each query
 type.

 Is there any reason why Solr doesn't support using multiple spellcheckers
 for a query? Is it because of performance overhead?

 Thanks,
 Nalini



Using multiple DirectSolrSpellcheckers for a query

2012-01-25 Thread Nalini Kartha
Hi,

We are trying to use the DirectSolrSpellChecker to get corrections for
mis-spelled query terms directly from fields in the Solr index.

However, we need to use multiple fields for spellchecking a query. It looks
looks like you can only use one spellchecker for a request and so the
workaround for this it to create a copy field from the fields required for
spell correction?

We'd like to avoid this because we allow users to perform different kinds
of queries on different sets of fields and so to provide meaningful
corrections we'd have to create multiple copy fields - one for each query
type.

Is there any reason why Solr doesn't support using multiple spellcheckers
for a query? Is it because of performance overhead?

Thanks,
Nalini