Getting term offsets from Solr
Hi, We're looking at implementing highlighting for some fields which may be too large to store in the index. As an alternative to using the Solr Highlighter (which needs fields to be stored), I was wondering if a) the offsets of terms are stored BY DEFAULT in the index (even if we're not using the TermVectorComponent) and if so, b) is there a way to get the offset information from Solr. Thanks, Nalini
Re: Getting term offsets from Solr
Thanks for the reply. We tried enabling these options but that's also causing too much index bloat so I was wondering if there's a way to get at the offset information more cheaply? Thanks, Nalini On Fri, Sep 20, 2013 at 4:41 PM, Jack Krupansky j...@basetechnology.comwrote: Set: termVectors=true termPositions=true termOffsets=true And use the fast vector highlighter. -- Jack Krupansky -Original Message- From: Nalini Kartha Sent: Friday, September 20, 2013 7:34 PM To: solr-user@lucene.apache.org Subject: Getting term offsets from Solr Hi, We're looking at implementing highlighting for some fields which may be too large to store in the index. As an alternative to using the Solr Highlighter (which needs fields to be stored), I was wondering if a) the offsets of terms are stored BY DEFAULT in the index (even if we're not using the TermVectorComponent) and if so, b) is there a way to get the offset information from Solr. Thanks, Nalini
Re: Getting term offsets from Solr
I'm wondering if storing just the offset as a payload would be cheaper from storage perspective than enabling termOffsets, termVectors and termPositions? Maybe we could get the offset info to return with results from there then? Thanks, Nalini On Fri, Sep 20, 2013 at 5:02 PM, Nalini Kartha nalinikar...@gmail.comwrote: Thanks for the reply. We tried enabling these options but that's also causing too much index bloat so I was wondering if there's a way to get at the offset information more cheaply? Thanks, Nalini On Fri, Sep 20, 2013 at 4:41 PM, Jack Krupansky j...@basetechnology.comwrote: Set: termVectors=true termPositions=true termOffsets=true And use the fast vector highlighter. -- Jack Krupansky -Original Message- From: Nalini Kartha Sent: Friday, September 20, 2013 7:34 PM To: solr-user@lucene.apache.org Subject: Getting term offsets from Solr Hi, We're looking at implementing highlighting for some fields which may be too large to store in the index. As an alternative to using the Solr Highlighter (which needs fields to be stored), I was wondering if a) the offsets of terms are stored BY DEFAULT in the index (even if we're not using the TermVectorComponent) and if so, b) is there a way to get the offset information from Solr. Thanks, Nalini
Re: Converting fq params to Filter object
Hi James, We're using Solr but reason I wanted to issue the queries from DirectSpellChecker was so that we don't end up returning a bunch of corrections from suggestSimilar() which then later get weeded out when we run the extra correction queries because they would return no hits taking fqs into account. If we were able to issue the queries at the time of building up the list of corrections then we know that they are all valid. Thanks for the pointer to the EarlyTerminatingCollector, that seems like it would improve perf a lot. I'm still not sure if there's an easy way to build the Filter object from the fq params though, will keep digging around. If someone could point me to any code that does this conversion (I'm guessing that conversion needs to be done at some point for regular queries when Solr calls into Lucene but I could be wrong) that would be much appreciated. Thanks, Nalini On Thu, Dec 27, 2012 at 4:28 PM, Dyer, James james.d...@ingramcontent.comwrote: Nalini, Assuming that you're using Solr, the hook into the collate functionality is in SpellCheckComponent#addCollationsToResponse . To do what you want, you would have to modify the call to SpellCheckCollator to issue test queries against the individual words instead of the collations. See http://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java Of course if you're using Lucene directly and not Solr, then you would want to build a series of queries that each query one word with the filters applied. DirectSpellChecker#suggestSimilar returns an array of SuggestWord instances that contain the individual words you would want to try. To optimize this, you can use the same approach as in SOLR-3240, implementing a Collector that only looks for 1 document then quits. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Thursday, December 27, 2012 2:31 PM To: solr-user@lucene.apache.org Subject: Re: Converting fq params to Filter object Hi James, Yup, that was what I tried to do initially but it seems like calling through to those Solr methods from DirectSpellChecker was not a good idea - am I wrong? And like you mentioned, this seemed like it wasn't low-level enough. Eric: Unfortunately the collate functionality does not work for our use case since the queries we're correcting are default OR. Here's the original thread about this - http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E Thanks, Nalini On Thu, Dec 27, 2012 at 2:46 PM, Dyer, James james.d...@ingramcontent.comwrote: https://issues.apache.org/jira/browse/SOLR-3240
Re: Converting fq params to Filter object
Hi Lance, Thanks for the response. I didn't quite understand how to issue the queries from DirectSpellChecker with the fq params applied like you were suggesting - could you point me to the API that can be used for this? Also, we haven't benchmarked the DirectSpellChecker against the IndexBasedSpellChecker. I considered issuing one large OR query with all corrections but that doesn't ensure that *every* correction would return some hits with the fq params applied, it only tells us that some correction returned hits so this isn't restrictive enough for us. And ANDing the corrections together becomes too restrictive since it requires that *all* corrections existed in the same documents instead of checking that they individually exist in some docs (which satisfy the filter queries of course). Thanks, Nalini On Wed, Dec 26, 2012 at 9:32 PM, Lance Norskog goks...@gmail.com wrote: A Solr facet query does a boolean query, caches the Lucene facet data structure, and uses it as a Lucene filter. After that until you do a full commit, using the same fq=string (you must match the string exactly) fetches the cached data structure and uses it again as a Lucene filter. Have you benchmarked the DirectSpellChecker against IndexBasedSpellChecker? If you use the fq= filter query as the spellcheck.q= query it should use the cached filter. Also, since you are checking all words against the same filter query, can you just do one large OR query with all of the words? On 12/26/2012 03:10 PM, Nalini Kartha wrote: Hi Otis, Sorry, let me be more specific. The end goal is for the DirectSpellChecker to make sure that the corrections it is returning will return some results taking into account the fq params included in the original query. This is a follow up question to another question I had posted earlier - http://mail-archives.apache.**org/mod_mbox/lucene-solr-user/** 201212.mbox/%**3CCAMqOzYFTgiWyRbvwSdF0hFZ1SZN** kQ9gnBJfDb_OBNeLsMvR0XA@mail.**gmail.com%3Ehttp://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E Initially, the way I was thinking of implementing this was to call one of the SolrIndexSearcher.getDocSet() methods for ever correction, passing in the correction as the Query and a DocSet created from the fq queries. But I didn't think that calling a SolrIndexSearcher method in Lucene code (DirectSpellChecker) was a good idea. So I started looking at which method on IndexSearcher would accomplish this. That's where I'm stuck trying to figure out how to convert the fq params into a Filter object. Does this approach make sense? Also I realize that this implementation is probably non-performant but wanted to give it a try and measure how it does. Any advice about what the perf overhead from issuing such queries for say 50 corrections would be? Note that the filter from the fq params is the same for every query - would that be cached and help speed things up? Thanks, Nalini On Wed, Dec 26, 2012 at 3:34 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The fq *is* for filtering. What is your end goal, what are you trying to achieve? Otis Solr ElasticSearch Support http://sematext.com/ On Dec 26, 2012 11:22 AM, Nalini Kartha nalinikar...@gmail.com wrote: Hi, I'm trying to figure out how to convert the fq params that are being passed to Solr into something that can be used to filter the results of a query that's being issued against the Lucene IndexSearcher (I'm modifying some Lucene code to issue the query so calling through to one of the SolrIndexSearcher methods would be ugly). Looks like one of the IndexSearcher.search(Query query, Filter filter, ...) methods would do what I want but I'm wondering if there's any easy way of converting the fq params into a Filter? Or is there a better way of doing all of this? Thanks, Nalini
Re: Converting fq params to Filter object
Hi Eric, Sorry, I think I wasn't very clear in explaining what we need to do. We don't really need to do any complicated overriding, just want to change the DirectSpellChecker to issue a query for every correction it finds *with fq params from the original query taken into account* so that we can check if the correction would actually result in some hits. I was thinking of implementing this using the IndexSearcher.search(Query query, Filter filter, int n) method where 'query' is a regular TermQuery (the term is the correction) and 'filter' would represent the fq params. What I'm not sure about is how to convert the fq params from Solr into a Filter object and whether this is something we need to build ourselves or if there's an existing API for this. Also, I'm new to this code so not sure if I'm approaching this the wrong way. Any advice/pointers are much appreciated. Thanks, Nalini On Thu, Dec 27, 2012 at 12:53 PM, Erik Hatcher erik.hatc...@gmail.comwrote: I think the answer is yes, that there's a better way to doing all of this. But I'm not yet sure what this all entails in your situation. What are you overriding with the Lucene searches? I imagine Solr has the flexibility to handle what you're trying to do without overriding anything core in SolrIndexSearcher. Generally, the way to get a custom filter in place is to create a custom query parser and use that for your fq parameter, like fq={!myparser param1='some value'}possible+expression+if+needed, so maybe that helps? Tell us more about what you're doing specifically, and maybe we can guide you to a more elegant way to plug in any custom logic you want. Erik On Dec 26, 2012, at 11:21 , Nalini Kartha wrote: Hi, I'm trying to figure out how to convert the fq params that are being passed to Solr into something that can be used to filter the results of a query that's being issued against the Lucene IndexSearcher (I'm modifying some Lucene code to issue the query so calling through to one of the SolrIndexSearcher methods would be ugly). Looks like one of the IndexSearcher.search(Query query, Filter filter, ...) methods would do what I want but I'm wondering if there's any easy way of converting the fq params into a Filter? Or is there a better way of doing all of this? Thanks, Nalini
Re: Converting fq params to Filter object
Hi James, Yup, that was what I tried to do initially but it seems like calling through to those Solr methods from DirectSpellChecker was not a good idea - am I wrong? And like you mentioned, this seemed like it wasn't low-level enough. Eric: Unfortunately the collate functionality does not work for our use case since the queries we're correcting are default OR. Here's the original thread about this - http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E Thanks, Nalini On Thu, Dec 27, 2012 at 2:46 PM, Dyer, James james.d...@ingramcontent.comwrote: https://issues.apache.org/jira/browse/SOLR-3240
Re: Converting fq params to Filter object
Hi Otis, Sorry, let me be more specific. The end goal is for the DirectSpellChecker to make sure that the corrections it is returning will return some results taking into account the fq params included in the original query. This is a follow up question to another question I had posted earlier - http://mail-archives.apache.org/mod_mbox/lucene-solr-user/201212.mbox/%3ccamqozyftgiwyrbvwsdf0hfz1sznkq9gnbjfdb_obnelsmvr...@mail.gmail.com%3E Initially, the way I was thinking of implementing this was to call one of the SolrIndexSearcher.getDocSet() methods for ever correction, passing in the correction as the Query and a DocSet created from the fq queries. But I didn't think that calling a SolrIndexSearcher method in Lucene code (DirectSpellChecker) was a good idea. So I started looking at which method on IndexSearcher would accomplish this. That's where I'm stuck trying to figure out how to convert the fq params into a Filter object. Does this approach make sense? Also I realize that this implementation is probably non-performant but wanted to give it a try and measure how it does. Any advice about what the perf overhead from issuing such queries for say 50 corrections would be? Note that the filter from the fq params is the same for every query - would that be cached and help speed things up? Thanks, Nalini On Wed, Dec 26, 2012 at 3:34 PM, Otis Gospodnetic otis.gospodne...@gmail.com wrote: Hi, The fq *is* for filtering. What is your end goal, what are you trying to achieve? Otis Solr ElasticSearch Support http://sematext.com/ On Dec 26, 2012 11:22 AM, Nalini Kartha nalinikar...@gmail.com wrote: Hi, I'm trying to figure out how to convert the fq params that are being passed to Solr into something that can be used to filter the results of a query that's being issued against the Lucene IndexSearcher (I'm modifying some Lucene code to issue the query so calling through to one of the SolrIndexSearcher methods would be ugly). Looks like one of the IndexSearcher.search(Query query, Filter filter, ...) methods would do what I want but I'm wondering if there's any easy way of converting the fq params into a Filter? Or is there a better way of doing all of this? Thanks, Nalini
Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query
Thanks for the advice. Unfortunately what we really need is for the corrections to satisfy fq params. Was wondering how bad the perf would be if we're using the same DocSet (or should it be an OpenBitSet? sorry, I'm still trying to figure all that code out) for each 'correction query'? Seems like this is similar to how facet counts are calculated? Thanks, Nalini On Thu, Dec 20, 2012 at 12:12 PM, Dyer, James james.d...@ingramcontent.comwrote: The spellchecker doesn't support checking the indivdual words against the index with fq applied. This is only done for collations (and only if maxCollationTries is greater than 0). Checking every suggested word individually against the index after applying filter queries is probably going to be very expensive no matter how you implement it. However, someone with more lucene-core knowledge than I have might be able to give you better advice. If your root problem, though, is getting good did-you-mean-style suggestions with dismax queries and mm=0, and if you want to consider the case where some words in the query are misspelled and others are entirely irrelevant (and can't be corrected), then setting maxResultsForSuggest to a high value might give you the end result you want. Unlike if you use spellcheck.collateParam.mm=100%, it won't insist that the irrelevant terms (or a corrected irrelevant term) match anything. On the other hand, it won't assume the query is Correctly Spelled just because you got some hits from it (because mm=0 will just cause the misspelled terms to be thrown out). James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Thursday, December 20, 2012 8:53 AM To: solr-user@lucene.apache.org Subject: Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query Hi James, I don't get how the spellcheck.maxResultsForSuggest param helps with making sure that the suggestions returned satisfy the fq params? That's the main problem we're trying to solve, how often suggestions are being returned is not really an issue for us at the moment. Thanks, Nalini On Wed, Dec 19, 2012 at 4:35 PM, Dyer, James james.d...@ingramcontent.comwrote: Instead of using spellcheck.collateParam.mm, try just setting spellcheck.maxResultsForSuggest to a very high value (you can use up to Integer.MAX_VALUE here). So long as the user gets fewer results that whatever this is set for, you will get suggestions (and collations if desired). I was just playing with this and if I am understanding you correctly think this combination of parameters will give you what you want: spellcheck=true spellcheck.dictionary=whatever spellcheck.maxResultsForSuggest=1000 (or whatever the cut off is before you don't want suggestions) spellcheck.count=20 (+/- depending on performance vs # suggestions required) spellcheck.maxCollationTries=10 (+/- depending on performance vs # suggestions required) spellcheck.maxCollations=10 (+/- depending on performance vs # suggestions required) spellcheck.collate=true spellcheck.collateExtendedResults=true James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Wednesday, December 19, 2012 2:06 PM To: solr-user@lucene.apache.org Subject: Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query Hi James, Yup the example you gave about sums it up. Reason we use an OR query is that we want the flexibility of every term not having to match but when it comes to corrections we want to be sure that the ones we pick will actually return results (we message the user with the corrected query so it would be bad/confusing if there were no matches for the corrections). *- by default the spellchecker doesn't see this as a problem because it has hits (mm=0 and wrapping matches something). So you get neither individual words back nor collations from the spellchecker.* * * I think we would still get back 'papr - paper' as a correction and 'christmas wrapping paper' as a collation in this case - I've seen that corrections are returned even for OR queries. Problem is these would be returned even if 'paper' doesn't exist in any docs that have item:in_stock. *- with spellcheck.collateParam.mm http://spellcheck.collateparam.mm/ =100 it tries to fix both papr and christmas but can't fix christmas because spelling isn't the problem here (it is an irrelevant term not in the index). So while you get words suggested there are no collations. The individual words would be helpful, but you're not sure because they might all apply to items that do not match fq=item:in_stock.* Yup, exactly. Do you think the workaround I suggested would work
Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query
Hi James, I don't get how the spellcheck.maxResultsForSuggest param helps with making sure that the suggestions returned satisfy the fq params? That's the main problem we're trying to solve, how often suggestions are being returned is not really an issue for us at the moment. Thanks, Nalini On Wed, Dec 19, 2012 at 4:35 PM, Dyer, James james.d...@ingramcontent.comwrote: Instead of using spellcheck.collateParam.mm, try just setting spellcheck.maxResultsForSuggest to a very high value (you can use up to Integer.MAX_VALUE here). So long as the user gets fewer results that whatever this is set for, you will get suggestions (and collations if desired). I was just playing with this and if I am understanding you correctly think this combination of parameters will give you what you want: spellcheck=true spellcheck.dictionary=whatever spellcheck.maxResultsForSuggest=1000 (or whatever the cut off is before you don't want suggestions) spellcheck.count=20 (+/- depending on performance vs # suggestions required) spellcheck.maxCollationTries=10 (+/- depending on performance vs # suggestions required) spellcheck.maxCollations=10 (+/- depending on performance vs # suggestions required) spellcheck.collate=true spellcheck.collateExtendedResults=true James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Wednesday, December 19, 2012 2:06 PM To: solr-user@lucene.apache.org Subject: Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query Hi James, Yup the example you gave about sums it up. Reason we use an OR query is that we want the flexibility of every term not having to match but when it comes to corrections we want to be sure that the ones we pick will actually return results (we message the user with the corrected query so it would be bad/confusing if there were no matches for the corrections). *- by default the spellchecker doesn't see this as a problem because it has hits (mm=0 and wrapping matches something). So you get neither individual words back nor collations from the spellchecker.* * * I think we would still get back 'papr - paper' as a correction and 'christmas wrapping paper' as a collation in this case - I've seen that corrections are returned even for OR queries. Problem is these would be returned even if 'paper' doesn't exist in any docs that have item:in_stock. *- with spellcheck.collateParam.mm http://spellcheck.collateparam.mm/ =100 it tries to fix both papr and christmas but can't fix christmas because spelling isn't the problem here (it is an irrelevant term not in the index). So while you get words suggested there are no collations. The individual words would be helpful, but you're not sure because they might all apply to items that do not match fq=item:in_stock.* Yup, exactly. Do you think the workaround I suggested would work (and not have terrible perf)? Or any other ideas? Thanks, Nalini On Wed, Dec 19, 2012 at 1:09 PM, Dyer, James james.d...@ingramcontent.comwrote: Let me try and get a better idea of what you're after. Is it that your users might query a combination of irrelevant terms and misspelled terms, so you want the ability to ignore the irrelevant terms but still get suggestions for the misspelled terms? For instance if someone wanted q=christmas wrapping paprmm=0fq=item:in_stock, but christmas was not in the index and you wanted to return results for just wrapping paper, the problem is... - by default the spellchecker doesn't see this as a problem because it has hits (mm=0 and wrapping matches something). So you get neither individual words back nor collations from the spellchecker. - with spellcheck.collateParam.mm=100 it tries to fix both papr and christmas but can't fix christmas because spelling isn't the problem here (it is an irrelevant term not in the index). So while you get words suggested there are no collations. The individual words would be helpful, but you're not sure because they might all apply to items that do not match fq=item:in_stock. Is this the problem? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Wednesday, December 19, 2012 11:20 AM To: solr-user@lucene.apache.org Subject: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query Hi, With the DirectSolrSpellChecker, we want to be able to make sure that the corrections that are being returned satisfy the fq params of the original query. The collate functionality helps with this but seems to only work with default AND queries - our use case is for default OR queries. I also saw that there is now a spellcheck.collateParam.XX param which allows you to override params
Re: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query
Hi James, Yup the example you gave about sums it up. Reason we use an OR query is that we want the flexibility of every term not having to match but when it comes to corrections we want to be sure that the ones we pick will actually return results (we message the user with the corrected query so it would be bad/confusing if there were no matches for the corrections). *- by default the spellchecker doesn't see this as a problem because it has hits (mm=0 and wrapping matches something). So you get neither individual words back nor collations from the spellchecker.* * * I think we would still get back 'papr - paper' as a correction and 'christmas wrapping paper' as a collation in this case - I've seen that corrections are returned even for OR queries. Problem is these would be returned even if 'paper' doesn't exist in any docs that have item:in_stock. *- with spellcheck.collateParam.mm http://spellcheck.collateparam.mm/=100 it tries to fix both papr and christmas but can't fix christmas because spelling isn't the problem here (it is an irrelevant term not in the index). So while you get words suggested there are no collations. The individual words would be helpful, but you're not sure because they might all apply to items that do not match fq=item:in_stock.* Yup, exactly. Do you think the workaround I suggested would work (and not have terrible perf)? Or any other ideas? Thanks, Nalini On Wed, Dec 19, 2012 at 1:09 PM, Dyer, James james.d...@ingramcontent.comwrote: Let me try and get a better idea of what you're after. Is it that your users might query a combination of irrelevant terms and misspelled terms, so you want the ability to ignore the irrelevant terms but still get suggestions for the misspelled terms? For instance if someone wanted q=christmas wrapping paprmm=0fq=item:in_stock, but christmas was not in the index and you wanted to return results for just wrapping paper, the problem is... - by default the spellchecker doesn't see this as a problem because it has hits (mm=0 and wrapping matches something). So you get neither individual words back nor collations from the spellchecker. - with spellcheck.collateParam.mm=100 it tries to fix both papr and christmas but can't fix christmas because spelling isn't the problem here (it is an irrelevant term not in the index). So while you get words suggested there are no collations. The individual words would be helpful, but you're not sure because they might all apply to items that do not match fq=item:in_stock. Is this the problem? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Wednesday, December 19, 2012 11:20 AM To: solr-user@lucene.apache.org Subject: Ensuring SpellChecker returns corrections which satisfy fq params for default OR query Hi, With the DirectSolrSpellChecker, we want to be able to make sure that the corrections that are being returned satisfy the fq params of the original query. The collate functionality helps with this but seems to only work with default AND queries - our use case is for default OR queries. I also saw that there is now a spellcheck.collateParam.XX param which allows you to override params from the original query - the example mentioned was to override the mm param to be 100% which would make the collated query default AND. This doesn't quite do what we want either though because it seems like all collations would be thrown out if one of the correctly spelled terms in the query did not satisfy the fq params. We don't want it to check that the correctly spelled terms MUST be in results, just that each correction (individually) would result in some hits taking into account the fqs. I was wondering whether it is possible (and what the perf overhead would be) to use the SolrIndexSearcher.getDocSet(Query, DocSet) method to check that each correction being considered (the Query) matches some docs taking into account the fqs (the DocSet)? Would appreciate other suggestions/ideas if this isn't feasible. Thanks! - Nalini
Re: Differentiate between correctly spelled term and mis-spelled term with no corrections
Got it. Thanks again for all the info! Will open a JIRA and follow up about this sometime soon. Thanks, Nalini On Fri, Dec 14, 2012 at 1:32 PM, Dyer, James james.d...@ingramcontent.comwrote: Nalini, I don't think you can change the *default* response format until a new major release (so its ok for Trunk/5.0 but not for the 4.x branch). What you can do, however, is create a new spellcheck.xxx parameter to let users opt-in to the new functionality in 4.x as desired. We'd also want to update solrj so java clients could easily use the new feature (see http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/solrj/src/java/org/apache/solr/client/solrj/response/SpellCheckResponse.java ). I'm not sure I ever heard someone wanting to combine suggestions from multiple cores before. I'd be interested in hearing more about what you're trying to do. But this does seem similar to the problem of combining suggestions between multiple SpellCheckers. See https://issues.apache.org/jira/browse/SOLR-2993 , which adds a new spellchecker that corrects word break problems. This added a new class, ConjunctionSolrSpellChecker that interleaves the results from the main String-Distance-based checker with results from the word break checker. You might be able to generalize this class to also be able to combine results from multiple DirectSolrSpellCheckers together. While you want to get suggestions from multiple cores, others might want this feature to be able to have separate dictionaries per-field from the same core. I think its ok to rank combined results by String Distance so long as you knew the same metric was applied to all. This is in constrast to how it is with the Word Break spellchecker which uses an incompatible distance metric. So for this case, ConjunctionSolrSpellChecker just interleaves the results round-robin. So expanding on ConjunctionSolrSpellChecker might be one possible way to accomplish what you want to do. You might find something else that works better. For whatever you come up with, by all means open a JIRA issue and attach your work as a patch and see where it goes from there. (subscribe to the dev list if you haven't already as that's where these type of discussions usually happen). James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Friday, December 14, 2012 11:06 AM To: solr-user@lucene.apache.org Subject: Re: Differentiate between correctly spelled term and mis-spelled term with no corrections Hi James, Couple more follow up questions - 1. Do changes to the response format have to be backwards compatible at this point? Seems like if we changed it to always return the origFreq even if there are no suggestions then that could break things right? 2. For our purposes, we need to be able to order suggestions from multiple Solr cores so we were thinking of changing the format to also include the score that is calculated for each suggestion (which isn't exposed right now). Are these scores from different dictionary fields comparable (assuming we use the default INTERNAL_LEVENSHTEIN_DISTANCE metric)? And do you think this would be of general use i.e. could it be contributed back to Solr? Thanks, Nalini On Fri, Dec 7, 2012 at 2:20 PM, Nalini Kartha nalinikar...@gmail.com wrote: Ah I see what you mean. Will probably try to change the response to look like the internal shard one then. Thanks for the detailed explanation! - Nalini On Fri, Dec 7, 2012 at 1:38 PM, Dyer, James james.d...@ingramcontent.comwrote: The response from the shards is different from the final spellcheck response in that it does include the term even if there are no suggestions for it. So to get the behavior you want, we'd probably just have to make it so you could get the shard-to-shard-internal version. See http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java ...and method toNamedList(...) ...and this line: if (theSuggestions != null (theSuggestions.size() 0 || shardRequest)) { ... } ...the shardRequest boolean is passed with true here if its the 1st stage of a distributed request (from #process). The various shards send their responses to the main shard which then integrates them together (in #finishStage) Note that #finishStage always passes shardRequest=false to #toNamedList so that the end user gets a normal response back, omitting terms for which there are no suggestions. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Friday, December 07, 2012 9:54 AM To: solr-user@lucene.apache.org Subject: Re: Differentiate between correctly spelled term and mis-spelled term
Re: Differentiate between correctly spelled term and mis-spelled term with no corrections
Hi James, Couple more follow up questions - 1. Do changes to the response format have to be backwards compatible at this point? Seems like if we changed it to always return the origFreq even if there are no suggestions then that could break things right? 2. For our purposes, we need to be able to order suggestions from multiple Solr cores so we were thinking of changing the format to also include the score that is calculated for each suggestion (which isn't exposed right now). Are these scores from different dictionary fields comparable (assuming we use the default INTERNAL_LEVENSHTEIN_DISTANCE metric)? And do you think this would be of general use i.e. could it be contributed back to Solr? Thanks, Nalini On Fri, Dec 7, 2012 at 2:20 PM, Nalini Kartha nalinikar...@gmail.comwrote: Ah I see what you mean. Will probably try to change the response to look like the internal shard one then. Thanks for the detailed explanation! - Nalini On Fri, Dec 7, 2012 at 1:38 PM, Dyer, James james.d...@ingramcontent.comwrote: The response from the shards is different from the final spellcheck response in that it does include the term even if there are no suggestions for it. So to get the behavior you want, we'd probably just have to make it so you could get the shard-to-shard-internal version. See http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java ...and method toNamedList(...) ...and this line: if (theSuggestions != null (theSuggestions.size() 0 || shardRequest)) { ... } ...the shardRequest boolean is passed with true here if its the 1st stage of a distributed request (from #process). The various shards send their responses to the main shard which then integrates them together (in #finishStage) Note that #finishStage always passes shardRequest=false to #toNamedList so that the end user gets a normal response back, omitting terms for which there are no suggestions. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Friday, December 07, 2012 9:54 AM To: solr-user@lucene.apache.org Subject: Re: Differentiate between correctly spelled term and mis-spelled term with no corrections Hi James, Thanks for the response, will open a JIRA for this. Had one follow-up question - how does the Distributed SpellCheckComponent handle this? I tried looking at the code but it's not obvious to me how it is able to differentiate between these 2 cases. I see that it only considers a term to be wrongly spelt if all shards return a suggestion for it but isn't it possible that a suggestion is not returned because nothing close enough could be found in some shard? Or is the response from shards different than the final spellcheck response we get from Solr in some way? Thanks, Nalini On Fri, Dec 7, 2012 at 10:26 AM, Dyer, James james.d...@ingramcontent.comwrote: You might want to open a jira issue for this to request that the feature be added. If you haven't used it before, you need to create an account. https://issues.apache.org/jira/browse/SOLR In the mean time, If you need to get the document frequency of the query terms, see http://wiki.apache.org/solr/TermsComponent , which maybe would provide you a viable workaround. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Thursday, December 06, 2012 2:44 PM To: solr-user@lucene.apache.org Subject: Differentiate between correctly spelled term and mis-spelled term with no corrections Hi, When using the SolrSpellChecker, is there currently any way to differentiate between a term that exists in the dictionary and a mis-spelled term for which no corrections were found when looking at the spellcheck response? From reading the doc and trying out some simple test cases it seems like there isn't - in both cases it looks like the response doesn't include the term. Could the extended results format be changed to include the original term frequency even if there are no suggestions? This would allow us to make this differentiation. Thanks, Nalini
Re: Differentiate between correctly spelled term and mis-spelled term with no corrections
Hi James, Thanks for the response, will open a JIRA for this. Had one follow-up question - how does the Distributed SpellCheckComponent handle this? I tried looking at the code but it's not obvious to me how it is able to differentiate between these 2 cases. I see that it only considers a term to be wrongly spelt if all shards return a suggestion for it but isn't it possible that a suggestion is not returned because nothing close enough could be found in some shard? Or is the response from shards different than the final spellcheck response we get from Solr in some way? Thanks, Nalini On Fri, Dec 7, 2012 at 10:26 AM, Dyer, James james.d...@ingramcontent.comwrote: You might want to open a jira issue for this to request that the feature be added. If you haven't used it before, you need to create an account. https://issues.apache.org/jira/browse/SOLR In the mean time, If you need to get the document frequency of the query terms, see http://wiki.apache.org/solr/TermsComponent , which maybe would provide you a viable workaround. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Thursday, December 06, 2012 2:44 PM To: solr-user@lucene.apache.org Subject: Differentiate between correctly spelled term and mis-spelled term with no corrections Hi, When using the SolrSpellChecker, is there currently any way to differentiate between a term that exists in the dictionary and a mis-spelled term for which no corrections were found when looking at the spellcheck response? From reading the doc and trying out some simple test cases it seems like there isn't - in both cases it looks like the response doesn't include the term. Could the extended results format be changed to include the original term frequency even if there are no suggestions? This would allow us to make this differentiation. Thanks, Nalini
Re: Differentiate between correctly spelled term and mis-spelled term with no corrections
Ah I see what you mean. Will probably try to change the response to look like the internal shard one then. Thanks for the detailed explanation! - Nalini On Fri, Dec 7, 2012 at 1:38 PM, Dyer, James james.d...@ingramcontent.comwrote: The response from the shards is different from the final spellcheck response in that it does include the term even if there are no suggestions for it. So to get the behavior you want, we'd probably just have to make it so you could get the shard-to-shard-internal version. See http://svn.apache.org/repos/asf/lucene/dev/trunk/solr/core/src/java/org/apache/solr/handler/component/SpellCheckComponent.java ...and method toNamedList(...) ...and this line: if (theSuggestions != null (theSuggestions.size() 0 || shardRequest)) { ... } ...the shardRequest boolean is passed with true here if its the 1st stage of a distributed request (from #process). The various shards send their responses to the main shard which then integrates them together (in #finishStage) Note that #finishStage always passes shardRequest=false to #toNamedList so that the end user gets a normal response back, omitting terms for which there are no suggestions. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Friday, December 07, 2012 9:54 AM To: solr-user@lucene.apache.org Subject: Re: Differentiate between correctly spelled term and mis-spelled term with no corrections Hi James, Thanks for the response, will open a JIRA for this. Had one follow-up question - how does the Distributed SpellCheckComponent handle this? I tried looking at the code but it's not obvious to me how it is able to differentiate between these 2 cases. I see that it only considers a term to be wrongly spelt if all shards return a suggestion for it but isn't it possible that a suggestion is not returned because nothing close enough could be found in some shard? Or is the response from shards different than the final spellcheck response we get from Solr in some way? Thanks, Nalini On Fri, Dec 7, 2012 at 10:26 AM, Dyer, James james.d...@ingramcontent.comwrote: You might want to open a jira issue for this to request that the feature be added. If you haven't used it before, you need to create an account. https://issues.apache.org/jira/browse/SOLR In the mean time, If you need to get the document frequency of the query terms, see http://wiki.apache.org/solr/TermsComponent , which maybe would provide you a viable workaround. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Thursday, December 06, 2012 2:44 PM To: solr-user@lucene.apache.org Subject: Differentiate between correctly spelled term and mis-spelled term with no corrections Hi, When using the SolrSpellChecker, is there currently any way to differentiate between a term that exists in the dictionary and a mis-spelled term for which no corrections were found when looking at the spellcheck response? From reading the doc and trying out some simple test cases it seems like there isn't - in both cases it looks like the response doesn't include the term. Could the extended results format be changed to include the original term frequency even if there are no suggestions? This would allow us to make this differentiation. Thanks, Nalini
minPrefix attribute of DirectSolrSpellChecker
Hi, In most of the examples I have seen for configuring the DirectSolrSpellChecker the minPrefix attribute is set to 1 (and this is the default value as well). Is there any specific reason for this - would performance take a hit if it was set to 0? We'd like to support returning corrections which don't start with the same letter so just wanted to confirm that there aren't any issues with changing this. Thanks, Nalini
Re: Using multiple DirectSolrSpellcheckers for a query
Hi James/Robert, Thanks for the responses. Robert: What is it about the current APIs that makes this hard? How much/what kind of refactoring would open this up? James: I didn't quite understand the usage you suggested. I thought that the spellcheck.q param shouldn't include field names, etc and that the purpose of specifying this param is to avoid the extra parsing out of the field names, etc. from the q param to get the query terms for spell checking. This is based on this bit in the SpellCheckComponent wiki - The spellcheck.q parameter is intended to be the original query, minus any extra markup like field names, boosts, etc. Did I misunderstand something? I agree that it's impossible to know if the query run should be corrected to sun or running in the example I gave but I guess I'm asking more from the angle of how to avoid correcting terms that will be matched because they exist in other more processed fields that are being searched. Since the recommendation is to build spellcheck fields from minimally processed source fields, seems like this would be a common problem? And another kind of unrelated question - all the examples of spellcheck dictionaries I've seen in sample solrconfig.xmls have minPrefix set to 1. Is this for performance reasons? And with this setting, we wouldn't get run as a correction for eon right? Thanks, Nalini On Wed, Mar 7, 2012 at 11:04 AM, Robert Muir rcm...@gmail.com wrote: On Wed, Jan 25, 2012 at 12:55 PM, Nalini Kartha nalinikar...@gmail.com wrote: Is there any reason why Solr doesn't support using multiple spellcheckers for a query? Is it because of performance overhead? Thats not the case really, see https://issues.apache.org/jira/browse/SOLR-2926 I think the issue is that the spellchecker APIs need to be extended to allow this to happen easier, there is no real hard performance/technical/algorithmic issue, its just a matter of refactoring spellchecker APIs to allow this! -- lucidimagination.com
Re: Using multiple DirectSolrSpellcheckers for a query
Hi James, Thanks for the detailed reply and sorry for the delay getting back. One issue for us with using the collate functionality is that some of our query types are default OR (implemented using the mm param value). Since the collate functionality reruns the query using all param values specified in the original query, it'll effectively be issuing an OR query again right? Which means that again we could end up with corrections which aren't the best for the current query? Another issue we're running into is that we're using unstemmed fields as the source for our spell correction field and so we could end up unnecessarily correcting queries containing stemmed versions of words. So for eg. if I have a document containing running my fields look like this - docUnstemmed: running docStemmed: run, ... spellcheck: running If a user searches for run OR jump, there are matching results (since we search against both the stemmed and unstemmed fields) but the spellcheck results will contain corrections for run, let's say sun. We don't want to overcorrect queries which are returning valid results like this one. Any suggestions for how to deal with this? I was thinking that there might be value in having another dictionary which is used for vetting words but not for finding corrections - the stemmed fields could be used as a source for this dictionary. So before finding corrections for a term if it doesn't exist in the primary dictionary, check the secondary dictionary and make sure the term does not exist in it as well. But then, this would require an extra copyfield (we could have multiple unstemmed fields as a source for this secondary dictionary) and bloat the index even more so I'm not sure if it's feasible. Thanks, Nalini On Thu, Jan 26, 2012 at 10:23 AM, Dyer, James james.d...@ingrambook.comwrote: Nalini, Right now the best you can do is to use copyField to combine everything into a catch-all for spellchecking purposes. While this seems wasteful, this often has to be done anyhow because typically you'll need less/different analysis for spellchecking than for searching. But rather than having separate copyFields to create multiple dictionaries, put everything into one field to create a single master dictionary. From there, you need to set spellcheck.collate to true and also spellcheck.maxCollationTries greater than zero (5-10 usually works). The first parameter tells it to generate re-written queries with spelling suggestions (collations). The second parameter tells it to weed out any collations that won't generate hits if you re-query them. This is important because having unrelated keywords in your master dictionary will increase the chances the spellchecker will pick the wrong words as corrections. There is a significant caveat to this: The spellchecker typically only suggests for words in the dictionary. So by creating a huge, master dictionary you might find that many misspelled words won't generate suggestions. See this thread for some workarounds: http://lucene.472066.n3.nabble.com/Improving-Solr-Spell-Checker-Results-td3658411.html I think having multiple, per-field dictionaries as you suggest might be a good way to go. While this is not supported, I don't think its because of performance concerns. (There would be an overhead cost to this but I think it would still be practical). It just hasn't been implemented yet. But we might be getting to a possible start to this type of functionality. In https://issues.apache.org/jira/browse/SOLR-2585 a separate spellchecker is added that just corrects wordbreak (or is it word break?) problems, then a ConjunctionSolrSpellChecker combines the results from the main spellchecker and the wordbreak spellcheker. I could see a next step beyond this being to support per-field dictionaries, checking them separately, then combining the results. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -Original Message- From: Nalini Kartha [mailto:nalinikar...@gmail.com] Sent: Wednesday, January 25, 2012 11:56 AM To: solr-user@lucene.apache.org Subject: Using multiple DirectSolrSpellcheckers for a query Hi, We are trying to use the DirectSolrSpellChecker to get corrections for mis-spelled query terms directly from fields in the Solr index. However, we need to use multiple fields for spellchecking a query. It looks looks like you can only use one spellchecker for a request and so the workaround for this it to create a copy field from the fields required for spell correction? We'd like to avoid this because we allow users to perform different kinds of queries on different sets of fields and so to provide meaningful corrections we'd have to create multiple copy fields - one for each query type. Is there any reason why Solr doesn't support using multiple spellcheckers for a query? Is it because of performance overhead? Thanks, Nalini
Using multiple DirectSolrSpellcheckers for a query
Hi, We are trying to use the DirectSolrSpellChecker to get corrections for mis-spelled query terms directly from fields in the Solr index. However, we need to use multiple fields for spellchecking a query. It looks looks like you can only use one spellchecker for a request and so the workaround for this it to create a copy field from the fields required for spell correction? We'd like to avoid this because we allow users to perform different kinds of queries on different sets of fields and so to provide meaningful corrections we'd have to create multiple copy fields - one for each query type. Is there any reason why Solr doesn't support using multiple spellcheckers for a query? Is it because of performance overhead? Thanks, Nalini