Re: docFreq coming to be more than 1 for unique id field
Hello Markus, Ahmet, Forgot to update the thread; optimization works i.e. after optimizing all unique keys have docFreq as 1. On Wed, Jun 18, 2014 at 1:58 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : text in it, query is of the type keywords:(word1 OR word2 ... OR wordN). : The client is relying on default relevancy based sort returned by solr. : Some documents can get penalised because of some other documents which were : deleted. Is this functionality correct? yes, because term stats are over the entire index including deleted documents still in segments -- information about deletions isn't purged from the index until a segment is merged and the stats are recomputed over the docs/terms in the new segment. the only way to get those types of statistics at request time such that they were *not* afected by deleted documents would involve scanning every doc to compute them -- which would defeat the point of having the inverted index. -Hoss http://www.lucidworks.com/ -- Thanks Regards, Apoorva
RE: docFreq coming to be more than 1 for unique id field
Hi - did you perhaps update on of those documents? -Original message- From:Apoorva Gaurav apoorva.gau...@myntra.com Sent: Tuesday 17th June 2014 16:58 To: solr-user@lucene.apache.org Subject: docFreq coming to be more than 1 for unique id field Hello All, We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need to extract docs in a pre-defined order if they match a certain condition. Our query is of the format uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) where weight1 weight2 weightN But the result is not in the desired order. On debugging the query we've found out that for some of the documents docFreq is higher than 1 and hence their tf-idf based score is less than others. What can be the reason behind a unique id field having docFreq greater than 1? How can we prevent it? -- Thanks Regards, Apoorva
Re: docFreq coming to be more than 1 for unique id field
Hi, Just a guess, do you have deletions? What happens when you optimize and re-try? On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need to extract docs in a pre-defined order if they match a certain condition. Our query is of the format uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) where weight1 weight2 weightN But the result is not in the desired order. On debugging the query we've found out that for some of the documents docFreq is higher than 1 and hence their tf-idf based score is less than others. What can be the reason behind a unique id field having docFreq greater than 1? How can we prevent it? -- Thanks Regards, Apoorva
Re: docFreq coming to be more than 1 for unique id field
Yes we have updates on these. Didn't try optimizing will do. But isn't the unique field supposed to be unique? On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Just a guess, do you have deletions? What happens when you optimize and re-try? On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need to extract docs in a pre-defined order if they match a certain condition. Our query is of the format uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) where weight1 weight2 weightN But the result is not in the desired order. On debugging the query we've found out that for some of the documents docFreq is higher than 1 and hence their tf-idf based score is less than others. What can be the reason behind a unique id field having docFreq greater than 1? How can we prevent it? -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
RE: docFreq coming to be more than 1 for unique id field
Yes, it is unique but they are not immediately purged, only when `optimized` or forceMerge or during regular segment merges. The problem is that they keep messing with the statistics. -Original message- From:Apoorva Gaurav apoorva.gau...@myntra.com Sent: Tuesday 17th June 2014 17:16 To: solr-user solr-user@lucene.apache.org; Ahmet Arslan iori...@yahoo.com Subject: Re: docFreq coming to be more than 1 for unique id field Yes we have updates on these. Didn't try optimizing will do. But isn't the unique field supposed to be unique? On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Just a guess, do you have deletions? What happens when you optimize and re-try? On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need to extract docs in a pre-defined order if they match a certain condition. Our query is of the format uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) where weight1 weight2 weightN But the result is not in the desired order. On debugging the query we've found out that for some of the documents docFreq is higher than 1 and hence their tf-idf based score is less than others. What can be the reason behind a unique id field having docFreq greater than 1? How can we prevent it? -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: docFreq coming to be more than 1 for unique id field
Will try optimizing and then respond to the thread. On Tue, Jun 17, 2014 at 8:47 PM, Markus Jelsma markus.jel...@openindex.io wrote: Yes, it is unique but they are not immediately purged, only when `optimized` or forceMerge or during regular segment merges. The problem is that they keep messing with the statistics. -Original message- From:Apoorva Gaurav apoorva.gau...@myntra.com Sent: Tuesday 17th June 2014 17:16 To: solr-user solr-user@lucene.apache.org; Ahmet Arslan iori...@yahoo.com Subject: Re: docFreq coming to be more than 1 for unique id field Yes we have updates on these. Didn't try optimizing will do. But isn't the unique field supposed to be unique? On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Just a guess, do you have deletions? What happens when you optimize and re-try? On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need to extract docs in a pre-defined order if they match a certain condition. Our query is of the format uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) where weight1 weight2 weightN But the result is not in the desired order. On debugging the query we've found out that for some of the documents docFreq is higher than 1 and hence their tf-idf based score is less than others. What can be the reason behind a unique id field having docFreq greater than 1? How can we prevent it? -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: docFreq coming to be more than 1 for unique id field
Personally, although I understand the rationale and performance ramifications of the current approach of including deleted documents, I would agree that DF and IDF should definitely be accurate, despite deletions. So, if they aren't, I'd suggest filing a bug Jira. Granted it might be rejected as by design or won't fix or improvement, but it's worth having the discussion. Maybe one theory from the old days is that the model of batch update would by definition include an optimize step. But now with Solr considered by some to be a NoSQL database and with (near) real-time updates, that model is clearly obsolete. -- Jack Krupansky -Original Message- From: Apoorva Gaurav Sent: Tuesday, June 17, 2014 11:15 AM To: solr-user ; Ahmet Arslan Subject: Re: docFreq coming to be more than 1 for unique id field Yes we have updates on these. Didn't try optimizing will do. But isn't the unique field supposed to be unique? On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid wrote: Hi, Just a guess, do you have deletions? What happens when you optimize and re-try? On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav apoorva.gau...@myntra.com wrote: Hello All, We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need to extract docs in a pre-defined order if they match a certain condition. Our query is of the format uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) where weight1 weight2 weightN But the result is not in the desired order. On debugging the query we've found out that for some of the documents docFreq is higher than 1 and hence their tf-idf based score is less than others. What can be the reason behind a unique id field having docFreq greater than 1? How can we prevent it? -- Thanks Regards, Apoorva -- Thanks Regards, Apoorva
Re: docFreq coming to be more than 1 for unique id field
All index wide statistics (like the docFreq of each term) are over the entire index, which includes deleted docs -- because it's an *inverted* index, it's not feasible to update those statistics to account for deleted docs (that would basically kill all the performance advantages thatcome from having an inverted index. : uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) : where weight1 weight2 weightN : : But the result is not in the desired order. On debugging the query we've if you are requesting a small number of docs, and all the docs you are requesting are returned in a single request, why do you care what order they are in? why not just put them in hte order you want on the client. That would not only make your solr request simpler, but would almost certainly be a bit *faster* since you could sort exactly as you wnated w/o needing to compute a complex score that you don't actaully care about. -Hoss http://www.lucidworks.com/
Re: docFreq coming to be more than 1 for unique id field
Currently we are not using SolrJ but are simply interacting with solr with json over http, this will change in a couple of months but currently not there. As of now we are putting all the logic in query building, using it to query solr and then passing on the json returned by it to front end. I know this is not the ideal approach, but that's what we have at the moment. Hence need a way of deterministically order the result set provided they match other search criteria. On Tue, Jun 17, 2014 at 10:28 PM, Chris Hostetter hossman_luc...@fucit.org wrote: All index wide statistics (like the docFreq of each term) are over the entire index, which includes deleted docs -- because it's an *inverted* index, it's not feasible to update those statistics to account for deleted docs (that would basically kill all the performance advantages thatcome from having an inverted index. : uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN) : where weight1 weight2 weightN : : But the result is not in the desired order. On debugging the query we've if you are requesting a small number of docs, and all the docs you are requesting are returned in a single request, why do you care what order they are in? why not just put them in hte order you want on the client. That would not only make your solr request simpler, but would almost certainly be a bit *faster* since you could sort exactly as you wnated w/o needing to compute a complex score that you don't actaully care about. -Hoss http://www.lucidworks.com/ -- Thanks Regards, Apoorva
Re: docFreq coming to be more than 1 for unique id field
: Currently we are not using SolrJ but are simply interacting with solr with : json over http, this will change in a couple of months but currently not : there. As of now we are putting all the logic in query building, using it : to query solr and then passing on the json returned by it to front end. I : know this is not the ideal approach, but that's what we have at the moment. : Hence need a way of deterministically order the result set provided they : match other search criteria. wether you are using SOlrJ or not doesn't really change my point at all -- you are jumping though all sorts of hoops, and asking solr to jump through all sorts of hoops, for a score you don't actaully care about, and isn't going ot work perfectly for what you want anyway because of the fundemental nature of the inverted index stats, leading you to look for even smaller, higher, hoops to try and jump through. it would be far simpler to just ask for the exact set of N documents you wnat from Solr in default order, re-order the resulting documents in the magic order you already know and care about, and then give that modified response to your front end. -Hoss http://www.lucidworks.com/
Re: docFreq coming to be more than 1 for unique id field
OK lets for a moment forget about this specific use case and consider a more general case. Lets say the field name is keywords are we are storing text in it, query is of the type keywords:(word1 OR word2 ... OR wordN). The client is relying on default relevancy based sort returned by solr. Some documents can get penalised because of some other documents which were deleted. Is this functionality correct? On Wed, Jun 18, 2014 at 12:52 AM, Chris Hostetter hossman_luc...@fucit.org wrote: : Currently we are not using SolrJ but are simply interacting with solr with : json over http, this will change in a couple of months but currently not : there. As of now we are putting all the logic in query building, using it : to query solr and then passing on the json returned by it to front end. I : know this is not the ideal approach, but that's what we have at the moment. : Hence need a way of deterministically order the result set provided they : match other search criteria. wether you are using SOlrJ or not doesn't really change my point at all -- you are jumping though all sorts of hoops, and asking solr to jump through all sorts of hoops, for a score you don't actaully care about, and isn't going ot work perfectly for what you want anyway because of the fundemental nature of the inverted index stats, leading you to look for even smaller, higher, hoops to try and jump through. it would be far simpler to just ask for the exact set of N documents you wnat from Solr in default order, re-order the resulting documents in the magic order you already know and care about, and then give that modified response to your front end. -Hoss http://www.lucidworks.com/ -- Thanks Regards, Apoorva
Re: docFreq coming to be more than 1 for unique id field
: text in it, query is of the type keywords:(word1 OR word2 ... OR wordN). : The client is relying on default relevancy based sort returned by solr. : Some documents can get penalised because of some other documents which were : deleted. Is this functionality correct? yes, because term stats are over the entire index including deleted documents still in segments -- information about deletions isn't purged from the index until a segment is merged and the stats are recomputed over the docs/terms in the new segment. the only way to get those types of statistics at request time such that they were *not* afected by deleted documents would involve scanning every doc to compute them -- which would defeat the point of having the inverted index. -Hoss http://www.lucidworks.com/