Re: docFreq coming to be more than 1 for unique id field

2014-06-23 Thread Apoorva Gaurav
Hello Markus, Ahmet,
Forgot to update the thread; optimization works i.e. after optimizing all
unique keys have docFreq as 1.


On Wed, Jun 18, 2014 at 1:58 AM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : text in it, query is of the type keywords:(word1 OR word2 ... OR
 wordN).
 : The client is relying on default relevancy based sort returned by solr.
 : Some documents can get penalised because of some other documents which
 were
 : deleted. Is this functionality correct?

 yes, because term stats are over the entire index including deleted
 documents still in segments -- information about deletions isn't purged
 from the index until a segment is merged and the stats are recomputed over
 the docs/terms in the new segment.

 the only way to get those types of statistics at request time such that
 they were *not* afected by deleted documents would involve scanning every
 doc to compute them -- which would defeat the point of having the inverted
 index.


 -Hoss
 http://www.lucidworks.com/




-- 
Thanks  Regards,
Apoorva


RE: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Markus Jelsma
Hi - did you perhaps update on of those documents?

 
 
-Original message-
 From:Apoorva Gaurav apoorva.gau...@myntra.com
 Sent: Tuesday 17th June 2014 16:58
 To: solr-user@lucene.apache.org
 Subject: docFreq coming to be more than 1 for unique id field
 
 Hello All,
 
 We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need
 to extract docs in a pre-defined order if they match a certain condition.
 Our query is of the format
 
 uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
 where weight1  weight2    weightN
 
 But the result is not in the desired order. On debugging the query we've
 found out that for some of the documents docFreq is higher than 1 and hence
 their tf-idf based score is less than others. What can be the reason behind
 a unique id field having docFreq greater than 1?  How can we prevent it?
 
 -- 
 Thanks  Regards,
 Apoorva
 


Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Ahmet Arslan
Hi,

Just a guess, do you have deletions? What happens when you optimize and re-try?



On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav apoorva.gau...@myntra.com 
wrote:
Hello All,

We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need
to extract docs in a pre-defined order if they match a certain condition.
Our query is of the format

uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
where weight1  weight2    weightN

But the result is not in the desired order. On debugging the query we've
found out that for some of the documents docFreq is higher than 1 and hence
their tf-idf based score is less than others. What can be the reason behind
a unique id field having docFreq greater than 1?  How can we prevent it?

-- 
Thanks  Regards,
Apoorva



Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Apoorva Gaurav
Yes we have updates on these. Didn't try optimizing will do. But isn't the
unique field supposed to be unique?


On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Just a guess, do you have deletions? What happens when you optimize and
 re-try?



 On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav 
 apoorva.gau...@myntra.com wrote:
 Hello All,

 We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need
 to extract docs in a pre-defined order if they match a certain condition.
 Our query is of the format

 uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
 where weight1  weight2    weightN

 But the result is not in the desired order. On debugging the query we've
 found out that for some of the documents docFreq is higher than 1 and hence
 their tf-idf based score is less than others. What can be the reason behind
 a unique id field having docFreq greater than 1?  How can we prevent it?

 --
 Thanks  Regards,
 Apoorva




-- 
Thanks  Regards,
Apoorva


RE: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Markus Jelsma
Yes, it is unique but they are not immediately purged, only when `optimized` or 
forceMerge or during regular segment merges. The problem is that they keep 
messing with the statistics.
 
-Original message-
 From:Apoorva Gaurav apoorva.gau...@myntra.com
 Sent: Tuesday 17th June 2014 17:16
 To: solr-user solr-user@lucene.apache.org; Ahmet Arslan iori...@yahoo.com
 Subject: Re: docFreq coming to be more than 1 for unique id field
 
 Yes we have updates on these. Didn't try optimizing will do. But isn't the
 unique field supposed to be unique?
 
 
 On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:
 
  Hi,
 
  Just a guess, do you have deletions? What happens when you optimize and
  re-try?
 
 
 
  On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav 
  apoorva.gau...@myntra.com wrote:
  Hello All,
 
  We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We need
  to extract docs in a pre-defined order if they match a certain condition.
  Our query is of the format
 
  uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
  where weight1  weight2    weightN
 
  But the result is not in the desired order. On debugging the query we've
  found out that for some of the documents docFreq is higher than 1 and hence
  their tf-idf based score is less than others. What can be the reason behind
  a unique id field having docFreq greater than 1?  How can we prevent it?
 
  --
  Thanks  Regards,
  Apoorva
 
 
 
 
 -- 
 Thanks  Regards,
 Apoorva
 


Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Apoorva Gaurav
Will try optimizing and then respond to the thread.


On Tue, Jun 17, 2014 at 8:47 PM, Markus Jelsma markus.jel...@openindex.io
wrote:

 Yes, it is unique but they are not immediately purged, only when
 `optimized` or forceMerge or during regular segment merges. The problem is
 that they keep messing with the statistics.

 -Original message-
  From:Apoorva Gaurav apoorva.gau...@myntra.com
  Sent: Tuesday 17th June 2014 17:16
  To: solr-user solr-user@lucene.apache.org; Ahmet Arslan 
 iori...@yahoo.com
  Subject: Re: docFreq coming to be more than 1 for unique id field
 
  Yes we have updates on these. Didn't try optimizing will do. But isn't
 the
  unique field supposed to be unique?
 
 
  On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid
 
  wrote:
 
   Hi,
  
   Just a guess, do you have deletions? What happens when you optimize and
   re-try?
  
  
  
   On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav 
   apoorva.gau...@myntra.com wrote:
   Hello All,
  
   We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We
 need
   to extract docs in a pre-defined order if they match a certain
 condition.
   Our query is of the format
  
   uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
   where weight1  weight2    weightN
  
   But the result is not in the desired order. On debugging the query
 we've
   found out that for some of the documents docFreq is higher than 1 and
 hence
   their tf-idf based score is less than others. What can be the reason
 behind
   a unique id field having docFreq greater than 1?  How can we prevent
 it?
  
   --
   Thanks  Regards,
   Apoorva
  
  
 
 
  --
  Thanks  Regards,
  Apoorva
 




-- 
Thanks  Regards,
Apoorva


Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Jack Krupansky
Personally, although I understand the rationale and performance 
ramifications of the current approach of including deleted documents, I 
would agree that DF and IDF should definitely be accurate, despite 
deletions. So, if they aren't, I'd suggest filing a bug Jira. Granted it 
might be rejected as by design or won't fix or improvement, but it's 
worth having the discussion.


Maybe one theory from the old days is that the model of batch update would 
by definition include an optimize step. But now with Solr considered by some 
to be a NoSQL database and with (near) real-time updates, that model is 
clearly obsolete.


-- Jack Krupansky

-Original Message- 
From: Apoorva Gaurav

Sent: Tuesday, June 17, 2014 11:15 AM
To: solr-user ; Ahmet Arslan
Subject: Re: docFreq coming to be more than 1 for unique id field

Yes we have updates on these. Didn't try optimizing will do. But isn't the
unique field supposed to be unique?


On Tue, Jun 17, 2014 at 8:37 PM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:


Hi,

Just a guess, do you have deletions? What happens when you optimize and
re-try?



On Tuesday, June 17, 2014 5:58 PM, Apoorva Gaurav 
apoorva.gau...@myntra.com wrote:
Hello All,

We are using solr 4.4.0. We have a uniqueKey of type solr.StrField. We 
need

to extract docs in a pre-defined order if they match a certain condition.
Our query is of the format

uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
where weight1  weight2    weightN

But the result is not in the desired order. On debugging the query we've
found out that for some of the documents docFreq is higher than 1 and 
hence
their tf-idf based score is less than others. What can be the reason 
behind

a unique id field having docFreq greater than 1?  How can we prevent it?

--
Thanks  Regards,
Apoorva





--
Thanks  Regards,
Apoorva 



Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Chris Hostetter

All index wide statistics (like the docFreq of each term) are over the 
entire index, which includes deleted docs -- because it's an *inverted* 
index, it's not feasible to update those statistics to account for deleted 
docs (that would basically kill all the performance advantages thatcome 
from having an inverted index.


: uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
: where weight1  weight2    weightN
: 
: But the result is not in the desired order. On debugging the query we've

if you are requesting a small number of docs, and all the docs you are 
requesting are returned in a single request, why do you care what order 
they are in?  why not just put them in hte order you want on the client.

That would not only make your solr request simpler, but would almost 
certainly be a bit *faster* since you could sort exactly as you wnated w/o 
needing to compute a complex score that you don't actaully care about.



-Hoss
http://www.lucidworks.com/


Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Apoorva Gaurav
Currently we are not using SolrJ but are simply interacting with solr with
json over http, this will change in a couple of months but currently not
there. As of now we are putting all the logic in query building, using it
to query solr and then passing on the json returned by it to front end. I
know this is not the ideal approach, but that's what we have at the moment.
Hence need a way of deterministically order the result set provided they
match other search criteria.


On Tue, Jun 17, 2014 at 10:28 PM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 All index wide statistics (like the docFreq of each term) are over the
 entire index, which includes deleted docs -- because it's an *inverted*
 index, it's not feasible to update those statistics to account for deleted
 docs (that would basically kill all the performance advantages thatcome
 from having an inverted index.


 : uniqueField:(id1 ^ weight1 OR id2 ^ weight2 . OR idN ^ weightN)
 : where weight1  weight2    weightN
 :
 : But the result is not in the desired order. On debugging the query we've

 if you are requesting a small number of docs, and all the docs you are
 requesting are returned in a single request, why do you care what order
 they are in?  why not just put them in hte order you want on the client.

 That would not only make your solr request simpler, but would almost
 certainly be a bit *faster* since you could sort exactly as you wnated w/o
 needing to compute a complex score that you don't actaully care about.



 -Hoss
 http://www.lucidworks.com/




-- 
Thanks  Regards,
Apoorva


Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Chris Hostetter

: Currently we are not using SolrJ but are simply interacting with solr with
: json over http, this will change in a couple of months but currently not
: there. As of now we are putting all the logic in query building, using it
: to query solr and then passing on the json returned by it to front end. I
: know this is not the ideal approach, but that's what we have at the moment.
: Hence need a way of deterministically order the result set provided they
: match other search criteria.

wether you are using SOlrJ or not doesn't really change my point at all -- 
you are jumping though all sorts of hoops, and asking solr to jump through 
all sorts of hoops, for a score you don't actaully care about, and isn't 
going ot work perfectly for what you want anyway because of the 
fundemental nature of the inverted index stats, leading you to look for 
even smaller, higher, hoops to try and jump through.

it would be far simpler to just ask for the exact set of N documents you 
wnat from Solr in default order, re-order the resulting documents in the 
magic order you already know and care about, and then give that modified 
response to your front end.


-Hoss
http://www.lucidworks.com/


Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Apoorva Gaurav
OK lets for a moment forget about this specific use case and consider a
more general case. Lets say the field name is keywords are we are storing
text in it, query is of the type keywords:(word1 OR word2 ... OR wordN).
The client is relying on default relevancy based sort returned by solr.
Some documents can get penalised because of some other documents which were
deleted. Is this functionality correct?


On Wed, Jun 18, 2014 at 12:52 AM, Chris Hostetter hossman_luc...@fucit.org
wrote:


 : Currently we are not using SolrJ but are simply interacting with solr
 with
 : json over http, this will change in a couple of months but currently not
 : there. As of now we are putting all the logic in query building, using it
 : to query solr and then passing on the json returned by it to front end. I
 : know this is not the ideal approach, but that's what we have at the
 moment.
 : Hence need a way of deterministically order the result set provided they
 : match other search criteria.

 wether you are using SOlrJ or not doesn't really change my point at all --
 you are jumping though all sorts of hoops, and asking solr to jump through
 all sorts of hoops, for a score you don't actaully care about, and isn't
 going ot work perfectly for what you want anyway because of the
 fundemental nature of the inverted index stats, leading you to look for
 even smaller, higher, hoops to try and jump through.

 it would be far simpler to just ask for the exact set of N documents you
 wnat from Solr in default order, re-order the resulting documents in the
 magic order you already know and care about, and then give that modified
 response to your front end.


 -Hoss
 http://www.lucidworks.com/




-- 
Thanks  Regards,
Apoorva


Re: docFreq coming to be more than 1 for unique id field

2014-06-17 Thread Chris Hostetter

: text in it, query is of the type keywords:(word1 OR word2 ... OR wordN).
: The client is relying on default relevancy based sort returned by solr.
: Some documents can get penalised because of some other documents which were
: deleted. Is this functionality correct?

yes, because term stats are over the entire index including deleted 
documents still in segments -- information about deletions isn't purged 
from the index until a segment is merged and the stats are recomputed over 
the docs/terms in the new segment.

the only way to get those types of statistics at request time such that 
they were *not* afected by deleted documents would involve scanning every 
doc to compute them -- which would defeat the point of having the inverted 
index.


-Hoss
http://www.lucidworks.com/