Re: Consecutive calls to a query give different results

Webster Homer Fri, 08 Sep 2017 10:37:23 -0700

Thank you, Erick Erickson and Shawn Heisey for your excellent answers.
For some of our collections, it would seem that an occasional optimize
would be a good thing. However we have some collections that are updated
constantly

Would using the commit expungeDeletes help mitigate the issue?

I also came across a discussion of Lucene merge policies. and the
TieredMergePolicy.
Is there documentation about this? I notice that a couple of our replicas
in some of our collections have ~30% deleted documents which I would think
would contribute to the problem.
I have at least 3 collections that are updated constantly, and would not
lend themselves to being optimized what is the best approach for these?

Thanks

On Fri, Sep 8, 2017 at 9:47 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 9/7/2017 8:54 AM, Webster Homer wrote:
> > I am not concerned about deleted documents. I am concerned that the same
> > search gives different results after each search. The top document seems
> to
> > cycle between 3 different documents
> >
> > I have an enhanced collections info api call that calls the core admin
> api
> > to get the index information for the replica.
> > When I said the numdocs were the same I meant exactly that. maxdocs and
> > deleted documents are not the same for the replicas, but the number of
> > numdocs is.
> >
> > Or are you saying that the search is looking at deleted documents
> wouldn't
> > that be a very significant bug?
>
> Lucene score calculations take a lot of information in the index into
> account when calculating the score.  That includes deleted documents,
> because they are part of the index.  When you delete a document, Lucene
> just makes a note saying "internal document ID number NNNN is deleted."
> The actual information for that document is not removed from the index,
> because doing so could take a very long time.
>
> When you make queries against a replicated SolrCloud, the queries are
> load balanced across the entire cloud, so different queries will hit
> different replicas.  With different numbers of deleted documents in
> different replicas (which is not unusual), the scores are going to come
> out a little bit different on each query.  If you're sorting by score
> (which is the default sort), that *can* affect the order.  Your replicas
> have a fairly high percentage of deleted documents, so there is a lot of
> extra information affecting the scores.  The relative difference in the
> deleted document count between the replicas is high as well, so multiple
> queries could be substantially different.
>
> It is not a bug that Lucene and Solr look at deleted documents.
> Removing deleted document information from things like the score
> calculation would be VERY computationally intense, bordering on the
> impossible.  To assure good performance, Lucene doesn't even try.
> Because the way Lucene tracks deleted documents is with a list of
> internal Lucene document IDs, those documents are easily removed from
> *results*, but their contents are an integral part of the index and that
> information can only be truly removed by completely rewriting (merging)
> the index segments.
>
> You can get rid of all deleted documents with an optimize operation,
> which is a forced merge of the entire index down to one segment -- but
> just like it sounds, that is a complete rewrite of the index.  It
> involves a huge amount of CPU resources and disk I/O, and can severely
> impact normal indexing and query operations while it's happening.  If
> the collection is extremely large, an optimize could take hours.  For
> indexes that change rapidly, optimize is strongly discouraged, except as
> an occasional "clean things up" operation, run during non-peak times.
>
> Thanks,
> Shawn
>
>

-- 

This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Re: Consecutive calls to a query give different results

Reply via email to