On 9/7/2017 8:54 AM, Webster Homer wrote: > I am not concerned about deleted documents. I am concerned that the same > search gives different results after each search. The top document seems to > cycle between 3 different documents > > I have an enhanced collections info api call that calls the core admin api > to get the index information for the replica. > When I said the numdocs were the same I meant exactly that. maxdocs and > deleted documents are not the same for the replicas, but the number of > numdocs is. > > Or are you saying that the search is looking at deleted documents wouldn't > that be a very significant bug?
Lucene score calculations take a lot of information in the index into account when calculating the score. That includes deleted documents, because they are part of the index. When you delete a document, Lucene just makes a note saying "internal document ID number NNNN is deleted." The actual information for that document is not removed from the index, because doing so could take a very long time. When you make queries against a replicated SolrCloud, the queries are load balanced across the entire cloud, so different queries will hit different replicas. With different numbers of deleted documents in different replicas (which is not unusual), the scores are going to come out a little bit different on each query. If you're sorting by score (which is the default sort), that *can* affect the order. Your replicas have a fairly high percentage of deleted documents, so there is a lot of extra information affecting the scores. The relative difference in the deleted document count between the replicas is high as well, so multiple queries could be substantially different. It is not a bug that Lucene and Solr look at deleted documents. Removing deleted document information from things like the score calculation would be VERY computationally intense, bordering on the impossible. To assure good performance, Lucene doesn't even try. Because the way Lucene tracks deleted documents is with a list of internal Lucene document IDs, those documents are easily removed from *results*, but their contents are an integral part of the index and that information can only be truly removed by completely rewriting (merging) the index segments. You can get rid of all deleted documents with an optimize operation, which is a forced merge of the entire index down to one segment -- but just like it sounds, that is a complete rewrite of the index. It involves a huge amount of CPU resources and disk I/O, and can severely impact normal indexing and query operations while it's happening. If the collection is extremely large, an optimize could take hours. For indexes that change rapidly, optimize is strongly discouraged, except as an occasional "clean things up" operation, run during non-peak times. Thanks, Shawn