I would rather ask whether such small differences matter enough to do this. Is this something users will _ever_ notice? Optimization is quite a heavyweight operation, and is generally not recommended on indexes that change often, and 5 minutes is certainly below the recommendation for optimizing.
There is/has been work done on "distributed IDF", but I don't quite know the current status that should address this (I think). But other than in a test setup, is it worth the effort? Best, Erick On Wed, Oct 22, 2014 at 3:54 AM, Giovanni Bricconi <giovanni.bricc...@banzai.it> wrote: > I have made some small patch to the application to make this problem less > visible, and I'm trying to perform the optimize once per hour, yesterday it > took 5 minutes to perform it, this morning 15 minutes. Today I will collect > some statistics but the publication process sends documents every 5 > minutes, and I think the optimize is taking too much time. > > I have no default mergeFactor configured for this collection, do you think > that setting it to a small value could improve the situation? If I have > understood well having to merge segments will keep similar stats on all > nodes. It's ok to have the indexing process a little bit slower. > > > 2014-10-21 18:44 GMT+02:00 Erick Erickson <erickerick...@gmail.com>: > >> Giovanni: >> >> To see how this happens, consider a shard with a leader and two >> followers. Assume your autocommit interval is 60 seconds on each. >> >> This interval can expire at slightly different "wall clock" times. >> Even if the servers started perfectly in synch, they can get slightly >> out of sync. So, you index a bunch of docs and these replicas close >> the current segment and re-open a new segment with slightly different >> contents. >> >> Now docs come in that replace older docs. The tf/idf statistics >> _include_ deleted document data (which is purged on optimize). Given >> that doc X an be in different segments (or, more accurately, segments >> that get merged at different times on different machines), replica 1 >> may have slightly different stats than replica 2, thus computing >> slightly different scores. >> >> Optimizing purges all data related to deleted documents, so it all >> regularizes itself on optimize. >> >> Best, >> Erick >> >> On Tue, Oct 21, 2014 at 11:08 AM, Giovanni Bricconi >> <giovanni.bricc...@banzai.it> wrote: >> > I noticed again the problem, now I was able to collect some data. in my >> > paste http://pastebin.com/nVwf327c you can see the result of the same >> query >> > issued twice, the 2nd and 3rd group are swapped. >> > >> > I pasted also the clusterstate and the core state for each core. >> > >> > The logs did'n show any problem related to indexing, only some malformed >> > query. >> > >> > After doing an optimize the problem disappeared. >> > >> > So, is the problem related to documents that where deleted from the >> index? >> > >> > The optimization took 5 minutes to complete >> > >> > 2014-10-21 11:41 GMT+02:00 Giovanni Bricconi < >> giovanni.bricc...@banzai.it>: >> > >> >> Nice! >> >> I will monitor the index and try this if the problem comes back. >> >> Actually the problem was due to small differences in score, so I think >> the >> >> problem has the same origin >> >> >> >> 2014-10-21 8:10 GMT+02:00 lboutros <boutr...@gmail.com>: >> >> >> >>> Hi Giovanni, >> >>> >> >>> we had this problem as well. >> >>> The cause was that the different nodes have slightly different idf >> values. >> >>> >> >>> We solved this problem by doing an optimize operation which really >> remove >> >>> suppressed data. >> >>> >> >>> Ludovic. >> >>> >> >>> >> >>> >> >>> ----- >> >>> Jouve >> >>> France. >> >>> -- >> >>> View this message in context: >> >>> >> http://lucene.472066.n3.nabble.com/unstable-results-on-refresh-tp4164913p4165086.html >> >>> Sent from the Solr - User mailing list archive at Nabble.com. >> >>> >> >> >> >> >>