I would rather ask whether such small differences matter enough to
do this. Is this something users will _ever_ notice? Optimization
is quite a heavyweight operation, and is generally not recommended
on indexes that change often, and 5 minutes is certainly below
the recommendation for optimizing.

There is/has been work done on "distributed IDF", but I don't quite
know the current status that should address this (I think).

But other than in a test setup, is it worth the effort?

Best,
Erick

On Wed, Oct 22, 2014 at 3:54 AM, Giovanni Bricconi
<giovanni.bricc...@banzai.it> wrote:
> I have made some small patch to the application to make this problem less
> visible, and I'm trying to perform the optimize once per hour, yesterday it
> took 5 minutes to perform it, this morning 15 minutes. Today I will collect
> some statistics but the publication process sends documents every 5
> minutes, and I think the optimize is taking too much time.
>
> I have no default mergeFactor configured for this collection, do you think
> that setting it to a small value could improve the situation? If I have
> understood well having to merge segments will keep similar stats on all
> nodes. It's ok to have the indexing process a little bit slower.
>
>
> 2014-10-21 18:44 GMT+02:00 Erick Erickson <erickerick...@gmail.com>:
>
>> Giovanni:
>>
>> To see how this happens, consider a shard with a leader and two
>> followers. Assume your autocommit interval is 60 seconds on each.
>>
>> This interval can expire at slightly different "wall clock" times.
>> Even if the servers started perfectly in synch, they can get slightly
>> out of sync. So, you index a bunch of docs and these replicas close
>> the current segment and re-open a new segment with slightly different
>> contents.
>>
>> Now docs come in that replace older docs. The tf/idf statistics
>> _include_ deleted document data (which is purged on optimize). Given
>> that doc X an be in different segments (or, more accurately, segments
>> that get merged at different times on different machines), replica 1
>> may have slightly different stats than replica 2, thus computing
>> slightly different scores.
>>
>> Optimizing purges all data related to deleted documents, so it all
>> regularizes itself on optimize.
>>
>> Best,
>> Erick
>>
>> On Tue, Oct 21, 2014 at 11:08 AM, Giovanni Bricconi
>> <giovanni.bricc...@banzai.it> wrote:
>> > I noticed again the problem, now I was able to collect some data. in my
>> > paste http://pastebin.com/nVwf327c you can see the result of the same
>> query
>> > issued twice, the 2nd and 3rd group are swapped.
>> >
>> > I pasted also the clusterstate and the core state for each core.
>> >
>> > The logs did'n show any problem related to indexing, only some malformed
>> > query.
>> >
>> > After doing an optimize the problem disappeared.
>> >
>> > So, is the problem related to documents that where deleted from the
>> index?
>> >
>> > The optimization took 5 minutes to complete
>> >
>> > 2014-10-21 11:41 GMT+02:00 Giovanni Bricconi <
>> giovanni.bricc...@banzai.it>:
>> >
>> >> Nice!
>> >> I will monitor the index and try this if the problem comes back.
>> >> Actually the problem was due to small differences in score, so I think
>> the
>> >> problem has the same origin
>> >>
>> >> 2014-10-21 8:10 GMT+02:00 lboutros <boutr...@gmail.com>:
>> >>
>> >>> Hi Giovanni,
>> >>>
>> >>> we had this problem as well.
>> >>> The cause was that the different nodes have slightly different idf
>> values.
>> >>>
>> >>> We solved this problem by doing an optimize operation which really
>> remove
>> >>> suppressed data.
>> >>>
>> >>> Ludovic.
>> >>>
>> >>>
>> >>>
>> >>> -----
>> >>> Jouve
>> >>> France.
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://lucene.472066.n3.nabble.com/unstable-results-on-refresh-tp4164913p4165086.html
>> >>> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>>
>> >>
>> >>
>>

Reply via email to