No sanity checks before replicating files?

2009-05-21 Thread Damien Tournoud
Hi list,

We have deployed an experimental Solr 1.4 cluster (a master/slave
setup, with automatic promotion of the slave as a master in case of
failure) on drupal.org, to manage our medium size index (3GB, about
400K documents).

One of the problem we are facing is that there seems to be no sanity
checks before downloading files. Take the following scenario:

 - initial situation: s1 is master, s2 is slave
 - s1 fails, the virtual IP falls back to s2
 - some updates happen on s2
 - suppose now that s1 gets back online, s2 tries to replicate from
s1, but after replicating all the files (3GB), the commit fails
because the local index has been locally updated, the replication
fails, but the process restarts at the next poll (redownload all the
index files, fails again...) and so on

We are considering configuring each server to replicate from the
virtual IP, which should solve that issue for us, but couldn't the
slave do some sanity checks before trying to download all the files
from the master?

Thanks in advance for any help you could provide,

Damien Tournoud


Re: No sanity checks before replicating files?

2009-05-21 Thread Damien Tournoud
Hi Otis,

Thanks for your answer.

On Thu, May 21, 2009 at 7:14 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 Interesting, this is similar to my suggestion to another person I just 
 replied to here on solr-user.
 Have you actually run into this problem?  I haven't tried it, but I'd think 
 the first next replication (copying index from s1 to s2) would not 
 necessarily fail, but would simply overwrite any changes that were made on s2 
 while it was serving as the master.  Is that not what happens?

No it doesn't. For some reason, Solr download all the files of the
index, but fails to commit the changes locally. At the next poll, the
process restarts. Not only does this clogs the network, but it also
unnecessarily uses resources on the newly promoted slave, until we
change its configuration.

 If that's what happens, then I think what you'd simply have to do is to:

 1) bring s1 back up, but don't make it a master immediately
 2) take away the master role from s2
 3) make s1 copy the index from s2, since s2 might have a more up to date 
 index now
 4) make s1 the master

Once s2 is the master, we want it to stay this way. We will reassign
s1 as the slave at a later stage, when resources allows. What worries
me is that strange behavior of Solr 1.4 replication when the slave
index is fresher then the master one.

Damien