Re: ReplicationHandler reports incorrect replication failures

Jason Rutherglen Mon, 29 Mar 2010 10:11:06 -0700

Shawn,

I was working on something very similar... Lets perhaps also create a
Jira issue for this monitoring?


Thanks,

Jason

On Fri, Mar 26, 2010 at 6:59 AM, Shawn Smith <ssmit...@gmail.com> wrote:
> We're using Solr 1.4 Java replication, which seems to be working
> nicely.  While writing production monitors to check that replication
> is healthy, I think we've run into a bug in the status reporting of
> the "../solr/replication?command=details" command.  (I know it's
> experimental...)
>
> Our monitor parses the replication?command=details XML and checks that
> replication lag is reasonable by diffing the indexVersion of the
> master and slave indices to make sure it's within a reasonable time
> range.
>
> Our monitor also compares the first elements of
> "indexReplicatedAtList" and "replicationFailedAtList" lists to see if
> the last replication attempt failed.  This is where we're having a
> problem with the monitor throwing false errors.  It looks like there's
> a bug that causes successful replications to be considered failures.
> The bug is triggered immediately after a slave restarts when the slave
> is already in sync with the master.  Each no-op replication attempt
> after restart is considered a failure until something on the master
> changes and replication has to actually do work.
>
> From the code, it looks like "SnapPuller.successfulInstall" starts out
> false on restart.  If the slave starts out in sync with the master,
> then each no-op replication poll leaves "successfulInstall" set to
> false which makes SnapPuller.logReplicationTimeAndConfFiles log the
> poll as a failure.  SnapPuller.successfulInstall stays false until the
> first time replication actually has to do something, at which point it
> gets set to true, and then everything is OK.
>
> Thanks,
> Shawn
>

Re: ReplicationHandler reports incorrect replication failures

Reply via email to