Shawn, I was working on something very similar... Lets perhaps also create a Jira issue for this monitoring?
Thanks, Jason On Fri, Mar 26, 2010 at 6:59 AM, Shawn Smith <ssmit...@gmail.com> wrote: > We're using Solr 1.4 Java replication, which seems to be working > nicely. While writing production monitors to check that replication > is healthy, I think we've run into a bug in the status reporting of > the "../solr/replication?command=details" command. (I know it's > experimental...) > > Our monitor parses the replication?command=details XML and checks that > replication lag is reasonable by diffing the indexVersion of the > master and slave indices to make sure it's within a reasonable time > range. > > Our monitor also compares the first elements of > "indexReplicatedAtList" and "replicationFailedAtList" lists to see if > the last replication attempt failed. This is where we're having a > problem with the monitor throwing false errors. It looks like there's > a bug that causes successful replications to be considered failures. > The bug is triggered immediately after a slave restarts when the slave > is already in sync with the master. Each no-op replication attempt > after restart is considered a failure until something on the master > changes and replication has to actually do work. > > From the code, it looks like "SnapPuller.successfulInstall" starts out > false on restart. If the slave starts out in sync with the master, > then each no-op replication poll leaves "successfulInstall" set to > false which makes SnapPuller.logReplicationTimeAndConfFiles log the > poll as a failure. SnapPuller.successfulInstall stays false until the > first time replication actually has to do something, at which point it > gets set to true, and then everything is OK. > > Thanks, > Shawn >