I think it was possibly due to a bug in the perl client combined with a retry bug in a user's code DOS'ing our system. Unfortunately I don't have good enough logs to debug it more than that. All I have is server-side logs showing a bad buffer in an exists call repeated over and over and over.
On Sun, Aug 7, 2011 at 3:01 PM, Vishal Kher <[email protected]> wrote: > Hi Camille, > > Can you share the kind of problems you were facing on the servers that > forced you to rollback the cluster? > > Thanks. > -Vishal > > On Thu, Aug 4, 2011 at 1:29 PM, Fournier, Camille F. < > [email protected]> wrote: > >> We had an issue here the other day where the ZK servers were running >> poorly, and in an effort to get them healthy again we ended up rolling back >> the cluster state. While this was, in retrospect, not the right solution to >> the problem we were facing, it brought up another problem. Namely, that many >> of our clients couldn't reconnect with their sessions because their zxid was >> too high (expected), but that the error they got when trying to do that >> reconnection was just a vanilla disconnected error. The result was that most >> of our clients had to be bounced. >> >> Aside from trying hard to avoid ever rolling back the cluster state, does >> anyone have a way they deal with this situation if it occurs? Should we >> consider enhancing the error message to the client so we could track the >> fact that we were ahead of the quorum zxid and react sensibly? Alternately, >> since we were sending a sessionId along with the zxid, perhaps it would be >> nice to check to see if the sessionId exists before checking the zxid, which >> would send an expired state signal which my client code could handle >> cleanly. >> >> Any ideas or suggestions would be welcome. >> >> C >> >> >
