We had an issue here the other day where the ZK servers were running poorly, 
and in an effort to get them healthy again we ended up rolling back the cluster 
state. While this was, in retrospect, not the right solution to the problem we 
were facing, it brought up another problem. Namely, that many of our clients 
couldn't reconnect with their sessions because their zxid was too high 
(expected), but that the error they got when trying to do that reconnection was 
just a vanilla disconnected error. The result was that most of our clients had 
to be bounced.

Aside from trying hard to avoid ever rolling back the cluster state, does 
anyone have a way they deal with this situation if it occurs? Should we 
consider enhancing the error message to the client so we could track the fact 
that we were ahead of the quorum zxid and react sensibly? Alternately, since we 
were sending a sessionId along with the zxid, perhaps it would be nice to check 
to see if the sessionId exists before checking the zxid, which would send an 
expired state signal which my client code could handle cleanly.

Any ideas or suggestions would be welcome.

C

Reply via email to