I think it was possibly due to a bug in the perl client combined with
a retry bug in a user's code DOS'ing our system. Unfortunately I don't
have good enough logs to debug it more than that. All I have is
server-side logs showing a bad buffer in an exists call repeated over
and over and over.

On Sun, Aug 7, 2011 at 3:01 PM, Vishal Kher <[email protected]> wrote:
> Hi Camille,
>
> Can you share the kind of problems you were facing on the servers that
> forced you to rollback the cluster?
>
> Thanks.
> -Vishal
>
> On Thu, Aug 4, 2011 at 1:29 PM, Fournier, Camille F. <
> [email protected]> wrote:
>
>> We had an issue here the other day where the ZK servers were running
>> poorly, and in an effort to get them healthy again we ended up rolling back
>> the cluster state. While this was, in retrospect, not the right solution to
>> the problem we were facing, it brought up another problem. Namely, that many
>> of our clients couldn't reconnect with their sessions because their zxid was
>> too high (expected), but that the error they got when trying to do that
>> reconnection was just a vanilla disconnected error. The result was that most
>> of our clients had to be bounced.
>>
>> Aside from trying hard to avoid ever rolling back the cluster state, does
>> anyone have a way they deal with this situation if it occurs? Should we
>> consider enhancing the error message to the client so we could track the
>> fact that we were ahead of the quorum zxid and react sensibly? Alternately,
>> since we were sending a sessionId along with the zxid, perhaps it would be
>> nice to check to see if the sessionId exists before checking the zxid, which
>> would send an expired state signal which my client code could handle
>> cleanly.
>>
>> Any ideas or suggestions would be welcome.
>>
>> C
>>
>>
>

Reply via email to