expected behavior if a node undergoes unclean shutdown

2015-04-08 Thread Jason Rosenberg
Hello,

I'm still trying to get to the bottom of an issue we had previously, with
an unclean shutdown during an upgrade to 0.8.2.1 (from 0.8.1.1).  In that
case, the controlled shutdown was interrupted, and the node was shutdown
abruptly.  This resulted in about 5 minutes of unavailability for most
partitions.  (I think that issue is related to the one reported by Thunder
in the thread titled: Problem with node after restart no partitions?).

Anyway, while investigating that, I've gotten side-tracked, trying
understand what the expected behavior should be, if the controller node
dies abruptly.

To test this, I have a small test cluster (2 nodes, 100 partitions, each
with replication factor 2, using 0.8.2.1).  There are also a few test
producer clients, some of them high volume

I intentionally killed the controller node hard.  I noticed that for 10
seconds, the second node spammed the logs for 10 seconds trying to fetch
data for the partitions it was following on the node that was killed.
Finally, after about 10 seconds, the second node elected itself the new
controller, and things slowly recovered.

Clients could not successfully produce to the affected partitions until the
new controller was elected (and got failed meta-data requests trying to
discover the new leader partition).

I would have expected the cluster to recover more quickly if a node fails,
if we have available replicas that can become leader and start receiving
data.  With just 100 partitions, I would have expected this recovery to
happen very quickly.  (Whereas in our previous issue, where it seemed to
take 5 minutes, the longer duration there was probably related to a much
larger number of partitions).

Anyway, before I start filing Jira's and attaching log snippets, I'd like
to understand what the expected behavior should be?

If a controller (or really any node in the cluster) undergoes unclean
shutdown, how should the cluster respond, in keeping replicas available
(assuming all replicas were in ISR before the shutdown).  How fast should
controller and partition leader election happen in this case?

Thanks,

Jason


Re: expected behavior if a node undergoes unclean shutdown

2015-04-08 Thread Jason Rosenberg
I've confirmed that the same thing happens even if it's not the controller
that's killed hard.  Also, in several trials, it took between 10-30 seconds
to recover.

Jason

On Wed, Apr 8, 2015 at 1:31 PM, Jason Rosenberg j...@squareup.com wrote:

 Hello,

 I'm still trying to get to the bottom of an issue we had previously, with
 an unclean shutdown during an upgrade to 0.8.2.1 (from 0.8.1.1).  In that
 case, the controlled shutdown was interrupted, and the node was shutdown
 abruptly.  This resulted in about 5 minutes of unavailability for most
 partitions.  (I think that issue is related to the one reported by Thunder
 in the thread titled: Problem with node after restart no partitions?).

 Anyway, while investigating that, I've gotten side-tracked, trying
 understand what the expected behavior should be, if the controller node
 dies abruptly.

 To test this, I have a small test cluster (2 nodes, 100 partitions, each
 with replication factor 2, using 0.8.2.1).  There are also a few test
 producer clients, some of them high volume

 I intentionally killed the controller node hard.  I noticed that for 10
 seconds, the second node spammed the logs for 10 seconds trying to fetch
 data for the partitions it was following on the node that was killed.
 Finally, after about 10 seconds, the second node elected itself the new
 controller, and things slowly recovered.

 Clients could not successfully produce to the affected partitions until
 the new controller was elected (and got failed meta-data requests trying to
 discover the new leader partition).

 I would have expected the cluster to recover more quickly if a node fails,
 if we have available replicas that can become leader and start receiving
 data.  With just 100 partitions, I would have expected this recovery to
 happen very quickly.  (Whereas in our previous issue, where it seemed to
 take 5 minutes, the longer duration there was probably related to a much
 larger number of partitions).

 Anyway, before I start filing Jira's and attaching log snippets, I'd like
 to understand what the expected behavior should be?

 If a controller (or really any node in the cluster) undergoes unclean
 shutdown, how should the cluster respond, in keeping replicas available
 (assuming all replicas were in ISR before the shutdown).  How fast should
 controller and partition leader election happen in this case?

 Thanks,

 Jason