I've confirmed that the same thing happens even if it's not the controller
that's killed hard.  Also, in several trials, it took between 10-30 seconds
to recover.

Jason

On Wed, Apr 8, 2015 at 1:31 PM, Jason Rosenberg <j...@squareup.com> wrote:

> Hello,
>
> I'm still trying to get to the bottom of an issue we had previously, with
> an unclean shutdown during an upgrade to 0.8.2.1 (from 0.8.1.1).  In that
> case, the controlled shutdown was interrupted, and the node was shutdown
> abruptly.  This resulted in about 5 minutes of unavailability for most
> partitions.  (I think that issue is related to the one reported by Thunder
> in the thread titled: "Problem with node after restart no partitions?").
>
> Anyway, while investigating that, I've gotten side-tracked, trying
> understand what the expected behavior should be, if the controller node
> dies abruptly.
>
> To test this, I have a small test cluster (2 nodes, 100 partitions, each
> with replication factor 2, using 0.8.2.1).  There are also a few test
> producer clients, some of them high volume....
>
> I intentionally killed the controller node hard.  I noticed that for 10
> seconds, the second node spammed the logs for 10 seconds trying to fetch
> data for the partitions it was following on the node that was killed.
> Finally, after about 10 seconds, the second node elected itself the new
> controller, and things slowly recovered.
>
> Clients could not successfully produce to the affected partitions until
> the new controller was elected (and got failed meta-data requests trying to
> discover the new leader partition).
>
> I would have expected the cluster to recover more quickly if a node fails,
> if we have available replicas that can become leader and start receiving
> data.  With just 100 partitions, I would have expected this recovery to
> happen very quickly.  (Whereas in our previous issue, where it seemed to
> take 5 minutes, the longer duration there was probably related to a much
> larger number of partitions).
>
> Anyway, before I start filing Jira's and attaching log snippets, I'd like
> to understand what the expected behavior should be?
>
> If a controller (or really any node in the cluster) undergoes unclean
> shutdown, how should the cluster respond, in keeping replicas available
> (assuming all replicas were in ISR before the shutdown).  How fast should
> controller and partition leader election happen in this case?
>
> Thanks,
>
> Jason
>

Reply via email to