After further investigation, I've figured out that the issue is caused by
the follower not processing messages from the controller until its
ReplicaFetcherThread has shutdown completely (which only happens when the
socket times out).

If the test waits for the socket to timeout, the logs show that the
ReplicaFetcherThread shuts down completely, and immediately thereafter, the
UpdateMetadata requests get processed.

Strangely, this happens even when controlled shutdown is enabled.

Sounds related to this[1] which seems to have been fixed in 0.8.0. Are
there other edge cases not covered by the fix? Is this a known problem in
0.8.1.1?

Thanks,
Philippe
[1] https://issues.apache.org/jira/browse/KAFKA-612

On Wed, Feb 18, 2015 at 12:21 AM, Philippe Laflamme <plafla...@hopper.com>
wrote:

> Hi,
>
> I'm trying to replicate a broker shutdown in unit tests. I've got a simple
> cluster running with 2 brokers (and one ZK). I'm successfully able to
> create a topic with a single partition and replication factor of 2.
>
> I'd like to test shutting down the current leader for the partition and
> make sure my code handles the exceptions thrown such as
> NotLeaderForPartitionException.
>
> I can't seem to shutdown a broker and have the remaining one report that
> it is now the leader for the partition. It looks as though the controller
> successfully changes leadership, but the broker itself is unaware of the
> change.
>
> Here's a gist of the (convoluted) logs[1].
>
> The sequence is as follows:
> 1- start 1 ZK and 2 brokers
> 2- create a topic (test-bogus) with 1 partition and 2 replication factor
> 3- wait for leadership
> 4- ask the controller who is the leader
> 5- ask all brokers who is the leader
> 6- shutdown leader
> 7- wait for leadership
> 8- ask the controller who is the leader
> 9- ask the remaining broker who is the leader
>
> Steps 4-6 appear here in the logs[2]
> Steps 8-9 appear here[3]
>
> As you can see, the controller is aware of the leadership change, but not
> the broker. I've activated controlled shutdown and this is still happening.
> Any idea what may be causing this?
>
> I'm using Kafka 0.8.1.1 and ZK 3.4.5-cdh4.6
>
> I'm using a TopicMetadataRequest for asking the brokers and inspecting
> ControllerContext.partitionLeadershipInfo to fetch leadership from the
> Controller.
>
> Thanks
> Philippe
> [1] https://gist.github.com/plaflamme/60805bfe15ae0106304a
> [2]
> https://gist.github.com/plaflamme/60805bfe15ae0106304a#file-gistfile1-txt-L153-L158
> [3]
> https://gist.github.com/plaflamme/60805bfe15ae0106304a#file-gistfile1-txt-L227-L228
>

Reply via email to