I'm in the process of reassigning partitions away from failing machines and it appears to be stuck. One thought is because our machines are failing at a very high rate and so some partitions no longer have any live replicas at all. At this point I don't care about the data, I just want to get all partitions onto the set of machines that I know work. Is there some way I can do this? I am happy to manipulate ZooKeeper and bounce nodes if need be.

And a warning... this is due to Amazon EC2 d2 instance type failures. We spun up 9 d2.xlarge instances and within a few hours 6 have failed under a Kafka workload. So yeah, bleeding edge.

One thing I've done is rebuilt one of these nodes with the same broker id and name but under a known working instance type. It came up and now is spewing this in the logs:

[2015-04-03 13:05:30,275] 805497 [kafka-request-handler-0] WARN kafka.server.KafkaApis - [KafkaApi-29] Produce request with correlation id 5849 from client ping_partitioner on partition [pings,245] failed due to Topic pings either doesn't exist or is in the process of being deleted

The topic most certainly should exist, however I'm guessing it's complaining because there are no live replicas for that partition. Is there some way to get it to just become leader?

Wes

Reply via email to