[jira] [Commented] (KAFKA-5200) If a replicated topic is deleted with one broker down, it can't be recreated

2018-10-05 Thread Adam Elliott (JIRA)


[ 
https://issues.apache.org/jira/browse/KAFKA-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640384#comment-16640384
 ] 

Adam Elliott commented on KAFKA-5200:
-

My team runs a multi-tenant Kafka cluster with a lot of diverse uses, and one 
of the services we provide is an API for managed topic creation/deletion. The 
cluster is large (> 100 nodes) and so it's pretty likely that, for whatever 
reason, at least one node will be down at any given point--and sometimes for 
extended periods.
 
We're currently struggling with the behaviour described above. From what I can 
see in the source, this is intentional behaviour. We don't have control over 
when clients choose to delete topics, so we can't reasonably block deletions 
for reasons that they would see as arbitrary ("some backend server is down, try 
again later").
 
The open source partition reassignment tool _does_ work, as of the version 
we're using at least, to move replicas off of dead brokers, but only if the 
topic hasn't already been deleted. If it has, the only remedy is manual surgery 
to Zookeeper state and bouncing the controller.
 
There's one additional factor which makes this bug worse: if too many topics 
are "half-deleted" at once, the controller crashes/becomes unresponsive; at 
which point a minor annoyance for one of our customers becomes something much 
more serious.
 
I've had a look at the various deletion related state machines and I don't see 
an easy fix. I also haven't seen much mention or discussion of this problem 
apart from this issue.

> If a replicated topic is deleted with one broker down, it can't be recreated
> 
>
> Key: KAFKA-5200
> URL: https://issues.apache.org/jira/browse/KAFKA-5200
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Edoardo Comar
>Priority: Major
>
> In a cluster with 5 broker, replication factor=3, min in sync=2,
> one broker went down 
> A user's app remained of course unaware of that and deleted a topic that 
> (unknowingly) had a replica on the dead broker.
> The topic went in 'pending delete' mode
> The user then tried to recreate the topic - which failed, so his app was left 
> stuck - no working topic and no ability to create one.
> The reassignment tool fails to move the replica out of the dead broker - 
> specifically because the broker with the partition replica to move is dead :-)
> Incidentally the confluent-rebalancer docs say
> http://docs.confluent.io/current/kafka/post-deployment.html#scaling-the-cluster
> > Supports moving partitions away from dead brokers
> It'd be nice to similarly improve the opensource reassignment tool



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (KAFKA-5200) If a replicated topic is deleted with one broker down, it can't be recreated

2017-12-08 Thread Matthias Rampke (JIRA)

[ 
https://issues.apache.org/jira/browse/KAFKA-5200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16283713#comment-16283713
 ] 

Matthias Rampke commented on KAFKA-5200:


To expand on the workaround [~huxi_2b] proposed:

If you cannot resurrect the dead broker itself, you can make Kafka act as if 
you did

#  Start a new broker, but then shut it down quickly (before any newly created 
partitions are assigned to it).
# in meta.properties, change the broker ID to the one of the dead broker
# Start it
# watch its logs – it will pick up the pending deletions and go through, or you 
can reassign at this point
# stop it again

This may be problematic if you have a lot of partition creation going on, 
because you need to avoid getting any partitions assigned to this broker while 
it's running, but otherwise this works without downtime.

> If a replicated topic is deleted with one broker down, it can't be recreated
> 
>
> Key: KAFKA-5200
> URL: https://issues.apache.org/jira/browse/KAFKA-5200
> Project: Kafka
>  Issue Type: Improvement
>  Components: core
>Reporter: Edoardo Comar
>
> In a cluster with 5 broker, replication factor=3, min in sync=2,
> one broker went down 
> A user's app remained of course unaware of that and deleted a topic that 
> (unknowingly) had a replica on the dead broker.
> The topic went in 'pending delete' mode
> The user then tried to recreate the topic - which failed, so his app was left 
> stuck - no working topic and no ability to create one.
> The reassignment tool fails to move the replica out of the dead broker - 
> specifically because the broker with the partition replica to move is dead :-)
> Incidentally the confluent-rebalancer docs say
> http://docs.confluent.io/current/kafka/post-deployment.html#scaling-the-cluster
> > Supports moving partitions away from dead brokers
> It'd be nice to similarly improve the opensource reassignment tool



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)