Re: Kafka topic deletion still not working well

2016-03-13 Thread Stevo Slavić
Created https://issues.apache.org/jira/browse/KAFKA-3389 and
https://issues.apache.org/jira/browse/KAFKA-3390

On Sun, Mar 13, 2016 at 10:11 AM, Stevo Slavić  wrote:

> Is it normal/expected behavior during topic deletion that to-be-deleted
> topic partitions ISR will get updated/shrink as replicas remove fetcher and
> delete partition log?
>
> On Fri, Mar 11, 2016 at 5:32 PM, Stevo Slavić  wrote:
>
>> Didn't finish sentence, wanted to change font but managed to hit Send
>> button instead...
>>
>>
>> As you can see from [1] replica server.log fragment, replica had one
>> state in cache, while actual in ZK was different (node didn't exist
>> anymore). I'm not sure why /brokers/topics/foo/partitions/0/state node
>> did not exist in ZK. I mentioned that foo was likely created with Kafka
>> 0.8.2.x - maybe different Kafka versions use different metadata (ZK dir
>> layout), and 0.9.0.1 when deleting expected something to exist in ZK while
>> it didn't since topic was created with 0.8.2.x.
>>
>> But I guess this Kafka version difference (during topic creation and at
>> deletion) is unlikely to be the cause, since replica had state cached so
>> state node is likely to have existed at some point before deletion
>> request. Not sure what could have deleted it. Maybe some other part of
>> distributed topic deletion process deleted the state node. Will check
>> more logs, Kafka and ZK, if I can find what deleted the state node. Any
>> ideas?
>>
>> On Fri, Mar 11, 2016 at 5:22 PM, Stevo Slavić  wrote:
>>
>>> Sorry, I do not know if this is related to KAFKA-2937, I let you judge
>>> and decide whether ticket needs to be reopened or new one created.
>>>
>>> I'm still missing full picture what was happening but here are more
>>> pieces.
>>>
>>> 1) Topic foo was created (likely with Kafka 0.8.2.x). Topic foo was
>>> never used (no messages published to it ever).
>>> 2) Topic bar was created with Kafka 0.9.0.1
>>> 3) Deletion for topic foo was requested, and replica deletion got stuck
>>> (see [1] server.log fragment)
>>> 4) Deletion for topic bar was requested, one bar replica was on same
>>> broker as replica for foo whose deletion got stuck, and deletion for bar
>>> replica was never even tried
>>> 5) Topics got in non-deletable state (see [2] controller.log fragment)
>>>
>>> As you can see from [1] replica had one state in cache, while actual in
>>> ZK was different. I'm not sure why /brokers/topics/foo/partitions/0/state
>>> did not
>>>
>>> [1] replica server.log fragment
>>> ...
>>> [2016-03-04 09:42:03,495] INFO [ReplicaFetcherManager on broker 1]
>>> Removed fetcher for partitions [foo,0] (kafka.server.ReplicaFetcherManager)
>>> [2016-03-04 09:42:03,533] INFO [ReplicaFetcherManager on broker 1]
>>> Removed fetcher for partitions [foo,0] (kafka.server.ReplicaFetcherManager)
>>> [2016-03-04 09:42:13,894] INFO Partition [foo,0] on broker 1: Shrinking
>>> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
>>> [2016-03-04 09:42:13,897] WARN Conditional update of path
>>> /brokers/topics/foo/partitions/0/state with data
>>> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
>>> and expected version 68 failed due to
>>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>>> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
>>> [2016-03-04 09:42:13,898] INFO Partition [foo,0] on broker 1: Cached
>>> zkVersion [68] not equal to that in zookeeper, skip updating ISR
>>> (kafka.cluster.Partition)
>>> [2016-03-04 09:42:23,894] INFO Partition [foo,0] on broker 1: Shrinking
>>> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
>>> [2016-03-04 09:42:23,897] WARN Conditional update of path
>>> /brokers/topics/foo/partitions/0/state with data
>>> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
>>> and expected version 68 failed due to
>>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>>> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
>>> [2016-03-04 09:42:23,897] INFO Partition [foo,0] on broker 1: Cached
>>> zkVersion [68] not equal to that in zookeeper, skip updating ISR
>>> (kafka.cluster.Partition)
>>> [2016-03-04 09:42:33,894] INFO Partition [foo,0] on broker 1: Shrinking
>>> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
>>> [2016-03-04 09:42:33,897] WARN Conditional update of path
>>> /brokers/topics/foo/partitions/0/state with data
>>> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
>>> and expected version 68 failed due to
>>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>>> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
>>> [2016-03-04 09:42:33,897] INFO Partition [foo,0] on broker 1: Cached
>>> zkVersion [68] not equal to that in zookeeper, skip updating ISR
>>> (kafka.cluster.Partition)
>>> ...
>>>
>>> [2] controller.log fragment
>>> ...
>>> [201

Re: Kafka topic deletion still not working well

2016-03-13 Thread Stevo Slavić
Is it normal/expected behavior during topic deletion that to-be-deleted
topic partitions ISR will get updated/shrink as replicas remove fetcher and
delete partition log?

On Fri, Mar 11, 2016 at 5:32 PM, Stevo Slavić  wrote:

> Didn't finish sentence, wanted to change font but managed to hit Send
> button instead...
>
>
> As you can see from [1] replica server.log fragment, replica had one state
> in cache, while actual in ZK was different (node didn't exist anymore). I'm
> not sure why /brokers/topics/foo/partitions/0/state node did not exist in
> ZK. I mentioned that foo was likely created with Kafka 0.8.2.x - maybe
> different Kafka versions use different metadata (ZK dir layout), and
> 0.9.0.1 when deleting expected something to exist in ZK while it didn't
> since topic was created with 0.8.2.x.
>
> But I guess this Kafka version difference (during topic creation and at
> deletion) is unlikely to be the cause, since replica had state cached so
> state node is likely to have existed at some point before deletion
> request. Not sure what could have deleted it. Maybe some other part of
> distributed topic deletion process deleted the state node. Will check
> more logs, Kafka and ZK, if I can find what deleted the state node. Any
> ideas?
>
> On Fri, Mar 11, 2016 at 5:22 PM, Stevo Slavić  wrote:
>
>> Sorry, I do not know if this is related to KAFKA-2937, I let you judge
>> and decide whether ticket needs to be reopened or new one created.
>>
>> I'm still missing full picture what was happening but here are more
>> pieces.
>>
>> 1) Topic foo was created (likely with Kafka 0.8.2.x). Topic foo was
>> never used (no messages published to it ever).
>> 2) Topic bar was created with Kafka 0.9.0.1
>> 3) Deletion for topic foo was requested, and replica deletion got stuck
>> (see [1] server.log fragment)
>> 4) Deletion for topic bar was requested, one bar replica was on same
>> broker as replica for foo whose deletion got stuck, and deletion for bar
>> replica was never even tried
>> 5) Topics got in non-deletable state (see [2] controller.log fragment)
>>
>> As you can see from [1] replica had one state in cache, while actual in
>> ZK was different. I'm not sure why /brokers/topics/foo/partitions/0/state
>> did not
>>
>> [1] replica server.log fragment
>> ...
>> [2016-03-04 09:42:03,495] INFO [ReplicaFetcherManager on broker 1]
>> Removed fetcher for partitions [foo,0] (kafka.server.ReplicaFetcherManager)
>> [2016-03-04 09:42:03,533] INFO [ReplicaFetcherManager on broker 1]
>> Removed fetcher for partitions [foo,0] (kafka.server.ReplicaFetcherManager)
>> [2016-03-04 09:42:13,894] INFO Partition [foo,0] on broker 1: Shrinking
>> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
>> [2016-03-04 09:42:13,897] WARN Conditional update of path
>> /brokers/topics/foo/partitions/0/state with data
>> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
>> and expected version 68 failed due to
>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
>> [2016-03-04 09:42:13,898] INFO Partition [foo,0] on broker 1: Cached
>> zkVersion [68] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partition)
>> [2016-03-04 09:42:23,894] INFO Partition [foo,0] on broker 1: Shrinking
>> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
>> [2016-03-04 09:42:23,897] WARN Conditional update of path
>> /brokers/topics/foo/partitions/0/state with data
>> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
>> and expected version 68 failed due to
>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
>> [2016-03-04 09:42:23,897] INFO Partition [foo,0] on broker 1: Cached
>> zkVersion [68] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partition)
>> [2016-03-04 09:42:33,894] INFO Partition [foo,0] on broker 1: Shrinking
>> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
>> [2016-03-04 09:42:33,897] WARN Conditional update of path
>> /brokers/topics/foo/partitions/0/state with data
>> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
>> and expected version 68 failed due to
>> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
>> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
>> [2016-03-04 09:42:33,897] INFO Partition [foo,0] on broker 1: Cached
>> zkVersion [68] not equal to that in zookeeper, skip updating ISR
>> (kafka.cluster.Partition)
>> ...
>>
>> [2] controller.log fragment
>> ...
>> [2016-03-04 09:59:06,753] DEBUG [Topic Deletion Manager 1], Deletion
>> failed for replicas [Topic=bar,Partition=0,Replica=0]. Halting deletion for
>> topics Set(bar) (kafka.controller.TopicDeletionManager)
>> [2016-03-04 09:59:06,753] INFO [Replica state machine on 

Re: Kafka topic deletion still not working well

2016-03-11 Thread Stevo Slavić
Didn't finish sentence, wanted to change font but managed to hit Send
button instead...


As you can see from [1] replica server.log fragment, replica had one state
in cache, while actual in ZK was different (node didn't exist anymore). I'm
not sure why /brokers/topics/foo/partitions/0/state node did not exist in
ZK. I mentioned that foo was likely created with Kafka 0.8.2.x - maybe
different Kafka versions use different metadata (ZK dir layout), and
0.9.0.1 when deleting expected something to exist in ZK while it didn't
since topic was created with 0.8.2.x.

But I guess this Kafka version difference (during topic creation and at
deletion) is unlikely to be the cause, since replica had state cached so
state node is likely to have existed at some point before deletion request.
Not sure what could have deleted it. Maybe some other part of distributed
topic deletion process deleted the state node. Will check more logs, Kafka
and ZK, if I can find what deleted the state node. Any ideas?

On Fri, Mar 11, 2016 at 5:22 PM, Stevo Slavić  wrote:

> Sorry, I do not know if this is related to KAFKA-2937, I let you judge and
> decide whether ticket needs to be reopened or new one created.
>
> I'm still missing full picture what was happening but here are more pieces.
>
> 1) Topic foo was created (likely with Kafka 0.8.2.x). Topic foo was never
> used (no messages published to it ever).
> 2) Topic bar was created with Kafka 0.9.0.1
> 3) Deletion for topic foo was requested, and replica deletion got stuck
> (see [1] server.log fragment)
> 4) Deletion for topic bar was requested, one bar replica was on same
> broker as replica for foo whose deletion got stuck, and deletion for bar
> replica was never even tried
> 5) Topics got in non-deletable state (see [2] controller.log fragment)
>
> As you can see from [1] replica had one state in cache, while actual in ZK
> was different. I'm not sure why /brokers/topics/foo/partitions/0/state
> did not
>
> [1] replica server.log fragment
> ...
> [2016-03-04 09:42:03,495] INFO [ReplicaFetcherManager on broker 1] Removed
> fetcher for partitions [foo,0] (kafka.server.ReplicaFetcherManager)
> [2016-03-04 09:42:03,533] INFO [ReplicaFetcherManager on broker 1] Removed
> fetcher for partitions [foo,0] (kafka.server.ReplicaFetcherManager)
> [2016-03-04 09:42:13,894] INFO Partition [foo,0] on broker 1: Shrinking
> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
> [2016-03-04 09:42:13,897] WARN Conditional update of path
> /brokers/topics/foo/partitions/0/state with data
> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
> and expected version 68 failed due to
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
> [2016-03-04 09:42:13,898] INFO Partition [foo,0] on broker 1: Cached
> zkVersion [68] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-03-04 09:42:23,894] INFO Partition [foo,0] on broker 1: Shrinking
> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
> [2016-03-04 09:42:23,897] WARN Conditional update of path
> /brokers/topics/foo/partitions/0/state with data
> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
> and expected version 68 failed due to
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
> [2016-03-04 09:42:23,897] INFO Partition [foo,0] on broker 1: Cached
> zkVersion [68] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> [2016-03-04 09:42:33,894] INFO Partition [foo,0] on broker 1: Shrinking
> ISR for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
> [2016-03-04 09:42:33,897] WARN Conditional update of path
> /brokers/topics/foo/partitions/0/state with data
> {"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
> and expected version 68 failed due to
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
> NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
> [2016-03-04 09:42:33,897] INFO Partition [foo,0] on broker 1: Cached
> zkVersion [68] not equal to that in zookeeper, skip updating ISR
> (kafka.cluster.Partition)
> ...
>
> [2] controller.log fragment
> ...
> [2016-03-04 09:59:06,753] DEBUG [Topic Deletion Manager 1], Deletion
> failed for replicas [Topic=bar,Partition=0,Replica=0]. Halting deletion for
> topics Set(bar) (kafka.controller.TopicDeletionManager)
> [2016-03-04 09:59:06,753] INFO [Replica state machine on controller 1]:
> Invoking state change to ReplicaDeletionIneligible for replicas
> [Topic=bar,Partition=0,Replica=0] (kafka.controller.ReplicaStateMachine)
> [2016-03-04 09:59:06,754] INFO [Topic Deletion Manager 1], Halted deletion
> of topics bar (kafka.controller.TopicDeletionManager)
> ...
>
> On Fri, Mar 11, 2016 at 5:00 PM, Ma

Re: Kafka topic deletion still not working well

2016-03-11 Thread Stevo Slavić
Sorry, I do not know if this is related to KAFKA-2937, I let you judge and
decide whether ticket needs to be reopened or new one created.

I'm still missing full picture what was happening but here are more pieces.

1) Topic foo was created (likely with Kafka 0.8.2.x). Topic foo was never
used (no messages published to it ever).
2) Topic bar was created with Kafka 0.9.0.1
3) Deletion for topic foo was requested, and replica deletion got stuck
(see [1] server.log fragment)
4) Deletion for topic bar was requested, one bar replica was on same broker
as replica for foo whose deletion got stuck, and deletion for bar replica
was never even tried
5) Topics got in non-deletable state (see [2] controller.log fragment)

As you can see from [1] replica had one state in cache, while actual in ZK
was different. I'm not sure why /brokers/topics/foo/partitions/0/state did
not

[1] replica server.log fragment
...
[2016-03-04 09:42:03,495] INFO [ReplicaFetcherManager on broker 1] Removed
fetcher for partitions [foo,0] (kafka.server.ReplicaFetcherManager)
[2016-03-04 09:42:03,533] INFO [ReplicaFetcherManager on broker 1] Removed
fetcher for partitions [foo,0] (kafka.server.ReplicaFetcherManager)
[2016-03-04 09:42:13,894] INFO Partition [foo,0] on broker 1: Shrinking ISR
for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
[2016-03-04 09:42:13,897] WARN Conditional update of path
/brokers/topics/foo/partitions/0/state with data
{"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
and expected version 68 failed due to
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
[2016-03-04 09:42:13,898] INFO Partition [foo,0] on broker 1: Cached
zkVersion [68] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2016-03-04 09:42:23,894] INFO Partition [foo,0] on broker 1: Shrinking ISR
for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
[2016-03-04 09:42:23,897] WARN Conditional update of path
/brokers/topics/foo/partitions/0/state with data
{"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
and expected version 68 failed due to
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
[2016-03-04 09:42:23,897] INFO Partition [foo,0] on broker 1: Cached
zkVersion [68] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
[2016-03-04 09:42:33,894] INFO Partition [foo,0] on broker 1: Shrinking ISR
for partition [foo,0] from 1,3,2 to 1 (kafka.cluster.Partition)
[2016-03-04 09:42:33,897] WARN Conditional update of path
/brokers/topics/foo/partitions/0/state with data
{"controller_epoch":53,"leader":1,"version":1,"leader_epoch":34,"isr":[1]}
and expected version 68 failed due to
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode =
NoNode for /brokers/topics/foo/partitions/0/state (kafka.utils.ZkUtils)
[2016-03-04 09:42:33,897] INFO Partition [foo,0] on broker 1: Cached
zkVersion [68] not equal to that in zookeeper, skip updating ISR
(kafka.cluster.Partition)
...

[2] controller.log fragment
...
[2016-03-04 09:59:06,753] DEBUG [Topic Deletion Manager 1], Deletion failed
for replicas [Topic=bar,Partition=0,Replica=0]. Halting deletion for topics
Set(bar) (kafka.controller.TopicDeletionManager)
[2016-03-04 09:59:06,753] INFO [Replica state machine on controller 1]:
Invoking state change to ReplicaDeletionIneligible for replicas
[Topic=bar,Partition=0,Replica=0] (kafka.controller.ReplicaStateMachine)
[2016-03-04 09:59:06,754] INFO [Topic Deletion Manager 1], Halted deletion
of topics bar (kafka.controller.TopicDeletionManager)
...

On Fri, Mar 11, 2016 at 5:00 PM, Mayuresh Gharat  wrote:

> Hi Stevo,
>
> Let me know if we want to open Kafka-2937 again. I can include the above
> finding in to the patch or you want to create a separate JIra for this.
>
> Thanks,
>
> Mayuresh
>
> On Fri, Mar 11, 2016 at 7:53 AM, Mayuresh Gharat <
> gharatmayures...@gmail.com
> > wrote:
>
> > kafka-2937 is different from this I think. Kafka-2937 deals with the
> > delete topic getting stuck because the LeaderAndISR in ZK was updated by
> a
> > controller and then the controller dies and the new controller gets in to
> > the exception and never completes deleting the topic. The topic existed
> in
> > the cluster and was also marked for delete.
> > The case reported here is that the topic does not exist in cluster but is
> > marked for delete.
> > Am I right in understanding?
> >
> > Thanks,
> >
> > Mayuresh
> >
> > On Fri, Mar 11, 2016 at 5:30 AM, Stevo Slavić  wrote:
> >
> >> Topic it seems would get deleted but request in ZK to delete topic would
> >> not get cleared even after restarting Kafka cluster.
> >>
> >> I'm still investigating why deletion did not complete in the first place
> >> without restarting any nodes. It seems something smelly happens when

Re: Kafka topic deletion still not working well

2016-03-11 Thread Mayuresh Gharat
Hi Stevo,

Let me know if we want to open Kafka-2937 again. I can include the above
finding in to the patch or you want to create a separate JIra for this.

Thanks,

Mayuresh

On Fri, Mar 11, 2016 at 7:53 AM, Mayuresh Gharat  wrote:

> kafka-2937 is different from this I think. Kafka-2937 deals with the
> delete topic getting stuck because the LeaderAndISR in ZK was updated by a
> controller and then the controller dies and the new controller gets in to
> the exception and never completes deleting the topic. The topic existed in
> the cluster and was also marked for delete.
> The case reported here is that the topic does not exist in cluster but is
> marked for delete.
> Am I right in understanding?
>
> Thanks,
>
> Mayuresh
>
> On Fri, Mar 11, 2016 at 5:30 AM, Stevo Slavić  wrote:
>
>> Topic it seems would get deleted but request in ZK to delete topic would
>> not get cleared even after restarting Kafka cluster.
>>
>> I'm still investigating why deletion did not complete in the first place
>> without restarting any nodes. It seems something smelly happens when there
>> is request to delete more than one topic.
>>
>> Anyway, I think I found one potential bug in
>> ReplicaStateMachine.areAllReplicasForTopicDeleted check which could be
>> cause for not clearing deletion request from ZK even after restart of
>> whole
>> cluster. Line ReplicaStateMachine.scala#L285
>> <
>> https://github.com/sslavic/kafka/blob/trunk/core/src/main/scala/kafka/controller/ReplicaStateMachine.scala#L285
>> >
>>
>> replicaStatesForTopic.forall(_._2 == ReplicaDeletionSuccessful)
>>
>> which is return value of that function/check, probably should better be
>> checking for
>>
>> replicaStatesForTopic.isEmpty || replicaStatesForTopic.forall(_._2 ==
>> ReplicaDeletionSuccessful)
>>
>> I noticed it because in controller logs I found entries like:
>>
>> [2016-03-04 13:27:29,115] DEBUG [Replica state machine on controller 1]:
>> Are all replicas for topic foo deleted Map()
>> (kafka.controller.ReplicaStateMachine)
>>
>> even though normally they look like:
>>
>> [2016-03-04 09:33:41,036] DEBUG [Replica state machine on controller 1]:
>> Are all replicas for topic foo deleted
>> Map([Topic=foo,Partition=0,Replica=0] -> ReplicaDeletionStarted,
>> [Topic=foo,Partition=0,Replica=3] -> ReplicaDeletionStarted,
>> [Topic=foo,Partition=0,Replica=1] -> ReplicaDeletionSuccessful)
>> (kafka.controller.ReplicaStateMachine)
>>
>> Kind regards,
>> Stevo Slavic.
>>
>> On Sun, Mar 6, 2016 at 12:31 AM, Guozhang Wang 
>> wrote:
>>
>> > Thanks Stevo,
>> >
>> > Feel free to paste your findings in KAFKA-2937, we can re-open that
>> ticket
>> > if necessary.
>> >
>> > Guozhang
>> >
>> > On Fri, Mar 4, 2016 at 4:38 AM, Stevo Slavić  wrote:
>> >
>> > > Hell Apache Kafka community,
>> > >
>> > > I'm still investigating an incident; from initial findings topic
>> deletion
>> > > doesn't seem to work well still with Kafka 0.9.0.1, likely some edge
>> case
>> > > not covered.
>> > >
>> > > Before with 0.8.2.x it used to happen that non-lead replica would be
>> > stuck
>> > > in topic deletion process, and workaround was just to restart that
>> node.
>> > >
>> > > If I'm not mistaken, that edge case got (or at least is expected to
>> be)
>> > > fixed in 0.9.0.1 via KAFKA-2937
>> > > 
>> > >
>> > > Request to delete topic continued to be there in ZK even after whole
>> > > cluster restart - topic seemed not to exist, seemed to actually be
>> > deleted,
>> > > but request to delete topic would remain. Had to manually delete
>> request
>> > > node in ZK.
>> > >
>> > > When I have more details, and reproducible use case, will report back.
>> > >
>> > > Kind regards,
>> > > Stevo Slavic.
>> > >
>> >
>> >
>> >
>> > --
>> > -- Guozhang
>> >
>>
>
>
>
> --
> -Regards,
> Mayuresh R. Gharat
> (862) 250-7125
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125


Re: Kafka topic deletion still not working well

2016-03-11 Thread Mayuresh Gharat
kafka-2937 is different from this I think. Kafka-2937 deals with the delete
topic getting stuck because the LeaderAndISR in ZK was updated by a
controller and then the controller dies and the new controller gets in to
the exception and never completes deleting the topic. The topic existed in
the cluster and was also marked for delete.
The case reported here is that the topic does not exist in cluster but is
marked for delete.
Am I right in understanding?

Thanks,

Mayuresh

On Fri, Mar 11, 2016 at 5:30 AM, Stevo Slavić  wrote:

> Topic it seems would get deleted but request in ZK to delete topic would
> not get cleared even after restarting Kafka cluster.
>
> I'm still investigating why deletion did not complete in the first place
> without restarting any nodes. It seems something smelly happens when there
> is request to delete more than one topic.
>
> Anyway, I think I found one potential bug in
> ReplicaStateMachine.areAllReplicasForTopicDeleted check which could be
> cause for not clearing deletion request from ZK even after restart of whole
> cluster. Line ReplicaStateMachine.scala#L285
> <
> https://github.com/sslavic/kafka/blob/trunk/core/src/main/scala/kafka/controller/ReplicaStateMachine.scala#L285
> >
>
> replicaStatesForTopic.forall(_._2 == ReplicaDeletionSuccessful)
>
> which is return value of that function/check, probably should better be
> checking for
>
> replicaStatesForTopic.isEmpty || replicaStatesForTopic.forall(_._2 ==
> ReplicaDeletionSuccessful)
>
> I noticed it because in controller logs I found entries like:
>
> [2016-03-04 13:27:29,115] DEBUG [Replica state machine on controller 1]:
> Are all replicas for topic foo deleted Map()
> (kafka.controller.ReplicaStateMachine)
>
> even though normally they look like:
>
> [2016-03-04 09:33:41,036] DEBUG [Replica state machine on controller 1]:
> Are all replicas for topic foo deleted
> Map([Topic=foo,Partition=0,Replica=0] -> ReplicaDeletionStarted,
> [Topic=foo,Partition=0,Replica=3] -> ReplicaDeletionStarted,
> [Topic=foo,Partition=0,Replica=1] -> ReplicaDeletionSuccessful)
> (kafka.controller.ReplicaStateMachine)
>
> Kind regards,
> Stevo Slavic.
>
> On Sun, Mar 6, 2016 at 12:31 AM, Guozhang Wang  wrote:
>
> > Thanks Stevo,
> >
> > Feel free to paste your findings in KAFKA-2937, we can re-open that
> ticket
> > if necessary.
> >
> > Guozhang
> >
> > On Fri, Mar 4, 2016 at 4:38 AM, Stevo Slavić  wrote:
> >
> > > Hell Apache Kafka community,
> > >
> > > I'm still investigating an incident; from initial findings topic
> deletion
> > > doesn't seem to work well still with Kafka 0.9.0.1, likely some edge
> case
> > > not covered.
> > >
> > > Before with 0.8.2.x it used to happen that non-lead replica would be
> > stuck
> > > in topic deletion process, and workaround was just to restart that
> node.
> > >
> > > If I'm not mistaken, that edge case got (or at least is expected to be)
> > > fixed in 0.9.0.1 via KAFKA-2937
> > > 
> > >
> > > Request to delete topic continued to be there in ZK even after whole
> > > cluster restart - topic seemed not to exist, seemed to actually be
> > deleted,
> > > but request to delete topic would remain. Had to manually delete
> request
> > > node in ZK.
> > >
> > > When I have more details, and reproducible use case, will report back.
> > >
> > > Kind regards,
> > > Stevo Slavic.
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-Regards,
Mayuresh R. Gharat
(862) 250-7125


Re: Kafka topic deletion still not working well

2016-03-11 Thread Stevo Slavić
Topic it seems would get deleted but request in ZK to delete topic would
not get cleared even after restarting Kafka cluster.

I'm still investigating why deletion did not complete in the first place
without restarting any nodes. It seems something smelly happens when there
is request to delete more than one topic.

Anyway, I think I found one potential bug in
ReplicaStateMachine.areAllReplicasForTopicDeleted check which could be
cause for not clearing deletion request from ZK even after restart of whole
cluster. Line ReplicaStateMachine.scala#L285


replicaStatesForTopic.forall(_._2 == ReplicaDeletionSuccessful)

which is return value of that function/check, probably should better be
checking for

replicaStatesForTopic.isEmpty || replicaStatesForTopic.forall(_._2 ==
ReplicaDeletionSuccessful)

I noticed it because in controller logs I found entries like:

[2016-03-04 13:27:29,115] DEBUG [Replica state machine on controller 1]:
Are all replicas for topic foo deleted Map()
(kafka.controller.ReplicaStateMachine)

even though normally they look like:

[2016-03-04 09:33:41,036] DEBUG [Replica state machine on controller 1]:
Are all replicas for topic foo deleted
Map([Topic=foo,Partition=0,Replica=0] -> ReplicaDeletionStarted,
[Topic=foo,Partition=0,Replica=3] -> ReplicaDeletionStarted,
[Topic=foo,Partition=0,Replica=1] -> ReplicaDeletionSuccessful)
(kafka.controller.ReplicaStateMachine)

Kind regards,
Stevo Slavic.

On Sun, Mar 6, 2016 at 12:31 AM, Guozhang Wang  wrote:

> Thanks Stevo,
>
> Feel free to paste your findings in KAFKA-2937, we can re-open that ticket
> if necessary.
>
> Guozhang
>
> On Fri, Mar 4, 2016 at 4:38 AM, Stevo Slavić  wrote:
>
> > Hell Apache Kafka community,
> >
> > I'm still investigating an incident; from initial findings topic deletion
> > doesn't seem to work well still with Kafka 0.9.0.1, likely some edge case
> > not covered.
> >
> > Before with 0.8.2.x it used to happen that non-lead replica would be
> stuck
> > in topic deletion process, and workaround was just to restart that node.
> >
> > If I'm not mistaken, that edge case got (or at least is expected to be)
> > fixed in 0.9.0.1 via KAFKA-2937
> > 
> >
> > Request to delete topic continued to be there in ZK even after whole
> > cluster restart - topic seemed not to exist, seemed to actually be
> deleted,
> > but request to delete topic would remain. Had to manually delete request
> > node in ZK.
> >
> > When I have more details, and reproducible use case, will report back.
> >
> > Kind regards,
> > Stevo Slavic.
> >
>
>
>
> --
> -- Guozhang
>


Re: Kafka topic deletion still not working well

2016-03-05 Thread Guozhang Wang
Thanks Stevo,

Feel free to paste your findings in KAFKA-2937, we can re-open that ticket
if necessary.

Guozhang

On Fri, Mar 4, 2016 at 4:38 AM, Stevo Slavić  wrote:

> Hell Apache Kafka community,
>
> I'm still investigating an incident; from initial findings topic deletion
> doesn't seem to work well still with Kafka 0.9.0.1, likely some edge case
> not covered.
>
> Before with 0.8.2.x it used to happen that non-lead replica would be stuck
> in topic deletion process, and workaround was just to restart that node.
>
> If I'm not mistaken, that edge case got (or at least is expected to be)
> fixed in 0.9.0.1 via KAFKA-2937
> 
>
> Request to delete topic continued to be there in ZK even after whole
> cluster restart - topic seemed not to exist, seemed to actually be deleted,
> but request to delete topic would remain. Had to manually delete request
> node in ZK.
>
> When I have more details, and reproducible use case, will report back.
>
> Kind regards,
> Stevo Slavic.
>



-- 
-- Guozhang