[ 
https://issues.apache.org/jira/browse/KAFKA-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16528385#comment-16528385
 ] 

Anna Povzner commented on KAFKA-7122:
-------------------------------------

Have you seen https://issues.apache.org/jira/browse/KAFKA-6361? The description 
sounds very similar, except two leader changes happen: broker 1 -> broker 2 -> 
broker 1, where the second leader change happens while both brokers are in ISR 
(which can happen due to preferred leader election), and broker 1 does not have 
a chance to truncate before the leadership changes the second time. Is it 
possible that broker A did not actually have a chance to truncate and fetch 
from new leader in step 7? 

> Data is lost when ZooKeeper times out
> -------------------------------------
>
>                 Key: KAFKA-7122
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7122
>             Project: Kafka
>          Issue Type: Bug
>          Components: core, replication
>    Affects Versions: 0.11.0.2
>            Reporter: Nick Lipple
>            Priority: Blocker
>
> Noticed that a kafka cluster will lose data when a leader for a partition has 
> their zookeeper connection timeout.
> Sequence of events:
>  # Say broker A leads a partition followed by brokers B and C
>  # A ZK node has a network issue, happens to be the node used by broker A. 
> Lets say this happens at offset X
>  # Kafka Controller immediately selects broker C as the new partition leader
>  # Broker A does not timeout from zookeeper for another 4 seconds. Broker A 
> still thinks it is the leader, presumably accepting producer writes.
>  # Broker A detects the ZK timeout and leaves the ISR.
>  # Broker A reconnects to ZK, rejoins cluster as follower for partition
>  # Broker A truncates log to some offset Y such that Y > X. Broker A proceeds 
> to catch up normally and becomes an ISR
>  # ISRs for partition are now in an inconsistent state:
>  ## Broker C has all offsets X through Y plus everything after
>  ## Broker B has all offsets X through Y plus everything after
>  ## Broker A has offsets up to X and after Y. Everything between X and Y *IS 
> MISSING*
>  # Within 5 minutes, controller trigger preferred replica election making 
> Broker A the new leader for partition (this is default behavior)
> All consumers after step 9 will not receive any messages for offsets between 
> X and Y.
>  
> The root problem here seems to be broker A truncates to offset Y when 
> rejoining the cluster. It should be truncating further back to offset X to 
> prevent data loss
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to