[ 
https://issues.apache.org/jira/browse/KAFKA-9118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin McCabe updated KAFKA-9118:
--------------------------------
        Parent:     (was: KAFKA-9119)
    Issue Type: Improvement  (was: Sub-task)

> LogDirFailureHandler shouldn't use Zookeeper
> --------------------------------------------
>
>                 Key: KAFKA-9118
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9118
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Viktor Somogyi-Vass
>            Assignee: Viktor Somogyi-Vass
>            Priority: Major
>
> As described in 
> [KIP-112|https://cwiki.apache.org/confluence/display/KAFKA/KIP-112%3A+Handle+disk+failure+for+JBOD#KIP-112:HandlediskfailureforJBOD-Zookeeper]:
> {noformat}
> 2. A log directory stops working on a broker during runtime
> - The controller watches the path /log_dir_event_notification for new znode.
> - The broker detects offline log directories during runtime.
> - The broker takes actions as if it has received StopReplicaRequest for this 
> replica. More specifically, the replica is no longer considered leader and is 
> removed from any replica fetcher thread. (The clients will receive a 
> UnknownTopicOrPartitionException at this point)
> - The broker notifies the controller by creating a sequential znode under 
> path /log_dir_event_notification with data of the format {"version" : 1, 
> "broker" : brokerId, "event" : LogDirFailure}.
> - The controller reads the znode to get the brokerId and finds that the event 
> type is LogDirFailure.
> - The controller deletes the notification znode
> - The controller sends LeaderAndIsrRequest to that broker to query the state 
> of all topic partitions on the broker. The LeaderAndIsrResponse from this 
> broker will specify KafkaStorageException for those partitions that are on 
> the bad log directories.
> - The controller updates the information of offline replicas in memory and 
> trigger leader election as appropriate.
> - The controller removes offline replicas from ISR in the ZK and sends 
> LeaderAndIsrRequest with updated ISR to be used by partition leaders.
> - The controller propagates the information of offline replicas to brokers by 
> sending UpdateMetadataRequest.
> {noformat}
> Instead of the notification ZNode we should use a Kafka protocol that sends a 
> notification message to the controller with the offline partitions. The 
> controller then updates the information of offline replicas in memory and 
> trigger leader election, then removes the replicas from ISR in ZK and sends a 
> LAIR and an UpdateMetadataRequest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to