[jira] [Updated] (KAFKA-8036) Log dir reassignment on followers fails with FileNotFoundException for the leader epoch cache on leader election

Stanislav Kozlovski (JIRA) Mon, 04 Mar 2019 08:06:25 -0800


     [ 
https://issues.apache.org/jira/browse/KAFKA-8036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Stanislav Kozlovski updated KAFKA-8036:
---------------------------------------
    Description: 
When changing a partition's log directories for a follower broker, we move all 
the data related to that partition to the other log dir (as per 
[KIP-113|https://cwiki.apache.org/confluence/display/KAFKA/KIP-113:+Support+replicas+movement+between+log+directories]).
 On a successful move, we rename the original directory by adding a suffix 
consisting of an UUID and `-delete`. (e.g `test_log_dir` would be renamed to 
`test_log_dir-0.32e77c96939140f9a56a49b75ad8ec8d-delete`)

We copy every log file and [initialize a new leader epoch file 
cache|https://github.com/apache/kafka/blob/0d56f1413557adabc736cae2dffcdc56a620403e/core/src/main/scala/kafka/log/Log.scala#L768].
 The problem is that we do not update the associated `Replica` class' leader 
epoch cache - it still points to the old `LeaderEpochFileCache` instance.
This results in a FileNotFound exception when the broker is [elected as a 
leader for the 
[partition|https://github.com/apache/kafka/blob/255f4a6effdc71c273691859cd26c4138acad778/core/src/main/scala/kafka/cluster/Partition.scala#L312].
 This has the unintended side effect of marking the log directory as offline, 
resulting in all partitions from that log directory becoming unavailable for 
the specific broker.

  was:
When changing a partition's log directories for a follower broker, we move all 
the data related to that partition to the other log dir (as per 
[KIP-113|https://cwiki.apache.org/confluence/display/KAFKA/KIP-113:+Support+replicas+movement+between+log+directories]).
 On a successful move, we rename the original directory by adding a suffix 
consisting of an UUID and `-delete`. (e.g `test_log_dir` would be renamed to 
`test_log_dir-0.32e77c96939140f9a56a49b75ad8ec8d-delete`)

We copy every log file and [initialize a new leader epoch file 
cache|https://github.com/apache/kafka/blob/0d56f1413557adabc736cae2dffcdc56a620403e/core/src/main/scala/kafka/log/Log.scala#L768].
 The problem is that we do not update the associated `Replica` class' leader 
epoch cache - it still points to the old `LeaderEpochFileCache` instance.
This results in a FileNotFound exception when the broker is [elected as a 
leader for the 
partition|[https://github.com/apache/kafka/blob/255f4a6effdc71c273691859cd26c4138acad778/core/src/main/scala/kafka/cluster/Partition.scala#L312]].
 This has the unintended side effect of marking the log directory as offline, 
resulting in all partitions from that log directory becoming unavailable for 
the specific broker.


> Log dir reassignment on followers fails with FileNotFoundException for the 
> leader epoch cache on leader election
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-8036
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8036
>             Project: Kafka
>          Issue Type: Improvement
>    Affects Versions: 1.0.2, 1.1.0, 2.0.1
>            Reporter: Stanislav Kozlovski
>            Assignee: Stanislav Kozlovski
>            Priority: Major
>
> When changing a partition's log directories for a follower broker, we move 
> all the data related to that partition to the other log dir (as per 
> [KIP-113|https://cwiki.apache.org/confluence/display/KAFKA/KIP-113:+Support+replicas+movement+between+log+directories]).
>  On a successful move, we rename the original directory by adding a suffix 
> consisting of an UUID and `-delete`. (e.g `test_log_dir` would be renamed to 
> `test_log_dir-0.32e77c96939140f9a56a49b75ad8ec8d-delete`)
> We copy every log file and [initialize a new leader epoch file 
> cache|https://github.com/apache/kafka/blob/0d56f1413557adabc736cae2dffcdc56a620403e/core/src/main/scala/kafka/log/Log.scala#L768].
>  The problem is that we do not update the associated `Replica` class' leader 
> epoch cache - it still points to the old `LeaderEpochFileCache` instance.
> This results in a FileNotFound exception when the broker is [elected as a 
> leader for the 
> [partition|https://github.com/apache/kafka/blob/255f4a6effdc71c273691859cd26c4138acad778/core/src/main/scala/kafka/cluster/Partition.scala#L312].
>  This has the unintended side effect of marking the log directory as offline, 
> resulting in all partitions from that log directory becoming unavailable for 
> the specific broker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (KAFKA-8036) Log dir reassignment on followers fails with FileNotFoundException for the leader epoch cache on leader election

Reply via email to