[ https://issues.apache.org/jira/browse/KAFKA-9840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jason Gustafson updated KAFKA-9840: ----------------------------------- Description: We have observed a case where the consumer attempted to detect truncation with the OffsetsForLeaderEpoch API against a broker which had become a zombie. In this case, the last epoch known to the consumer was higher than the last epoch known to the zombie broker, so the broker returned -1 as both the end offset and epoch in the response. The consumer did not check for this in the response, which resulted in the following message: {code} Truncation detected for partition topic-1 at offset FetchPosition{offset=11859, offsetEpoch=Optional[46], currentLeader=LeaderAndEpoch{leader=broker-host (id: 3 rack: null), epoch=-1}}, resetting offset to the first offset known to diverge FetchPosition{offset=-1, offsetEpoch=Optional[-1], currentLeader=LeaderAndEpoch{broker-host (id: 3 rack: null), epoch=-1}} (org.apache.kafka.clients.consumer.internals.SubscriptionState:414) {code} There are a couple ways we the consumer can handle this situation better. First, the reason we did not detect the zombie broker is that we did not include the current leader epoch in the OffsetForLeaderEpoch request. This was likely because of KAFKA-9212. Following this patch, we would not initialize the current leader epoch from metadata responses because there are cases that we cannot rely on it. But if the client cannot rely on being able to detect zombies, then the epoch validation is less useful anyway. So the simple solution is to not bother with the validation unless we have a reliable current leader epoch. Second, the consumer needs to check for the case when the returned offset and epoch are not defined. In this case, we have to treat this as a normal OffsetOutOfRange case and invoke the reset policy. was: We have observed a case where the consumer attempted to detect truncation with the OffsetsForLeaderEpoch API against a broker which had become a zombie. In this case, the last epoch known to the consumer was higher than the last epoch known to the zombie broker, so the broker returned -1 as the offset and epoch in the response. The consumer did not check for this in the response, which resulted in the following message: {code} Truncation detected for partition topic-1 at offset FetchPosition{offset=11859, offsetEpoch=Optional[46], currentLeader=LeaderAndEpoch{leader=broker-host (id: 3 rack: null), epoch=-1}}, resetting offset to the first offset known to diverge FetchPosition{offset=-1, offsetEpoch=Optional[-1], currentLeader=LeaderAndEpoch{broker-host (id: 3 rack: null), epoch=-1}} (org.apache.kafka.clients.consumer.internals.SubscriptionState:414) {code} There are a couple ways we the consumer can handle this situation better. First, the reason we did not detect the zombie broker is that we did not include the current leader epoch in the OffsetForLeaderEpoch request. This was likely because of KAFKA-9212. Following this patch, we would not initialize the current leader epoch from metadata responses because there are cases that we cannot rely on it. But if the client cannot rely on being able to detect zombies, then the epoch validation is less useful anyway. So the simple solution is to not bother with the validation unless we have a reliable current leader epoch. Second, the consumer needs to check for the case when the returned offset and epoch are not defined. In this case, we have to treat this as a normal OffsetOutOfRange case and invoke the reset policy. > Consumer should not use OffsetForLeaderEpoch without current epoch validation > ----------------------------------------------------------------------------- > > Key: KAFKA-9840 > URL: https://issues.apache.org/jira/browse/KAFKA-9840 > Project: Kafka > Issue Type: Bug > Components: consumer > Affects Versions: 2.4.1 > Reporter: Jason Gustafson > Priority: Major > > We have observed a case where the consumer attempted to detect truncation > with the OffsetsForLeaderEpoch API against a broker which had become a > zombie. In this case, the last epoch known to the consumer was higher than > the last epoch known to the zombie broker, so the broker returned -1 as both > the end offset and epoch in the response. The consumer did not check for this > in the response, which resulted in the following message: > {code} > Truncation detected for partition topic-1 at offset > FetchPosition{offset=11859, offsetEpoch=Optional[46], > currentLeader=LeaderAndEpoch{leader=broker-host (id: 3 rack: null), > epoch=-1}}, resetting offset to the first offset known to diverge > FetchPosition{offset=-1, offsetEpoch=Optional[-1], > currentLeader=LeaderAndEpoch{broker-host (id: 3 rack: null), epoch=-1}} > (org.apache.kafka.clients.consumer.internals.SubscriptionState:414) > {code} > There are a couple ways we the consumer can handle this situation better. > First, the reason we did not detect the zombie broker is that we did not > include the current leader epoch in the OffsetForLeaderEpoch request. This > was likely because of KAFKA-9212. Following this patch, we would not > initialize the current leader epoch from metadata responses because there are > cases that we cannot rely on it. But if the client cannot rely on being able > to detect zombies, then the epoch validation is less useful anyway. So the > simple solution is to not bother with the validation unless we have a > reliable current leader epoch. > Second, the consumer needs to check for the case when the returned offset and > epoch are not defined. In this case, we have to treat this as a normal > OffsetOutOfRange case and invoke the reset policy. -- This message was sent by Atlassian Jira (v8.3.4#803005)