[jira] [Created] (KAFKA-16218) Partition reassignment can't complete if any target replica is out-of-sync

2024-02-01 Thread Drawxy (Jira)
Drawxy created KAFKA-16218:
--

 Summary: Partition reassignment can't complete if any target 
replica is out-of-sync
 Key: KAFKA-16218
 URL: https://issues.apache.org/jira/browse/KAFKA-16218
 Project: Kafka
  Issue Type: Bug
Reporter: Drawxy


Assumed that there were 4 brokers (1001,2001,3001,4001) and a topic partition 
_foo-0_ (replicas[1001,2001,3001], isr[1001,3001]). The replica 2001 can't 
catch up and become out-of-sync due to some issue.

If we launch a partition reassinment for this _foo-0_ (the target replica list 
is [1001,2001,4001]), the partition reassignment can't complete even if the 
adding replica 4001 already catches up. At that time, the partition state would 
be replicas[1001,2001,4001,3001] isr[1001,3001,4001].

 

The out-of-sync replica 2001 shouldn't make the partition reassignment stuck.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15608) Uncopied leader epoch cache cause repeated OffsetOutOfRangeException

2023-10-16 Thread Drawxy (Jira)
Drawxy created KAFKA-15608:
--

 Summary: Uncopied leader epoch cache cause repeated 
OffsetOutOfRangeException
 Key: KAFKA-15608
 URL: https://issues.apache.org/jira/browse/KAFKA-15608
 Project: Kafka
  Issue Type: Bug
  Components: core
Affects Versions: 3.1.0
Reporter: Drawxy


Recently, I encountered a issue that there was always 1 partition having only 1 
ISR (no produce traffic on this topic). The bug is related to altering log dir. 
When replacing current log with future log, broker doesn't copy the leader 
epoch checkpoint cache, which records the current leader epoch and log start 
offset. The cache for each partition is updated only when appending new 
messages or becoming leader. If there is no traffic and the replica is already 
the leader, the cache will not be updated any more. However, the partition 
leader will fetch its leader epoch from the cache and compare with the leader 
epoch sent by follower when handling fetch request. If the former one is missed 
or less than the latter one, the leader will interrupt the process and return 
an OffsetOutOfRangeException to follower. The follower might be out of sync 
over time.

Take the following case as an example, all the key points are listed in 
chronological order:
 # Reassigner submitted a partition reassignment for partition foo-1
{quote}{"topic": "foo","partition": 1,"replicas": [5002,3003,4001],"logDirs": 
["\data\kafka-logs-0","any","any"]}{quote}
 # Reassignment completed immediately due to there is no traffic on this topic.
 # Controller sent LeaderAndISR requests to all the replicas.
 # Newly added replica 5002 became the new leader and the current log updated 
the leader epoch offset cache. Replica 5002 successfully handled the 
LeaderAndISR request.
 # Altering log dir completed and the newly updated current log didn't have 
leader epoch offset information.
 # Replica 5002 handled fetch requests (include fetch offset and current leader 
epoch) from followers and returned OffsetOutOfRangeException due to leader 
epoch offset cache hadn't been updated. So, the replica 5002 couldn't update 
the fetch state for each follower and reported ISRShrink later. The followers 
3003 and 4001 would repeatedly print the following log:

{quote}WARN [ReplicaFetcher replicaId=4001, leaderId=5002, fetcherId=2] Reset 
fetch offset for partition foo-1 from 231196 to current leader's start offset 
231196 (kafka.server.ReplicaFetcherThread)
INFO [ReplicaFetcher replicaId=4001, leaderId=5002, fetcherId=2] Current offset 
231196 for partition foo-1 is out of range, which typically implies a leader 
change. Reset fetch offset to 231196 (kafka.server.ReplicaFetcherThread)
{quote}
This issue arises only when all the three conditions are met:
 # No produce traffic on the partition.
 # Newly added replica become new leader.
 # LeaderAndISR request is handled successfully before altering log dir 
completed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (KAFKA-15082) The log retention policy doesn't take effect after altering log dir

2023-06-12 Thread Drawxy (Jira)
Drawxy created KAFKA-15082:
--

 Summary: The log retention policy doesn't take effect after 
altering log dir
 Key: KAFKA-15082
 URL: https://issues.apache.org/jira/browse/KAFKA-15082
 Project: Kafka
  Issue Type: Bug
Affects Versions: 3.3.1
Reporter: Drawxy


There are two scenarios where the log retention policy doesn't take effect:
 # During altering log dir, if a LeaderAndISR request of partition being moved 
is sent to the broker, the broker will never resume the cleaning of log segment.
 # Cancel altering log dir.

In those scenarios, the stale log segment files will never be deleted, which 
can cause the disk full issue. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)