[jira] [Updated] (KAFKA-5758) Reassigning a topic's partitions can adversely impact other topics

David van Geest (JIRA) Mon, 21 Aug 2017 10:14:17 -0700

     [ 
https://issues.apache.org/jira/browse/KAFKA-5758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


David van Geest updated KAFKA-5758:
-----------------------------------
    Description: 
We've noticed that reassigning a topic's partitions seems to adversely impact 
other topics. Specifically, followers for other topics fall out of the ISR.

While I'm not 100% sure about why this happens, the scenario seems to be as 
follows:

1. Reassignment is manually triggered on topic-partition X-Y, and broker A 
(which used to be a follower for X-Y) is no longer a follower.
2. Broker A makes `FetchRequest` including topic-partition X-Y to broker B, 
just after the reassignment.
3. Broker B can fulfill the `FetchRequest`, but while trying to do so it tries 
to record the position of "follower" A. This fails, because broker A is no 
longer a follower for X-Y (see exception below).
4. The entire `FetchRequest` request fails, and broker A's other followed 
topics start falling behind.
5. Depending on the length of the reassignment, this sequence repeats.

In step 3, we see exceptions like:

{noformat}
Error when handling request Name: FetchRequest; Version: 3; CorrelationId: 
46781859; ClientId: ReplicaFetcherThread-0-1001; ReplicaId: 1006; MaxWait: 500 
ms; MinBytes: 1 bytes; MaxBytes:10485760 bytes; RequestInfo: 

<LOTS OF PARTITIONS>

kafka.common.NotAssignedReplicaException: Leader 1001 failed to record follower 
1006's position -1 since the replica is not recognized to be one of the 
assigned replicas 1001,1004,1005 for partition [topic_being_reassigned,5].
at kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:249)
        at 
kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:923)
        at 
kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:920)
        at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at 
kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:920)
        at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:481)
        at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:534)
        at kafka.server.KafkaApis.handle(KafkaApis.scala:79)
        at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
        at java.lang.Thread.run(Thread.java:745)
{noformat}

Does my assessment make sense? If so, this behaviour seems problematic. A few 
changes that might improve matters (assuming I'm on the right track):

1. `FetchRequest` should be able to return partial results
2. The broker fulfilling the `FetchRequest` could ignore the 
`NotAssignedReplicaException`, and return results without recording the 
not-any-longer-follower position.

This behaviour was experienced with 0.10.1.1, although looking at the 
changelogs and the code in question, I don't see any reason why it would have 
changed in later versions.

Am very interested to have some discussion on this. Thanks!


  was:
We've noticed that reassigning a topic's partitions seems to adversely impact 
other topics. Specifically, followers for other topics fall out of the ISR.

While I'm not 100% sure about why this happens, the scenario seems to be as 
follows:

1. Reassignment is manually triggered on topic-partition X-Y, and broker A 
(which used to be a follower for X-Y) is no longer a follower.
2. Broker A makes `FetchRequest` including topic-partition X-Y to broker B, 
just after the reassignment.
3. Broker B can fulfill the `FetchRequest`, but while trying to do so it tries 
to record the position of "follower" A. This fails, because broker A is no 
longer a follower for X-Y (see exception below).
4. The entire `FetchRequest` request fails, and broker A's other followed 
topics start falling behind.
5. Depending on the length of the reassignment, this sequence repeats.

In step 3, we see exceptions like:

{noformat}
Error when handling request Name: FetchRequest; Version: 3; CorrelationId: 
46781859; ClientId: ReplicaFetcherThread-0-1001; ReplicaId: 1006; MaxWait: 500 
ms; MinBytes: 1 bytes; MaxBytes:10485760 bytes; RequestInfo: 

<LOTS OF PARTITIONS>

kafka.common.NotAssignedReplicaException: Leader 1001 failed to record follower 
1006's position -1 since the replica is not recognized to be one of the 
assigned replicas 1001,1004,1005 for partition [topic_being_reassigned,5].
{noformat}

Does my assessment make sense? If so, this behaviour seems problematic. A few 
changes that might improve matters (assuming I'm on the right track):

1. `FetchRequest` should be able to return partial results
2. The broker fulfilling the `FetchRequest` could ignore the 
`NotAssignedReplicaException`, and return results without recording the 
not-any-longer-follower position.

This behaviour was experienced with 0.10.1.1, although looking at the 
changelogs and the code in question, I don't see any reason why it would have 
changed in later versions.

Am very interested to have some discussion on this. Thanks!



> Reassigning a topic's partitions can adversely impact other topics
> ------------------------------------------------------------------
>
>                 Key: KAFKA-5758
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5758
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.10.1.1
>            Reporter: David van Geest
>
> We've noticed that reassigning a topic's partitions seems to adversely impact 
> other topics. Specifically, followers for other topics fall out of the ISR.
> While I'm not 100% sure about why this happens, the scenario seems to be as 
> follows:
> 1. Reassignment is manually triggered on topic-partition X-Y, and broker A 
> (which used to be a follower for X-Y) is no longer a follower.
> 2. Broker A makes `FetchRequest` including topic-partition X-Y to broker B, 
> just after the reassignment.
> 3. Broker B can fulfill the `FetchRequest`, but while trying to do so it 
> tries to record the position of "follower" A. This fails, because broker A is 
> no longer a follower for X-Y (see exception below).
> 4. The entire `FetchRequest` request fails, and broker A's other followed 
> topics start falling behind.
> 5. Depending on the length of the reassignment, this sequence repeats.
> In step 3, we see exceptions like:
> {noformat}
> Error when handling request Name: FetchRequest; Version: 3; CorrelationId: 
> 46781859; ClientId: ReplicaFetcherThread-0-1001; ReplicaId: 1006; MaxWait: 
> 500 ms; MinBytes: 1 bytes; MaxBytes:10485760 bytes; RequestInfo: 
> <LOTS OF PARTITIONS>
> kafka.common.NotAssignedReplicaException: Leader 1001 failed to record 
> follower 1006's position -1 since the replica is not recognized to be one of 
> the assigned replicas 1001,1004,1005 for partition [topic_being_reassigned,5].
> at kafka.cluster.Partition.updateReplicaLogReadResult(Partition.scala:249)
>       at 
> kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:923)
>       at 
> kafka.server.ReplicaManager$$anonfun$updateFollowerLogReadResults$2.apply(ReplicaManager.scala:920)
>       at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>       at 
> kafka.server.ReplicaManager.updateFollowerLogReadResults(ReplicaManager.scala:920)
>       at kafka.server.ReplicaManager.fetchMessages(ReplicaManager.scala:481)
>       at kafka.server.KafkaApis.handleFetchRequest(KafkaApis.scala:534)
>       at kafka.server.KafkaApis.handle(KafkaApis.scala:79)
>       at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:60)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Does my assessment make sense? If so, this behaviour seems problematic. A few 
> changes that might improve matters (assuming I'm on the right track):
> 1. `FetchRequest` should be able to return partial results
> 2. The broker fulfilling the `FetchRequest` could ignore the 
> `NotAssignedReplicaException`, and return results without recording the 
> not-any-longer-follower position.
> This behaviour was experienced with 0.10.1.1, although looking at the 
> changelogs and the code in question, I don't see any reason why it would have 
> changed in later versions.
> Am very interested to have some discussion on this. Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (KAFKA-5758) Reassigning a topic's partitions can adversely impact other topics

Reply via email to