from:"shizhenzhen \(Jira\)"



[ 
https://issues.apache.org/jira/browse/KAFKA-14329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17623054#comment-17623054
 ] 

shizhenzhen commented on KAFKA-14329:
-

对于这里的异常 ，我觉得没有必只要有分区异常就抛出来，而是应该判断查询的分区有异常的话就抛出来。

 

看这里的代码(当然我并不是说要在这里修改)

 

!image-2022-10-24-16-47-03-180.png!

 

 

另一个相关的异常：https://issues.apache.org/jira/browse/KAFKA-14328

 

 

[~showuon] [~dengziming]   [~guozhang] 

 

 

 

> KafkaAdminClient#listOffsets should query by partition dimension
> 
>
> Key: KAFKA-14329
> URL: https://issues.apache.org/jira/browse/KAFKA-14329
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 2.7.2
>Reporter: shizhenzhen
>Priority: Major
>  Labels: bug
> Attachments: image-2022-10-24-16-18-01-418.png, 
> image-2022-10-24-16-20-31-778.png, image-2022-10-24-16-47-03-180.png
>
>
>  
> listOffsets接口查询 TopicPartition分区Offset的时候,  正常来说应该是按照分区维度来查询的
>  
> 比如 我查询了 Topic1-0  、Topic1-1 ; 返回对应的数据。
>  
> 正常情况下是这样没错，但是在调用这个接口之前 先调用了 Metadata接口。
>  
> 并且在处理Metadata返回数据的时候，判断了如果对应的Topic有一个Topic分区leader不存在。都会抛出异常。如下：
>  
> !image-2022-10-24-16-20-31-778.png!
>  
>  
> !image-2022-10-24-16-18-01-418.png!
>  
>  
> 假设这种情况：
>  
> Topic1 有3个分区； Topic1-0;Topic1-1; Topic1-2 ;
>  
> 但是刚好 Topic1-2 这个分区由于某些原因导致分区Leader = -1
>  
> 这个时候我想去查询  Topic1-0;Topic1-1; 这两个分区的Offset的时候，它直接给我抛出来异常了
>  
> 正常逻辑，我这里没有涉及到有问题的Topic1-2；  我应该是能够查询出来数据的。
>  
> 但是却给我异常了，这非常的不友好。
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14329) KafkaAdminClient#listOffsets should query by partition dimension



 [ 
https://issues.apache.org/jira/browse/KAFKA-14329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-14329:

Attachment: image-2022-10-24-16-47-03-180.png

> KafkaAdminClient#listOffsets should query by partition dimension
> 
>
> Key: KAFKA-14329
> URL: https://issues.apache.org/jira/browse/KAFKA-14329
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 2.7.2
>Reporter: shizhenzhen
>Priority: Major
>  Labels: bug
> Attachments: image-2022-10-24-16-18-01-418.png, 
> image-2022-10-24-16-20-31-778.png, image-2022-10-24-16-47-03-180.png
>
>
>  
> listOffsets接口查询 TopicPartition分区Offset的时候,  正常来说应该是按照分区维度来查询的
>  
> 比如 我查询了 Topic1-0  、Topic1-1 ; 返回对应的数据。
>  
> 正常情况下是这样没错，但是在调用这个接口之前 先调用了 Metadata接口。
>  
> 并且在处理Metadata返回数据的时候，判断了如果对应的Topic有一个Topic分区leader不存在。都会抛出异常。如下：
>  
> !image-2022-10-24-16-20-31-778.png!
>  
>  
> !image-2022-10-24-16-18-01-418.png!
>  
>  
> 假设这种情况：
>  
> Topic1 有3个分区； Topic1-0;Topic1-1; Topic1-2 ;
>  
> 但是刚好 Topic1-2 这个分区由于某些原因导致分区Leader = -1
>  
> 这个时候我想去查询  Topic1-0;Topic1-1; 这两个分区的Offset的时候，它直接给我抛出来异常了
>  
> 正常逻辑，我这里没有涉及到有问题的Topic1-2；  我应该是能够查询出来数据的。
>  
> 但是却给我异常了，这非常的不友好。
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14329) KafkaAdminClient#listOffsets should query by partition dimension

shizhenzhen created KAFKA-14329:
---

 Summary: KafkaAdminClient#listOffsets should query by partition 
dimension
 Key: KAFKA-14329
 URL: https://issues.apache.org/jira/browse/KAFKA-14329
 Project: Kafka
  Issue Type: Improvement
  Components: admin
Affects Versions: 2.7.2
Reporter: shizhenzhen
 Attachments: image-2022-10-24-16-18-01-418.png, 
image-2022-10-24-16-20-31-778.png

 

listOffsets接口查询 TopicPartition分区Offset的时候,  正常来说应该是按照分区维度来查询的

 

比如 我查询了 Topic1-0  、Topic1-1 ; 返回对应的数据。

 

正常情况下是这样没错，但是在调用这个接口之前 先调用了 Metadata接口。

 

并且在处理Metadata返回数据的时候，判断了如果对应的Topic有一个Topic分区leader不存在。都会抛出异常。如下：

 

!image-2022-10-24-16-20-31-778.png!

 

 

!image-2022-10-24-16-18-01-418.png!

 

 

假设这种情况：

 

Topic1 有3个分区； Topic1-0;Topic1-1; Topic1-2 ;

 

但是刚好 Topic1-2 这个分区由于某些原因导致分区Leader = -1

 

这个时候我想去查询  Topic1-0;Topic1-1; 这两个分区的Offset的时候，它直接给我抛出来异常了

 

正常逻辑，我这里没有涉及到有问题的Topic1-2；  我应该是能够查询出来数据的。

 

但是却给我异常了，这非常的不友好。

 

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



 [ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-14328:

Attachment: image-2022-10-24-14-48-27-907.png

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png, 
> image-2022-10-21-16-56-45-448.png, image-2022-10-21-16-58-19-353.png, 
> image-2022-10-24-14-28-10-365.png, image-2022-10-24-14-47-30-641.png, 
> image-2022-10-24-14-48-27-907.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



[ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622984#comment-17622984
 ] 

shizhenzhen commented on KAFKA-14328:
-

[~dengziming] [~showuon]   

 

超时的时候 可以把最后一次请求的异常临时保存一下，等超时的时候打印出来

 

!image-2022-10-24-14-47-30-641.png!

 

 

 

!image-2022-10-24-14-28-10-365.png!

 

 

 

效果 ：

 

 

!image-2022-10-24-14-48-27-907.png!

 

 

 

 

如果觉得这种方案没什么问题，可以把这个issue分配给我

 

 

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png, 
> image-2022-10-21-16-56-45-448.png, image-2022-10-21-16-58-19-353.png, 
> image-2022-10-24-14-28-10-365.png, image-2022-10-24-14-47-30-641.png, 
> image-2022-10-24-14-48-27-907.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



 [ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-14328:

Attachment: image-2022-10-24-14-47-30-641.png

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png, 
> image-2022-10-21-16-56-45-448.png, image-2022-10-21-16-58-19-353.png, 
> image-2022-10-24-14-28-10-365.png, image-2022-10-24-14-47-30-641.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



 [ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-14328:

Attachment: image-2022-10-24-14-28-10-365.png

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png, 
> image-2022-10-21-16-56-45-448.png, image-2022-10-21-16-58-19-353.png, 
> image-2022-10-24-14-28-10-365.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



[ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622140#comment-17622140
 ] 

shizhenzhen commented on KAFKA-14328:
-

如果是发生请求的时候 导致的超时，代码里面或许是会把超时时候的异常给打印出来。

 

但是 PendingCall  是没有具体异常的，因为他本质上就是一直在等待轮到它发起请求，

 

因为一直在重试，倒是在Pending的状态一直没有机会发起请求 那么自然而然就会超时了。

 

所以我建议是把具体的异常 log.warn 提示一会或许会更好

 

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png, 
> image-2022-10-21-16-56-45-448.png, image-2022-10-21-16-58-19-353.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



 [ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-14328:

Attachment: image-2022-10-21-16-58-19-353.png

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png, 
> image-2022-10-21-16-56-45-448.png, image-2022-10-21-16-58-19-353.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs


[ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622136#comment-17622136
 ] 

shizhenzhen edited comment on KAFKA-14328 at 10/21/22 8:56 AM:
---

[~showuon]   感谢回复, 像你说的一样，我们期望通过重试能够解决这个异常，但是如果超时时间范围内不能够解决呢？

 

这个时候抛出来的异常就是  Timed out waiting for a node assignment. Call: metadata

 

但是这么简单的异常，谁能够知道具体的原因是什么呢？

 

所以对于排查问题来说很不友好

 

如果这里改成 log.warn 或许能够帮助我们寻找到一些蛛丝马迹。

 

如果说 等到超时的时候能够把这个具体的异常能够返回回去，那当然会更优雅一点

 

但是这似乎做起来并没有 直接改成 log.warn来的简单。

 

因为超时的时候是没有具体异常信息的，判断了某个请求超时了，只给了下列的信息

 

!image-2022-10-21-16-54-40-588.png!

 

!image-2022-10-21-16-56-45-448.png!

 


was (Author: shizhenzhen):
[~showuon]   感谢回复, 像你说的一样，我们期望通过重试能够解决这个异常，但是如果超时时间范围内不能够解决呢？

 

这个时候抛出来的异常就是  Timed out waiting for a node assignment. Call: metadata

 

但是这么简单的异常，谁能够知道具体的原因是什么呢？

 

所以对于排查问题来说很不友好

 

如果这里改成 log.warn 或许能够帮助我们寻找到一些蛛丝马迹。

 

如果说 等到超时的时候能够把这个具体的异常能够返回回去，那当然会更优雅一点

 

但是这似乎做起来并没有 直接改成 log.warn来的简单。

 

因为超时的时候是没有具体异常信息的，判断了某个请求超时了，只给了下列的信息

 

!image-2022-10-21-16-54-40-588.png!

 

 

 

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png, 
> image-2022-10-21-16-56-45-448.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



[ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17622136#comment-17622136
 ] 

shizhenzhen commented on KAFKA-14328:
-

[~showuon]   感谢回复, 像你说的一样，我们期望通过重试能够解决这个异常，但是如果超时时间范围内不能够解决呢？

 

这个时候抛出来的异常就是  Timed out waiting for a node assignment. Call: metadata

 

但是这么简单的异常，谁能够知道具体的原因是什么呢？

 

所以对于排查问题来说很不友好

 

如果这里改成 log.warn 或许能够帮助我们寻找到一些蛛丝马迹。

 

如果说 等到超时的时候能够把这个具体的异常能够返回回去，那当然会更优雅一点

 

但是这似乎做起来并没有 直接改成 log.warn来的简单。

 

因为超时的时候是没有具体异常信息的，判断了某个请求超时了，只给了下列的信息

 

!image-2022-10-21-16-54-40-588.png!

 

 

 

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



 [ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-14328:

Attachment: image-2022-10-21-16-54-40-588.png

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png, image-2022-10-21-16-54-40-588.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs



 [ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-14328:

 Attachment: image-2022-10-21-14-56-31-753.png
Description: 
 

 

KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。

 

就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；

 

但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。

他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
assignment. Call: metadata

 

无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是

 

```

org.apache.kafka.common.errors.LeaderNotAvailableException: There is no leader 
for this topic-partition as we are in the middle of a leadership election.

 

```

 

 

 

 

!https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!

 

 

 

下面截图那里是我改成的warn基本的日志

!image-2022-10-21-11-19-21-064.png!

 

所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。

就可以说明当前因为某个异常的原因而导致可能的重试。

 

 

 

 

!image-2022-10-21-14-56-31-753.png!

 

  was:
 

 

KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。

 

就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；

 

但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。

他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
assignment. Call: metadata

 

无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是

 

```

org.apache.kafka.common.errors.LeaderNotAvailableException: There is no leader 
for this topic-partition as we are in the middle of a leadership election.

 

```

 

 

 

 

!https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!

 

 

 

下面截图那里是我改成的warn基本的日志

!image-2022-10-21-11-19-21-064.png!

 

 

 

 

所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。

就可以说明当前因为某个异常的原因而导致可能的重试。

 

 


> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png, 
> image-2022-10-21-14-56-31-753.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  
> 
>  
>  
> !image-2022-10-21-14-56-31-753.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs

2022-10-20 Thread shizhenzhen (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-14328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17621468#comment-17621468
 ] 

shizhenzhen commented on KAFKA-14328:
-

[~guozhang]   

> KafkaAdminClient should be Changing the exception level When an exception 
> occurs
> 
>
> Key: KAFKA-14328
> URL: https://issues.apache.org/jira/browse/KAFKA-14328
> Project: Kafka
>  Issue Type: Improvement
>  Components: admin
>Affects Versions: 3.3
>Reporter: shizhenzhen
>Priority: Major
> Attachments: image-2022-10-21-11-19-21-064.png
>
>
>  
>  
> KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。
>  
> 就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；
>  
> 但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。
> 他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
> assignment. Call: metadata
>  
> 无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是
>  
> ```
> org.apache.kafka.common.errors.LeaderNotAvailableException: There is no 
> leader for this topic-partition as we are in the middle of a leadership 
> election.
>  
> ```
>  
>  
>  
>  
> !https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!
>  
>  
>  
> 下面截图那里是我改成的warn基本的日志
> !image-2022-10-21-11-19-21-064.png!
>  
>  
>  
>  
> 所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。
> 就可以说明当前因为某个异常的原因而导致可能的重试。
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (KAFKA-14328) KafkaAdminClient should be Changing the exception level When an exception occurs

2022-10-20 Thread shizhenzhen (Jira)

shizhenzhen created KAFKA-14328:
---

 Summary: KafkaAdminClient should be Changing the exception level 
When an exception occurs
 Key: KAFKA-14328
 URL: https://issues.apache.org/jira/browse/KAFKA-14328
 Project: Kafka
  Issue Type: Improvement
  Components: admin
Affects Versions: 3.3
Reporter: shizhenzhen
 Attachments: image-2022-10-21-11-19-21-064.png

 

 

KafkaAdminClient 的一些日志全部是 log.trace.  当遇到异常的时候根本不知道什么原因，导致排查问题非常艰难。

 

就比如下面这里，当去请求Metadata请求的时候，如果查询到的Topic有分区Leader=-1的时候，就会抛出异常；

 

但是这个时候实际上异常是被吞掉了的，这里往上面抛出异常之后，到了下面第二张图的 Catch部分。

他会把这个请求重新放到到请求队列中。然后就会陷入无限读重试之后，直到达到超时时间抛出异常：Timed out waiting for a node 
assignment. Call: metadata

 

无法给Metadata请求分配节点，正常情况下谁知道他真正的异常其实是

 

```

org.apache.kafka.common.errors.LeaderNotAvailableException: There is no leader 
for this topic-partition as we are in the middle of a leadership election.

 

```

 

 

 

 

!https://user-images.githubusercontent.com/10442648/196944422-e11b732f-6f7f-4f77-8d9c-1f0544257461.png!

 

 

 

下面截图那里是我改成的warn基本的日志

!image-2022-10-21-11-19-21-064.png!

 

 

 

 

所以我希望这里的log.trace 能改成 log.warn ; 给一个提醒。

就可以说明当前因为某个异常的原因而导致可能的重试。

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (KAFKA-13834) batch drain for nodes might have starving issue

2022-04-19 Thread shizhenzhen (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524687#comment-17524687
 ] 

shizhenzhen commented on KAFKA-13834:
-

[~guozhang] [~showuon]   
Done!
Thanks！

> batch drain for nodes might have starving issue
> ---
>
> Key: KAFKA-13834
> URL: https://issues.apache.org/jira/browse/KAFKA-13834
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 2.5.0, 2.4.1, 2.6.0, 2.5.1, 2.7.0, 2.6.1, 2.8.0, 2.7.1, 
> 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1
>Reporter: shizhenzhen
>Priority: Trivial
>  Labels: producer
> Attachments: image-2022-04-18-17-36-47-393.png
>
>
> h3. 问题代码 problem code
> RecordAccumulator#drainBatchesForOneNode
> !https://img-blog.csdnimg.cn/a4e309723c364586a46df8d94e49291f.png|width=786,height=266!
>   
> 问题出在这个, private int drainIndex;
> The problem is this,private int drainIndex;
> h3. 代码预期 code expectations
> 这端代码的逻辑, 是计算出发往每个Node的ProducerBatchs，是批量发送。
> 因为发送一次的请求量是有限的(max.request.size), 所以一次可能只能发几个ProducerBatch. 那么这次发送了之后, 
> 需要记录一下这里是遍历到了哪个Batch, 下次再次遍历的时候能够接着上一次遍历发送。
> 简单来说呢就是下图这样
>  
> The logic of the code at this end is to calculate the ProducerBatchs sent to 
> each Node, which is sent in batches.
> Because the amount of requests sent at one time is limited 
> (max.request.size), only a few ProducerBatch may be sent at a time. Then 
> after sending this time, you need to record which Batch is traversed here, 
> and the next time you traverse it again Can continue the last traversal send.
> Simply put, it is as follows
>  
> !image-2022-04-18-17-36-47-393.png|width=798,height=526!
>  
>  
>  
> h3. 实际情况 The actual situation
> 但是呢, 因为上面的索引drainIndex 是一个全局变量, 是RecordAccumulator共享的。
> 那么通常会有很多个Node需要进行遍历, 
> 上一个Node的索引会接着被第二个第三个Node接着使用,那么也就无法比较均衡合理的让每个TopicPartition都遍历到.
> 正常情况下其实这样也没有事情, 如果不出现极端情况的下，基本上都能遍历到。
> 怕就怕极端情况, 导致有很多TopicPartition不能够遍历到,也就会造成一部分消息一直发送不出去。
> However, because the index drainIndex above is a global variable shared by 
> RecordAccumulator.
> Then there are usually many Nodes that need to be traversed, and the index of 
> the previous Node will be used by the second and third Nodes, so it is 
> impossible to traverse each TopicPartition in a balanced and reasonable 
> manner.
> Under normal circumstances, there is nothing wrong with this. If there is no 
> extreme situation, it can basically be traversed.
> I'm afraid of extreme situations, which will result in many TopicPartitions 
> that cannot be traversed, and some messages will not be sent out all the time.
> h3. 造成的影响 impact
> 导致部分消息一直发送不出去、或者很久才能够发送出去。
> As a result, some messages cannot be sent out, or can take a long time to be 
> sent out.
> h3. 触发异常情况的一个Case /  A Case that triggers an exception
> 该Case场景如下：
>  # 生产者向3个Node发送消息
>  # 每个Node都是3个TopicPartition
>  # 每个TopicPartition队列都一直源源不断的写入消息、
>  # max.request.size 刚好只能存放一个ProdcuerBatch的大小。
> 就是这么几个条件,会造成每个Node只能收到一个TopicPartition队列里面的PrdoucerBatch消息。
> 开始的时候 drainIndex=0. 开始遍历第一个Node-0。 Node-0 准备开始遍历它下面的几个队列中的ProducerBatch，遍历一次 
> 则drainIndex+1,发现遍历了一个队列之后,就装满了这一批次的请求。
> 那么开始遍历Node-1，这个时候则drainIndex=1，首先遍历到的是 第二个TopicPartition。然后发现一个Batch之后也满了。
> 那么开始遍历Node-1，这个时候则drainIndex=2，首先遍历到的是 第三个TopicPartition。然后发现一个Batch之后也满了。
> 这一次的Node遍历结束之后把消息发送之后
> 又接着上面的请求流程，那么这个时候的drainIndex=3了。
> 遍历Node-0,这个时候取模计算得到的是第几个TopicPartition呢？那不还是第1个吗。相当于后面的流程跟上面一模一样。
> 也就导致了每个Node的第2、3个TopicPartition队列中的ProducerBatch永远遍历不到。
> 也就发送不出去了。
>  
> The case scenario is as follows:
> Producer sends message to 3 Nodes
> Each Node is 3 TopicPartitions
> Each TopicPartition queue has been continuously writing messages,
> max.request.size can only store the size of one ProdcuerBatch.
> It is these conditions that cause each Node to receive only one PrdoucerBatch 
> message in the TopicPartition queue.
> At the beginning drainIndex=0. Start traversing the first Node-0. Node-0 is 
> ready to start traversing the ProducerBatch in several queues below it. After 
> traversing once, drainIndex + 1. After traversing a queue, it is full of 
> requests for this batch.
> Then start traversing Node-1. At this time, drainIndex=1, and the second 
> TopicPartition is traversed first. Then I found that a Batch was also full.
> Then start traversing Node-1. At this time, drainIndex=2, and the third 
> TopicPartition is traversed first. Then I found that a Batch was also full.
> After this Node traversal is over, the message is sent
> Then the above request process is followed, then drainIndex=3 at this time.
> Traversing Node-0, which TopicPartition is obtained by taking the modulo 
> calculation at this time? Isn't that the first one? Equivalent to the 
> following process is exactly the same as above.
> As a result, the ProducerBatch in the second and third TopicPartition queues 
> of each Node can never

[jira] [Updated] (KAFKA-13834) batch drain for nodes might have starving issue

2022-04-19 Thread shizhenzhen (Jira)

[
https://issues.apache.org/jira/browse/KAFKA-13834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

shizhenzhen updated KAFKA-13834:

Description:
h3. 问题代码 problem code

RecordAccumulator#drainBatchesForOneNode

!https://img-blog.csdnimg.cn/a4e309723c364586a46df8d94e49291f.png|width=786,height=266!

问题出在这个, private int drainIndex;

The problem is this,private int drainIndex;
h3. 代码预期 code expectations

这端代码的逻辑, 是计算出发往每个Node的ProducerBatchs，是批量发送。

因为发送一次的请求量是有限的(max.request.size), 所以一次可能只能发几个ProducerBatch. 那么这次发送了之后,
需要记录一下这里是遍历到了哪个Batch, 下次再次遍历的时候能够接着上一次遍历发送。

简单来说呢就是下图这样

The logic of the code at this end is to calculate the ProducerBatchs sent to
each Node, which is sent in batches.

Because the amount of requests sent at one time is limited (max.request.size),
only a few ProducerBatch may be sent at a time. Then after sending this time,
you need to record which Batch is traversed here, and the next time you
traverse it again Can continue the last traversal send.

Simply put, it is as follows

!image-2022-04-18-17-36-47-393.png|width=798,height=526!

h3. 实际情况 The actual situation

但是呢, 因为上面的索引drainIndex 是一个全局变量, 是RecordAccumulator共享的。

那么通常会有很多个Node需要进行遍历,
上一个Node的索引会接着被第二个第三个Node接着使用,那么也就无法比较均衡合理的让每个TopicPartition都遍历到.

正常情况下其实这样也没有事情, 如果不出现极端情况的下，基本上都能遍历到。

怕就怕极端情况, 导致有很多TopicPartition不能够遍历到,也就会造成一部分消息一直发送不出去。

However, because the index drainIndex above is a global variable shared by
RecordAccumulator.

Then there are usually many Nodes that need to be traversed, and the index of
the previous Node will be used by the second and third Nodes, so it is
impossible to traverse each TopicPartition in a balanced and reasonable manner.

Under normal circumstances, there is nothing wrong with this. If there is no
extreme situation, it can basically be traversed.

I'm afraid of extreme situations, which will result in many TopicPartitions
that cannot be traversed, and some messages will not be sent out all the time.
h3. 造成的影响 impact

导致部分消息一直发送不出去、或者很久才能够发送出去。

As a result, some messages cannot be sent out, or can take a long time to be
sent out.
h3. 触发异常情况的一个Case / A Case that triggers an exception

该Case场景如下：
# 生产者向3个Node发送消息
# 每个Node都是3个TopicPartition
# 每个TopicPartition队列都一直源源不断的写入消息、
# max.request.size 刚好只能存放一个ProdcuerBatch的大小。

就是这么几个条件,会造成每个Node只能收到一个TopicPartition队列里面的PrdoucerBatch消息。

开始的时候 drainIndex=0. 开始遍历第一个Node-0。 Node-0 准备开始遍历它下面的几个队列中的ProducerBatch，遍历一次
则drainIndex+1,发现遍历了一个队列之后,就装满了这一批次的请求。

那么开始遍历Node-1，这个时候则drainIndex=1，首先遍历到的是 第二个TopicPartition。然后发现一个Batch之后也满了。

那么开始遍历Node-1，这个时候则drainIndex=2，首先遍历到的是 第三个TopicPartition。然后发现一个Batch之后也满了。

这一次的Node遍历结束之后把消息发送之后

又接着上面的请求流程，那么这个时候的drainIndex=3了。

遍历Node-0,这个时候取模计算得到的是第几个TopicPartition呢？那不还是第1个吗。相当于后面的流程跟上面一模一样。

也就导致了每个Node的第2、3个TopicPartition队列中的ProducerBatch永远遍历不到。

也就发送不出去了。

The case scenario is as follows:

Producer sends message to 3 Nodes
Each Node is 3 TopicPartitions
Each TopicPartition queue has been continuously writing messages,
max.request.size can only store the size of one ProdcuerBatch.

It is these conditions that cause each Node to receive only one PrdoucerBatch
message in the TopicPartition queue.

At the beginning drainIndex=0. Start traversing the first Node-0. Node-0 is
ready to start traversing the ProducerBatch in several queues below it. After
traversing once, drainIndex + 1. After traversing a queue, it is full of
requests for this batch.

Then start traversing Node-1. At this time, drainIndex=1, and the second
TopicPartition is traversed first. Then I found that a Batch was also full.

Then start traversing Node-1. At this time, drainIndex=2, and the third
TopicPartition is traversed first. Then I found that a Batch was also full.

After this Node traversal is over, the message is sent

Then the above request process is followed, then drainIndex=3 at this time.

Traversing Node-0, which TopicPartition is obtained by taking the modulo
calculation at this time? Isn't that the first one? Equivalent to the following
process is exactly the same as above.

As a result, the ProducerBatch in the second and third TopicPartition queues of
each Node can never be traversed.

It can't be sent.

!https://img-blog.csdnimg.cn/aa2cc2e7a9ff4536a1800d9117e02555.png#pic_center|width=660,height=394!

h3. 解决方案 solution

只需要每个Node，维护一个自己的索引就行了。

Only each Node needs to maintain its own index.

was:
h3. 问题代码 problem code

RecordAccumulator#drainBatchesForOneNode

!https://img-blog.csdnimg.cn/a4e309723c364586a46df8d94e49291f.png! 在这里插入图片描述

问题出在这个, private int drainIndex;

The problem is this,private int drainIndex;
h3. 代码预期 code expectations

这端代码的逻辑, 是计算出发往每个Node的ProducerBatchs，是批量发送。

简单来说呢就是下图这样

The logic of the code at this end is to calculate the ProducerBatchs sent to
each Node, which is

[jira] [Updated] (KAFKA-13834) batch drain for nodes might have starving issue

2022-04-19 Thread shizhenzhen (Jira)



 [ 
https://issues.apache.org/jira/browse/KAFKA-13834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-13834:

Summary: batch drain for nodes might have starving issue  (was: Some 
problems with producers choosing batches of messages to send)

> batch drain for nodes might have starving issue
> ---
>
> Key: KAFKA-13834
> URL: https://issues.apache.org/jira/browse/KAFKA-13834
> Project: Kafka
>  Issue Type: Bug
>  Components: producer 
>Affects Versions: 2.5.0, 2.4.1, 2.6.0, 2.5.1, 2.7.0, 2.6.1, 2.8.0, 2.7.1, 
> 2.6.2, 3.1.0, 2.6.3, 2.7.2, 2.8.1, 3.0.0, 3.0.1
>Reporter: shizhenzhen
>Priority: Trivial
>  Labels: producer
> Attachments: image-2022-04-18-17-36-47-393.png
>
>
> h3. 问题代码 problem code
> RecordAccumulator#drainBatchesForOneNode
> !https://img-blog.csdnimg.cn/a4e309723c364586a46df8d94e49291f.png! 在这里插入图片描述
> 问题出在这个, private int drainIndex;
> The problem is this,private int drainIndex;
> h3. 代码预期 code expectations
> 这端代码的逻辑, 是计算出发往每个Node的ProducerBatchs，是批量发送。
> 因为发送一次的请求量是有限的(max.request.size), 所以一次可能只能发几个ProducerBatch. 那么这次发送了之后, 
> 需要记录一下这里是遍历到了哪个Batch, 下次再次遍历的时候能够接着上一次遍历发送。
> 简单来说呢就是下图这样
>  
> The logic of the code at this end is to calculate the ProducerBatchs sent to 
> each Node, which is sent in batches.
> Because the amount of requests sent at one time is limited 
> (max.request.size), only a few ProducerBatch may be sent at a time. Then 
> after sending this time, you need to record which Batch is traversed here, 
> and the next time you traverse it again Can continue the last traversal send.
> Simply put, it is as follows
>  
> !image-2022-04-18-17-36-47-393.png!
>  
>  
>  
> h3. 实际情况 The actual situation
> 但是呢, 因为上面的索引drainIndex 是一个全局变量, 是RecordAccumulator共享的。
> 那么通常会有很多个Node需要进行遍历, 
> 上一个Node的索引会接着被第二个第三个Node接着使用,那么也就无法比较均衡合理的让每个TopicPartition都遍历到.
> 正常情况下其实这样也没有事情, 如果不出现极端情况的下，基本上都能遍历到。
> 怕就怕极端情况, 导致有很多TopicPartition不能够遍历到,也就会造成一部分消息一直发送不出去。
> However, because the index drainIndex above is a global variable shared by 
> RecordAccumulator.
> Then there are usually many Nodes that need to be traversed, and the index of 
> the previous Node will be used by the second and third Nodes, so it is 
> impossible to traverse each TopicPartition in a balanced and reasonable 
> manner.
> Under normal circumstances, there is nothing wrong with this. If there is no 
> extreme situation, it can basically be traversed.
> I'm afraid of extreme situations, which will result in many TopicPartitions 
> that cannot be traversed, and some messages will not be sent out all the time.
> h3. 造成的影响 impact
> 导致部分消息一直发送不出去、或者很久才能够发送出去。
> As a result, some messages cannot be sent out, or can take a long time to be 
> sent out.
> h3. 触发异常情况的一个Case /  A Case that triggers an exception
> 该Case场景如下：
>  # 生产者向3个Node发送消息
>  # 每个Node都是3个TopicPartition
>  # 每个TopicPartition队列都一直源源不断的写入消息、
>  # max.request.size 刚好只能存放一个ProdcuerBatch的大小。
> 就是这么几个条件,会造成每个Node只能收到一个TopicPartition队列里面的PrdoucerBatch消息。
> 开始的时候 drainIndex=0. 开始遍历第一个Node-0。 Node-0 准备开始遍历它下面的几个队列中的ProducerBatch，遍历一次 
> 则drainIndex+1,发现遍历了一个队列之后,就装满了这一批次的请求。
> 那么开始遍历Node-1，这个时候则drainIndex=1，首先遍历到的是 第二个TopicPartition。然后发现一个Batch之后也满了。
> 那么开始遍历Node-1，这个时候则drainIndex=2，首先遍历到的是 第三个TopicPartition。然后发现一个Batch之后也满了。
> 这一次的Node遍历结束之后把消息发送之后
> 又接着上面的请求流程，那么这个时候的drainIndex=3了。
> 遍历Node-0,这个时候取模计算得到的是第几个TopicPartition呢？那不还是第1个吗。相当于后面的流程跟上面一模一样。
> 也就导致了每个Node的第2、3个TopicPartition队列中的ProducerBatch永远遍历不到。
> 也就发送不出去了。
>  
> The case scenario is as follows:
> Producer sends message to 3 Nodes
> Each Node is 3 TopicPartitions
> Each TopicPartition queue has been continuously writing messages,
> max.request.size can only store the size of one ProdcuerBatch.
> It is these conditions that cause each Node to receive only one PrdoucerBatch 
> message in the TopicPartition queue.
> At the beginning drainIndex=0. Start traversing the first Node-0. Node-0 is 
> ready to start traversing the ProducerBatch in several queues below it. After 
> traversing once, drainIndex + 1. After traversing a queue, it is full of 
> requests for this batch.
> Then start traversing Node-1. At this time, drainIndex=1, and the second 
> TopicPartition is traversed first. Then I found that a Batch was also full.
> Then start traversing Node-1. At this time, drainIndex=2, and the third 
> TopicPartition is traversed first. Then I found that a Batch was also full.
> After this Node traversal is over, the message is sent
> Then the above request process is followed, then drainIndex=3 at this time.
> Traversing Node-0, which TopicPartition is obtained by taking the modulo 
> calculation at this time? Isn't that the first one? Equivalent to the 
> following process is exactly the same as above.
> As a result, the ProducerBatch in the second and third TopicPartition queues 
> of

[jira] [Created] (KAFKA-13834) Some problems with producers choosing batches of messages to send

2022-04-18 Thread shizhenzhen (Jira)

shizhenzhen created KAFKA-13834:
---

Summary: Some problems with producers choosing batches of messages
to send
Key: KAFKA-13834
URL: https://issues.apache.org/jira/browse/KAFKA-13834
Project: Kafka
Issue Type: Bug
Components: producer
Affects Versions: 3.0.1, 3.0.0, 2.8.1, 2.7.2, 2.6.3, 3.1.0, 2.6.2, 2.7.1,
2.8.0, 2.6.1, 2.7.0, 2.5.1, 2.6.0, 2.4.1, 2.5.0
Reporter: shizhenzhen
Attachments: image-2022-04-18-17-36-47-393.png

h3. 问题代码 problem code

RecordAccumulator#drainBatchesForOneNode

!https://img-blog.csdnimg.cn/a4e309723c364586a46df8d94e49291f.png! 在这里插入图片描述

问题出在这个, private int drainIndex;

The problem is this,private int drainIndex;
h3. 代码预期 code expectations

这端代码的逻辑, 是计算出发往每个Node的ProducerBatchs，是批量发送。

简单来说呢就是下图这样

The logic of the code at this end is to calculate the ProducerBatchs sent to
each Node, which is sent in batches.

Simply put, it is as follows

!image-2022-04-18-17-36-47-393.png!

h3. 实际情况 The actual situation

但是呢, 因为上面的索引drainIndex 是一个全局变量, 是RecordAccumulator共享的。

那么通常会有很多个Node需要进行遍历,
上一个Node的索引会接着被第二个第三个Node接着使用,那么也就无法比较均衡合理的让每个TopicPartition都遍历到.

正常情况下其实这样也没有事情, 如果不出现极端情况的下，基本上都能遍历到。

怕就怕极端情况, 导致有很多TopicPartition不能够遍历到,也就会造成一部分消息一直发送不出去。

However, because the index drainIndex above is a global variable shared by
RecordAccumulator.

Under normal circumstances, there is nothing wrong with this. If there is no
extreme situation, it can basically be traversed.

I'm afraid of extreme situations, which will result in many TopicPartitions
that cannot be traversed, and some messages will not be sent out all the time.
h3. 造成的影响 impact

导致部分消息一直发送不出去、或者很久才能够发送出去。

As a result, some messages cannot be sent out, or can take a long time to be
sent out.
h3. 触发异常情况的一个Case / A Case that triggers an exception

就是这么几个条件,会造成每个Node只能收到一个TopicPartition队列里面的PrdoucerBatch消息。

那么开始遍历Node-1，这个时候则drainIndex=1，首先遍历到的是 第二个TopicPartition。然后发现一个Batch之后也满了。

那么开始遍历Node-1，这个时候则drainIndex=2，首先遍历到的是 第三个TopicPartition。然后发现一个Batch之后也满了。

这一次的Node遍历结束之后把消息发送之后

又接着上面的请求流程，那么这个时候的drainIndex=3了。

遍历Node-0,这个时候取模计算得到的是第几个TopicPartition呢？那不还是第1个吗。相当于后面的流程跟上面一模一样。

也就导致了每个Node的第2、3个TopicPartition队列中的ProducerBatch永远遍历不到。

也就发送不出去了。

The case scenario is as follows:

Producer sends message to 3 Nodes
Each Node is 3 TopicPartitions
Each TopicPartition queue has been continuously writing messages,
max.request.size can only store the size of one ProdcuerBatch.

It is these conditions that cause each Node to receive only one PrdoucerBatch
message in the TopicPartition queue.

Then start traversing Node-1. At this time, drainIndex=1, and the second
TopicPartition is traversed first. Then I found that a Batch was also full.

Then start traversing Node-1. At this time, drainIndex=2, and the third
TopicPartition is traversed first. Then I found that a Batch was also full.

After this Node traversal is over, the message is sent

Then the above request process is followed, then drainIndex=3 at this time.

Traversing Node-0, which TopicPartition is obtained by taking the modulo
calculation at this time? Isn't that the first one? Equivalent to the following
process is exactly the same as above.

As a result, the ProducerBatch in the second and third TopicPartition queues of
each Node can never be traversed.

It can't be sent.

!https://img-blog.csdnimg.cn/aa2cc2e7a9ff4536a1800d9117e02555.png#pic_center!

h3. 解决方案 solution

只需要每个Node，维护一个自己的索引就行了。

Only each Node needs to maintain its own index.

--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (KAFKA-13226) Partition expansion may cause uneven distribution

2021-10-31 Thread shizhenzhen (Jira)



[ 
https://issues.apache.org/jira/browse/KAFKA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435380#comment-17435380
 ] 

shizhenzhen edited comment on KAFKA-13226 at 11/1/21, 2:40 AM:
---

I hava push  a PR, please review it

https://github.com/apache/kafka/pull/11453


was (Author: shizhenzhen):
I hava push  a PR, please review it

https://github.com/apache/kafka/pull/11445

> Partition expansion may cause uneven distribution
> -
>
> Key: KAFKA-13226
> URL: https://issues.apache.org/jira/browse/KAFKA-13226
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.5.0, 2.8.0, 2.7.1, 2.6.2, 2.8.1, 3.0.0
> Environment: mac  
> kafka-2.5.0
>Reporter: shizhenzhen
>Priority: Major
>
>  
>  {color:#ff}*Partition expansion may cause uneven distribution*{color}
>  
> 1. Create a Topic  , 3-partition   1-replical
> !https://img-blog.csdnimg.cn/561112064b114acfb03882aa09100e0e.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_55,color_FF,t_70,g_se,x_16!
>  
> 2. partition expansion to 5 - partiton
> !https://img-blog.csdnimg.cn/f7c3c33b6662457080d9bb5bb190c0c2.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_49,color_FF,t_70,g_se,x_16!
>  
> 3. Does this meet expectations ？
>  
> !https://img-blog.csdnimg.cn/20cc1007c4214c4ebfcb1b2c2eeb98e4.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_18,color_FF,t_70,g_se,x_16!
>  
> {color:#ff}*so this is a bug ?*{color}
>  
> The problem may arise here ; 
> When we create a new topic .  get the broker list is a Object Map ; 
> *This is disordered*
> you can read the code , first it have sortBy brokerId, but finally it convert 
> to a *disorde Map;*
>  
>  
> !https://img-blog.csdnimg.cn/131b9bf0c19e4753a73512af4c9c5854.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_66,color_FF,t_70,g_se,x_16!
>  
>  
>  
> The important thing is that it has been sorted when expanding the partition 
> and parition-reassignment ;   
> {color:#ff}*So why not sort when creating topics?*{color}
>  
> If the topic is sorted when  create a new topic , this problem will not occur 
> ;
>  
> so it maybe is a tiny bug ?
>  
>  
> if you can read Chinese ,
> You can look at this article. I describe it in detail
>  
> We look forward to receiving your reply
>  
> 如果你能看懂中文,可以看看这篇文章 我描述的很详细! 
>  [This may be a Kafka 
> bug？|[https://shirenchuang.blog.csdn.net/article/details/119912418]]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-13226) Partition expansion may cause uneven distribution



[ 
https://issues.apache.org/jira/browse/KAFKA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435384#comment-17435384
 ] 

shizhenzhen edited comment on KAFKA-13226 at 10/28/21, 1:03 PM:


When I tried to repair it, I found that even if I repaired the problem that the 
toMap in kafkazkclient#getallbrokerandepochsincluster is not sorted, it still 
can not solve this problem, because the brokerList will be cover when 
updateMedata  request is made,It's still not sorted, , so updatetata request 
also needs to be processed, However, even if the whole cluster is not sorted, 
there is no problem, except for createtopic, so when I solve this bug, I just 
sort it in createtopic, so the change is the smallest,


was (Author: shizhenzhen):
When I tried to repair it, I found that even if I repaired the problem that the 
toMap in kafkazkclient#getallbrokerandepochsincluster is not sorted, it still 
can not solve this problem, because the brokerList will be cover when 
updateMedata  request is made,It's still not ordered, , so updatetata request 
also needs to be processed, However, even if the whole cluster is not sorted, 
there is no problem, except for createtopic, so when I solve this bug, I just 
sort it in createtopic, so the change is the smallest,

> Partition expansion may cause uneven distribution
> -
>
> Key: KAFKA-13226
> URL: https://issues.apache.org/jira/browse/KAFKA-13226
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.5.0, 2.8.0, 2.7.1, 2.6.2, 2.8.1, 3.0.0
> Environment: mac  
> kafka-2.5.0
>Reporter: shizhenzhen
>Priority: Major
>
>  
>  {color:#ff}*Partition expansion may cause uneven distribution*{color}
>  
> 1. Create a Topic  , 3-partition   1-replical
> !https://img-blog.csdnimg.cn/561112064b114acfb03882aa09100e0e.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_55,color_FF,t_70,g_se,x_16!
>  
> 2. partition expansion to 5 - partiton
> !https://img-blog.csdnimg.cn/f7c3c33b6662457080d9bb5bb190c0c2.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_49,color_FF,t_70,g_se,x_16!
>  
> 3. Does this meet expectations ？
>  
> !https://img-blog.csdnimg.cn/20cc1007c4214c4ebfcb1b2c2eeb98e4.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_18,color_FF,t_70,g_se,x_16!
>  
> {color:#ff}*so this is a bug ?*{color}
>  
> The problem may arise here ; 
> When we create a new topic .  get the broker list is a Object Map ; 
> *This is disordered*
> you can read the code , first it have sortBy brokerId, but finally it convert 
> to a *disorde Map;*
>  
>  
> !https://img-blog.csdnimg.cn/131b9bf0c19e4753a73512af4c9c5854.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_66,color_FF,t_70,g_se,x_16!
>  
>  
>  
> The important thing is that it has been sorted when expanding the partition 
> and parition-reassignment ;   
> {color:#ff}*So why not sort when creating topics?*{color}
>  
> If the topic is sorted when  create a new topic , this problem will not occur 
> ;
>  
> so it maybe is a tiny bug ?
>  
>  
> if you can read Chinese ,
> You can look at this article. I describe it in detail
>  
> We look forward to receiving your reply
>  
> 如果你能看懂中文,可以看看这篇文章 我描述的很详细! 
>  [This may be a Kafka 
> bug？|[https://shirenchuang.blog.csdn.net/article/details/119912418]]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (KAFKA-13226) Partition expansion may cause uneven distribution



[ 
https://issues.apache.org/jira/browse/KAFKA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435384#comment-17435384
 ] 

shizhenzhen edited comment on KAFKA-13226 at 10/28/21, 1:02 PM:


When I tried to repair it, I found that even if I repaired the problem that the 
toMap in kafkazkclient#getallbrokerandepochsincluster is not sorted, it still 
can not solve this problem, because the brokerList will be cover when 
updateMedata  request is made,It's still not ordered, , so updatetata request 
also needs to be processed, However, even if the whole cluster is not sorted, 
there is no problem, except for createtopic, so when I solve this bug, I just 
sort it in createtopic, so the change is the smallest,


was (Author: shizhenzhen):
When I tried to repair it, I found that even if I repaired the problem that the 
tomap in kafkazkclient#getallbrokerandepochsincluster is not sorted, it still 
can not solve this problem, because the update is also made when the updatetata 
request is made, which will result in overwriting the previously sorted ones, 
so updatetata also needs to be processed, However, even if the whole cluster is 
not sorted, there is no problem, except for createtopic, so when I solve this 
bug, I just sort it in createtopic, so the change is the smallest,

> Partition expansion may cause uneven distribution
> -
>
> Key: KAFKA-13226
> URL: https://issues.apache.org/jira/browse/KAFKA-13226
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.5.0, 2.8.0, 2.7.1, 2.6.2, 2.8.1, 3.0.0
> Environment: mac  
> kafka-2.5.0
>Reporter: shizhenzhen
>Priority: Major
>
>  
>  {color:#ff}*Partition expansion may cause uneven distribution*{color}
>  
> 1. Create a Topic  , 3-partition   1-replical
> !https://img-blog.csdnimg.cn/561112064b114acfb03882aa09100e0e.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_55,color_FF,t_70,g_se,x_16!
>  
> 2. partition expansion to 5 - partiton
> !https://img-blog.csdnimg.cn/f7c3c33b6662457080d9bb5bb190c0c2.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_49,color_FF,t_70,g_se,x_16!
>  
> 3. Does this meet expectations ？
>  
> !https://img-blog.csdnimg.cn/20cc1007c4214c4ebfcb1b2c2eeb98e4.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_18,color_FF,t_70,g_se,x_16!
>  
> {color:#ff}*so this is a bug ?*{color}
>  
> The problem may arise here ; 
> When we create a new topic .  get the broker list is a Object Map ; 
> *This is disordered*
> you can read the code , first it have sortBy brokerId, but finally it convert 
> to a *disorde Map;*
>  
>  
> !https://img-blog.csdnimg.cn/131b9bf0c19e4753a73512af4c9c5854.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_66,color_FF,t_70,g_se,x_16!
>  
>  
>  
> The important thing is that it has been sorted when expanding the partition 
> and parition-reassignment ;   
> {color:#ff}*So why not sort when creating topics?*{color}
>  
> If the topic is sorted when  create a new topic , this problem will not occur 
> ;
>  
> so it maybe is a tiny bug ?
>  
>  
> if you can read Chinese ,
> You can look at this article. I describe it in detail
>  
> We look forward to receiving your reply
>  
> 如果你能看懂中文,可以看看这篇文章 我描述的很详细! 
>  [This may be a Kafka 
> bug？|[https://shirenchuang.blog.csdn.net/article/details/119912418]]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13226) Partition expansion may cause uneven distribution



[ 
https://issues.apache.org/jira/browse/KAFKA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435384#comment-17435384
 ] 

shizhenzhen commented on KAFKA-13226:
-

When I tried to repair it, I found that even if I repaired the problem that the 
tomap in kafkazkclient#getallbrokerandepochsincluster is not sorted, it still 
can not solve this problem, because the update is also made when the updatetata 
request is made, which will result in overwriting the previously sorted ones, 
so updatetata also needs to be processed, However, even if the whole cluster is 
not sorted, there is no problem, except for createtopic, so when I solve this 
bug, I just sort it in createtopic, so the change is the smallest,

> Partition expansion may cause uneven distribution
> -
>
> Key: KAFKA-13226
> URL: https://issues.apache.org/jira/browse/KAFKA-13226
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.5.0, 2.8.0, 2.7.1, 2.6.2, 2.8.1, 3.0.0
> Environment: mac  
> kafka-2.5.0
>Reporter: shizhenzhen
>Priority: Major
>
>  
>  {color:#ff}*Partition expansion may cause uneven distribution*{color}
>  
> 1. Create a Topic  , 3-partition   1-replical
> !https://img-blog.csdnimg.cn/561112064b114acfb03882aa09100e0e.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_55,color_FF,t_70,g_se,x_16!
>  
> 2. partition expansion to 5 - partiton
> !https://img-blog.csdnimg.cn/f7c3c33b6662457080d9bb5bb190c0c2.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_49,color_FF,t_70,g_se,x_16!
>  
> 3. Does this meet expectations ？
>  
> !https://img-blog.csdnimg.cn/20cc1007c4214c4ebfcb1b2c2eeb98e4.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_18,color_FF,t_70,g_se,x_16!
>  
> {color:#ff}*so this is a bug ?*{color}
>  
> The problem may arise here ; 
> When we create a new topic .  get the broker list is a Object Map ; 
> *This is disordered*
> you can read the code , first it have sortBy brokerId, but finally it convert 
> to a *disorde Map;*
>  
>  
> !https://img-blog.csdnimg.cn/131b9bf0c19e4753a73512af4c9c5854.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_66,color_FF,t_70,g_se,x_16!
>  
>  
>  
> The important thing is that it has been sorted when expanding the partition 
> and parition-reassignment ;   
> {color:#ff}*So why not sort when creating topics?*{color}
>  
> If the topic is sorted when  create a new topic , this problem will not occur 
> ;
>  
> so it maybe is a tiny bug ?
>  
>  
> if you can read Chinese ,
> You can look at this article. I describe it in detail
>  
> We look forward to receiving your reply
>  
> 如果你能看懂中文,可以看看这篇文章 我描述的很详细! 
>  [This may be a Kafka 
> bug？|[https://shirenchuang.blog.csdn.net/article/details/119912418]]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13226) Partition expansion may cause uneven distribution



[ 
https://issues.apache.org/jira/browse/KAFKA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435385#comment-17435385
 ] 

shizhenzhen commented on KAFKA-13226:
-

我在尝试修复它的时候, 我发现就算修复了KafkaZkClient#getAllBrokerAndEpochsInCluster 那里的toMap 
不排序的问题, 仍然是不能够解决这个问题的,因为在UPDATAMETADATA 
请求的时候也做了更新,会导致将之前已经排序的覆盖掉,所以UPDATAMETADATA也需要做处理, 
但是整个集群就算不是排序的,也没有什么问题,除了createTopic的时候,所以我在解决这个bug的时候,只是在createTopic那里做了一下排序,这样改动最小,可以完美的解决问题

> Partition expansion may cause uneven distribution
> -
>
> Key: KAFKA-13226
> URL: https://issues.apache.org/jira/browse/KAFKA-13226
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.5.0, 2.8.0, 2.7.1, 2.6.2, 2.8.1, 3.0.0
> Environment: mac  
> kafka-2.5.0
>Reporter: shizhenzhen
>Priority: Major
>
>  
>  {color:#ff}*Partition expansion may cause uneven distribution*{color}
>  
> 1. Create a Topic  , 3-partition   1-replical
> !https://img-blog.csdnimg.cn/561112064b114acfb03882aa09100e0e.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_55,color_FF,t_70,g_se,x_16!
>  
> 2. partition expansion to 5 - partiton
> !https://img-blog.csdnimg.cn/f7c3c33b6662457080d9bb5bb190c0c2.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_49,color_FF,t_70,g_se,x_16!
>  
> 3. Does this meet expectations ？
>  
> !https://img-blog.csdnimg.cn/20cc1007c4214c4ebfcb1b2c2eeb98e4.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_18,color_FF,t_70,g_se,x_16!
>  
> {color:#ff}*so this is a bug ?*{color}
>  
> The problem may arise here ; 
> When we create a new topic .  get the broker list is a Object Map ; 
> *This is disordered*
> you can read the code , first it have sortBy brokerId, but finally it convert 
> to a *disorde Map;*
>  
>  
> !https://img-blog.csdnimg.cn/131b9bf0c19e4753a73512af4c9c5854.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_66,color_FF,t_70,g_se,x_16!
>  
>  
>  
> The important thing is that it has been sorted when expanding the partition 
> and parition-reassignment ;   
> {color:#ff}*So why not sort when creating topics?*{color}
>  
> If the topic is sorted when  create a new topic , this problem will not occur 
> ;
>  
> so it maybe is a tiny bug ?
>  
>  
> if you can read Chinese ,
> You can look at this article. I describe it in detail
>  
> We look forward to receiving your reply
>  
> 如果你能看懂中文,可以看看这篇文章 我描述的很详细! 
>  [This may be a Kafka 
> bug？|[https://shirenchuang.blog.csdn.net/article/details/119912418]]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (KAFKA-13226) Partition expansion may cause uneven distribution



[ 
https://issues.apache.org/jira/browse/KAFKA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17435380#comment-17435380
 ] 

shizhenzhen commented on KAFKA-13226:
-

I hava push  a PR, please review it

https://github.com/apache/kafka/pull/11445

> Partition expansion may cause uneven distribution
> -
>
> Key: KAFKA-13226
> URL: https://issues.apache.org/jira/browse/KAFKA-13226
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.5.0, 2.8.0, 2.7.1, 2.6.2, 2.8.1, 3.0.0
> Environment: mac  
> kafka-2.5.0
>Reporter: shizhenzhen
>Priority: Major
>
>  
>  {color:#ff}*Partition expansion may cause uneven distribution*{color}
>  
> 1. Create a Topic  , 3-partition   1-replical
> !https://img-blog.csdnimg.cn/561112064b114acfb03882aa09100e0e.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_55,color_FF,t_70,g_se,x_16!
>  
> 2. partition expansion to 5 - partiton
> !https://img-blog.csdnimg.cn/f7c3c33b6662457080d9bb5bb190c0c2.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_49,color_FF,t_70,g_se,x_16!
>  
> 3. Does this meet expectations ？
>  
> !https://img-blog.csdnimg.cn/20cc1007c4214c4ebfcb1b2c2eeb98e4.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_18,color_FF,t_70,g_se,x_16!
>  
> {color:#ff}*so this is a bug ?*{color}
>  
> The problem may arise here ; 
> When we create a new topic .  get the broker list is a Object Map ; 
> *This is disordered*
> you can read the code , first it have sortBy brokerId, but finally it convert 
> to a *disorde Map;*
>  
>  
> !https://img-blog.csdnimg.cn/131b9bf0c19e4753a73512af4c9c5854.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_66,color_FF,t_70,g_se,x_16!
>  
>  
>  
> The important thing is that it has been sorted when expanding the partition 
> and parition-reassignment ;   
> {color:#ff}*So why not sort when creating topics?*{color}
>  
> If the topic is sorted when  create a new topic , this problem will not occur 
> ;
>  
> so it maybe is a tiny bug ?
>  
>  
> if you can read Chinese ,
> You can look at this article. I describe it in detail
>  
> We look forward to receiving your reply
>  
> 如果你能看懂中文,可以看看这篇文章 我描述的很详细! 
>  [This may be a Kafka 
> bug？|[https://shirenchuang.blog.csdn.net/article/details/119912418]]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-13226) Partition expansion may cause uneven distribution



 [ 
https://issues.apache.org/jira/browse/KAFKA-13226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shizhenzhen updated KAFKA-13226:

Affects Version/s: 2.8.1
   3.0.0

> Partition expansion may cause uneven distribution
> -
>
> Key: KAFKA-13226
> URL: https://issues.apache.org/jira/browse/KAFKA-13226
> Project: Kafka
>  Issue Type: Bug
>  Components: controller
>Affects Versions: 2.5.0, 2.8.0, 2.7.1, 2.6.2, 2.8.1, 3.0.0
> Environment: mac  
> kafka-2.5.0
>Reporter: shizhenzhen
>Priority: Major
>
>  
>  {color:#ff}*Partition expansion may cause uneven distribution*{color}
>  
> 1. Create a Topic  , 3-partition   1-replical
> !https://img-blog.csdnimg.cn/561112064b114acfb03882aa09100e0e.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_55,color_FF,t_70,g_se,x_16!
>  
> 2. partition expansion to 5 - partiton
> !https://img-blog.csdnimg.cn/f7c3c33b6662457080d9bb5bb190c0c2.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_49,color_FF,t_70,g_se,x_16!
>  
> 3. Does this meet expectations ？
>  
> !https://img-blog.csdnimg.cn/20cc1007c4214c4ebfcb1b2c2eeb98e4.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_18,color_FF,t_70,g_se,x_16!
>  
> {color:#ff}*so this is a bug ?*{color}
>  
> The problem may arise here ; 
> When we create a new topic .  get the broker list is a Object Map ; 
> *This is disordered*
> you can read the code , first it have sortBy brokerId, but finally it convert 
> to a *disorde Map;*
>  
>  
> !https://img-blog.csdnimg.cn/131b9bf0c19e4753a73512af4c9c5854.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_Q1NETiBA55-z6Ie76Ie755qE5p2C6LSn6ZO6,size_66,color_FF,t_70,g_se,x_16!
>  
>  
>  
> The important thing is that it has been sorted when expanding the partition 
> and parition-reassignment ;   
> {color:#ff}*So why not sort when creating topics?*{color}
>  
> If the topic is sorted when  create a new topic , this problem will not occur 
> ;
>  
> so it maybe is a tiny bug ?
>  
>  
> if you can read Chinese ,
> You can look at this article. I describe it in detail
>  
> We look forward to receiving your reply
>  
> 如果你能看懂中文,可以看看这篇文章 我描述的很详细! 
>  [This may be a Kafka 
> bug？|[https://shirenchuang.blog.csdn.net/article/details/119912418]]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (KAFKA-13226) Partition expansion may cause uneven distribution