waitingF commented on code in PR #8376: URL: https://github.com/apache/hudi/pull/8376#discussion_r1166212077
########## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java: ########## @@ -148,9 +166,58 @@ public static OffsetRange[] computeOffsetRanges(Map<TopicPartition, Long> fromOf } Review Comment: @bvaradar I think the algorithm would not work well in data skew case. In data skew case, it will not divvy partition evenly. For example, given topic partitions "0:0->100, 1:0->500" and minPartitions=3, the algorithm will generate 2 ranges: "0:0->100, 1:0->200, 1:200->300", for the 2 ranges of partition 1, they are not divvied evenly. Given more skew partitions, it will be worse. In the data skew case, resplit will generate even ranges for one TopicPartition. Because it will allocate ranges for topic partitions first, then based on the allocated ranges resplit into roughly minPartitions ranges. Based on this and the complex of the resplit should be very small, I think resplit should be better. How do you think? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org