waitingF commented on code in PR #8376:
URL: https://github.com/apache/hudi/pull/8376#discussion_r1166212077


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/KafkaOffsetGen.java:
##########
@@ -148,9 +166,58 @@ public static OffsetRange[] 
computeOffsetRanges(Map<TopicPartition, Long> fromOf
         }

Review Comment:
   @bvaradar  I think the algorithm would not work well in data skew case. 
   In data skew case, it will not divvy partition evenly. For example, given 
topic partitions "0:0->100, 1:0->500" and minPartitions=3, the algorithm will 
generate 2 ranges: "0:0->100, 1:0->200, 1:200->300", for the 2 ranges of 
partition 1, they are not divvied evenly. Given more skew partitions, it will 
be worse.
   In the data skew case, resplit will generate even ranges for one 
TopicPartition. Because it will allocate ranges for topic partitions first, 
then based on the allocated ranges resplit into roughly minPartitions ranges.
   Based on this and the complex of the resplit should be very small, I think 
resplit should be better.
   How do you think?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to