lokesh-lingarajan-0310 commented on code in PR #9473: URL: https://github.com/apache/hudi/pull/9473#discussion_r1302262807
########## hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java: ########## @@ -130,7 +130,7 @@ public static QueryInfo generateQueryInfo(JavaSparkContext jssc, String srcBaseP } }); - String previousInstantTime = beginInstantTime; + String previousInstantTime = DEFAULT_BEGIN_TIMESTAMP; if (!beginInstantTime.equals(DEFAULT_BEGIN_TIMESTAMP)) { Review Comment: The issue here is partial read of initial commit of new flows. Lets say if the initial commit is C1 containing k1, k2, k3 objects and if after the first commit if the sourcelimit allows it to read only k1, we will have the first checkpoint as C1#k1, following this on the second round of sync, when we come this api to calculate previous, begin and end instances, if we initialize previous to begin, we will have previous = C1, begin = C1, end = C5 (lets say we have had 4 commits after the first one). The code following - https://github.com/apache/hudi/blob/21e462cca3551eaf84e10442cf7abd25003b40a8/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/QueryRunner.java#L81 Will end up skipping C1 altogether, hence reading C1 partially. The fix is to default previous to start that way this case is handled and full read of C1 will happen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org