lokesh-lingarajan-0310 commented on code in PR #9473:
URL: https://github.com/apache/hudi/pull/9473#discussion_r1302262807


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java:
##########
@@ -130,7 +130,7 @@ public static QueryInfo generateQueryInfo(JavaSparkContext 
jssc, String srcBaseP
       }
     });
 
-    String previousInstantTime = beginInstantTime;
+    String previousInstantTime = DEFAULT_BEGIN_TIMESTAMP;
     if (!beginInstantTime.equals(DEFAULT_BEGIN_TIMESTAMP)) {

Review Comment:
   The issue here is partial read of initial commit of new flows. Lets say if 
the initial commit is C1 containing k1, k2, k3 objects and if after the first 
commit if the sourcelimit allows it to read only k1, we will have the first 
checkpoint as C1#k1, following this on the second round of sync, when we come 
this api to calculate previous, begin and end instances, if we initialize 
previous to begin, we will have previous = C1, begin = C1, end = C5 (lets say 
we have had 4 commits after the first one). The code following - 
https://github.com/apache/hudi/blob/21e462cca3551eaf84e10442cf7abd25003b40a8/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/QueryRunner.java#L81
   
   Will end up skipping C1 altogether, hence reading C1 partially. The fix is 
to default previous to start that way this case is handled and full read of C1 
will happen.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to