Re: [I] Iceberg Spark streaming skips rows of data [iceberg]

via GitHub Tue, 02 Jul 2024 09:58:19 -0700


singhpk234 commented on issue #10156:
URL: https://github.com/apache/iceberg/issues/10156#issuecomment-2189687924


   Haven't been looking into this actively. 
   
   couple of questions : 
   
   > That's because it applies the stream-from-timestamp when in fact it should 
not look at it at all but instead rely on the checkpoint information (which I'm 
sure is good)
   
   I see so let say there were S1, S2, S3, S4 committed at timestamp t1, t2, 
t3, t4 where t1 < t2 < t3 < t4 right now we consumed till t2 and hopefully 
checkpointed till t2 and now when we start the stream it starts let's say S5 
(which happens after S4) and not resume from S3 ? is this understanding correct 
? 
   
   - How was the stream restarted ? was it killed ? 
   - there is a logic to overwrite intitialOffset via spark interface
   
https://github.com/apache/iceberg/blob/fc5b2b336c774b0b8b032f7d87a1fb21e76b3f20/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L184
   
   
   
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/streaming/SparkDataStream.java#L40
   
   One thing that comes to my mind is that we should let the initialOffset 
process even when it;s not passing our latest offset check ? but thats 
something planFilesApi would already be doing, may be we need more log lines to 
see the call pattern then ? 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] Iceberg Spark streaming skips rows of data [iceberg]

Reply via email to