singhpk234 commented on issue #10156: URL: https://github.com/apache/iceberg/issues/10156#issuecomment-2189687924
Haven't been looking into this actively. couple of questions : > That's because it applies the stream-from-timestamp when in fact it should not look at it at all but instead rely on the checkpoint information (which I'm sure is good) I see so let say there were S1, S2, S3, S4 committed at timestamp t1, t2, t3, t4 where t1 < t2 < t3 < t4 right now we consumed till t2 and hopefully checkpointed till t2 and now when we start the stream it starts let's say S5 (which happens after S4) and not resume from S3 ? is this understanding correct ? - How was the stream restarted ? was it killed ? - there is a logic to overwrite intitialOffset via spark interface https://github.com/apache/iceberg/blob/fc5b2b336c774b0b8b032f7d87a1fb21e76b3f20/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkMicroBatchStream.java#L184 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector/read/streaming/SparkDataStream.java#L40 One thing that comes to my mind is that we should let the initialOffset process even when it;s not passing our latest offset check ? but thats something planFilesApi would already be doing, may be we need more log lines to see the call pattern then ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org