bobby-richard commented on code in PR #10045:
URL: https://github.com/apache/pinot/pull/10045#discussion_r1062499804
##########
pinot-connectors/pinot-flink-connector/src/main/java/org/apache/pinot/connector/flink/sink/PinotSinkFunction.java:
##########
@@ -151,4 +148,18 @@ private void flush()
LOG.info("Pinot segment uploaded to {}", segmentURI);
});
}
+
+ @Override
+ public List<GenericRow> snapshotState(long checkpointId, long timestamp) {
Review Comment:
@snleee Let's say we are using a kafka source to feed the pinot sink with
checkpointing enabled at a 1 minute interval. Every minute when the checkpoint
is taken, the kafka source will persist its current offsets in the checkpoint.
If the job fails, it will be restarted from the last checkpoint. The kafka
source will resume from the last committed offsets, but the pinot sink will
have lost all pending rows for the current unflushed segment up until that
point.
You can rewind data from the source if running in batch mode without
checkpointing enabled. But if you are running a streaming job with a kafka
source, it's impossible to match up the kafka offsets of the source with the
last row of the most recently flushed segment.
I'm not sure how you can support checkpointing in the sink without storing
the pending rows in state.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]