Cody Koeninger created SPARK-18258: -------------------------------------- Summary: Sinks need access to offset representation Key: SPARK-18258 URL: https://issues.apache.org/jira/browse/SPARK-18258 Project: Spark Issue Type: Improvement Components: Structured Streaming Reporter: Cody Koeninger
Transactional "exactly-once" semantics for output require storing an offset identifier in the same transaction as results. The Sink.addBatch method currently only has access to batchId and data, not the actual offset representation. I want to store the actual offsets, so that they are recoverable as long as the results are and I'm not locked in to a particular streaming engine. I could see this being accomplished by adding parameters to Sink.addBatch for the starting and ending offsets (either the offsets themselves, or the SPARK-17829 string/json representation). That would be an API change, but if there's another way to map batch ids to offset representations without changing the Sink api that would work as well. I'm assuming we don't need the same level of access to offsets throughout a job as e.g. the Kafka dstream gives, because Sinks are the main place that should need them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org