Cody Koeninger created SPARK-18258:
--------------------------------------

             Summary: Sinks need access to offset representation
                 Key: SPARK-18258
                 URL: https://issues.apache.org/jira/browse/SPARK-18258
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
            Reporter: Cody Koeninger


Transactional "exactly-once" semantics for output require storing an offset 
identifier in the same transaction as results.

The Sink.addBatch method currently only has access to batchId and data, not the 
actual offset representation.

I want to store the actual offsets, so that they are recoverable as long as the
results are and I'm not locked in to a particular streaming engine.

I could see this being accomplished by adding parameters to Sink.addBatch for 
the starting and ending offsets (either the offsets themselves, or the 
SPARK-17829 string/json representation).  That would be an API change, but if 
there's another way to map batch ids to offset representations without changing 
the Sink api that would work as well.  

I'm assuming we don't need the same level of access to offsets throughout a job 
as e.g. the Kafka dstream gives, because Sinks are the main place that should 
need them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to