Hemant Bhanawat created SPARK-24144:
---------------------------------------

             Summary: monotonically_increasing_id on streaming dataFrames
                 Key: SPARK-24144
                 URL: https://issues.apache.org/jira/browse/SPARK-24144
             Project: Spark
          Issue Type: New Feature
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: Hemant Bhanawat


For our use case, we want to assign snapshot ids (incrementing counters) to the 
incoming records. In case of failures, the same record should get the same id 
after failure so that the downstream DB can handle the records in a correct 
manner. 

We were trying to do this by zipping the streaming rdds with that counter using 
a modified version of ZippedWithIndexRDD. There are other ways to do that but 
it turns out all ways are cumbersome and error prone in failure scenarios.

As suggested on the spark user dev list, one way to do this would be to support 
monotonically_increasing_id on streaming dataFrames in Spark code base. This 
would ensure that counters are incrementing for the records of the stream. 
Also, since the counter can be checkpointed, it would work well in case of 
failure scenarios. Last but not the least, doing this in spark would be the 
most performance efficient way.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to