[ https://issues.apache.org/jira/browse/SPARK-24144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hemant Bhanawat updated SPARK-24144: ------------------------------------ Priority: Major (was: Minor) > monotonically_increasing_id on streaming dataFrames > --------------------------------------------------- > > Key: SPARK-24144 > URL: https://issues.apache.org/jira/browse/SPARK-24144 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming > Affects Versions: 2.3.0 > Reporter: Hemant Bhanawat > Priority: Major > > For our use case, we want to assign snapshot ids (incrementing counters) to > the incoming records. In case of failures, the same record should get the > same id after failure so that the downstream DB can handle the records in a > correct manner. > We were trying to do this by zipping the streaming rdds with that counter > using a modified version of ZippedWithIndexRDD. There are other ways to do > that but it turns out all ways are cumbersome and error prone in failure > scenarios. > As suggested on the spark user dev list, one way to do this would be to > support monotonically_increasing_id on streaming dataFrames in Spark code > base. This would ensure that counters are incrementing for the records of the > stream. Also, since the counter can be checkpointed, it would work well in > case of failure scenarios. Last but not the least, doing this in spark would > be the most performance efficient way. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org