Hemant Bhanawat created SPARK-24144: ---------------------------------------
Summary: monotonically_increasing_id on streaming dataFrames Key: SPARK-24144 URL: https://issues.apache.org/jira/browse/SPARK-24144 Project: Spark Issue Type: New Feature Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Hemant Bhanawat For our use case, we want to assign snapshot ids (incrementing counters) to the incoming records. In case of failures, the same record should get the same id after failure so that the downstream DB can handle the records in a correct manner. We were trying to do this by zipping the streaming rdds with that counter using a modified version of ZippedWithIndexRDD. There are other ways to do that but it turns out all ways are cumbersome and error prone in failure scenarios. As suggested on the spark user dev list, one way to do this would be to support monotonically_increasing_id on streaming dataFrames in Spark code base. This would ensure that counters are incrementing for the records of the stream. Also, since the counter can be checkpointed, it would work well in case of failure scenarios. Last but not the least, doing this in spark would be the most performance efficient way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org