We would like to utilize maintaining an arbitrary state between invokations of the iterations of StructuredStreaming in python
How can we maintain a static DataFrame that acts as state between the iterations? Several options that may be relevant: 1. in Spark memory (distributed across the workers) 2. External Memory solution (e.g. ElasticSearch / Redis) 3. utilizing other state maintenance that can work with PySpark Specifically - given that in iteration N we get a Streaming DataFrame from Kafka, we apply computation that produces a label column over the window of samples from the last hour. We want to keep around the labels and the sample ids for the next iteration (N+1) where we want to do a join with the new sample window to inherit the labels of samples that existed in the previous (N) iteration. -- Regards, Ofer Eliassaf