Hi, I would like to get your advice on my use case. I have a few spark streaming applications where I need to keep updating a dataframe after each batch. Each batch probably affects a small fraction of the dataframe (5k out of 200k records).
The options I have been considering so far: 1) keep dataframe on the driver, and update that after each batch 2) keep dataframe distributed, and use checkpointing to mitigate lineage I solved previous use cases with option 2, but I am not sure if it is the most optimal as checkpointing is relatively expensive. I also wondered about HBASE or some sort of quick access memory storage, however it is currently not in my stack. Curious to hear your thoughts Andras