Use case advice

András Kolbert Sat, 09 Jan 2021 06:45:31 -0800

Hi,
I would like to get your advice on my use case.
I have a few spark streaming applications where I need to keep updating a
dataframe after each batch. Each batch probably affects a small fraction of
the dataframe (5k out of 200k records).


The options I have been considering so far:
1) keep dataframe on the driver, and update that after each batch
2) keep dataframe distributed, and use checkpointing to mitigate lineage

I solved previous use cases with option 2, but I am not sure if it is the
most optimal as checkpointing is relatively expensive. I also wondered
about HBASE or some sort of quick access memory storage, however it is
currently not in my stack.

Curious to hear your thoughts

Andras

Use case advice

Reply via email to