Thanks all for the feedback so far.
I havn't decided which external storage will be used yet.
HBase is cool but it requires Hadoop in production. I only have 3-4 servers
for the whole things ( i am thinking of a relational database for this, can
be MariaDB, Memsql or mysql) but they are hard to scale.
I will try various appoaches before making any decision.

In addition, using Spark Streaming is there any way to update only new data
to external storage after using updateStateByKey?
The foreachRDD function seems to loop over all RDDs( includes one that
havent changed) i believe Spark streamming must has a way to do it, but i
still couldn't find an example doing similar job.

Reply via email to