Re: DB I strongly encourage you to look at Cassandra – it’s almost as powerful as Hbase, a lot easier to setup and manage. Well suited for this type of usecase, with a combination of K/V store and time series data.
For the second question, I’ve used this pattern all the time for “flash messages” - passing info as a 1 time message downstream: * In your updateStateByKey function, emit a tuple of (actualNewState, changedData) * Then filter this on !changedData.isEmpty or something * And only do foreachRdd on the filtered stream. Makes sense? -adrian From: Thúy Hằng Lê Date: Friday, September 25, 2015 at 10:31 AM To: ALEX K Cc: "user@spark.apache.org<mailto:user@spark.apache.org>" Subject: Re: Using Spark for portfolio manager app Thanks all for the feedback so far. I havn't decided which external storage will be used yet. HBase is cool but it requires Hadoop in production. I only have 3-4 servers for the whole things ( i am thinking of a relational database for this, can be MariaDB, Memsql or mysql) but they are hard to scale. I will try various appoaches before making any decision. In addition, using Spark Streaming is there any way to update only new data to external storage after using updateStateByKey? The foreachRDD function seems to loop over all RDDs( includes one that havent changed) i believe Spark streamming must has a way to do it, but i still couldn't find an example doing similar job.