Hi, Suppose I have a stream of logs and I want to count them by minute. The result is like:
2014-10-26 18:38:00 100 2014-10-26 18:39:00 150 2014-10-26 18:40:00 200 One way to do this is to set the batch interval to 1 min, but each batch would be quite large. Or I can use updateStateByKey where key is like '2014-10-26 18:38:00', but I have two questions: 1. How to persist the result to MySQL? Do I need to flush them every batch? 2. How to delete the old state? For example, now is 18:50 but the 18:40's state is still in Spark. One solution is to set the key's state to None when there's no data of this key in this batch. But what if the log is not so much, and some batches get zero logs? For instance 18:40:00~18:40:10 has 10 logs -> key 18:40's value is set to 10 18:40:10~18:40:20 has no log -> key 18:40 is deleted 18:40:20~18:40:30 has 5 logs -> key 18:40's value is set to 5 You can see the result is wrong. Maybe I can use an 'update' approach when flushing, i.e. check MySQL whether there's already an entry of 18:40 and add the result to that. But how about a unique count? I can't store all unique values in MySQL per se. So I'm looking for a better way to store count-by-minute result into rdbms (or nosql?). Any idea would be appreciated. Thanks. -- Jerry --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org