Hi,

Just new to Spark and in need of some help for framing the problem I have. A problem well stated is half solved it's the saying :)

Let's say that I have a DStream[String] basically containing Json of some measurements from IoT devices. In order to keep it simple say that after unmarshalling I have data like:

case class Measurement(val deviceId:Long, val timestamp:Date, val measurement:Double)

I need to use DStreams because there is some interest in monitoring real-time the measurements of devices. So let's say that I have a dashboard with hourly granularity, past hours are consolidated but current hour must be kept updated on every input.

The problem is that the Time that matters is the timestamp in the Json not the receiving timestamp by Spark so I think that I have to keep a stateful DStream like the one described http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/ . I have two questions here:

1. Once a given hour is gone, I'd like to flush the consolidated stream
   into a DB. I think the strategy is to have a Stream with key-values
   where the key is (userID, truncateByHour(timestamp)) and reducing
   over the values. But it seems to me that with this approach Spark
   has lost any sense of time, how would you flush all the RDDs with
   timestamps between 00:00:00 and 00:59:59 for instance? Maybe I'm
   missing some function in the API
2. How do you deal with events that come with timestamps in the past,
   is it a matter of ignoring them, finding a trade-off between memory
   and how long the stateful DStream is? But then, who is the one
   poping the mature time slices from the stateful stream.

For me Spark Streaming would be the most natural way to face this problem, but maybe a simple Spark processing run every minute could keep easily the sorting by time of external events.

I'd like to hear your thoughts.
Toni

Reply via email to