Hi,
Just new to Spark and in need of some help for framing the problem I
have. A problem well stated is half solved it's the saying :)
Let's say that I have a DStream[String] basically containing Json of
some measurements from IoT devices. In order to keep it simple say that
after unmarshalling I have data like:
case class Measurement(val deviceId:Long, val timestamp:Date, val
measurement:Double)
I need to use DStreams because there is some interest in monitoring
real-time the measurements of devices. So let's say that I have a
dashboard with hourly granularity, past hours are consolidated but
current hour must be kept updated on every input.
The problem is that the Time that matters is the timestamp in the Json
not the receiving timestamp by Spark so I think that I have to keep a
stateful DStream like the one described
http://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/
. I have two questions here:
1. Once a given hour is gone, I'd like to flush the consolidated stream
into a DB. I think the strategy is to have a Stream with key-values
where the key is (userID, truncateByHour(timestamp)) and reducing
over the values. But it seems to me that with this approach Spark
has lost any sense of time, how would you flush all the RDDs with
timestamps between 00:00:00 and 00:59:59 for instance? Maybe I'm
missing some function in the API
2. How do you deal with events that come with timestamps in the past,
is it a matter of ignoring them, finding a trade-off between memory
and how long the stateful DStream is? But then, who is the one
poping the mature time slices from the stateful stream.
For me Spark Streaming would be the most natural way to face this
problem, but maybe a simple Spark processing run every minute could keep
easily the sorting by time of external events.
I'd like to hear your thoughts.
Toni