How to setup this "false streaming" problem

Toni Cebrián Tue, 28 Apr 2015 10:39:11 -0700

Hi,

Just new to Spark and in need of some help for framing the problem Ihave. A problem well stated is half solved it's the saying :)

Let's say that I have a DStream[String] basically containing Json ofsome measurements from IoT devices. In order to keep it simple say thatafter unmarshalling I have data like:

case class Measurement(val deviceId:Long, val timestamp:Date, valmeasurement:Double)

I need to use DStreams because there is some interest in monitoringreal-time the measurements of devices. So let's say that I have adashboard with hourly granularity, past hours are consolidated butcurrent hour must be kept updated on every input.

The problem is that the Time that matters is the timestamp in the Jsonnot the receiving timestamp by Spark so I think that I have to keep astateful DStream like the one describedhttp://blog.cloudera.com/blog/2014/11/how-to-do-near-real-time-sessionization-with-spark-streaming-and-apache-hadoop/. I have two questions here:


1. Once a given hour is gone, I'd like to flush the consolidated stream
   into a DB. I think the strategy is to have a Stream with key-values
   where the key is (userID, truncateByHour(timestamp)) and reducing
   over the values. But it seems to me that with this approach Spark
   has lost any sense of time, how would you flush all the RDDs with
   timestamps between 00:00:00 and 00:59:59 for instance? Maybe I'm
   missing some function in the API
2. How do you deal with events that come with timestamps in the past,
   is it a matter of ignoring them, finding a trade-off between memory
   and how long the stateful DStream is? But then, who is the one
   poping the mature time slices from the stateful stream.

For me Spark Streaming would be the most natural way to face thisproblem, but maybe a simple Spark processing run every minute could keepeasily the sorting by time of external events.


I'd like to hear your thoughts.
Toni

How to setup this "false streaming" problem

Reply via email to