* We have an inbound stream of sensor data for millions of devices (which have unique identifiers). Spark Streaming can handel events in the ballpark of 100-500K records/sec/node - *so you need to decide on a cluster accordingly. And its scalable.*
* We need to perform aggregation of this stream on a per device level. The aggregation will read data that has already been processed (and persisted) in previous batches. - *You need to do stateful stream processing, Spark streaming allows you to do that checkout - **updateStateByKey -**http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html <http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html>* * Key point: When we process data for a particular device we need to ensure that no other processes are processing data for that particular device. This is because the outcome of our processing will affect the downstream processing for that device. Effectively we need a distributed lock. - *You can make the source device as a key and then updateStateByKey in spark using the key.* * In addition the event device data needs to be processed in the order that the events occurred. - *You would need to implement this in your code adding timestamp as a data item. Spark Streaming dosnt ensure in order delivery of your event.* On Thu, Feb 12, 2015 at 4:51 PM, Legg John <john.l...@axonvibe.com> wrote: > Hi > > After doing lots of reading and building a POC for our use case we are > still unsure as to whether Spark Streaming can handle our use case: > > * We have an inbound stream of sensor data for millions of devices (which > have unique identifiers). > * We need to perform aggregation of this stream on a per device level. > The aggregation will read data that has already been processed (and > persisted) in previous batches. > * Key point: When we process data for a particular device we need to > ensure that no other processes are processing data for that particular > device. This is because the outcome of our processing will affect the > downstream processing for that device. Effectively we need a distributed > lock. > * In addition the event device data needs to be processed in the order > that the events occurred. > > Essentially we canĀ¹t have two batches for the same device being processed > at the same time. > > Can Spark handle our use case? > > Any advice appreciated. > > Regards > John > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- [image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com> *Arush Kharbanda* || Technical Teamlead ar...@sigmoidanalytics.com || www.sigmoidanalytics.com