Re: Spark Streaming distributed batch locking

Arush Kharbanda Thu, 12 Feb 2015 04:12:08 -0800

* We have an inbound stream of sensor data for millions of devices (which
have unique identifiers). Spark Streaming can handel events in the ballpark
of 100-500K records/sec/node - *so you need to decide on a cluster
accordingly. And its scalable.*


* We need to perform aggregation of this stream on a per device level.
The aggregation will read data that has already been processed (and
persisted) in previous batches. - *You need to do stateful stream
processing, Spark streaming allows you to do that checkout - **updateStateByKey
-**http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html
<http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html>*

* Key point:  When we process data for a particular device we need to
ensure that no other processes are processing data for that particular
device.  This is because the outcome of our processing will affect the
downstream processing for that device.  Effectively we need a distributed
lock. - *You can make the source device as a key and then updateStateByKey
in spark using the key.*

* In addition the event device data needs to be processed in the order
that the events occurred. - *You would need to implement this in your code
adding timestamp as a data item. Spark Streaming dosnt ensure in order
delivery of your event.*

On Thu, Feb 12, 2015 at 4:51 PM, Legg John <john.l...@axonvibe.com> wrote:

> Hi
>
> After doing lots of reading and building a POC for our use case we are
> still unsure as to whether Spark Streaming can handle our use case:
>
> * We have an inbound stream of sensor data for millions of devices (which
> have unique identifiers).
> * We need to perform aggregation of this stream on a per device level.
> The aggregation will read data that has already been processed (and
> persisted) in previous batches.
> * Key point:  When we process data for a particular device we need to
> ensure that no other processes are processing data for that particular
> device.  This is because the outcome of our processing will affect the
> downstream processing for that device.  Effectively we need a distributed
> lock.
> * In addition the event device data needs to be processed in the order
> that the events occurred.
>
> Essentially we can¹t have two batches for the same device being processed
> at the same time.
>
> Can Spark handle our use case?
>
> Any advice appreciated.
>
> Regards
> John
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

[image: Sigmoid Analytics] <http://htmlsig.com/www.sigmoidanalytics.com>

*Arush Kharbanda* || Technical Teamlead

ar...@sigmoidanalytics.com || www.sigmoidanalytics.com

Re: Spark Streaming distributed batch locking

Reply via email to