One of the challenges we need to prepare for with streaming apps is bursty
data. Typically we need to estimate our worst case data load and make sure
we have enough capacity


It not obvious what best practices are with spark streaming.

* we have implemented check pointing as described in the prog guide
* Use stand alone cluster manager and spark-submit
* We use the mgmt console to kill drives when needed
* we plan to configure write ahead spark.streaming.backpressure.enabled to
true.
* our application runs a single unreliable receive
> * We run multiple implementation configured to partition the input

As long as our processing time is < our windowing time everything is fine

In the streaming systems I have worked on in the past we scaled out by using
load balancers and proxy farms to create buffering capacity. Its not clear
how to scale out spark

In our limited testing it seems like we have a single app configure to
receive a predefined portion of the data. Once it is stated we can not add
additional resources. Adding cores and memory does not seem increase our
capacity 


Kind regards

Andy




Reply via email to