One of the challenges we need to prepare for with streaming apps is bursty data. Typically we need to estimate our worst case data load and make sure we have enough capacity
It not obvious what best practices are with spark streaming. * we have implemented check pointing as described in the prog guide * Use stand alone cluster manager and spark-submit * We use the mgmt console to kill drives when needed * we plan to configure write ahead spark.streaming.backpressure.enabled to true. * our application runs a single unreliable receive > * We run multiple implementation configured to partition the input As long as our processing time is < our windowing time everything is fine In the streaming systems I have worked on in the past we scaled out by using load balancers and proxy farms to create buffering capacity. Its not clear how to scale out spark In our limited testing it seems like we have a single app configure to receive a predefined portion of the data. Once it is stated we can not add additional resources. Adding cores and memory does not seem increase our capacity Kind regards Andy