Without knowing anything about your pipeline the best estimate of the resources 
needed is to run the job with same ingestion rate as the normal production load.

With kafka you can enable back pressure so with high load also your latency 
will just increase but you don’t have to have capacity for handling the spikes. 
If you want you can then ie. autoscale the cluster to respond for the load.

If you are using Yarn you can isolate and limit some resources so you can also 
run other workloads in same cluster if you need to have lots of elasticity. 

Usually with streaming jobs the concerns are not with computing capacity but 
more with network bandwidth and memory consumption.


> On 14.11.2017, at 14.54, Nadeem Lalani <nadeem...@gmail.com> wrote:
> 
> Hi,
> 
> I was wondering if anyone has done some work around measuring the cluster 
> resource utilization of a "typical" spark streaming job.
> 
> We are trying to build a message ingestion system which will read from Kafka 
> and do some processing.  We have had some concerns raised in the team that a 
> 24*7 streaming job might not be the best use of cluster resources especially 
> when our use cases are to process data in a micro batch fashion and are not 
> truly streaming.
> 
> We wanted to measure  as to how much resource does a spark streaming process 
> take. Any pointers on where one would start?
> 
> We are on Yarn and plan to use spark 2.1
> 
> Thanks in advance,
> Nadeem 


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to