Hi all,

I have a Spark streaming job running on yarn. It consume data from Kafka
and group the data by a certain field. The data size is 480k lines per
minute where the batch size is 1 minute.

For some batches, the program sometimes take more than 3 minute to finish
the groupBy operation, which seems slow to me. I allocated 300 workers and
specify 300 as the partition number for groupby. When I checked the slow
stage *"combineByKey at ShuffledDStream.scala:42",* there are sometimes 2
executors allocated for this stage. However, during other batches, the
executors can be several hundred for the same stage, which means the number
of executors for the same operations change.

Does anyone know how Spark allocate the number of executors for different
stages and how to increase the efficiency for task? Thanks!

Bill

Reply via email to