Hi all, I have a Spark streaming job running on yarn. It consume data from Kafka and group the data by a certain field. The data size is 480k lines per minute where the batch size is 1 minute.
For some batches, the program sometimes take more than 3 minute to finish the groupBy operation, which seems slow to me. I allocated 300 workers and specify 300 as the partition number for groupby. When I checked the slow stage *"combineByKey at ShuffledDStream.scala:42",* there are sometimes 2 executors allocated for this stage. However, during other batches, the executors can be several hundred for the same stage, which means the number of executors for the same operations change. Does anyone know how Spark allocate the number of executors for different stages and how to increase the efficiency for task? Thanks! Bill