Hi Tathagata, I set default parallelism as 300 in my configuration file. Sometimes there are more executors in a job. However, it is still slow. And I further observed that most executors take less than 20 seconds but two of them take much longer such as 2 minutes. The data size is very small (less than 480k lines with only 4 fields). I am not sure why the group by operation takes more then 3 minutes. Thanks!
Bill On Thu, Jul 10, 2014 at 4:28 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > Are you specifying the number of reducers in all the DStream.****ByKey > operations? If the reduce by key is not set, then the number of reducers > used in the stages can keep changing across batches. > > TD > > > On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay <bill.jaypeter...@gmail.com> > wrote: > >> Hi all, >> >> I have a Spark streaming job running on yarn. It consume data from Kafka >> and group the data by a certain field. The data size is 480k lines per >> minute where the batch size is 1 minute. >> >> For some batches, the program sometimes take more than 3 minute to finish >> the groupBy operation, which seems slow to me. I allocated 300 workers and >> specify 300 as the partition number for groupby. When I checked the slow >> stage *"combineByKey at ShuffledDStream.scala:42",* there are sometimes >> 2 executors allocated for this stage. However, during other batches, the >> executors can be several hundred for the same stage, which means the number >> of executors for the same operations change. >> >> Does anyone know how Spark allocate the number of executors for different >> stages and how to increase the efficiency for task? Thanks! >> >> Bill >> > >