Hi Tathagata,

I set default parallelism as 300 in my configuration file. Sometimes there
are more executors in a job. However, it is still slow. And I further
observed that most executors take less than 20 seconds but two of them take
much longer such as 2 minutes. The data size is very small (less than 480k
lines with only 4 fields). I am not sure why the group by operation takes
more then 3 minutes.  Thanks!

Bill


On Thu, Jul 10, 2014 at 4:28 PM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> Are you specifying the number of reducers in all the DStream.****ByKey
> operations? If the reduce by key is not set, then the number of reducers
> used in the stages can keep changing across batches.
>
> TD
>
>
> On Wed, Jul 9, 2014 at 4:05 PM, Bill Jay <bill.jaypeter...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> I have a Spark streaming job running on yarn. It consume data from Kafka
>> and group the data by a certain field. The data size is 480k lines per
>> minute where the batch size is 1 minute.
>>
>> For some batches, the program sometimes take more than 3 minute to finish
>> the groupBy operation, which seems slow to me. I allocated 300 workers and
>> specify 300 as the partition number for groupby. When I checked the slow
>> stage *"combineByKey at ShuffledDStream.scala:42",* there are sometimes
>> 2 executors allocated for this stage. However, during other batches, the
>> executors can be several hundred for the same stage, which means the number
>> of executors for the same operations change.
>>
>> Does anyone know how Spark allocate the number of executors for different
>> stages and how to increase the efficiency for task? Thanks!
>>
>> Bill
>>
>
>

Reply via email to