Could you use "coalesce" to reduce the number of partitions?

Shixiong Zhu

On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> Here is more info.
>
> The job stuck at:
> INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks
>
> Then got the error:
> Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out
> after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
>
> So I increased spark.network.timeout from 120s to 600s.  It sometimes
> works.
>
> Each task is a parquet file.  I could not repartition due to out of GC
> problem.
>
> Is there any way I could to improve the performance?
>
> Thanks,
> Gavin
>
>
> On Sun, Jan 10, 2016 at 1:51 AM, Gavin Yue <yue.yuany...@gmail.com> wrote:
>
>> Hey,
>>
>> I have 10 days data, each day has a parquet directory with over 7000
>> partitions.
>> So when I union 10 days and do a count, then it submits over 70K tasks.
>>
>> Then the job failed silently with one container exit with code 1.  The
>> union with like 5, 6 days data is fine.
>> In the spark-shell, it just hang showing: Yarn scheduler submit 70000+
>> tasks.
>>
>> I am running spark 1.6 over hadoop 2.7.  Is there any setting I could
>> change to make this work?
>>
>> Thanks,
>> Gavin
>>
>>
>>
>

Reply via email to