Could you use "coalesce" to reduce the number of partitions?
Shixiong Zhu On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue <yue.yuany...@gmail.com> wrote: > Here is more info. > > The job stuck at: > INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks > > Then got the error: > Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out > after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout > > So I increased spark.network.timeout from 120s to 600s. It sometimes > works. > > Each task is a parquet file. I could not repartition due to out of GC > problem. > > Is there any way I could to improve the performance? > > Thanks, > Gavin > > > On Sun, Jan 10, 2016 at 1:51 AM, Gavin Yue <yue.yuany...@gmail.com> wrote: > >> Hey, >> >> I have 10 days data, each day has a parquet directory with over 7000 >> partitions. >> So when I union 10 days and do a count, then it submits over 70K tasks. >> >> Then the job failed silently with one container exit with code 1. The >> union with like 5, 6 days data is fine. >> In the spark-shell, it just hang showing: Yarn scheduler submit 70000+ >> tasks. >> >> I am running spark 1.6 over hadoop 2.7. Is there any setting I could >> change to make this work? >> >> Thanks, >> Gavin >> >> >> >