As I understand it, your initial number of partitions will always depend on
the initial data. I'm not aware of any way to change this, other than
changing the configuration of the underlying data store.

Have you tried reading the data in several data frames (e.g. one data frame
per day), coalescing each data frame, and *then* unioning them? You could
try with and without a shuffle. Not sure if it'll work, but might be worth
a shot.

On Mon, Jan 11, 2016 at 8:39 PM, Gavin Yue <yue.yuany...@gmail.com> wrote:

> Thank you for the suggestion.
>
> I tried the df.coalesce(1000).write.parquet() and yes, the parquet file
> number drops to 1000, but the parition of parquet stills is like 5000+.
> When I read the parquet and do a count, it still has the 5000+ tasks.
>
> So I guess I need to do a repartition here to drop task number?  But
> repartition never works for me, always failed due to out of memory.
>
> And regarding the large number task delay problem, I found a similar
> problem: https://issues.apache.org/jira/browse/SPARK-7447.
>
> I am unionALL like 10 parquet folder, with totally 70K+ parquet files,
> generating 70k+ taskes. It took around 5-8 mins before all tasks start just
> like the ticket abover.
>
> It also happens if I do a partition discovery with base path.    Is there
> any schema inference or checking doing, which causes the slowness?
>
> Thanks,
> Gavin
>
>
>
> On Mon, Jan 11, 2016 at 1:21 PM, Shixiong(Ryan) Zhu <
> shixi...@databricks.com> wrote:
>
>> Could you use "coalesce" to reduce the number of partitions?
>>
>>
>> Shixiong Zhu
>>
>>
>> On Mon, Jan 11, 2016 at 12:21 AM, Gavin Yue <yue.yuany...@gmail.com>
>> wrote:
>>
>>> Here is more info.
>>>
>>> The job stuck at:
>>> INFO cluster.YarnScheduler: Adding task set 1.0 with 79212 tasks
>>>
>>> Then got the error:
>>> Caused by: org.apache.spark.rpc.RpcTimeoutException: Futures timed out
>>> after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout
>>>
>>> So I increased spark.network.timeout from 120s to 600s.  It sometimes
>>> works.
>>>
>>> Each task is a parquet file.  I could not repartition due to out of GC
>>> problem.
>>>
>>> Is there any way I could to improve the performance?
>>>
>>> Thanks,
>>> Gavin
>>>
>>>
>>> On Sun, Jan 10, 2016 at 1:51 AM, Gavin Yue <yue.yuany...@gmail.com>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> I have 10 days data, each day has a parquet directory with over 7000
>>>> partitions.
>>>> So when I union 10 days and do a count, then it submits over 70K tasks.
>>>>
>>>> Then the job failed silently with one container exit with code 1.  The
>>>> union with like 5, 6 days data is fine.
>>>> In the spark-shell, it just hang showing: Yarn scheduler submit 70000+
>>>> tasks.
>>>>
>>>> I am running spark 1.6 over hadoop 2.7.  Is there any setting I could
>>>> change to make this work?
>>>>
>>>> Thanks,
>>>> Gavin
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to