Re: writing a small csv to HDFS is super slow

Lian Jiang Mon, 25 Mar 2019 17:09:57 -0700

Thanks guys for reply.

The execution plan shows a giant query. After divide and conquer, saving is
quick.


On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <kathleenli...@gmail.com>
wrote:

> Hi Lian,
> Since you using repartition(1), do you want to decrease the number of
> partitions? If so, have you tried to use coalesce instead?
>
> Kathleen
>
> On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <jiangok2...@gmail.com> wrote:
>
>> Hi,
>>
>> Writing a csv to HDFS takes about 1 hour:
>>
>>
>> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
>>
>> The generated csv file is only about 150kb. The job uses 3 containers (13
>> cores, 23g mem).
>>
>> Other people have similar issues but I don't see a good explanation and
>> solution.
>>
>> Any clue is highly appreciated! Thanks.
>>
>>
>>

Re: writing a small csv to HDFS is super slow

Reply via email to