Re: writing a small csv to HDFS is super slow

Apostolos N. Papadopoulos Fri, 22 Mar 2019 14:55:06 -0700

Is it also slow when you do not repartition? (i.e., to get multipleoutput files)


Also did you try simply saveAsTextFile?


Also, before repartition, how many partitions are there?

a.


On 22/3/19 23:34, Lian Jiang wrote:

Hi,

Writing a csv to HDFS takes about 1 hour:

df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
The generated csv file is only about 150kb. The job uses 3 containers(13 cores, 23g mem).
Other people have similar issues but I don't see a good explanationand solution.
Any clue is highly appreciated! Thanks.

--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: writing a small csv to HDFS is super slow

Reply via email to