Is it also slow when you do not repartition? (i.e., to get multiple output files)

Also did you try simply saveAsTextFile?

Also, before repartition, how many partitions are there?

a.


On 22/3/19 23:34, Lian Jiang wrote:
Hi,

Writing a csv to HDFS takes about 1 hour:

df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)

The generated csv file is only about 150kb. The job uses 3 containers (13 cores, 23g mem).

Other people have similar issues but I don't see a good explanation and solution.

Any clue is highly appreciated! Thanks.


--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to