Is it also slow when you do not repartition? (i.e., to get multiple
output files)
Also did you try simply saveAsTextFile?
Also, before repartition, how many partitions are there?
a.
On 22/3/19 23:34, Lian Jiang wrote:
Hi,
Writing a csv to HDFS takes about 1 hour:
df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv)
The generated csv file is only about 150kb. The job uses 3 containers
(13 cores, 23g mem).
Other people have similar issues but I don't see a good explanation
and solution.
Any clue is highly appreciated! Thanks.
--
Apostolos N. Papadopoulos, Associate Professor
Department of Informatics
Aristotle University of Thessaloniki
Thessaloniki, GREECE
tel: ++0030312310991918
email: papad...@csd.auth.gr
twitter: @papadopoulos_ap
web: http://datalab.csd.auth.gr/~apostol
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org