Thanks guys for reply. The execution plan shows a giant query. After divide and conquer, saving is quick.
On Fri, Mar 22, 2019 at 4:01 PM kathy Harayama <kathleenli...@gmail.com> wrote: > Hi Lian, > Since you using repartition(1), do you want to decrease the number of > partitions? If so, have you tried to use coalesce instead? > > Kathleen > > On Fri, Mar 22, 2019 at 2:43 PM Lian Jiang <jiangok2...@gmail.com> wrote: > >> Hi, >> >> Writing a csv to HDFS takes about 1 hour: >> >> >> df.repartition(1).write.format('com.databricks.spark.csv').mode('overwrite').options(header='true').save(csv) >> >> The generated csv file is only about 150kb. The job uses 3 containers (13 >> cores, 23g mem). >> >> Other people have similar issues but I don't see a good explanation and >> solution. >> >> Any clue is highly appreciated! Thanks. >> >> >>