You can use `.checkpoint` for that `df.sort(…).coalesce(1).write...` — `coalesce` will make `sort` to have only one partition, so sorting will take a lot of time
`df.sort(…).repartition(1).write...` — `repartition` will add an explicit stage, but sorting will be lost, since it's a repartition ``` sc.setCheckpointDir("/tmp/test") val checkpointedDf = df.sort(…).checkpoint(eager=true) // will save all partitions checkpointedDf.coalesce(1).write.csv(…) // will load checkpointed partitions in one task, concatenate them, and will write them out as a single file ``` On Fri, Mar 9, 2018 at 9:47 AM, Deepak Sharma <deepakmc...@gmail.com> wrote: > I would suggest repartioning it to reasonable partitions may ne 500 and > save it to some intermediate working directory . > Finally read all the files from this working dir and then coalesce as 1 > and save to final location. > > Thanks > Deepak > > On Fri, Mar 9, 2018, 20:12 Vadim Semenov <va...@datadoghq.com> wrote: > >> because `coalesce` gets propagated further up in the DAG in the last >> stage, so your last stage only has one task. >> >> You need to break your DAG so your expensive operations would be in a >> previous stage before the stage with `.coalesce(1)` >> >> On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim < >> rezaul.ka...@insight-centre.org> wrote: >> >>> Dear All, >>> >>> I have a tiny CSV file, which is around 250MB. There are only 30 columns >>> in the DataFrame. Now I'm trying to save the pre-processed DataFrame as an >>> another CSV file on disk for later usage. >>> >>> However, I'm getting pissed off as writing the resultant DataFrame is >>> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the >>> file written on the disk is about 58GB! >>> >>> Here's the sample code that I tried: >>> >>> # Using repartition() >>> myDF.repartition(1).write.format("com.databricks.spark. >>> csv").save("data/file.csv") >>> >>> # Using coalesce() >>> myDF. coalesce(1).write.format("com.databricks.spark.csv").save(" >>> data/file.csv") >>> >>> >>> Any better suggestion? >>> >>> >>> >>> ---- >>> Md. Rezaul Karim, BSc, MSc >>> Research Scientist, Fraunhofer FIT, Germany >>> >>> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany >>> >>> eMail: rezaul.ka...@fit.fraunhofer.de >>> <andrea.berna...@fit.fraunhofer.de> >>> Tel: +49 241 80-21527 <+49%20241%208021527> >>> >> >> >> >> -- >> Sent from my iPhone >> > -- Sent from my iPhone