But overall, I think the original approach is not correct. If you get a single file in 10s GB, the approach is probably must be reworked.
I don't see why you can't just write multiple CSV files using Spark, and then concatenate them without Spark On Fri, Mar 9, 2018 at 10:02 AM, Vadim Semenov <va...@datadoghq.com> wrote: > You can use `.checkpoint` for that > > `df.sort(…).coalesce(1).write...` — `coalesce` will make `sort` to have > only one partition, so sorting will take a lot of time > > `df.sort(…).repartition(1).write...` — `repartition` will add an explicit > stage, but sorting will be lost, since it's a repartition > > ``` > sc.setCheckpointDir("/tmp/test") > val checkpointedDf = df.sort(…).checkpoint(eager=true) // will save all > partitions > checkpointedDf.coalesce(1).write.csv(…) // will load checkpointed > partitions in one task, concatenate them, and will write them out as a > single file > ``` > > On Fri, Mar 9, 2018 at 9:47 AM, Deepak Sharma <deepakmc...@gmail.com> > wrote: > >> I would suggest repartioning it to reasonable partitions may ne 500 and >> save it to some intermediate working directory . >> Finally read all the files from this working dir and then coalesce as 1 >> and save to final location. >> >> Thanks >> Deepak >> >> On Fri, Mar 9, 2018, 20:12 Vadim Semenov <va...@datadoghq.com> wrote: >> >>> because `coalesce` gets propagated further up in the DAG in the last >>> stage, so your last stage only has one task. >>> >>> You need to break your DAG so your expensive operations would be in a >>> previous stage before the stage with `.coalesce(1)` >>> >>> On Fri, Mar 9, 2018 at 5:23 AM, Md. Rezaul Karim < >>> rezaul.ka...@insight-centre.org> wrote: >>> >>>> Dear All, >>>> >>>> I have a tiny CSV file, which is around 250MB. There are only 30 >>>> columns in the DataFrame. Now I'm trying to save the pre-processed >>>> DataFrame as an another CSV file on disk for later usage. >>>> >>>> However, I'm getting pissed off as writing the resultant DataFrame is >>>> taking too long, which is about 4 to 5 hours. Nevertheless, the size of the >>>> file written on the disk is about 58GB! >>>> >>>> Here's the sample code that I tried: >>>> >>>> # Using repartition() >>>> myDF.repartition(1).write.format("com.databricks.spark.csv") >>>> .save("data/file.csv") >>>> >>>> # Using coalesce() >>>> myDF. coalesce(1).write.format("com.databricks.spark.csv").save("d >>>> ata/file.csv") >>>> >>>> >>>> Any better suggestion? >>>> >>>> >>>> >>>> ---- >>>> Md. Rezaul Karim, BSc, MSc >>>> Research Scientist, Fraunhofer FIT, Germany >>>> >>>> Ph.D. Researcher, Information Systems, RWTH Aachen University, Germany >>>> >>>> eMail: rezaul.ka...@fit.fraunhofer.de >>>> <andrea.berna...@fit.fraunhofer.de> >>>> Tel: +49 241 80-21527 <+49%20241%208021527> >>>> >>> >>> >>> >>> -- >>> Sent from my iPhone >>> >> > > > -- > Sent from my iPhone > -- Sent from my iPhone