These are not csv files, utf8 files with a specific delimiter. I tried this out with a file(3 GB):
myDF.write.json("output/myJson") Time taken- 60 secs approximately. myDF.rdd.repartition(1).saveAsTextFile("output/text") Time taken 160 secs That is where I am concerned, the time to write a text file compared to json grows exponentially. On Wednesday, September 14, 2016, Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > These intermediate file what sort of files are there. Are there csv type > files. > > I agree that DF is more efficient than an RDD as it follows tabular format > (I assume that is what you mean by "columnar" format). So if you read these > files in a bath process you may not worry too much about execution time? > > A textFile saving is simply a one to one mapping from your DF to HDFS. I > think it is pretty efficient. > > For myself, I would do something like below > > myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output") > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 14 September 2016 at 12:46, sanat kumar Patnaik < > patnaik.sa...@gmail.com > <javascript:_e(%7B%7D,'cvml','patnaik.sa...@gmail.com');>> wrote: > >> Hi All, >> >> >> - I am writing a batch application using Spark SQL and Dataframes. >> This application has a bunch of file joins and there are intermediate >> points where I need to drop a file for downstream applications to consume. >> - The problem is all these downstream applications are still on >> legacy, so they still require us to drop them a text file.As you all must >> be knowing Dataframe stores the data in columnar format internally. >> >> Only way I found out how to do this and which looks awfully slow is this: >> >> myDF=sc.textFile("inputpath").toDF() >> myDF.rdd.repartition(1).saveAsTextFile("mypath/output") >> >> Is there any better way to do this? >> >> *P.S: *The other workaround would be to use RDDs for all my operations. >> But I am wary of using them as the documentation says Dataframes are way >> faster because of the Catalyst engine running behind the scene. >> >> Please suggest if any of you might have tried something similar. >> > > -- Regards, Sanat Patnaik Cell->804-882-6424