The performance I mentioned here is all on local(my laptop). I have tried the same thing on cluster(Elastic MapReduce) and have seen even worse results.
Is there a way this can be done efficiently?If any of you might have tried it. On Wednesday, September 14, 2016, Jörn Franke <jornfra...@gmail.com> wrote: > It could be that by using the rdd it converts the data from the internal > format to Java objects (-> much more memory is needed), which may lead to > spill over to disk. This conversion takes a lot of time. Then, you need to > transfer these Java objects via network to one single node (repartition > ...), which takes on a 1 gbit network for 3 gb (since it may transfer Java > objects this might be even more for 3 gb) under optimal conditions ca 25 > seconds (if no other transfers happening at the same time, jumbo frames > activated etc). On the destination node we may have again spill over to > disk. Then you store them to a single disk (potentially multiple if you > have and use HDFS) which takes also time (assuming that no other process > uses this disk). > > Btw spark-csv can be used with different dataframes. > As said, other options are compression, avoid repartitioning (to avoid > network transfer), avoid spilling to disk (provide memory in yarn etc), > increase network bandwidth ... > > On 14 Sep 2016, at 14:22, sanat kumar Patnaik <patnaik.sa...@gmail.com > <javascript:_e(%7B%7D,'cvml','patnaik.sa...@gmail.com');>> wrote: > > These are not csv files, utf8 files with a specific delimiter. > I tried this out with a file(3 GB): > > myDF.write.json("output/myJson") > Time taken- 60 secs approximately. > > myDF.rdd.repartition(1).saveAsTextFile("output/text") > Time taken 160 secs > > That is where I am concerned, the time to write a text file compared to > json grows exponentially. > > On Wednesday, September 14, 2016, Mich Talebzadeh < > mich.talebza...@gmail.com > <javascript:_e(%7B%7D,'cvml','mich.talebza...@gmail.com');>> wrote: > >> These intermediate file what sort of files are there. Are there csv type >> files. >> >> I agree that DF is more efficient than an RDD as it follows tabular >> format (I assume that is what you mean by "columnar" format). So if you >> read these files in a bath process you may not worry too much about >> execution time? >> >> A textFile saving is simply a one to one mapping from your DF to HDFS. I >> think it is pretty efficient. >> >> For myself, I would do something like below >> >> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output") >> >> HTH >> >> Dr Mich Talebzadeh >> >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> >> http://talebzadehmich.wordpress.com >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> On 14 September 2016 at 12:46, sanat kumar Patnaik < >> patnaik.sa...@gmail.com> wrote: >> >>> Hi All, >>> >>> >>> - I am writing a batch application using Spark SQL and Dataframes. >>> This application has a bunch of file joins and there are intermediate >>> points where I need to drop a file for downstream applications to >>> consume. >>> - The problem is all these downstream applications are still on >>> legacy, so they still require us to drop them a text file.As you all must >>> be knowing Dataframe stores the data in columnar format internally. >>> >>> Only way I found out how to do this and which looks awfully slow is this: >>> >>> myDF=sc.textFile("inputpath").toDF() >>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output") >>> >>> Is there any better way to do this? >>> >>> *P.S: *The other workaround would be to use RDDs for all my operations. >>> But I am wary of using them as the documentation says Dataframes are way >>> faster because of the Catalyst engine running behind the scene. >>> >>> Please suggest if any of you might have tried something similar. >>> >> >> > > -- > Regards, > Sanat Patnaik > Cell->804-882-6424 > > -- Regards, Sanat Patnaik Cell->804-882-6424