Hi,

DataFrames are more efficient if you have Tungsten activated as the underlying 
processing engine (normally by default). However, this only speeds up 
processing , saving as an io-bound operation not necessarily.

What is exactly slow? The write? 
You could use myDF.write.save().write...

However, repartition (1) means that everything is dumped into one executor and 
if there is a lot of data this may lead to network congestion.
Better (if it is supported by the legacy application) is to write each 
partition individually in a file.

If your processing is slow then you need to provide more concrete examples.


Best regards

> On 14 Sep 2016, at 14:10, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> These intermediate file what sort of files are there. Are there csv type 
> files.
> 
> I agree that DF is more efficient than an RDD as it follows tabular format (I 
> assume that is what you mean by "columnar" format). So if you read these 
> files in a bath process you may not worry too much about execution time?
> 
> A textFile saving is simply a one to one mapping from your DF to HDFS. I 
> think it is pretty efficient.
> 
> For myself, I would do something like below
> 
> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
> 
> HTH
> 
> Dr Mich Talebzadeh
>  
> LinkedIn  
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>  
> http://talebzadehmich.wordpress.com
> 
> Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
> damage or destruction of data or any other property which may arise from 
> relying on this email's technical content is explicitly disclaimed. The 
> author will in no case be liable for any monetary damages arising from such 
> loss, damage or destruction.
>  
> 
>> On 14 September 2016 at 12:46, sanat kumar Patnaik <patnaik.sa...@gmail.com> 
>> wrote:
>> Hi All,
>> 
>> I am writing a batch application using Spark SQL and Dataframes. This 
>> application has a bunch of file joins and there are intermediate points 
>> where I need to drop a file for downstream applications to consume.
>> The problem is all these downstream applications are still on legacy, so 
>> they still require us to drop them a text file.As you all must be knowing 
>> Dataframe stores the data in columnar format internally.
>> Only way I found out how to do this and which looks awfully slow is this:
>> 
>> myDF=sc.textFile("inputpath").toDF()
>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>  
>> Is there any better way to do this?
>> 
>> P.S: The other workaround would be to use RDDs for all my operations. But I 
>> am wary of using them as the documentation says Dataframes are way faster 
>> because of the Catalyst engine running behind the scene.
>> 
>> Please suggest if any of you might have tried something similar.
> 

Reply via email to