Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

sanat kumar Patnaik Wed, 14 Sep 2016 05:24:08 -0700

These are not csv files, utf8 files with a specific delimiter.
I tried this out with a file(3 GB):


myDF.write.json("output/myJson")
Time taken- 60 secs approximately.

myDF.rdd.repartition(1).saveAsTextFile("output/text")
Time taken 160 secs

That is where I am concerned, the time to write a text file compared to
json grows exponentially.

On Wednesday, September 14, 2016, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> These intermediate file what sort of files are there. Are there csv type
> files.
>
> I agree that DF is more efficient than an RDD as it follows tabular format
> (I assume that is what you mean by "columnar" format). So if you read these
> files in a bath process you may not worry too much about execution time?
>
> A textFile saving is simply a one to one mapping from your DF to HDFS. I
> think it is pretty efficient.
>
> For myself, I would do something like below
>
> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 14 September 2016 at 12:46, sanat kumar Patnaik <
> patnaik.sa...@gmail.com
> <javascript:_e(%7B%7D,'cvml','patnaik.sa...@gmail.com');>> wrote:
>
>> Hi All,
>>
>>
>>    - I am writing a batch application using Spark SQL and Dataframes.
>>    This application has a bunch of file joins and there are intermediate
>>    points where I need to drop a file for downstream applications to consume.
>>    - The problem is all these downstream applications are still on
>>    legacy, so they still require us to drop them a text file.As you all must
>>    be knowing Dataframe stores the data in columnar format internally.
>>
>> Only way I found out how to do this and which looks awfully slow is this:
>>
>> myDF=sc.textFile("inputpath").toDF()
>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>
>> Is there any better way to do this?
>>
>> *P.S: *The other workaround would be to use RDDs for all my operations.
>> But I am wary of using them as the documentation says Dataframes are way
>> faster because of the Catalyst engine running behind the scene.
>>
>> Please suggest if any of you might have tried something similar.
>>
>
>

-- 
Regards,
Sanat Patnaik
Cell->804-882-6424

Re: Efficiently write a Dataframe to Text file(Spark Version 1.6.1)

Reply via email to