As I understand you cannot deliver json file downstream as they want text
format.

If it is batch processing, what is the window of delivery within the SLA?

To write a 3GB file in 160 seconds means that it takes > 50 seconds to
write 1 Gig which looks a long time to me. Even talking one minute for json
looks excessive.

Is your Spark on the same sub-net as your HDFS if HDFS and Spark are not
sharing the same hardware?

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 14 September 2016 at 13:22, sanat kumar Patnaik <patnaik.sa...@gmail.com>
wrote:

> These are not csv files, utf8 files with a specific delimiter.
> I tried this out with a file(3 GB):
>
> myDF.write.json("output/myJson")
> Time taken- 60 secs approximately.
>
> myDF.rdd.repartition(1).saveAsTextFile("output/text")
> Time taken 160 secs
>
> That is where I am concerned, the time to write a text file compared to
> json grows exponentially.
>
> On Wednesday, September 14, 2016, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> These intermediate file what sort of files are there. Are there csv type
>> files.
>>
>> I agree that DF is more efficient than an RDD as it follows tabular
>> format (I assume that is what you mean by "columnar" format). So if you
>> read these files in a bath process you may not worry too much about
>> execution time?
>>
>> A textFile saving is simply a one to one mapping from your DF to HDFS. I
>> think it is pretty efficient.
>>
>> For myself, I would do something like below
>>
>> myDF.rdd.repartition(1).cache.saveAsTextFile("mypath/output")
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 14 September 2016 at 12:46, sanat kumar Patnaik <
>> patnaik.sa...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>>
>>>    - I am writing a batch application using Spark SQL and Dataframes.
>>>    This application has a bunch of file joins and there are intermediate
>>>    points where I need to drop a file for downstream applications to 
>>> consume.
>>>    - The problem is all these downstream applications are still on
>>>    legacy, so they still require us to drop them a text file.As you all must
>>>    be knowing Dataframe stores the data in columnar format internally.
>>>
>>> Only way I found out how to do this and which looks awfully slow is this:
>>>
>>> myDF=sc.textFile("inputpath").toDF()
>>> myDF.rdd.repartition(1).saveAsTextFile("mypath/output")
>>>
>>> Is there any better way to do this?
>>>
>>> *P.S: *The other workaround would be to use RDDs for all my operations.
>>> But I am wary of using them as the documentation says Dataframes are way
>>> faster because of the Catalyst engine running behind the scene.
>>>
>>> Please suggest if any of you might have tried something similar.
>>>
>>
>>
>
> --
> Regards,
> Sanat Patnaik
> Cell->804-882-6424
>

Reply via email to