Re: Cost of converting RDD's to dataframe and back

Jörn Franke Fri, 24 Jun 2016 01:14:52 -0700

I would push the Spark people to provide equivalent functionality . In the end 
it is a deserialization/serialization process which should not be done back and 
forth because it is one of the more costly aspects during processing. It needs 
to convert Java objects to a binary representation. It is ok to do it once, 
because afterwards the access in binary form is much more efficient, but this 
will be completely irrelevant if you convert back and forth all the time.


I have heard somewhere the figure that serialization/deserialization takes 80% 
of the time in the big data world, but i would be happy to see this figure be 
confirmed empirically for different scenarios. Unfortunately I do not have a 
source for this figure so do not take it as granted.

> On 24 Jun 2016, at 08:00, pan <pranav.na...@gmail.com> wrote:
> 
> Hello,
>   I am trying to understand the cost of converting an RDD to Dataframe and
> back. Would a conversion back and forth very frequently cost performance.
> 
> I do observe that some operations like join are implemented very differently
> for RDD (pair) and Dataframe so trying to figure out the cose of converting
> one to another
> 
> Regards,
> Pranav
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Cost-of-converting-RDD-s-to-dataframe-and-back-tp27222.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Cost of converting RDD's to dataframe and back

Reply via email to