Re: RDD generated from Dataframes

Sean Owen Thu, 21 Apr 2016 06:56:49 -0700

I don't think that's generally true, but is true to the extent that
you can push down the work of higher-level logical operators like
select and groupBy, on common types, that can be understood and
optimized. Your arbitrary user code is opaque and can't be optimized.
So DataFrame.groupBy.max is likely to be more efficient, if that's all
you're doing, than executing a groupBy on opaque user objects.


If you really need to apply a user function to each row, I'm not sure
DataFrames help much since you're not using them qua DataFrames, and
indeed you end up treating them as RDDs.

You should instead see if you can express more of your operations in
standard DataFrame operations (consider registering smaller UDFs if
needed), since that is what you do to get the speedups.

On Thu, Apr 21, 2016 at 2:49 PM, Apurva Nandan <apurva3...@gmail.com> wrote:
> Hello everyone,
>
> Generally speaking, I guess it's well known that dataframes are much faster
> than RDD when it comes to performance.
> My question is how do you go around when it comes to transforming a
> dataframe using map.
> I mean then the dataframe gets converted into RDD, hence now do you again
> convert this RDD to a new dataframe for better performance?
> Further, if you have a process which involves series of transformations i.e.
> from one RDD to another, do you keep on converting each RDD to a dataframe
> first, all the time?
>
> It's also possible that I might be missing something here, please share your
> experiences.
>
>
> Thanks and Regards,
> Apurva

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: RDD generated from Dataframes

Reply via email to