I don't think that's generally true, but is true to the extent that you can push down the work of higher-level logical operators like select and groupBy, on common types, that can be understood and optimized. Your arbitrary user code is opaque and can't be optimized. So DataFrame.groupBy.max is likely to be more efficient, if that's all you're doing, than executing a groupBy on opaque user objects.
If you really need to apply a user function to each row, I'm not sure DataFrames help much since you're not using them qua DataFrames, and indeed you end up treating them as RDDs. You should instead see if you can express more of your operations in standard DataFrame operations (consider registering smaller UDFs if needed), since that is what you do to get the speedups. On Thu, Apr 21, 2016 at 2:49 PM, Apurva Nandan <apurva3...@gmail.com> wrote: > Hello everyone, > > Generally speaking, I guess it's well known that dataframes are much faster > than RDD when it comes to performance. > My question is how do you go around when it comes to transforming a > dataframe using map. > I mean then the dataframe gets converted into RDD, hence now do you again > convert this RDD to a new dataframe for better performance? > Further, if you have a process which involves series of transformations i.e. > from one RDD to another, do you keep on converting each RDD to a dataframe > first, all the time? > > It's also possible that I might be missing something here, please share your > experiences. > > > Thanks and Regards, > Apurva --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org