Re: is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-07-02 Thread Davies Liu
On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl a...@whisperstream.com wrote:
 In pyspark, when I convert from rdds to dataframes it looks like the rdd is
 being materialized/collected/repartitioned before it's converted to a
 dataframe.

It's not true. When converting a RDD to dataframe, it only take a few of rows to
infer the types, no other collect/repartition will happen.

 Just wondering if there's any guidelines for doing this conversion and
 whether it's best to do it early to get the performance benefits of
 dataframes or weigh that against the size/number of items in the rdd.

It's better to do it as early as possible, I think.

 Thanks,

 -Axel


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



is there any significant performance issue converting between rdd and dataframes in pyspark?

2015-06-29 Thread Axel Dahl
In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.

Just wondering if there's any guidelines for doing this conversion and
whether it's best to do it early to get the performance benefits of
dataframes or weigh that against the size/number of items in the rdd.

Thanks,

-Axel