On Mon, Jun 29, 2015 at 1:27 PM, Axel Dahl a...@whisperstream.com wrote:
In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.
It's not true. When converting a RDD to dataframe, it only take a few of rows to
infer the types, no other collect/repartition will happen.
Just wondering if there's any guidelines for doing this conversion and
whether it's best to do it early to get the performance benefits of
dataframes or weigh that against the size/number of items in the rdd.
It's better to do it as early as possible, I think.
Thanks,
-Axel
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org