is there any significant performance issue converting between rdd and dataframes in pyspark?

Axel Dahl Mon, 29 Jun 2015 13:28:43 -0700

In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.


Just wondering if there's any guidelines for doing this conversion and
whether it's best to do it early to get the performance benefits of
dataframes or weigh that against the size/number of items in the rdd.

Thanks,

-Axel

is there any significant performance issue converting between rdd and dataframes in pyspark?

Reply via email to