In pyspark, when I convert from rdds to dataframes it looks like the rdd is
being materialized/collected/repartitioned before it's converted to a
dataframe.

Just wondering if there's any guidelines for doing this conversion and
whether it's best to do it early to get the performance benefits of
dataframes or weigh that against the size/number of items in the rdd.

Thanks,

-Axel

Reply via email to