In pyspark, when I convert from rdds to dataframes it looks like the rdd is being materialized/collected/repartitioned before it's converted to a dataframe.
Just wondering if there's any guidelines for doing this conversion and whether it's best to do it early to get the performance benefits of dataframes or weigh that against the size/number of items in the rdd. Thanks, -Axel