we are having a join of 2 rdds thats fast (< 1 min), and suddenly it wouldn't even finish overnight anymore. the change was that the rdd was now derived from a dataframe.
so the new code that runs forever is something like this: dataframe.rdd.map(row => (Row(row(0)), row)).join(...) any idea why? i imagined it had something to do with recomputing parts of the data frame, but even a small change like this makes the issue go away: dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)), row)).join(...)