we are having a join of 2 rdds thats fast (< 1 min), and suddenly it
wouldn't even finish overnight anymore. the change was that the rdd was now
derived from a dataframe.

so the new code that runs forever is something like this:
dataframe.rdd.map(row => (Row(row(0)), row)).join(...)

any idea why?
i imagined it had something to do with recomputing parts of the data frame,
but even a small change like this makes the issue go away:
dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)),
row)).join(...)

Reply via email to