Can you please provide the high-level schema of the entities that you are attempting to join? I think that you may be able to use a more efficient technique to join these together; perhaps by registering the Dataframes as temp tables and constructing a Spark SQL query.
Also, which version of Spark are you using? On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuipers <ko...@tresata.com> wrote: > we are having a join of 2 rdds thats fast (< 1 min), and suddenly it > wouldn't even finish overnight anymore. the change was that the rdd was now > derived from a dataframe. > > so the new code that runs forever is something like this: > dataframe.rdd.map(row => (Row(row(0)), row)).join(...) > > any idea why? > i imagined it had something to do with recomputing parts of the data > frame, but even a small change like this makes the issue go away: > dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)), > row)).join(...) >