it spark 1.5.1 the dataframe has simply 2 columns, both string a sql query would be more efficient probably, but doesnt fit out purpose (we are doing a lot more stuff where we need rdds).
also i am just trying to understand in general what in that rdd coming from a dataframe could slow things down from 1 min to overnight... On Tue, Jan 12, 2016 at 5:29 PM, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > Can you please provide the high-level schema of the entities that you are > attempting to join? I think that you may be able to use a more efficient > technique to join these together; perhaps by registering the Dataframes as > temp tables and constructing a Spark SQL query. > > Also, which version of Spark are you using? > > On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> we are having a join of 2 rdds thats fast (< 1 min), and suddenly it >> wouldn't even finish overnight anymore. the change was that the rdd was now >> derived from a dataframe. >> >> so the new code that runs forever is something like this: >> dataframe.rdd.map(row => (Row(row(0)), row)).join(...) >> >> any idea why? >> i imagined it had something to do with recomputing parts of the data >> frame, but even a small change like this makes the issue go away: >> dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)), >> row)).join(...) >> > >