rdd join very slow when rdd created from data frame
we are having a join of 2 rdds thats fast (< 1 min), and suddenly it wouldn't even finish overnight anymore. the change was that the rdd was now derived from a dataframe. so the new code that runs forever is something like this: dataframe.rdd.map(row => (Row(row(0)), row)).join(...) any idea why? i imagined it had something to do with recomputing parts of the data frame, but even a small change like this makes the issue go away: dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)), row)).join(...)
Re: rdd join very slow when rdd created from data frame
it spark 1.5.1 the dataframe has simply 2 columns, both string a sql query would be more efficient probably, but doesnt fit out purpose (we are doing a lot more stuff where we need rdds). also i am just trying to understand in general what in that rdd coming from a dataframe could slow things down from 1 min to overnight... On Tue, Jan 12, 2016 at 5:29 PM, Kevin Mellottwrote: > Can you please provide the high-level schema of the entities that you are > attempting to join? I think that you may be able to use a more efficient > technique to join these together; perhaps by registering the Dataframes as > temp tables and constructing a Spark SQL query. > > Also, which version of Spark are you using? > > On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuipers wrote: > >> we are having a join of 2 rdds thats fast (< 1 min), and suddenly it >> wouldn't even finish overnight anymore. the change was that the rdd was now >> derived from a dataframe. >> >> so the new code that runs forever is something like this: >> dataframe.rdd.map(row => (Row(row(0)), row)).join(...) >> >> any idea why? >> i imagined it had something to do with recomputing parts of the data >> frame, but even a small change like this makes the issue go away: >> dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)), >> row)).join(...) >> > >
Re: rdd join very slow when rdd created from data frame
Can you please provide the high-level schema of the entities that you are attempting to join? I think that you may be able to use a more efficient technique to join these together; perhaps by registering the Dataframes as temp tables and constructing a Spark SQL query. Also, which version of Spark are you using? On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuiperswrote: > we are having a join of 2 rdds thats fast (< 1 min), and suddenly it > wouldn't even finish overnight anymore. the change was that the rdd was now > derived from a dataframe. > > so the new code that runs forever is something like this: > dataframe.rdd.map(row => (Row(row(0)), row)).join(...) > > any idea why? > i imagined it had something to do with recomputing parts of the data > frame, but even a small change like this makes the issue go away: > dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)), > row)).join(...) >