rdd join very slow when rdd created from data frame

2016-01-12 Thread Koert Kuipers
we are having a join of 2 rdds thats fast (< 1 min), and suddenly it wouldn't even finish overnight anymore. the change was that the rdd was now derived from a dataframe. so the new code that runs forever is something like this: dataframe.rdd.map(row => (Row(row(0)), row)).join(...) any idea

Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Koert Kuipers
it spark 1.5.1 the dataframe has simply 2 columns, both string a sql query would be more efficient probably, but doesnt fit out purpose (we are doing a lot more stuff where we need rdds). also i am just trying to understand in general what in that rdd coming from a dataframe could slow things

Re: rdd join very slow when rdd created from data frame

2016-01-12 Thread Kevin Mellott
Can you please provide the high-level schema of the entities that you are attempting to join? I think that you may be able to use a more efficient technique to join these together; perhaps by registering the Dataframes as temp tables and constructing a Spark SQL query. Also, which version of