Re: rdd join very slow when rdd created from data frame

Koert Kuipers Tue, 12 Jan 2016 15:03:16 -0800

it spark 1.5.1
the dataframe has simply 2 columns, both string

a sql query would be more efficient probably, but doesnt fit out purpose
(we are doing a lot more stuff where we need rdds).


also i am just trying to understand in general what in that rdd coming from
a dataframe could slow things down from 1 min to overnight...

On Tue, Jan 12, 2016 at 5:29 PM, Kevin Mellott <kevin.r.mell...@gmail.com>
wrote:

> Can you please provide the high-level schema of the entities that you are
> attempting to join? I think that you may be able to use a more efficient
> technique to join these together; perhaps by registering the Dataframes as
> temp tables and constructing a Spark SQL query.
>
> Also, which version of Spark are you using?
>
> On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> we are having a join of 2 rdds thats fast (< 1 min), and suddenly it
>> wouldn't even finish overnight anymore. the change was that the rdd was now
>> derived from a dataframe.
>>
>> so the new code that runs forever is something like this:
>> dataframe.rdd.map(row => (Row(row(0)), row)).join(...)
>>
>> any idea why?
>> i imagined it had something to do with recomputing parts of the data
>> frame, but even a small change like this makes the issue go away:
>> dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)),
>> row)).join(...)
>>
>
>

Re: rdd join very slow when rdd created from data frame

Reply via email to