Can you please provide the high-level schema of the entities that you are
attempting to join? I think that you may be able to use a more efficient
technique to join these together; perhaps by registering the Dataframes as
temp tables and constructing a Spark SQL query.

Also, which version of Spark are you using?

On Tue, Jan 12, 2016 at 4:16 PM, Koert Kuipers <ko...@tresata.com> wrote:

> we are having a join of 2 rdds thats fast (< 1 min), and suddenly it
> wouldn't even finish overnight anymore. the change was that the rdd was now
> derived from a dataframe.
>
> so the new code that runs forever is something like this:
> dataframe.rdd.map(row => (Row(row(0)), row)).join(...)
>
> any idea why?
> i imagined it had something to do with recomputing parts of the data
> frame, but even a small change like this makes the issue go away:
> dataframe.rdd.map(row => Row.fromSeq(row.toSeq)).map(row => (Row(row(0)),
> row)).join(...)
>

Reply via email to