your code should be
df_one = df.select('col1', 'col2')
df_two = df.select('col1', 'col3')
Your current code is generating a tupple, and of course df_1 and df_2 are
different, so join is yielding to cartesian.
Best
Ayan
On Wed, Apr 22, 2015 at 12:42 AM, Karlson <[email protected]> wrote:
> Hi,
>
> can anyone confirm (and if so elaborate on) the following problem?
>
> When I join two DataFrames that originate from the same source DataFrame,
> the resulting DF will explode to a huge number of rows. A quick example:
>
> I load a DataFrame with n rows from disk:
>
> df = sql_context.parquetFile('data.parquet')
>
> Then I create two DataFrames from that source.
>
> df_one = df.select(['col1', 'col2'])
> df_two = df.select(['col1', 'col3'])
>
> Finally I want to (inner) join them back together:
>
> df_joined = df_one.join(df_two, df_one['col1'] == df_two['col2'],
> 'inner')
>
> The key in col1 is unique. The resulting DataFrame should have n rows,
> however it does have n*n rows.
>
> That does not happen, when I load df_one and df_two from disk directly. I
> am on Spark 1.3.0, but this also happens on the current 1.4.0 snapshot.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>
--
Best Regards,
Ayan Guha