Re: Join on DataFrames from the same source (Pyspark)

Karlson Wed, 22 Apr 2015 00:20:48 -0700

DataFrames do not have the attributes 'alias' or 'as' in the Python API.


On 2015-04-21 20:41, Michael Armbrust wrote:

This is https://issues.apache.org/jira/browse/SPARK-6231

Unfortunately this is pretty hard to fix as its hard for us to
differentiate these without aliases.  However you can add an alias as
follows:

from pyspark.sql.functions import *
df.alias("a").join(df.alias("b"), col("a.col1") == col("b.col1"))

On Tue, Apr 21, 2015 at 8:10 AM, Karlson <ksonsp...@siberie.de> wrote:

Sorry, my code actually was

    df_one = df.select('col1', 'col2')
    df_two = df.select('col1', 'col3')
But in Spark 1.4.0 this does not seem to make any difference anywayand
the problem is the same with both versions.



On 2015-04-21 17:04, ayan guha wrote:
your code should be

 df_one = df.select('col1', 'col2')
 df_two = df.select('col1', 'col3')
Your current code is generating a tupple, and of course df_1 and df_2are
different, so join is yielding to cartesian.

Best
Ayan
On Wed, Apr 22, 2015 at 12:42 AM, Karlson <ksonsp...@siberie.de>wrote:
 Hi,
can anyone confirm (and if so elaborate on) the following problem?
When I join two DataFrames that originate from the same sourceDataFrame,the resulting DF will explode to a huge number of rows. A quickexample:
I load a DataFrame with n rows from disk:

    df = sql_context.parquetFile('data.parquet')

Then I create two DataFrames from that source.

    df_one = df.select(['col1', 'col2'])
    df_two = df.select(['col1', 'col3'])

Finally I want to (inner) join them back together:
df_joined = df_one.join(df_two, df_one['col1'] ==df_two['col2'],
'inner')
The key in col1 is unique. The resulting DataFrame should have nrows,
however it does have n*n rows.
That does not happen, when I load df_one and df_two from diskdirectly. Iam on Spark 1.3.0, but this also happens on the current 1.4.0snapshot.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Join on DataFrames from the same source (Pyspark)

Reply via email to