Re: Join with multiple conditions (In reference to SPARK-7197)

Davies Liu Tue, 25 Aug 2015 12:54:24 -0700

It's good to support this, could you create a JIRA for it and target for 1.6?


On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise
<michal.monsel...@gmail.com> wrote:
>
> Hello All,
>
> PySpark currently has two ways of performing a join: specifying a join 
> condition or column names.
>
> I would like to perform a join using a list of columns that appear in both 
> the left and right DataFrames. I have created an example in this question on 
> Stack Overflow.
>
> Basically, I would like to do the following as specified in the documentation 
> in  /spark/python/pyspark/sql/dataframe.py row 560 and specify a list of 
> column names:
>
> >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
>
> However, this produces an error.
>
> In JIRA issue SPARK-7197, it is mentioned that the syntax is actually 
> different from the one specified in the documentation for joining using a 
> condition.
>
> Documentation:
> >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond, 
> >>> 'outer').select(df.name, df3.age).collect()
>
> JIRA Issue:
>
> a.join(b, (a.year==b.year) & (a.month==b.month), 'inner')
>
>
> In other words. the join function cannot take a list.
> I was wondering if you could also clarify what is the correct syntax for 
> providing a list of columns.
>
>
> Thanks,
> Michal
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Join with multiple conditions (In reference to SPARK-7197)

Reply via email to