Re: Join with multiple conditions (In reference to SPARK-7197)

Michal Monselise Wed, 26 Aug 2015 17:00:07 -0700

Davies, I created an issue - SPARK-10246
<https://issues.apache.org/jira/browse/SPARK-10246>


On Tue, Aug 25, 2015 at 12:53 PM, Davies Liu <dav...@databricks.com> wrote:

> It's good to support this, could you create a JIRA for it and target for
> 1.6?
>
> On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise
> <michal.monsel...@gmail.com> wrote:
> >
> > Hello All,
> >
> > PySpark currently has two ways of performing a join: specifying a join
> condition or column names.
> >
> > I would like to perform a join using a list of columns that appear in
> both the left and right DataFrames. I have created an example in this
> question on Stack Overflow.
> >
> > Basically, I would like to do the following as specified in the
> documentation in  /spark/python/pyspark/sql/dataframe.py row 560 and
> specify a list of column names:
> >
> > >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
> >
> > However, this produces an error.
> >
> > In JIRA issue SPARK-7197, it is mentioned that the syntax is actually
> different from the one specified in the documentation for joining using a
> condition.
> >
> > Documentation:
> > >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3,
> cond, 'outer').select(df.name, df3.age).collect()
> >
> > JIRA Issue:
> >
> > a.join(b, (a.year==b.year) & (a.month==b.month), 'inner')
> >
> >
> > In other words. the join function cannot take a list.
> > I was wondering if you could also clarify what is the correct syntax for
> providing a list of columns.
> >
> >
> > Thanks,
> > Michal
> >
> >
>

Re: Join with multiple conditions (In reference to SPARK-7197)

Reply via email to