Davies, I created an issue - SPARK-10246 <https://issues.apache.org/jira/browse/SPARK-10246>
On Tue, Aug 25, 2015 at 12:53 PM, Davies Liu <dav...@databricks.com> wrote: > It's good to support this, could you create a JIRA for it and target for > 1.6? > > On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise > <michal.monsel...@gmail.com> wrote: > > > > Hello All, > > > > PySpark currently has two ways of performing a join: specifying a join > condition or column names. > > > > I would like to perform a join using a list of columns that appear in > both the left and right DataFrames. I have created an example in this > question on Stack Overflow. > > > > Basically, I would like to do the following as specified in the > documentation in /spark/python/pyspark/sql/dataframe.py row 560 and > specify a list of column names: > > > > >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect() > > > > However, this produces an error. > > > > In JIRA issue SPARK-7197, it is mentioned that the syntax is actually > different from the one specified in the documentation for joining using a > condition. > > > > Documentation: > > >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, > cond, 'outer').select(df.name, df3.age).collect() > > > > JIRA Issue: > > > > a.join(b, (a.year==b.year) & (a.month==b.month), 'inner') > > > > > > In other words. the join function cannot take a list. > > I was wondering if you could also clarify what is the correct syntax for > providing a list of columns. > > > > > > Thanks, > > Michal > > > > >