Hello All, PySpark currently has two ways of performing a join: specifying a join condition or column names.
I would like to perform a join using a list of columns that appear in both the left and right DataFrames. I have created an example in this question on Stack Overflow <http://stackoverflow.com/questions/32193488/joining-multiple-columns-in-pyspark> . Basically, I would like to do the following as specified in the documentation in /spark/python/pyspark/sql/dataframe.py row 560 and specify a list of column names: >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect() However, this produces an error. In JIRA issue SPARK-7197 <https://issues.apache.org/jira/browse/SPARK-7197>, it is mentioned that the syntax is actually different from the one specified in the documentation for joining using a condition. Documentation: >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect() JIRA Issue: a.join(b, (a.year==b.year) & (a.month==b.month), 'inner') In other words. the join function cannot take a list. I was wondering if you could also clarify what is the correct syntax for providing a list of columns. Thanks, Michal