Hello All,

PySpark currently has two ways of performing a join: specifying a join
condition or column names.

I would like to perform a join using a list of columns that appear in both
the left and right DataFrames. I have created an example in this question
on Stack Overflow
<http://stackoverflow.com/questions/32193488/joining-multiple-columns-in-pyspark>
.

Basically, I would like to do the following as specified in the
documentation in  /spark/python/pyspark/sql/dataframe.py row 560 and
specify a list of column names:

>>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
However, this produces an error.

In JIRA issue SPARK-7197 <https://issues.apache.org/jira/browse/SPARK-7197>,
it is mentioned that the syntax is actually different from the one
specified in the documentation for joining using a condition.

Documentation:
>>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond,
'outer').select(df.name, df3.age).collect()
JIRA Issue:

a.join(b, (a.year==b.year) & (a.month==b.month), 'inner')


In other words. the join function cannot take a list.
I was wondering if you could also clarify what is the correct syntax for
providing a list of columns.


Thanks,
Michal

Reply via email to