It's good to support this, could you create a JIRA for it and target for 1.6?
On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise <michal.monsel...@gmail.com> wrote: > > Hello All, > > PySpark currently has two ways of performing a join: specifying a join > condition or column names. > > I would like to perform a join using a list of columns that appear in both > the left and right DataFrames. I have created an example in this question on > Stack Overflow. > > Basically, I would like to do the following as specified in the documentation > in /spark/python/pyspark/sql/dataframe.py row 560 and specify a list of > column names: > > >>> df.join(df4, ['name', 'age']).select(df.name, df.age).collect() > > However, this produces an error. > > In JIRA issue SPARK-7197, it is mentioned that the syntax is actually > different from the one specified in the documentation for joining using a > condition. > > Documentation: > >>> cond = [df.name == df3.name, df.age == df3.age] >>> df.join(df3, cond, > >>> 'outer').select(df.name, df3.age).collect() > > JIRA Issue: > > a.join(b, (a.year==b.year) & (a.month==b.month), 'inner') > > > In other words. the join function cannot take a list. > I was wondering if you could also clarify what is the correct syntax for > providing a list of columns. > > > Thanks, > Michal > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org