Re: Join with multiple conditions (In reference to SPARK-7197)

2015-08-26 Thread Michal Monselise
Davies, I created an issue - SPARK-10246
https://issues.apache.org/jira/browse/SPARK-10246

On Tue, Aug 25, 2015 at 12:53 PM, Davies Liu dav...@databricks.com wrote:

 It's good to support this, could you create a JIRA for it and target for
 1.6?

 On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise
 michal.monsel...@gmail.com wrote:
 
  Hello All,
 
  PySpark currently has two ways of performing a join: specifying a join
 condition or column names.
 
  I would like to perform a join using a list of columns that appear in
 both the left and right DataFrames. I have created an example in this
 question on Stack Overflow.
 
  Basically, I would like to do the following as specified in the
 documentation in  /spark/python/pyspark/sql/dataframe.py row 560 and
 specify a list of column names:
 
   df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
 
  However, this produces an error.
 
  In JIRA issue SPARK-7197, it is mentioned that the syntax is actually
 different from the one specified in the documentation for joining using a
 condition.
 
  Documentation:
   cond = [df.name == df3.name, df.age == df3.age]  df.join(df3,
 cond, 'outer').select(df.name, df3.age).collect()
 
  JIRA Issue:
 
  a.join(b, (a.year==b.year)  (a.month==b.month), 'inner')
 
 
  In other words. the join function cannot take a list.
  I was wondering if you could also clarify what is the correct syntax for
 providing a list of columns.
 
 
  Thanks,
  Michal
 
 



Fwd: Join with multiple conditions (In reference to SPARK-7197)

2015-08-25 Thread Michal Monselise
Hello All,

PySpark currently has two ways of performing a join: specifying a join
condition or column names.

I would like to perform a join using a list of columns that appear in both
the left and right DataFrames. I have created an example in this question
on Stack Overflow
http://stackoverflow.com/questions/32193488/joining-multiple-columns-in-pyspark
.

Basically, I would like to do the following as specified in the
documentation in  /spark/python/pyspark/sql/dataframe.py row 560 and
specify a list of column names:

 df.join(df4, ['name', 'age']).select(df.name, df.age).collect()
However, this produces an error.

In JIRA issue SPARK-7197 https://issues.apache.org/jira/browse/SPARK-7197,
it is mentioned that the syntax is actually different from the one
specified in the documentation for joining using a condition.

Documentation:
 cond = [df.name == df3.name, df.age == df3.age]  df.join(df3, cond,
'outer').select(df.name, df3.age).collect()
JIRA Issue:

a.join(b, (a.year==b.year)  (a.month==b.month), 'inner')


In other words. the join function cannot take a list.
I was wondering if you could also clarify what is the correct syntax for
providing a list of columns.


Thanks,
Michal


Re: Join with multiple conditions (In reference to SPARK-7197)

2015-08-25 Thread Davies Liu
It's good to support this, could you create a JIRA for it and target for 1.6?

On Tue, Aug 25, 2015 at 11:21 AM, Michal Monselise
michal.monsel...@gmail.com wrote:

 Hello All,

 PySpark currently has two ways of performing a join: specifying a join 
 condition or column names.

 I would like to perform a join using a list of columns that appear in both 
 the left and right DataFrames. I have created an example in this question on 
 Stack Overflow.

 Basically, I would like to do the following as specified in the documentation 
 in  /spark/python/pyspark/sql/dataframe.py row 560 and specify a list of 
 column names:

  df.join(df4, ['name', 'age']).select(df.name, df.age).collect()

 However, this produces an error.

 In JIRA issue SPARK-7197, it is mentioned that the syntax is actually 
 different from the one specified in the documentation for joining using a 
 condition.

 Documentation:
  cond = [df.name == df3.name, df.age == df3.age]  df.join(df3, cond, 
  'outer').select(df.name, df3.age).collect()

 JIRA Issue:

 a.join(b, (a.year==b.year)  (a.month==b.month), 'inner')


 In other words. the join function cannot take a list.
 I was wondering if you could also clarify what is the correct syntax for 
 providing a list of columns.


 Thanks,
 Michal



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org