Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Axel Dahl
still feels like a bug to have to create unique names before a join.

On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice',
 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all join
 keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel







Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Nicholas Chammas
Yeah, you shouldn't have to rename the columns before joining them.

Do you see the same behavior on 1.3 vs 1.4?

Nick
2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성:

 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice',
 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all join
 keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel








Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Axel Dahl
I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
code would be failing right now.

On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 Yeah, you shouldn't have to rename the columns before joining them.

 Do you see the same behavior on 1.3 vs 1.4?

 Nick
 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com
 javascript:_e(%7B%7D,'cvml','a...@whisperstream.com');님이 작성:

 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com
 javascript:_e(%7B%7D,'cvml','guha.a...@gmail.com'); wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com
 javascript:_e(%7B%7D,'cvml','a...@whisperstream.com'); wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
 {'name':'alice', 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all join
 keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel








Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Nicholas Chammas
I would test it against 1.3 to be sure, because it could -- though unlikely
-- be a regression. For example, I recently stumbled upon this issue
https://issues.apache.org/jira/browse/SPARK-8670 which was specific to
1.4.

On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl a...@whisperstream.com wrote:

 I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
 code would be failing right now.

 On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Yeah, you shouldn't have to rename the columns before joining them.

 Do you see the same behavior on 1.3 vs 1.4?

 Nick
 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성:

 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
 {'name':'alice', 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all join
 keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel








Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Axel Dahl
created as SPARK-8685

https://issues.apache.org/jira/browse/SPARK-8685

@Yin, thx, have fixed sample code with the correct names.

On Sat, Jun 27, 2015 at 1:56 PM, Yin Huai yh...@databricks.com wrote:

 Axel,

 Can you file a jira and attach your code in the description of the jira?
 This looks like a bug.

 For the third row of df1, the name is alice instead of carol, right?
 Otherwise, carol should appear in the expected output.

 Btw, to get rid of those columns with the same name after the join, you
 can use select to pick columns you want to include in the results.

 Thanks,

 Yin

 On Sat, Jun 27, 2015 at 11:29 AM, Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 I would test it against 1.3 to be sure, because it could -- though
 unlikely -- be a regression. For example, I recently stumbled upon this
 issue https://issues.apache.org/jira/browse/SPARK-8670 which was
 specific to 1.4.

 On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl a...@whisperstream.com
 wrote:

 I've only tested on 1.4, but imagine 1.3 is the same or a lot of
 people's code would be failing right now.

 On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Yeah, you shouldn't have to rename the columns before joining them.

 Do you see the same behavior on 1.3 vs 1.4?

 Nick
 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성:

 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com
 wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 
 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
 {'name':'alice', 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all
 join keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red',
 country=u'usa', name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None,
 country=None, name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red',
 country=u'usa', name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country ==
 df2.country2, 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel









dataframe left joins are not working as expected in pyspark

2015-06-26 Thread Axel Dahl
I have the following code:

from pyspark import SQLContext

d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice',
'country': 'ire', 'colour':'green'}]

r1 = sc.parallelize(d1)
r2 = sc.parallelize(d2)

sqlContext = SQLContext(sc)
df1 = sqlContext.createDataFrame(d1)
df2 = sqlContext.createDataFrame(d2)
df1.join(df2, df1.name == df2.name and df1.country == df2.country,
'left_outer').collect()


When I run it I get the following, (notice in the first row, all join keys
are take from the right-side and so are blanked out):

[Row(age=2, country=None, name=None, colour=None, country=None, name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
name=u'bob'),
Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire',
name=u'alice')]

I would expect to get (though ideally without duplicate columns):
[Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
name=None),
Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
name=u'bob'),
Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire',
name=u'alice')]

The workaround for now is this rather clunky piece of code:
df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
'name2').withColumnRenamed('country', 'country2')
df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
'left_outer').collect()

So to me it looks like a bug, but am I doing something wrong?

Thanks,

-Axel