Re: dataframe left joins are not working as expected in pyspark
still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote: You can declare the schema with unique names before creation of df. On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote: I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, df1.name == df2.name and df1.country == df2.country, 'left_outer').collect() When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] I would expect to get (though ideally without duplicate columns): [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] The workaround for now is this rather clunky piece of code: df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 'left_outer').collect() So to me it looks like a bug, but am I doing something wrong? Thanks, -Axel
Re: dataframe left joins are not working as expected in pyspark
Yeah, you shouldn't have to rename the columns before joining them. Do you see the same behavior on 1.3 vs 1.4? Nick 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성: still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote: You can declare the schema with unique names before creation of df. On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote: I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, df1.name == df2.name and df1.country == df2.country, 'left_outer').collect() When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] I would expect to get (though ideally without duplicate columns): [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] The workaround for now is this rather clunky piece of code: df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 'left_outer').collect() So to me it looks like a bug, but am I doing something wrong? Thanks, -Axel
Re: dataframe left joins are not working as expected in pyspark
I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's code would be failing right now. On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yeah, you shouldn't have to rename the columns before joining them. Do you see the same behavior on 1.3 vs 1.4? Nick 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com javascript:_e(%7B%7D,'cvml','a...@whisperstream.com');님이 작성: still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com javascript:_e(%7B%7D,'cvml','guha.a...@gmail.com'); wrote: You can declare the schema with unique names before creation of df. On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com javascript:_e(%7B%7D,'cvml','a...@whisperstream.com'); wrote: I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, df1.name == df2.name and df1.country == df2.country, 'left_outer').collect() When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] I would expect to get (though ideally without duplicate columns): [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] The workaround for now is this rather clunky piece of code: df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 'left_outer').collect() So to me it looks like a bug, but am I doing something wrong? Thanks, -Axel
Re: dataframe left joins are not working as expected in pyspark
I would test it against 1.3 to be sure, because it could -- though unlikely -- be a regression. For example, I recently stumbled upon this issue https://issues.apache.org/jira/browse/SPARK-8670 which was specific to 1.4. On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl a...@whisperstream.com wrote: I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's code would be failing right now. On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yeah, you shouldn't have to rename the columns before joining them. Do you see the same behavior on 1.3 vs 1.4? Nick 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성: still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote: You can declare the schema with unique names before creation of df. On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote: I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, df1.name == df2.name and df1.country == df2.country, 'left_outer').collect() When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] I would expect to get (though ideally without duplicate columns): [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] The workaround for now is this rather clunky piece of code: df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 'left_outer').collect() So to me it looks like a bug, but am I doing something wrong? Thanks, -Axel
Re: dataframe left joins are not working as expected in pyspark
created as SPARK-8685 https://issues.apache.org/jira/browse/SPARK-8685 @Yin, thx, have fixed sample code with the correct names. On Sat, Jun 27, 2015 at 1:56 PM, Yin Huai yh...@databricks.com wrote: Axel, Can you file a jira and attach your code in the description of the jira? This looks like a bug. For the third row of df1, the name is alice instead of carol, right? Otherwise, carol should appear in the expected output. Btw, to get rid of those columns with the same name after the join, you can use select to pick columns you want to include in the results. Thanks, Yin On Sat, Jun 27, 2015 at 11:29 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: I would test it against 1.3 to be sure, because it could -- though unlikely -- be a regression. For example, I recently stumbled upon this issue https://issues.apache.org/jira/browse/SPARK-8670 which was specific to 1.4. On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl a...@whisperstream.com wrote: I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's code would be failing right now. On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com wrote: Yeah, you shouldn't have to rename the columns before joining them. Do you see the same behavior on 1.3 vs 1.4? Nick 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성: still feels like a bug to have to create unique names before a join. On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote: You can declare the schema with unique names before creation of df. On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote: I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, df1.name == df2.name and df1.country == df2.country, 'left_outer').collect() When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] I would expect to get (though ideally without duplicate columns): [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] The workaround for now is this rather clunky piece of code: df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 'left_outer').collect() So to me it looks like a bug, but am I doing something wrong? Thanks, -Axel
dataframe left joins are not working as expected in pyspark
I have the following code: from pyspark import SQLContext d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice', 'country': 'ire', 'colour':'green'}] r1 = sc.parallelize(d1) r2 = sc.parallelize(d2) sqlContext = SQLContext(sc) df1 = sqlContext.createDataFrame(d1) df2 = sqlContext.createDataFrame(d2) df1.join(df2, df1.name == df2.name and df1.country == df2.country, 'left_outer').collect() When I run it I get the following, (notice in the first row, all join keys are take from the right-side and so are blanked out): [Row(age=2, country=None, name=None, colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] I would expect to get (though ideally without duplicate columns): [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None, name=None), Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', name=u'bob'), Row(age=3, country=u'ire', name=u'alice', colour=u'green', country=u'ire', name=u'alice')] The workaround for now is this rather clunky piece of code: df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', 'name2').withColumnRenamed('country', 'country2') df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, 'left_outer').collect() So to me it looks like a bug, but am I doing something wrong? Thanks, -Axel