Re: dataframe left joins are not working as expected in pyspark

Axel Dahl Sat, 27 Jun 2015 20:12:07 -0700

created as SPARK-8685

https://issues.apache.org/jira/browse/SPARK-8685


@Yin, thx, have fixed sample code with the correct names.

On Sat, Jun 27, 2015 at 1:56 PM, Yin Huai <yh...@databricks.com> wrote:

> Axel,
>
> Can you file a jira and attach your code in the description of the jira?
> This looks like a bug.
>
> For the third row of df1, the name is "alice" instead of "carol", right?
> Otherwise, "carol" should appear in the expected output.
>
> Btw, to get rid of those columns with the same name after the join, you
> can use select to pick columns you want to include in the results.
>
> Thanks,
>
> Yin
>
> On Sat, Jun 27, 2015 at 11:29 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I would test it against 1.3 to be sure, because it could -- though
>> unlikely -- be a regression. For example, I recently stumbled upon this
>> issue <https://issues.apache.org/jira/browse/SPARK-8670> which was
>> specific to 1.4.
>>
>> On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl <a...@whisperstream.com>
>> wrote:
>>
>>> I've only tested on 1.4, but imagine 1.3 is the same or a lot of
>>> people's code would be failing right now.
>>>
>>> On Saturday, June 27, 2015, Nicholas Chammas <nicholas.cham...@gmail.com>
>>> wrote:
>>>
>>>> Yeah, you shouldn't have to rename the columns before joining them.
>>>>
>>>> Do you see the same behavior on 1.3 vs 1.4?
>>>>
>>>> Nick
>>>> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl <a...@whisperstream.com>님이 작성:
>>>>
>>>>> still feels like a bug to have to create unique names before a join.
>>>>>
>>>>> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha <guha.a...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You can declare the schema with unique names before creation of df.
>>>>>> On 27 Jun 2015 13:01, "Axel Dahl" <a...@whisperstream.com> wrote:
>>>>>>
>>>>>>>
>>>>>>> I have the following code:
>>>>>>>
>>>>>>> from pyspark import SQLContext
>>>>>>>
>>>>>>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
>>>>>>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 
>>>>>>> 3}]
>>>>>>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
>>>>>>> {'name':'alice', 'country': 'ire', 'colour':'green'}]
>>>>>>>
>>>>>>> r1 = sc.parallelize(d1)
>>>>>>> r2 = sc.parallelize(d2)
>>>>>>>
>>>>>>> sqlContext = SQLContext(sc)
>>>>>>> df1 = sqlContext.createDataFrame(d1)
>>>>>>> df2 = sqlContext.createDataFrame(d2)
>>>>>>> df1.join(df2, df1.name == df2.name and df1.country == df2.country,
>>>>>>> 'left_outer').collect()
>>>>>>>
>>>>>>>
>>>>>>> When I run it I get the following, (notice in the first row, all
>>>>>>> join keys are take from the right-side and so are blanked out):
>>>>>>>
>>>>>>> [Row(age=2, country=None, name=None, colour=None, country=None,
>>>>>>> name=None),
>>>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red',
>>>>>>> country=u'usa', name=u'bob'),
>>>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>>>>> country=u'ire', name=u'alice')]
>>>>>>>
>>>>>>> I would expect to get (though ideally without duplicate columns):
>>>>>>> [Row(age=2, country=u'ire', name=u'Alice', colour=None,
>>>>>>> country=None, name=None),
>>>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red',
>>>>>>> country=u'usa', name=u'bob'),
>>>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green',
>>>>>>> country=u'ire', name=u'alice')]
>>>>>>>
>>>>>>> The workaround for now is this rather clunky piece of code:
>>>>>>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
>>>>>>> 'name2').withColumnRenamed('country', 'country2')
>>>>>>> df1.join(df2, df1.name == df2.name2 and df1.country ==
>>>>>>> df2.country2, 'left_outer').collect()
>>>>>>>
>>>>>>> So to me it looks like a bug, but am I doing something wrong?
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> -Axel
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>

Re: dataframe left joins are not working as expected in pyspark

Reply via email to