created as SPARK-8685 https://issues.apache.org/jira/browse/SPARK-8685
@Yin, thx, have fixed sample code with the correct names. On Sat, Jun 27, 2015 at 1:56 PM, Yin Huai <yh...@databricks.com> wrote: > Axel, > > Can you file a jira and attach your code in the description of the jira? > This looks like a bug. > > For the third row of df1, the name is "alice" instead of "carol", right? > Otherwise, "carol" should appear in the expected output. > > Btw, to get rid of those columns with the same name after the join, you > can use select to pick columns you want to include in the results. > > Thanks, > > Yin > > On Sat, Jun 27, 2015 at 11:29 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> I would test it against 1.3 to be sure, because it could -- though >> unlikely -- be a regression. For example, I recently stumbled upon this >> issue <https://issues.apache.org/jira/browse/SPARK-8670> which was >> specific to 1.4. >> >> On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl <a...@whisperstream.com> >> wrote: >> >>> I've only tested on 1.4, but imagine 1.3 is the same or a lot of >>> people's code would be failing right now. >>> >>> On Saturday, June 27, 2015, Nicholas Chammas <nicholas.cham...@gmail.com> >>> wrote: >>> >>>> Yeah, you shouldn't have to rename the columns before joining them. >>>> >>>> Do you see the same behavior on 1.3 vs 1.4? >>>> >>>> Nick >>>> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl <a...@whisperstream.com>님이 작성: >>>> >>>>> still feels like a bug to have to create unique names before a join. >>>>> >>>>> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha <guha.a...@gmail.com> >>>>> wrote: >>>>> >>>>>> You can declare the schema with unique names before creation of df. >>>>>> On 27 Jun 2015 13:01, "Axel Dahl" <a...@whisperstream.com> wrote: >>>>>> >>>>>>> >>>>>>> I have the following code: >>>>>>> >>>>>>> from pyspark import SQLContext >>>>>>> >>>>>>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', >>>>>>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': >>>>>>> 3}] >>>>>>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, >>>>>>> {'name':'alice', 'country': 'ire', 'colour':'green'}] >>>>>>> >>>>>>> r1 = sc.parallelize(d1) >>>>>>> r2 = sc.parallelize(d2) >>>>>>> >>>>>>> sqlContext = SQLContext(sc) >>>>>>> df1 = sqlContext.createDataFrame(d1) >>>>>>> df2 = sqlContext.createDataFrame(d2) >>>>>>> df1.join(df2, df1.name == df2.name and df1.country == df2.country, >>>>>>> 'left_outer').collect() >>>>>>> >>>>>>> >>>>>>> When I run it I get the following, (notice in the first row, all >>>>>>> join keys are take from the right-side and so are blanked out): >>>>>>> >>>>>>> [Row(age=2, country=None, name=None, colour=None, country=None, >>>>>>> name=None), >>>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', >>>>>>> country=u'usa', name=u'bob'), >>>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green', >>>>>>> country=u'ire', name=u'alice')] >>>>>>> >>>>>>> I would expect to get (though ideally without duplicate columns): >>>>>>> [Row(age=2, country=u'ire', name=u'Alice', colour=None, >>>>>>> country=None, name=None), >>>>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', >>>>>>> country=u'usa', name=u'bob'), >>>>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green', >>>>>>> country=u'ire', name=u'alice')] >>>>>>> >>>>>>> The workaround for now is this rather clunky piece of code: >>>>>>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', >>>>>>> 'name2').withColumnRenamed('country', 'country2') >>>>>>> df1.join(df2, df1.name == df2.name2 and df1.country == >>>>>>> df2.country2, 'left_outer').collect() >>>>>>> >>>>>>> So to me it looks like a bug, but am I doing something wrong? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> -Axel >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>> >