I would test it against 1.3 to be sure, because it could -- though unlikely -- be a regression. For example, I recently stumbled upon this issue <https://issues.apache.org/jira/browse/SPARK-8670> which was specific to 1.4.
On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl <a...@whisperstream.com> wrote: > I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's > code would be failing right now. > > On Saturday, June 27, 2015, Nicholas Chammas <nicholas.cham...@gmail.com> > wrote: > >> Yeah, you shouldn't have to rename the columns before joining them. >> >> Do you see the same behavior on 1.3 vs 1.4? >> >> Nick >> 2015년 6월 27일 (토) 오전 2:51, Axel Dahl <a...@whisperstream.com>님이 작성: >> >>> still feels like a bug to have to create unique names before a join. >>> >>> On Fri, Jun 26, 2015 at 9:51 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> You can declare the schema with unique names before creation of df. >>>> On 27 Jun 2015 13:01, "Axel Dahl" <a...@whisperstream.com> wrote: >>>> >>>>> >>>>> I have the following code: >>>>> >>>>> from pyspark import SQLContext >>>>> >>>>> d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice', >>>>> 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}] >>>>> d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, >>>>> {'name':'alice', 'country': 'ire', 'colour':'green'}] >>>>> >>>>> r1 = sc.parallelize(d1) >>>>> r2 = sc.parallelize(d2) >>>>> >>>>> sqlContext = SQLContext(sc) >>>>> df1 = sqlContext.createDataFrame(d1) >>>>> df2 = sqlContext.createDataFrame(d2) >>>>> df1.join(df2, df1.name == df2.name and df1.country == df2.country, >>>>> 'left_outer').collect() >>>>> >>>>> >>>>> When I run it I get the following, (notice in the first row, all join >>>>> keys are take from the right-side and so are blanked out): >>>>> >>>>> [Row(age=2, country=None, name=None, colour=None, country=None, >>>>> name=None), >>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', >>>>> name=u'bob'), >>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green', >>>>> country=u'ire', name=u'alice')] >>>>> >>>>> I would expect to get (though ideally without duplicate columns): >>>>> [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None, >>>>> name=None), >>>>> Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa', >>>>> name=u'bob'), >>>>> Row(age=3, country=u'ire', name=u'alice', colour=u'green', >>>>> country=u'ire', name=u'alice')] >>>>> >>>>> The workaround for now is this rather clunky piece of code: >>>>> df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name', >>>>> 'name2').withColumnRenamed('country', 'country2') >>>>> df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2, >>>>> 'left_outer').collect() >>>>> >>>>> So to me it looks like a bug, but am I doing something wrong? >>>>> >>>>> Thanks, >>>>> >>>>> -Axel >>>>> >>>>> >>>>> >>>>> >>>>> >>>