Oh, I didn't know about that. Thanks for the pointer, Rakesh. I wonder why they did that, as opposed to taking the cue from SQL and prefixing column names with a specifiable dataframe alias. The suffix approach seems quite ugly.
Nick On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani <vnit.rak...@gmail.com> wrote: > To add to the above discussion, Pandas, allows suffixing and prefixing to > solve this issue > > > http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html > > Rakesh > > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table >> aliases. >> >> This is essential when joining dataframes that have identically named >> columns. >> >> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, >> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == df2['a'])>>> >> df12 >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>> >> df12.printSchema() >> root >> |-- a: long (nullable = true) >> |-- other: string (nullable = true) >> |-- a: long (nullable = true) >> |-- other: string (nullable = true) >> >> Now, trying any one of the following: >> >> df12.select('a') >> df12['a'] >> df12.a >> >> yields this: >> >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, >> could be: a#360L, a#358L.; >> >> Same goes for accessing the other field. >> >> This is good, but what are we supposed to do in this case? >> >> SQL solves this by fully qualifying the column name with the table name, >> > and also offering table aliasing <http://dba.stackexchange.com/a/5991/2660 >> > > > >> in the case where you are joining a table to itself. >> >> If we translate this directly into DataFrames lingo, perhaps it would look >> something like: >> >> df12['df1.a'] >> df12['df2.other'] >> >> But I’m not sure how this fits into the larger API. This certainly isn’t >> backwards compatible with how joins are done now. >> >> So what’s the recommended course of action here? >> >> Having to unique-ify all your column names before joining doesn’t sound >> like a nice solution. >> >> Nick >> >> >