You can actually just use df1['a'] in projection to differentiate. e.g. in Scala (similar things work in Python):
scala> val df1 = Seq((1, "one")).toDF("a", "b") df1: org.apache.spark.sql.DataFrame = [a: int, b: string] scala> val df2 = Seq((2, "two")).toDF("a", "b") df2: org.apache.spark.sql.DataFrame = [a: int, b: string] scala> df1.join(df2, df1("a") === df2("a") - 1).select(*df1("a")*).show() +-+ |a| +-+ |1| +-+ On Fri, May 8, 2015 at 11:53 AM, Nicholas Chammas < nicholas.cham...@gmail.com> wrote: > Oh, I didn't know about that. Thanks for the pointer, Rakesh. > > I wonder why they did that, as opposed to taking the cue from SQL and > prefixing column names with a specifiable dataframe alias. The suffix > approach seems quite ugly. > > Nick > > On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani <vnit.rak...@gmail.com> > wrote: > > > To add to the above discussion, Pandas, allows suffixing and prefixing to > > solve this issue > > > > > > > http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html > > > > Rakesh > > > > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas < > > nicholas.cham...@gmail.com> wrote: > > > >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s > table > >> aliases. > >> > >> This is essential when joining dataframes that have identically named > >> columns. > >> > >> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": > 4, > >> "other": "I know"}']))>>> df2 = > sqlContext.jsonRDD(sc.parallelize(['{"a": > >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == > df2['a'])>>> > >> df12 > >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>> > >> df12.printSchema() > >> root > >> |-- a: long (nullable = true) > >> |-- other: string (nullable = true) > >> |-- a: long (nullable = true) > >> |-- other: string (nullable = true) > >> > >> Now, trying any one of the following: > >> > >> df12.select('a') > >> df12['a'] > >> df12.a > >> > >> yields this: > >> > >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, > >> could be: a#360L, a#358L.; > >> > >> Same goes for accessing the other field. > >> > >> This is good, but what are we supposed to do in this case? > >> > >> SQL solves this by fully qualifying the column name with the table name, > >> > > and also offering table aliasing < > http://dba.stackexchange.com/a/5991/2660 > >> > > > > > > >> in the case where you are joining a table to itself. > >> > >> If we translate this directly into DataFrames lingo, perhaps it would > look > >> something like: > >> > >> df12['df1.a'] > >> df12['df2.other'] > >> > >> But I’m not sure how this fits into the larger API. This certainly isn’t > >> backwards compatible with how joins are done now. > >> > >> So what’s the recommended course of action here? > >> > >> Having to unique-ify all your column names before joining doesn’t sound > >> like a nice solution. > >> > >> Nick > >> > >> > > >