Ah, neat. So in the example I gave earlier, I’d do this to get columns from specific dataframes:
>>> df12.select(df1['a'], df2['other']) DataFrame[a: bigint, other: string]>>> df12.select(df1['a'], df2['other']).show() a other 4 I dunno This perhaps should be documented in an example in the docs somewhere. I’ll open a PR for that I suppose. Nick On Fri, May 8, 2015 at 3:01 PM Reynold Xin <r...@databricks.com> wrote: > You can actually just use df1['a'] in projection to differentiate. > > e.g. in Scala (similar things work in Python): > > > scala> val df1 = Seq((1, "one")).toDF("a", "b") > df1: org.apache.spark.sql.DataFrame = [a: int, b: string] > > scala> val df2 = Seq((2, "two")).toDF("a", "b") > df2: org.apache.spark.sql.DataFrame = [a: int, b: string] > > scala> df1.join(df2, df1("a") === df2("a") - 1).select(*df1("a")*).show() > +-+ > |a| > +-+ > |1| > +-+ > > > > > On Fri, May 8, 2015 at 11:53 AM, Nicholas Chammas < > nicholas.cham...@gmail.com> wrote: > >> Oh, I didn't know about that. Thanks for the pointer, Rakesh. >> >> I wonder why they did that, as opposed to taking the cue from SQL and >> prefixing column names with a specifiable dataframe alias. The suffix >> approach seems quite ugly. >> >> Nick >> >> On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani <vnit.rak...@gmail.com> >> wrote: >> >> > To add to the above discussion, Pandas, allows suffixing and prefixing >> to >> > solve this issue >> > >> > >> > >> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html >> > >> > Rakesh >> > >> > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas < >> > nicholas.cham...@gmail.com> wrote: >> > >> >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s >> table >> >> aliases. >> >> >> >> This is essential when joining dataframes that have identically named >> >> columns. >> >> >> >> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": >> 4, >> >> "other": "I know"}']))>>> df2 = >> sqlContext.jsonRDD(sc.parallelize(['{"a": >> >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == >> df2['a'])>>> >> >> df12 >> >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>> >> >> df12.printSchema() >> >> root >> >> |-- a: long (nullable = true) >> >> |-- other: string (nullable = true) >> >> |-- a: long (nullable = true) >> >> |-- other: string (nullable = true) >> >> >> >> Now, trying any one of the following: >> >> >> >> df12.select('a') >> >> df12['a'] >> >> df12.a >> >> >> >> yields this: >> >> >> >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, >> >> could be: a#360L, a#358L.; >> >> >> >> Same goes for accessing the other field. >> >> >> >> This is good, but what are we supposed to do in this case? >> >> >> >> SQL solves this by fully qualifying the column name with the table >> name, >> >> >> > and also offering table aliasing < >> http://dba.stackexchange.com/a/5991/2660 >> >> > >> > >> > >> >> in the case where you are joining a table to itself. >> >> >> >> If we translate this directly into DataFrames lingo, perhaps it would >> look >> >> something like: >> >> >> >> df12['df1.a'] >> >> df12['df2.other'] >> >> >> >> But I’m not sure how this fits into the larger API. This certainly >> isn’t >> >> backwards compatible with how joins are done now. >> >> >> >> So what’s the recommended course of action here? >> >> >> >> Having to unique-ify all your column names before joining doesn’t sound >> >> like a nice solution. >> >> >> >> Nick >> >> >> >> >> > >> > >