Ah, neat. So in the example I gave earlier, I’d do this to get columns from
specific dataframes:

>>> df12.select(df1['a'], df2['other'])
DataFrame[a: bigint, other: string]>>> df12.select(df1['a'],
df2['other']).show()
a other
               4 I dunno

This perhaps should be documented in an example in the docs somewhere. I’ll
open a PR for that I suppose.

Nick
​

On Fri, May 8, 2015 at 3:01 PM Reynold Xin <r...@databricks.com> wrote:

> You can actually just use df1['a'] in projection to differentiate.
>
> e.g. in Scala (similar things work in Python):
>
>
> scala> val df1 = Seq((1, "one")).toDF("a", "b")
> df1: org.apache.spark.sql.DataFrame = [a: int, b: string]
>
> scala> val df2 = Seq((2, "two")).toDF("a", "b")
> df2: org.apache.spark.sql.DataFrame = [a: int, b: string]
>
> scala> df1.join(df2, df1("a") === df2("a") - 1).select(*df1("a")*).show()
> +-+
> |a|
> +-+
> |1|
> +-+
>
>
>
>
> On Fri, May 8, 2015 at 11:53 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Oh, I didn't know about that. Thanks for the pointer, Rakesh.
>>
>> I wonder why they did that, as opposed to taking the cue from SQL and
>> prefixing column names with a specifiable dataframe alias. The suffix
>> approach seems quite ugly.
>>
>> Nick
>>
>> On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani <vnit.rak...@gmail.com>
>> wrote:
>>
>> > To add to the above discussion, Pandas, allows suffixing and prefixing
>> to
>> > solve this issue
>> >
>> >
>> >
>> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html
>> >
>> > Rakesh
>> >
>> > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <
>> > nicholas.cham...@gmail.com> wrote:
>> >
>> >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s
>> table
>> >> aliases.
>> >>
>> >> This is essential when joining dataframes that have identically named
>> >> columns.
>> >>
>> >> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a":
>> 4,
>> >> "other": "I know"}']))>>> df2 =
>> sqlContext.jsonRDD(sc.parallelize(['{"a":
>> >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] ==
>> df2['a'])>>>
>> >> df12
>> >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
>> >> df12.printSchema()
>> >> root
>> >>  |-- a: long (nullable = true)
>> >>  |-- other: string (nullable = true)
>> >>  |-- a: long (nullable = true)
>> >>  |-- other: string (nullable = true)
>> >>
>> >> Now, trying any one of the following:
>> >>
>> >> df12.select('a')
>> >> df12['a']
>> >> df12.a
>> >>
>> >> yields this:
>> >>
>> >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
>> >> could be: a#360L, a#358L.;
>> >>
>> >> Same goes for accessing the other field.
>> >>
>> >> This is good, but what are we supposed to do in this case?
>> >>
>> >> SQL solves this by fully qualifying the column name with the table
>> name,
>> >>
>> > and also offering table aliasing <
>> http://dba.stackexchange.com/a/5991/2660
>> >> >
>> >
>> >
>> >> in the case where you are joining a table to itself.
>> >>
>> >> If we translate this directly into DataFrames lingo, perhaps it would
>> look
>> >> something like:
>> >>
>> >> df12['df1.a']
>> >> df12['df2.other']
>> >>
>> >> But I’m not sure how this fits into the larger API. This certainly
>> isn’t
>> >> backwards compatible with how joins are done now.
>> >>
>> >> So what’s the recommended course of action here?
>> >>
>> >> Having to unique-ify all your column names before joining doesn’t sound
>> >> like a nice solution.
>> >>
>> >> Nick
>> >> ​
>> >>
>> >
>>
>
>

Reply via email to