I've opened an issue for a few doc fixes that the PySpark DataFrame API
needs: SPARK-7505 <https://issues.apache.org/jira/browse/SPARK-7505>

On Fri, May 8, 2015 at 3:10 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> Ah, neat. So in the example I gave earlier, I’d do this to get columns
> from specific dataframes:
>
> >>> df12.select(df1['a'], df2['other'])
> DataFrame[a: bigint, other: string]>>> df12.select(df1['a'], 
> df2['other']).show()
> a other                                                                       
>         4 I dunno
>
> This perhaps should be documented in an example in the docs somewhere.
> I’ll open a PR for that I suppose.
>
> Nick
> ​
>
> On Fri, May 8, 2015 at 3:01 PM Reynold Xin <r...@databricks.com> wrote:
>
>> You can actually just use df1['a'] in projection to differentiate.
>>
>> e.g. in Scala (similar things work in Python):
>>
>>
>> scala> val df1 = Seq((1, "one")).toDF("a", "b")
>> df1: org.apache.spark.sql.DataFrame = [a: int, b: string]
>>
>> scala> val df2 = Seq((2, "two")).toDF("a", "b")
>> df2: org.apache.spark.sql.DataFrame = [a: int, b: string]
>>
>> scala> df1.join(df2, df1("a") === df2("a") - 1).select(*df1("a")*).show()
>> +-+
>> |a|
>> +-+
>> |1|
>> +-+
>>
>>
>>
>>
>> On Fri, May 8, 2015 at 11:53 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Oh, I didn't know about that. Thanks for the pointer, Rakesh.
>>>
>>> I wonder why they did that, as opposed to taking the cue from SQL and
>>> prefixing column names with a specifiable dataframe alias. The suffix
>>> approach seems quite ugly.
>>>
>>> Nick
>>>
>>> On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani <vnit.rak...@gmail.com>
>>> wrote:
>>>
>>> > To add to the above discussion, Pandas, allows suffixing and prefixing
>>> to
>>> > solve this issue
>>> >
>>> >
>>> >
>>> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html
>>> >
>>> > Rakesh
>>> >
>>> > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <
>>> > nicholas.cham...@gmail.com> wrote:
>>> >
>>> >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s
>>> table
>>> >> aliases.
>>> >>
>>> >> This is essential when joining dataframes that have identically named
>>> >> columns.
>>> >>
>>> >> >>> # PySpark 1.3.1>>> df1 =
>>> sqlContext.jsonRDD(sc.parallelize(['{"a": 4,
>>> >> "other": "I know"}']))>>> df2 =
>>> sqlContext.jsonRDD(sc.parallelize(['{"a":
>>> >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] ==
>>> df2['a'])>>>
>>> >> df12
>>> >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
>>> >> df12.printSchema()
>>> >> root
>>> >>  |-- a: long (nullable = true)
>>> >>  |-- other: string (nullable = true)
>>> >>  |-- a: long (nullable = true)
>>> >>  |-- other: string (nullable = true)
>>> >>
>>> >> Now, trying any one of the following:
>>> >>
>>> >> df12.select('a')
>>> >> df12['a']
>>> >> df12.a
>>> >>
>>> >> yields this:
>>> >>
>>> >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
>>> >> could be: a#360L, a#358L.;
>>> >>
>>> >> Same goes for accessing the other field.
>>> >>
>>> >> This is good, but what are we supposed to do in this case?
>>> >>
>>> >> SQL solves this by fully qualifying the column name with the table
>>> name,
>>> >>
>>> > and also offering table aliasing <
>>> http://dba.stackexchange.com/a/5991/2660
>>> >> >
>>> >
>>> >
>>> >> in the case where you are joining a table to itself.
>>> >>
>>> >> If we translate this directly into DataFrames lingo, perhaps it would
>>> look
>>> >> something like:
>>> >>
>>> >> df12['df1.a']
>>> >> df12['df2.other']
>>> >>
>>> >> But I’m not sure how this fits into the larger API. This certainly
>>> isn’t
>>> >> backwards compatible with how joins are done now.
>>> >>
>>> >> So what’s the recommended course of action here?
>>> >>
>>> >> Having to unique-ify all your column names before joining doesn’t
>>> sound
>>> >> like a nice solution.
>>> >>
>>> >> Nick
>>> >> ​
>>> >>
>>> >
>>>
>>
>>

Reply via email to