Re: DataFrames equivalent to SQL table namespacing and aliases

Nicholas Chammas Fri, 08 May 2015 11:55:14 -0700

Oh, I didn't know about that. Thanks for the pointer, Rakesh.

I wonder why they did that, as opposed to taking the cue from SQL and
prefixing column names with a specifiable dataframe alias. The suffix
approach seems quite ugly.


Nick

On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani <vnit.rak...@gmail.com>
wrote:

> To add to the above discussion, Pandas, allows suffixing and prefixing to
> solve this issue
>
>
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html
>
> Rakesh
>
> On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s table
>> aliases.
>>
>> This is essential when joining dataframes that have identically named
>> columns.
>>
>> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4,
>> "other": "I know"}']))>>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a":
>> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == df2['a'])>>>
>> df12
>> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
>> df12.printSchema()
>> root
>>  |-- a: long (nullable = true)
>>  |-- other: string (nullable = true)
>>  |-- a: long (nullable = true)
>>  |-- other: string (nullable = true)
>>
>> Now, trying any one of the following:
>>
>> df12.select('a')
>> df12['a']
>> df12.a
>>
>> yields this:
>>
>> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
>> could be: a#360L, a#358L.;
>>
>> Same goes for accessing the other field.
>>
>> This is good, but what are we supposed to do in this case?
>>
>> SQL solves this by fully qualifying the column name with the table name,
>>
> and also offering table aliasing <http://dba.stackexchange.com/a/5991/2660
>> >
>
>
>> in the case where you are joining a table to itself.
>>
>> If we translate this directly into DataFrames lingo, perhaps it would look
>> something like:
>>
>> df12['df1.a']
>> df12['df2.other']
>>
>> But I’m not sure how this fits into the larger API. This certainly isn’t
>> backwards compatible with how joins are done now.
>>
>> So what’s the recommended course of action here?
>>
>> Having to unique-ify all your column names before joining doesn’t sound
>> like a nice solution.
>>
>> Nick
>> 
>>
>

Re: DataFrames equivalent to SQL table namespacing and aliases

Reply via email to