Re: DataFrames equivalent to SQL table namespacing and aliases

Reynold Xin Fri, 08 May 2015 12:02:14 -0700

You can actually just use df1['a'] in projection to differentiate.

e.g. in Scala (similar things work in Python):



scala> val df1 = Seq((1, "one")).toDF("a", "b")
df1: org.apache.spark.sql.DataFrame = [a: int, b: string]

scala> val df2 = Seq((2, "two")).toDF("a", "b")
df2: org.apache.spark.sql.DataFrame = [a: int, b: string]

scala> df1.join(df2, df1("a") === df2("a") - 1).select(*df1("a")*).show()
+-+
|a|
+-+
|1|
+-+




On Fri, May 8, 2015 at 11:53 AM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Oh, I didn't know about that. Thanks for the pointer, Rakesh.
>
> I wonder why they did that, as opposed to taking the cue from SQL and
> prefixing column names with a specifiable dataframe alias. The suffix
> approach seems quite ugly.
>
> Nick
>
> On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani <vnit.rak...@gmail.com>
> wrote:
>
> > To add to the above discussion, Pandas, allows suffixing and prefixing to
> > solve this issue
> >
> >
> >
> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html
> >
> > Rakesh
> >
> > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas <
> > nicholas.cham...@gmail.com> wrote:
> >
> >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s
> table
> >> aliases.
> >>
> >> This is essential when joining dataframes that have identically named
> >> columns.
> >>
> >> >>> # PySpark 1.3.1>>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a":
> 4,
> >> "other": "I know"}']))>>> df2 =
> sqlContext.jsonRDD(sc.parallelize(['{"a":
> >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] ==
> df2['a'])>>>
> >> df12
> >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>>
> >> df12.printSchema()
> >> root
> >>  |-- a: long (nullable = true)
> >>  |-- other: string (nullable = true)
> >>  |-- a: long (nullable = true)
> >>  |-- other: string (nullable = true)
> >>
> >> Now, trying any one of the following:
> >>
> >> df12.select('a')
> >> df12['a']
> >> df12.a
> >>
> >> yields this:
> >>
> >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous,
> >> could be: a#360L, a#358L.;
> >>
> >> Same goes for accessing the other field.
> >>
> >> This is good, but what are we supposed to do in this case?
> >>
> >> SQL solves this by fully qualifying the column name with the table name,
> >>
> > and also offering table aliasing <
> http://dba.stackexchange.com/a/5991/2660
> >> >
> >
> >
> >> in the case where you are joining a table to itself.
> >>
> >> If we translate this directly into DataFrames lingo, perhaps it would
> look
> >> something like:
> >>
> >> df12['df1.a']
> >> df12['df2.other']
> >>
> >> But I’m not sure how this fits into the larger API. This certainly isn’t
> >> backwards compatible with how joins are done now.
> >>
> >> So what’s the recommended course of action here?
> >>
> >> Having to unique-ify all your column names before joining doesn’t sound
> >> like a nice solution.
> >>
> >> Nick
> >> 
> >>
> >
>

Re: DataFrames equivalent to SQL table namespacing and aliases

Reply via email to