I've opened an issue for a few doc fixes that the PySpark DataFrame API needs: SPARK-7505 <https://issues.apache.org/jira/browse/SPARK-7505>
On Fri, May 8, 2015 at 3:10 PM Nicholas Chammas <nicholas.cham...@gmail.com> wrote: > Ah, neat. So in the example I gave earlier, I’d do this to get columns > from specific dataframes: > > >>> df12.select(df1['a'], df2['other']) > DataFrame[a: bigint, other: string]>>> df12.select(df1['a'], > df2['other']).show() > a other > 4 I dunno > > This perhaps should be documented in an example in the docs somewhere. > I’ll open a PR for that I suppose. > > Nick > > > On Fri, May 8, 2015 at 3:01 PM Reynold Xin <r...@databricks.com> wrote: > >> You can actually just use df1['a'] in projection to differentiate. >> >> e.g. in Scala (similar things work in Python): >> >> >> scala> val df1 = Seq((1, "one")).toDF("a", "b") >> df1: org.apache.spark.sql.DataFrame = [a: int, b: string] >> >> scala> val df2 = Seq((2, "two")).toDF("a", "b") >> df2: org.apache.spark.sql.DataFrame = [a: int, b: string] >> >> scala> df1.join(df2, df1("a") === df2("a") - 1).select(*df1("a")*).show() >> +-+ >> |a| >> +-+ >> |1| >> +-+ >> >> >> >> >> On Fri, May 8, 2015 at 11:53 AM, Nicholas Chammas < >> nicholas.cham...@gmail.com> wrote: >> >>> Oh, I didn't know about that. Thanks for the pointer, Rakesh. >>> >>> I wonder why they did that, as opposed to taking the cue from SQL and >>> prefixing column names with a specifiable dataframe alias. The suffix >>> approach seems quite ugly. >>> >>> Nick >>> >>> On Fri, May 8, 2015 at 2:47 PM Rakesh Chalasani <vnit.rak...@gmail.com> >>> wrote: >>> >>> > To add to the above discussion, Pandas, allows suffixing and prefixing >>> to >>> > solve this issue >>> > >>> > >>> > >>> http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.join.html >>> > >>> > Rakesh >>> > >>> > On Fri, May 8, 2015 at 2:42 PM Nicholas Chammas < >>> > nicholas.cham...@gmail.com> wrote: >>> > >>> >> DataFrames, as far as I can tell, don’t have an equivalent to SQL’s >>> table >>> >> aliases. >>> >> >>> >> This is essential when joining dataframes that have identically named >>> >> columns. >>> >> >>> >> >>> # PySpark 1.3.1>>> df1 = >>> sqlContext.jsonRDD(sc.parallelize(['{"a": 4, >>> >> "other": "I know"}']))>>> df2 = >>> sqlContext.jsonRDD(sc.parallelize(['{"a": >>> >> 4, "other": "I dunno"}']))>>> df12 = df1.join(df2, df1['a'] == >>> df2['a'])>>> >>> >> df12 >>> >> DataFrame[a: bigint, other: string, a: bigint, other: string]>>> >>> >> df12.printSchema() >>> >> root >>> >> |-- a: long (nullable = true) >>> >> |-- other: string (nullable = true) >>> >> |-- a: long (nullable = true) >>> >> |-- other: string (nullable = true) >>> >> >>> >> Now, trying any one of the following: >>> >> >>> >> df12.select('a') >>> >> df12['a'] >>> >> df12.a >>> >> >>> >> yields this: >>> >> >>> >> org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, >>> >> could be: a#360L, a#358L.; >>> >> >>> >> Same goes for accessing the other field. >>> >> >>> >> This is good, but what are we supposed to do in this case? >>> >> >>> >> SQL solves this by fully qualifying the column name with the table >>> name, >>> >> >>> > and also offering table aliasing < >>> http://dba.stackexchange.com/a/5991/2660 >>> >> > >>> > >>> > >>> >> in the case where you are joining a table to itself. >>> >> >>> >> If we translate this directly into DataFrames lingo, perhaps it would >>> look >>> >> something like: >>> >> >>> >> df12['df1.a'] >>> >> df12['df2.other'] >>> >> >>> >> But I’m not sure how this fits into the larger API. This certainly >>> isn’t >>> >> backwards compatible with how joins are done now. >>> >> >>> >> So what’s the recommended course of action here? >>> >> >>> >> Having to unique-ify all your column names before joining doesn’t >>> sound >>> >> like a nice solution. >>> >> >>> >> Nick >>> >> >>> >> >>> > >>> >> >>