In Spark 1.4 there is an extra heuristic to detect self-joins, which means that even option 1 will still work.
On Sun, May 17, 2015 at 9:31 AM, Jan-Paul Bultmann <janpaulbultm...@me.com> wrote: > It’s probably not advisable to use 1 though since it will break when `df = > df2`, > which can easily happen when you’ve written a function that does such a > join internally. > > This could be solved by an identity like function that returns the > dataframe unchanged but with a different identity. > `.as` would be such a candidate but that doesn’t work. > > Thoughts? > > > On 16 May 2015, at 00:55, Michael Armbrust <mich...@databricks.com> wrote: > > There are several ways to solve this ambiguity: > > *1. use the DataFrames to get the attribute so its already "resolved" and > not just a string we need to map to a DataFrame.* > > df.join(df2, df("_1") === df2("_1")) > > *2. Use aliases* > > df.as('a).join(df2.as('b), $"a._1" === $"b._1") > > *3. rename the columns as you suggested.* > > df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === > $"right_key").printSchema > > *4. (Spark 1.4 only) use def join(right: DataFrame, usingColumn: String): > DataFrame* > > df.join(df1, "_1") > > This has the added benefit of only outputting a single _1 column. > > On Fri, May 15, 2015 at 3:44 PM, Justin Yip <yipjus...@prediction.io> > wrote: > >> Hello, >> >> I would like ask know if there are recommended ways of preventing >> ambiguous columns when joining dataframes. When we join dataframes, it >> usually happen we join the column with identical name. I could have rename >> the columns on the right data frame, as described in the following code. Is >> there a better way to achieve this? >> >> scala> val df = sqlContext.createDataFrame(Seq((1, "a"), (2, "b"), (3, >> "b"), (4, "b"))) >> df: org.apache.spark.sql.DataFrame = [_1: int, _2: string] >> >> scala> val df2 = sqlContext.createDataFrame(Seq((1, 10), (2, 20), (3, >> 30), (4, 40))) >> df2: org.apache.spark.sql.DataFrame = [_1: int, _2: int] >> >> scala> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" === >> $"right_key").printSchema >> >> Thanks. >> >> Justin >> >> ------------------------------ >> View this message in context: Best practice to avoid ambiguous columns >> in DataFrame.join >> <http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-to-avoid-ambiguous-columns-in-DataFrame-join-tp22907.html> >> Sent from the Apache Spark User List mailing list archive >> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com. >> > > >