Re: Best practice to avoid ambiguous columns in DataFrame.join

Michael Armbrust Sun, 17 May 2015 12:42:12 -0700

In Spark 1.4 there is an extra heuristic to detect self-joins, which means
that even option 1 will still work.


On Sun, May 17, 2015 at 9:31 AM, Jan-Paul Bultmann <janpaulbultm...@me.com>
wrote:

> It’s probably not advisable to use 1 though since it will break when `df =
> df2`,
> which can easily happen when you’ve written a function that does such a
> join internally.
>
> This could be solved by an identity like function that returns the
> dataframe unchanged but with a different identity.
> `.as` would be such a candidate but that doesn’t work.
>
> Thoughts?
>
>
> On 16 May 2015, at 00:55, Michael Armbrust <mich...@databricks.com> wrote:
>
> There are several ways to solve this ambiguity:
>
> *1. use the DataFrames to get the attribute so its already "resolved" and
> not just a string we need to map to a DataFrame.*
>
> df.join(df2, df("_1") === df2("_1"))
>
> *2. Use aliases*
>
> df.as('a).join(df2.as('b), $"a._1" === $"b._1")
>
> *3. rename the columns as you suggested.*
>
> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" ===
> $"right_key").printSchema
>
> *4. (Spark 1.4 only) use def join(right: DataFrame, usingColumn: String):
> DataFrame*
>
> df.join(df1, "_1")
>
> This has the added benefit of only outputting a single _1 column.
>
> On Fri, May 15, 2015 at 3:44 PM, Justin Yip <yipjus...@prediction.io>
> wrote:
>
>> Hello,
>>
>> I would like ask know if there are recommended ways of preventing
>> ambiguous columns when joining dataframes. When we join dataframes, it
>> usually happen we join the column with identical name. I could have rename
>> the columns on the right data frame, as described in the following code. Is
>> there a better way to achieve this?
>>
>> scala> val df = sqlContext.createDataFrame(Seq((1, "a"), (2, "b"), (3,
>> "b"), (4, "b")))
>> df: org.apache.spark.sql.DataFrame = [_1: int, _2: string]
>>
>> scala> val df2 = sqlContext.createDataFrame(Seq((1, 10), (2, 20), (3,
>> 30), (4, 40)))
>> df2: org.apache.spark.sql.DataFrame = [_1: int, _2: int]
>>
>> scala> df.join(df2.withColumnRenamed("_1", "right_key"), $"_1" ===
>> $"right_key").printSchema
>>
>> Thanks.
>>
>> Justin
>>
>> ------------------------------
>> View this message in context: Best practice to avoid ambiguous columns
>> in DataFrame.join
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-to-avoid-ambiguous-columns-in-DataFrame-join-tp22907.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>
>

Re: Best practice to avoid ambiguous columns in DataFrame.join

Reply via email to