Fwd: dataframe null safe joins given a list of columns

Enrico Minack Sat, 08 Feb 2020 07:56:55 -0800

Hi Devs,

I am forwarding this from the user mailing list. I agree that the <=>version of join(Dataset[_], Seq[String]) would be useful.

Does any PMC consider this useful enough to be added to the Dataset API?I'd be happy to create a PR in that case.


Enrico



-------- Weitergeleitete Nachricht --------
Betreff:        dataframe null safe joins given a list of columns
Datum:  Thu, 6 Feb 2020 12:45:11 +0000
Von:    Marcelo Valle <marcelo.va...@ktech.com>
An:     user @spark <u...@spark.apache.org>

I was surprised I couldn't find a way of solving this in spark, as itmust be a very common problem for users. Then I decided to ask here.


Consider the code bellow:

```
val joinColumns = Seq("a", "b")

val df1 = Seq(("a1", "b1", "c1"), ("a2", "b2", "c2"), ("a4", null,"c4")).toDF("a", "b", "c")val df2 = Seq(("a1", "b1", "d1"), ("a3", "b3", "d3"), ("a4", null,"d4")).toDF("a", "b", "d")

df1.join(df2, joinColumns).show()
```

The output is :

```
+---+---+---+---+
|  a|  b|  c|  d|
+---+---+---+---+
| a1| b1| c1| d1|
+---+---+---+---+
```

But I want it to be:

```
+---+-----+---+---+
|  a|    b|  c|  d|
+---+-----+---+---+
| a1|   b1| c1| d1|
| a4| null| c4| d4|
+---+-----+---+---+
```

The join syntax of `df1.join(df2, joinColumns)` has some advantages, asit doesn't create duplicate columns by default. However, it uses theoperator `===` to join, not the null safe one `<=>`.


Using the following syntax:

```
df1.join(df2, df1("a") <=> df2("a") && df1("b") <=> df2("b")).show()
```

Would produce:

```
+---+----+---+---+----+---+
|  a|   b|  c|  a|   b|  d|
+---+----+---+---+----+---+
| a1|  b1| c1| a1|  b1| d1|
| a4|null| c4| a4|null| d4|
+---+----+---+---+----+---+
```

So to get the result I really want, I must do:

```

df1.join(df2, df1("a") <=> df2("a") && df1("b") <=>df2("b")).drop(df2("a")).drop(df2("b")).show()

+---+----+---+---+
|  a|   b|  c|  d|
+---+----+---+---+
| a1|  b1| c1| d1|
| a4|null| c4| d4|
+---+----+---+---+
```

Which works, but is really verbose, especially when you have many joincolumns.

Is there a better way of solving this without needing a utility method?This same problem is something I find in every spark project.

This email is confidential [and may be protected by legal privilege]. Ifyou are not the intended recipient, please do not copy or disclose itscontent but contact the sender immediately upon receipt.


KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE,United Kingdom

Fwd: dataframe null safe joins given a list of columns

Reply via email to