[spark] branch branch-3.2 updated: [SPARK-36732][BUILD][FOLLOWUP] Fix dependency manifest

2021-09-15 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 6352846  [SPARK-36732][BUILD][FOLLOWUP] Fix dependency manifest
6352846 is described below

commit 6352846f1c7bb9390fd0b944fffc7025e08f95b2
Author: Dongjoon Hyun 
AuthorDate: Wed Sep 15 23:38:48 2021 -0700

[SPARK-36732][BUILD][FOLLOWUP] Fix dependency manifest
---
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 1 +
 1 file changed, 1 insertion(+)

diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index d0d8d5e..d299528 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -102,6 +102,7 @@ janino/3.0.16//janino-3.0.16.jar
 javassist/3.25.0-GA//javassist-3.25.0-GA.jar
 javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar
 javolution/5.5.1//javolution-5.5.1.jar
+jaxb-api/2.2.11//jaxb-api-2.2.11.jar
 jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar
 jcl-over-slf4j/1.7.30//jcl-over-slf4j-1.7.30.jar
 jdo-api/3.0.1//jdo-api-3.0.1.jar

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11

2021-09-15 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 63b8417  [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11
63b8417 is described below

commit 63b8417794c33fe76adc64d74dea6227debc59f5
Author: Dongjoon Hyun 
AuthorDate: Wed Sep 15 23:36:26 2021 -0700

[SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11

### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes.

### Why are the changes needed?

Apache ORC 1.6.11 has the following fixes.
- https://issues.apache.org/jira/projects/ORC/versions/12350499

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33971 from dongjoon-hyun/SPARK-36732.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit c21779729765f640272e0249c6e71145aa5ccfdf)
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3| 6 +++---
 dev/deps/spark-deps-hadoop-3.2-hive-2.3| 7 +++
 pom.xml| 7 +--
 .../test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala | 4 ++--
 4 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index e92f8a7..8400413 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -195,9 +195,9 @@ objenesis/2.6//objenesis-2.6.jar
 okhttp/3.12.12//okhttp-3.12.12.jar
 okio/1.14.0//okio-1.14.0.jar
 opencsv/2.3//opencsv-2.3.jar
-orc-core/1.6.10//orc-core-1.6.10.jar
-orc-mapreduce/1.6.10//orc-mapreduce-1.6.10.jar
-orc-shims/1.6.10//orc-shims-1.6.10.jar
+orc-core/1.6.11//orc-core-1.6.11.jar
+orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar
+orc-shims/1.6.11//orc-shims-1.6.11.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index b8426e2..d0d8d5e 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -102,7 +102,6 @@ janino/3.0.16//janino-3.0.16.jar
 javassist/3.25.0-GA//javassist-3.25.0-GA.jar
 javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar
 javolution/5.5.1//javolution-5.5.1.jar
-jaxb-api/2.2.11//jaxb-api-2.2.11.jar
 jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar
 jcl-over-slf4j/1.7.30//jcl-over-slf4j-1.7.30.jar
 jdo-api/3.0.1//jdo-api-3.0.1.jar
@@ -166,9 +165,9 @@ objenesis/2.6//objenesis-2.6.jar
 okhttp/3.12.12//okhttp-3.12.12.jar
 okio/1.14.0//okio-1.14.0.jar
 opencsv/2.3//opencsv-2.3.jar
-orc-core/1.6.10//orc-core-1.6.10.jar
-orc-mapreduce/1.6.10//orc-mapreduce-1.6.10.jar
-orc-shims/1.6.10//orc-shims-1.6.10.jar
+orc-core/1.6.11//orc-core-1.6.11.jar
+orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar
+orc-shims/1.6.11//orc-shims-1.6.11.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
diff --git a/pom.xml b/pom.xml
index ab71632..9f2b2c9 100644
--- a/pom.xml
+++ b/pom.xml
@@ -137,7 +137,7 @@
 
 10.14.2.0
 1.12.1
-1.6.10
+1.6.11
 9.4.43.v20210629
 4.0.3
 0.10.0
@@ -2306,11 +2306,6 @@
 
   
   
-io.airlift
-aircompressor
-0.21
-  
-  
 org.apache.orc
 orc-mapreduce
 ${orc.version}
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
index 2436392..001b6a0 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
@@ -695,9 +695,9 @@ class FileBasedDataSourceSuite extends QueryTest
   test("SPARK-22790,SPARK-27668: spark.sql.sources.compressionFactor takes 
effect") {
 Seq(1.0, 0.5).foreach { compressionFactor =>
   withSQLConf(SQLConf.FILE_COMPRESSION_FACTOR.key -> 
compressionFactor.toString,
-SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "250") {
+SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "350") {
 withTempPath { workDir =>
-  // the file size is 486 bytes
+  // the file size is 504 bytes
   val workDirPath = workDir.getAbsolutePath
   val data1 = Seq(100, 200, 300, 400).toDF("count")
   data1.write.orc(workDirPath + "/data1")

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For 

[spark] branch master updated: [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11

2021-09-15 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c217797  [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11
c217797 is described below

commit c21779729765f640272e0249c6e71145aa5ccfdf
Author: Dongjoon Hyun 
AuthorDate: Wed Sep 15 23:36:26 2021 -0700

[SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11

### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes.

### Why are the changes needed?

Apache ORC 1.6.11 has the following fixes.
- https://issues.apache.org/jira/projects/ORC/versions/12350499

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33971 from dongjoon-hyun/SPARK-36732.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3| 6 +++---
 dev/deps/spark-deps-hadoop-3.2-hive-2.3| 7 +++
 pom.xml| 7 +--
 .../test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala | 4 ++--
 4 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index e55782d..6f91caf 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -195,9 +195,9 @@ objenesis/2.6//objenesis-2.6.jar
 okhttp/3.12.12//okhttp-3.12.12.jar
 okio/1.14.0//okio-1.14.0.jar
 opencsv/2.3//opencsv-2.3.jar
-orc-core/1.6.10//orc-core-1.6.10.jar
-orc-mapreduce/1.6.10//orc-mapreduce-1.6.10.jar
-orc-shims/1.6.10//orc-shims-1.6.10.jar
+orc-core/1.6.11//orc-core-1.6.11.jar
+orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar
+orc-shims/1.6.11//orc-shims-1.6.11.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index b49998e..ecf448f 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -102,7 +102,6 @@ janino/3.0.16//janino-3.0.16.jar
 javassist/3.25.0-GA//javassist-3.25.0-GA.jar
 javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar
 javolution/5.5.1//javolution-5.5.1.jar
-jaxb-api/2.2.11//jaxb-api-2.2.11.jar
 jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar
 jcl-over-slf4j/1.7.30//jcl-over-slf4j-1.7.30.jar
 jdo-api/3.0.1//jdo-api-3.0.1.jar
@@ -166,9 +165,9 @@ objenesis/2.6//objenesis-2.6.jar
 okhttp/3.12.12//okhttp-3.12.12.jar
 okio/1.14.0//okio-1.14.0.jar
 opencsv/2.3//opencsv-2.3.jar
-orc-core/1.6.10//orc-core-1.6.10.jar
-orc-mapreduce/1.6.10//orc-mapreduce-1.6.10.jar
-orc-shims/1.6.10//orc-shims-1.6.10.jar
+orc-core/1.6.11//orc-core-1.6.11.jar
+orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar
+orc-shims/1.6.11//orc-shims-1.6.11.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
diff --git a/pom.xml b/pom.xml
index e1503fe..b16c03d 100644
--- a/pom.xml
+++ b/pom.xml
@@ -137,7 +137,7 @@
 
 10.14.2.0
 1.12.1
-1.6.10
+1.6.11
 9.4.43.v20210629
 4.0.3
 0.10.0
@@ -2302,11 +2302,6 @@
 
   
   
-io.airlift
-aircompressor
-0.21
-  
-  
 org.apache.orc
 orc-mapreduce
 ${orc.version}
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
index ab28b05..910f159 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala
@@ -695,9 +695,9 @@ class FileBasedDataSourceSuite extends QueryTest
   test("SPARK-22790,SPARK-27668: spark.sql.sources.compressionFactor takes 
effect") {
 Seq(1.0, 0.5).foreach { compressionFactor =>
   withSQLConf(SQLConf.FILE_COMPRESSION_FACTOR.key -> 
compressionFactor.toString,
-SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "250") {
+SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "350") {
 withTempPath { workDir =>
-  // the file size is 486 bytes
+  // the file size is 504 bytes
   val workDirPath = workDir.getAbsolutePath
   val data1 = Seq(100, 200, 300, 400).toDF("count")
   data1.write.orc(workDirPath + "/data1")

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (bbb33af -> afd406e)

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from bbb33af  [SPARK-36735][SQL] Adjust overhead of cached relation for DPP
 add afd406e  [SPARK-36745][SQL] ExtractEquiJoinKeys should return the 
original predicates on join keys

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/analysis/StreamingJoinHelper.scala |  2 +-
 .../optimizer/NormalizeFloatingNumbers.scala|  2 +-
 .../spark/sql/catalyst/planning/patterns.scala  | 21 ++---
 .../logical/statsEstimation/JoinEstimation.scala|  2 +-
 .../spark/sql/execution/SparkStrategies.scala   | 10 +-
 .../execution/adaptive/DynamicJoinSelection.scala   |  2 +-
 .../adaptive/LogicalQueryStageStrategy.scala|  8 +---
 .../execution/dynamicpruning/PartitionPruning.scala |  2 +-
 .../sql/execution/joins/ExistenceJoinSuite.scala|  6 +++---
 .../spark/sql/execution/joins/InnerJoinSuite.scala  | 10 +-
 .../spark/sql/execution/joins/OuterJoinSuite.scala  |  6 +++---
 11 files changed, 40 insertions(+), 31 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.0 updated: [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new c0c121d  [SPARK-36755][SQL] ArraysOverlap should handle duplicated 
Double.NaN and Float.NaN
c0c121d is described below

commit c0c121d772b8af428b9c3edc48d1e2f153a51349
Author: Angerszh 
AuthorDate: Wed Sep 15 22:31:46 2021 +0800

[SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and 
Float.NaN

### What changes were proposed in this pull request?
For query
```
select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as 
double)))
```
This returns [false], but it should return [true].
This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` 
and `Float.NaN`.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
arrays_overlap won't handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #34006 from AngersZh/SPARK-36755.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/collectionOperations.scala |  4 ++--
 .../catalyst/expressions/CollectionExpressionsSuite.scala   | 13 +
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 7adb1ec..f0f71fb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -1253,12 +1253,12 @@ case class ArraysOverlap(left: Expression, right: 
Expression)
   (arr2, arr1)
 }
 if (smaller.numElements() > 0) {
-  val smallestSet = new mutable.HashSet[Any]
+  val smallestSet = new java.util.HashSet[Any]()
   smaller.foreach(elementType, (_, v) =>
 if (v == null) {
   hasNull = true
 } else {
-  smallestSet += v
+  smallestSet.add(v)
 })
   bigger.foreach(elementType, (_, v1) =>
 if (v1 == null) {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
index 639ed47..9221feb 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
@@ -1911,4 +1911,17 @@ class CollectionExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))),
   Seq(Float.NaN, null, 1f))
   }
+
+  test("SPARK-36755: ArraysOverlap hould handle duplicated Double.NaN and 
Float.Nan") {
+checkEvaluation(ArraysOverlap(
+  Literal.apply(Array(Double.NaN, 1d)), Literal.apply(Array(Double.NaN))), 
true)
+checkEvaluation(ArraysOverlap(
+  Literal.create(Seq(Double.NaN, null), ArrayType(DoubleType)),
+  Literal.create(Seq(Double.NaN, null, 1d), ArrayType(DoubleType))), true)
+checkEvaluation(ArraysOverlap(
+  Literal.apply(Array(Float.NaN)), Literal.apply(Array(Float.NaN, 1f))), 
true)
+checkEvaluation(ArraysOverlap(
+  Literal.create(Seq(Float.NaN, null), ArrayType(FloatType)),
+  Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), true)
+  }
 }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.0 updated: [SPARK-36702][SQL][3.0] ArrayUnion handle duplicated Double.NaN and F…

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.0
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.0 by this push:
 new 46c5def  [SPARK-36702][SQL][3.0] ArrayUnion handle duplicated 
Double.NaN and F…
46c5def is described below

commit 46c5def22e1ad8b74389aba2ccd499d849c0e9ea
Author: Angerszh 
AuthorDate: Thu Sep 16 12:37:02 2021 +0800

[SPARK-36702][SQL][3.0] ArrayUnion handle duplicated Double.NaN and F…

### What changes were proposed in this pull request?
For query
```
select array_union(array(cast('nan' as double), cast('nan' as double)), 
array())
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and 
`Float.NaN` too.
In this pr we add a wrap for OpenHashSet that can handle `null`, 
`Double.NaN`, `Float.NaN` together

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayUnion won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #34011 from AngersZh/SPARK-36702-3.0.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
---
 .../expressions/collectionOperations.scala | 65 +-
 .../org/apache/spark/sql/util/SQLOpenHashSet.scala | 80 ++
 .../expressions/CollectionExpressionsSuite.scala   | 17 +
 3 files changed, 145 insertions(+), 17 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 13363e9..7adb1ec 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -32,6 +32,7 @@ import org.apache.spark.sql.catalyst.util.DateTimeConstants._
 import org.apache.spark.sql.catalyst.util.DateTimeUtils._
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
+import org.apache.spark.sql.util.SQLOpenHashSet
 import org.apache.spark.unsafe.UTF8StringBuilder
 import org.apache.spark.unsafe.array.ByteArrayMethods
 import org.apache.spark.unsafe.array.ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH
@@ -3321,24 +3322,32 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
 if (TypeUtils.typeWithProperEquals(elementType)) {
   (array1, array2) =>
 val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
-val hs = new OpenHashSet[Any]
-var foundNullElement = false
+val hs = new SQLOpenHashSet[Any]()
+val isNaN = SQLOpenHashSet.isNaN(elementType)
+val valueNaN = SQLOpenHashSet.valueNaN(elementType)
 Seq(array1, array2).foreach { array =>
   var i = 0
   while (i < array.numElements()) {
 if (array.isNullAt(i)) {
-  if (!foundNullElement) {
+  if (!hs.containsNull) {
+hs.addNull
 arrayBuffer += null
-foundNullElement = true
   }
 } else {
   val elem = array.get(i, elementType)
-  if (!hs.contains(elem)) {
-if (arrayBuffer.size > 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
-  
ArrayBinaryLike.throwUnionLengthOverflowException(arrayBuffer.size)
+  if (isNaN(elem)) {
+if (!hs.containsNaN) {
+  arrayBuffer += valueNaN
+  hs.addNaN
+}
+  } else {
+if (!hs.contains(elem)) {
+  if (arrayBuffer.size > 
ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) {
+
ArrayBinaryLike.throwUnionLengthOverflowException(arrayBuffer.size)
+  }
+  arrayBuffer += elem
+  hs.add(elem)
 }
-arrayBuffer += elem
-hs.add(elem)
   }
 }
 i += 1
@@ -3395,13 +3404,12 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
   val ptName = CodeGenerator.primitiveTypeName(jt)
 
   nullSafeCodeGen(ctx, ev, (array1, array2) => {
-val foundNullElement = ctx.freshName("foundNullElement")
 val nullElementIndex = ctx.freshName("nullElementIndex")
 val builder = ctx.freshName("builder")
 val array = ctx.freshName("array")
 val arrays = ctx.freshName("arrays")
 val arrayDataIdx = ctx.freshName("arrayDataIdx")
-val openHashSet = classOf[OpenHashSet[_]].getName
+val openHashSet = classOf[SQLOpenHashSet[_]].getName
 val classTag = s"scala.refle

[spark] branch master updated (16f1f71 -> bbb33af)

2021-09-15 Thread viirya
This is an automated email from the ASF dual-hosted git repository.

viirya pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 16f1f71  [SPARK-36759][BUILD] Upgrade Scala to 2.12.15
 add bbb33af  [SPARK-36735][SQL] Adjust overhead of cached relation for DPP

No new revisions were added by this update.

Summary of changes:
 .../dynamicpruning/PartitionPruning.scala  | 27 +++-
 .../spark/sql/DynamicPartitionPruningSuite.scala   | 76 --
 2 files changed, 82 insertions(+), 21 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (a927b08 -> 16f1f71)

2021-09-15 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from a927b08  [SPARK-36726] Upgrade Parquet to 1.12.1
 add 16f1f71  [SPARK-36759][BUILD] Upgrade Scala to 2.12.15

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 6 +++---
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 6 +++---
 pom.xml | 4 ++--
 project/SparkBuild.scala| 2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36759][BUILD] Upgrade Scala to 2.12.15

2021-09-15 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 2067661  [SPARK-36759][BUILD] Upgrade Scala to 2.12.15
2067661 is described below

commit 2067661869037a1d391587c2f39fce9e101bb247
Author: Dongjoon Hyun 
AuthorDate: Wed Sep 15 13:43:25 2021 -0700

[SPARK-36759][BUILD] Upgrade Scala to 2.12.15

### What changes were proposed in this pull request?

This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better.

### Why are the changes needed?

Scala 2.12.15 improves compatibility with JDK 17 and 18:

https://github.com/scala/scala/releases/tag/v2.12.15

- Avoids IllegalArgumentException in JDK 17+ for lambda deserialization
- Upgrades to ASM 9.2, for JDK 18 support in optimizer

### Does this PR introduce _any_ user-facing change?

Yes, this is a Scala version change.

### How was this patch tested?

Pass the CIs

Closes #33999 from dongjoon-hyun/SPARK-36759.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 16f1f71ba5e5b174bd0e964cfad2f466725dd6a5)
Signed-off-by: Dongjoon Hyun 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 6 +++---
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 6 +++---
 pom.xml | 4 ++--
 project/SparkBuild.scala| 2 +-
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index 44f210f..e92f8a7 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -212,10 +212,10 @@ py4j/0.10.9.2//py4j-0.10.9.2.jar
 pyrolite/4.30//pyrolite-4.30.jar
 rocksdbjni/6.20.3//rocksdbjni-6.20.3.jar
 scala-collection-compat_2.12/2.1.1//scala-collection-compat_2.12-2.1.1.jar
-scala-compiler/2.12.14//scala-compiler-2.12.14.jar
-scala-library/2.12.14//scala-library-2.12.14.jar
+scala-compiler/2.12.15//scala-compiler-2.12.15.jar
+scala-library/2.12.15//scala-library-2.12.15.jar
 scala-parser-combinators_2.12/1.1.2//scala-parser-combinators_2.12-1.1.2.jar
-scala-reflect/2.12.14//scala-reflect-2.12.14.jar
+scala-reflect/2.12.15//scala-reflect-2.12.15.jar
 scala-xml_2.12/1.2.0//scala-xml_2.12-1.2.0.jar
 shapeless_2.12/2.3.3//shapeless_2.12-2.3.3.jar
 shims/0.9.0//shims-0.9.0.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index 1e28dd1..b8426e2 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -183,10 +183,10 @@ py4j/0.10.9.2//py4j-0.10.9.2.jar
 pyrolite/4.30//pyrolite-4.30.jar
 rocksdbjni/6.20.3//rocksdbjni-6.20.3.jar
 scala-collection-compat_2.12/2.1.1//scala-collection-compat_2.12-2.1.1.jar
-scala-compiler/2.12.14//scala-compiler-2.12.14.jar
-scala-library/2.12.14//scala-library-2.12.14.jar
+scala-compiler/2.12.15//scala-compiler-2.12.15.jar
+scala-library/2.12.15//scala-library-2.12.15.jar
 scala-parser-combinators_2.12/1.1.2//scala-parser-combinators_2.12-1.1.2.jar
-scala-reflect/2.12.14//scala-reflect-2.12.14.jar
+scala-reflect/2.12.15//scala-reflect-2.12.15.jar
 scala-xml_2.12/1.2.0//scala-xml_2.12-1.2.0.jar
 shapeless_2.12/2.3.3//shapeless_2.12-2.3.3.jar
 shims/0.9.0//shims-0.9.0.jar
diff --git a/pom.xml b/pom.xml
index e8d3863..ab71632 100644
--- a/pom.xml
+++ b/pom.xml
@@ -160,7 +160,7 @@
 3.4.1
 
 3.2.2
-2.12.14
+2.12.15
 2.12
 2.0.2
 --test
@@ -2652,7 +2652,7 @@
   
 com.github.ghik
 silencer-plugin_${scala.version}
-1.7.5
+1.7.6
   
 
   
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala
index 2f82b1c..b9068cc 100644
--- a/project/SparkBuild.scala
+++ b/project/SparkBuild.scala
@@ -208,7 +208,7 @@ object SparkBuild extends PomBuild {
   lazy val compilerWarningSettings: Seq[sbt.Def.Setting[_]] = Seq(
 libraryDependencies ++= {
   if 
(VersionNumber(scalaVersion.value).matchesSemVer(SemanticSelector("<2.13.2"))) {
-val silencerVersion = "1.7.5"
+val silencerVersion = "1.7.6"
 Seq(
   "org.scala-lang.modules" %% "scala-collection-compat" % "2.2.0",
   compilerPlugin("com.github.ghik" % "silencer-plugin" % 
silencerVersion cross CrossVersion.full),

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36726] Upgrade Parquet to 1.12.1

2021-09-15 Thread dbtsai
This is an automated email from the ASF dual-hosted git repository.

dbtsai pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new a7dc824  [SPARK-36726] Upgrade Parquet to 1.12.1
a7dc824 is described below

commit a7dc8242ea913841a2627949fde4cd2953d0b053
Author: Chao Sun 
AuthorDate: Wed Sep 15 19:17:34 2021 +

[SPARK-36726] Upgrade Parquet to 1.12.1

### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary 
encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 
release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun 
Signed-off-by: DB Tsai 
(cherry picked from commit a927b0836bd59d6731b4970957e82ac1e403ddc4)
Signed-off-by: DB Tsai 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 |  12 ++--
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 |  12 ++--
 pom.xml |   2 +-
 .../resources/test-data/malformed-file-offset.parquet   | Bin 0 -> 37968 bytes
 .../execution/datasources/parquet/ParquetIOSuite.scala  |   6 ++
 5 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index 6ba3c86..44f210f 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -201,12 +201,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
-parquet-column/1.12.0//parquet-column-1.12.0.jar
-parquet-common/1.12.0//parquet-common-1.12.0.jar
-parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
-parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
-parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
-parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
+parquet-column/1.12.1//parquet-column-1.12.1.jar
+parquet-common/1.12.1//parquet-common-1.12.1.jar
+parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
+parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
+parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
+parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
 protobuf-java/2.5.0//protobuf-java-2.5.0.jar
 py4j/0.10.9.2//py4j-0.10.9.2.jar
 pyrolite/4.30//pyrolite-4.30.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index 326229b..1e28dd1 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -172,12 +172,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
-parquet-column/1.12.0//parquet-column-1.12.0.jar
-parquet-common/1.12.0//parquet-common-1.12.0.jar
-parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
-parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
-parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
-parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
+parquet-column/1.12.1//parquet-column-1.12.1.jar
+parquet-common/1.12.1//parquet-common-1.12.1.jar
+parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
+parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
+parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
+parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
 protobuf-java/2.5.0//protobuf-java-2.5.0.jar
 py4j/0.10.9.2//py4j-0.10.9.2.jar
 pyrolite/4.30//pyrolite-4.30.jar
diff --git a/pom.xml b/pom.xml
index 54025de..e8d3863 100644
--- a/pom.xml
+++ b/pom.xml
@@ -136,7 +136,7 @@
 2.8.0
 
 10.14.2.0
-1.12.0
+1.12.1
 1.6.10
 9.4.43.v20210629
 4.0.3
diff --git 
a/sql/core/src/test/resources/test-data/malformed-file-offset.parquet 
b/sql/core/src/test/resources/test-data/malformed-file-offset.parquet
new file mode 100644
index 000..5abeabe
Binary files /dev/null and 
b/sql/core/src/test/resources/test-data/malform

[spark] branch master updated: [SPARK-36726] Upgrade Parquet to 1.12.1

2021-09-15 Thread dbtsai
This is an automated email from the ASF dual-hosted git repository.

dbtsai pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a927b08  [SPARK-36726] Upgrade Parquet to 1.12.1
a927b08 is described below

commit a927b0836bd59d6731b4970957e82ac1e403ddc4
Author: Chao Sun 
AuthorDate: Wed Sep 15 19:17:34 2021 +

[SPARK-36726] Upgrade Parquet to 1.12.1

### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary 
encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 
release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun 
Signed-off-by: DB Tsai 
---
 dev/deps/spark-deps-hadoop-2.7-hive-2.3 |  12 ++--
 dev/deps/spark-deps-hadoop-3.2-hive-2.3 |  12 ++--
 pom.xml |   2 +-
 .../resources/test-data/malformed-file-offset.parquet   | Bin 0 -> 37968 bytes
 .../execution/datasources/parquet/ParquetIOSuite.scala  |   6 ++
 5 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 
b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
index e3ddf97..0a83bc0 100644
--- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3
@@ -201,12 +201,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
-parquet-column/1.12.0//parquet-column-1.12.0.jar
-parquet-common/1.12.0//parquet-common-1.12.0.jar
-parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
-parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
-parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
-parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
+parquet-column/1.12.1//parquet-column-1.12.1.jar
+parquet-common/1.12.1//parquet-common-1.12.1.jar
+parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
+parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
+parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
+parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
 protobuf-java/2.5.0//protobuf-java-2.5.0.jar
 py4j/0.10.9.2//py4j-0.10.9.2.jar
 pyrolite/4.30//pyrolite-4.30.jar
diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 
b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
index e8c7ad9..30c901b 100644
--- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3
@@ -172,12 +172,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar
 oro/2.0.8//oro-2.0.8.jar
 osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
 paranamer/2.8//paranamer-2.8.jar
-parquet-column/1.12.0//parquet-column-1.12.0.jar
-parquet-common/1.12.0//parquet-common-1.12.0.jar
-parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar
-parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar
-parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar
-parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar
+parquet-column/1.12.1//parquet-column-1.12.1.jar
+parquet-common/1.12.1//parquet-common-1.12.1.jar
+parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar
+parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar
+parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar
+parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar
 protobuf-java/2.5.0//protobuf-java-2.5.0.jar
 py4j/0.10.9.2//py4j-0.10.9.2.jar
 pyrolite/4.30//pyrolite-4.30.jar
diff --git a/pom.xml b/pom.xml
index 02627b4..7b472cd 100644
--- a/pom.xml
+++ b/pom.xml
@@ -136,7 +136,7 @@
 2.8.0
 
 10.14.2.0
-1.12.0
+1.12.1
 1.6.10
 9.4.43.v20210629
 4.0.3
diff --git 
a/sql/core/src/test/resources/test-data/malformed-file-offset.parquet 
b/sql/core/src/test/resources/test-data/malformed-file-offset.parquet
new file mode 100644
index 000..5abeabe
Binary files /dev/null and 
b/sql/core/src/test/resources/test-data/malformed-file-offset.parquet differ
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasource

[spark] branch branch-3.2 updated: [SPARK-36722][PYTHON] Fix Series.update with another in same frame

2021-09-15 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 017bce7  [SPARK-36722][PYTHON] Fix Series.update with another in same 
frame
017bce7 is described below

commit 017bce7b118cd643e652d5a7914294e281b05e6e
Author: dgd-contributor 
AuthorDate: Wed Sep 15 11:08:01 2021 -0700

[SPARK-36722][PYTHON] Fix Series.update with another in same frame

### What changes were proposed in this pull request?
Fix Series.update with another in same frame

also add test for update series in diff frame

### Why are the changes needed?
Fix Series.update with another in same frame

Pandas behavior:
``` python
>>> pdf = pd.DataFrame(
... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 
1, None, 0, 0]},
... )
>>> pdf
 ab
0  NaN  NaN
1  2.0  5.0
2  3.0  NaN
3  4.0  3.0
4  5.0  2.0
5  6.0  1.0
6  7.0  NaN
7  8.0  0.0
8  NaN  0.0
>>> pdf.a.update(pdf.b)
>>> pdf
 ab
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
```

### Does this PR introduce _any_ user-facing change?
Before
```python
>>> psdf = ps.DataFrame(
... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 
1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in 
update
combined = combine_frames(self._psdf, other._psdf, how="leftouter")
  File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in 
combine_frames
assert not same_anchor(
AssertionError: We don't need to combine. `this` and `that` are same.
>>>
```

After
```python
>>> psdf = ps.DataFrame(
... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 
1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
>>> psdf
 ab
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
>>>
```

### How was this patch tested?
unit tests

Closes #33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor.

Authored-by: dgd-contributor 
Signed-off-by: Takuya UESHIN 
(cherry picked from commit c15072cc7397cb59496b7da1153d663d8201865c)
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/series.py| 35 ++
 .../pandas/tests/test_ops_on_diff_frames.py|  9 ++
 python/pyspark/pandas/tests/test_series.py | 32 
 3 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index 568754c..0eebcc9 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -4498,22 +4498,33 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
 if not isinstance(other, Series):
 raise TypeError("'other' must be a Series")
 
-combined = combine_frames(self._psdf, other._psdf, how="leftouter")
+if same_anchor(self, other):
+scol = (
+F.when(other.spark.column.isNotNull(), other.spark.column)
+.otherwise(self.spark.column)
+
.alias(self._psdf._internal.spark_column_name_for(self._column_label))
+)
+internal = self._psdf._internal.with_new_spark_column(
+self._column_label, scol  # TODO: dtype?
+)
+self._psdf._update_internal_frame(internal)
+else:
+combined = combine_frames(self._psdf, other._psdf, how="leftouter")
 
-this_scol = 
combined["this"]._internal.spark_column_for(self._column_label)
-that_scol = 
combined["that"]._internal.spark_column_for(other._column_label)
+this_scol = 
combined["this"]._internal.spark_column_for(self._column_label)
+that_scol = 
combined["that"]._internal.spark_column_for(other._column_label)
 
-scol = (
-F.when(that_scol.isNotNull(), that_scol)
-.otherwise(this_scol)
-
.alias(self._psdf._internal.spark_column_name_for(self._column_label))
-)
+scol = (
+F.when(that_scol.isNotNull(), that_scol)
+.otherwise(this_scol)
+
.alias(self._psdf._internal.spark_column_name_for(self._column_label))
+)
 
-internal = combined["this"]._internal.with_new_spark_column(
-self._column_label, scol  # TODO: dtype?
-)
+i

[spark] branch master updated: [SPARK-36722][PYTHON] Fix Series.update with another in same frame

2021-09-15 Thread ueshin
This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c15072c  [SPARK-36722][PYTHON] Fix Series.update with another in same 
frame
c15072c is described below

commit c15072cc7397cb59496b7da1153d663d8201865c
Author: dgd-contributor 
AuthorDate: Wed Sep 15 11:08:01 2021 -0700

[SPARK-36722][PYTHON] Fix Series.update with another in same frame

### What changes were proposed in this pull request?
Fix Series.update with another in same frame

also add test for update series in diff frame

### Why are the changes needed?
Fix Series.update with another in same frame

Pandas behavior:
``` python
>>> pdf = pd.DataFrame(
... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 
1, None, 0, 0]},
... )
>>> pdf
 ab
0  NaN  NaN
1  2.0  5.0
2  3.0  NaN
3  4.0  3.0
4  5.0  2.0
5  6.0  1.0
6  7.0  NaN
7  8.0  0.0
8  NaN  0.0
>>> pdf.a.update(pdf.b)
>>> pdf
 ab
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
```

### Does this PR introduce _any_ user-facing change?
Before
```python
>>> psdf = ps.DataFrame(
... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 
1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in 
update
combined = combine_frames(self._psdf, other._psdf, how="leftouter")
  File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in 
combine_frames
assert not same_anchor(
AssertionError: We don't need to combine. `this` and `that` are same.
>>>
```

After
```python
>>> psdf = ps.DataFrame(
... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 
1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
>>> psdf
 ab
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
>>>
```

### How was this patch tested?
unit tests

Closes #33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor.

Authored-by: dgd-contributor 
Signed-off-by: Takuya UESHIN 
---
 python/pyspark/pandas/series.py| 35 ++
 .../pandas/tests/test_ops_on_diff_frames.py|  9 ++
 python/pyspark/pandas/tests/test_series.py | 32 
 3 files changed, 64 insertions(+), 12 deletions(-)

diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py
index 8cbfdbf..d72c08d 100644
--- a/python/pyspark/pandas/series.py
+++ b/python/pyspark/pandas/series.py
@@ -4536,22 +4536,33 @@ class Series(Frame, IndexOpsMixin, Generic[T]):
 if not isinstance(other, Series):
 raise TypeError("'other' must be a Series")
 
-combined = combine_frames(self._psdf, other._psdf, how="leftouter")
+if same_anchor(self, other):
+scol = (
+F.when(other.spark.column.isNotNull(), other.spark.column)
+.otherwise(self.spark.column)
+
.alias(self._psdf._internal.spark_column_name_for(self._column_label))
+)
+internal = self._psdf._internal.with_new_spark_column(
+self._column_label, scol  # TODO: dtype?
+)
+self._psdf._update_internal_frame(internal)
+else:
+combined = combine_frames(self._psdf, other._psdf, how="leftouter")
 
-this_scol = 
combined["this"]._internal.spark_column_for(self._column_label)
-that_scol = 
combined["that"]._internal.spark_column_for(other._column_label)
+this_scol = 
combined["this"]._internal.spark_column_for(self._column_label)
+that_scol = 
combined["that"]._internal.spark_column_for(other._column_label)
 
-scol = (
-F.when(that_scol.isNotNull(), that_scol)
-.otherwise(this_scol)
-
.alias(self._psdf._internal.spark_column_name_for(self._column_label))
-)
+scol = (
+F.when(that_scol.isNotNull(), that_scol)
+.otherwise(this_scol)
+
.alias(self._psdf._internal.spark_column_name_for(self._column_label))
+)
 
-internal = combined["this"]._internal.with_new_spark_column(
-self._column_label, scol  # TODO: dtype?
-)
+internal = combined["this"]._internal.with_new_spark_column(
+self._column_label, scol  # TODO: dtyp

[spark] branch branch-3.1 updated: [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 682442a  [SPARK-36755][SQL] ArraysOverlap should handle duplicated 
Double.NaN and Float.NaN
682442a is described below

commit 682442a5aa46d50b6b4b74537fccf04fdd33fe0f
Author: Angerszh 
AuthorDate: Wed Sep 15 22:31:46 2021 +0800

[SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and 
Float.NaN

### What changes were proposed in this pull request?
For query
```
select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as 
double)))
```
This returns [false], but it should return [true].
This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` 
and `Float.NaN`.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
arrays_overlap won't handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #34006 from AngersZh/SPARK-36755.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit b665782f0d3729928be4ca897ec2eb990b714879)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/collectionOperations.scala |  4 ++--
 .../catalyst/expressions/CollectionExpressionsSuite.scala   | 13 +
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 7b231fe..9f922d1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -1262,12 +1262,12 @@ case class ArraysOverlap(left: Expression, right: 
Expression)
   (arr2, arr1)
 }
 if (smaller.numElements() > 0) {
-  val smallestSet = new mutable.HashSet[Any]
+  val smallestSet = new java.util.HashSet[Any]()
   smaller.foreach(elementType, (_, v) =>
 if (v == null) {
   hasNull = true
 } else {
-  smallestSet += v
+  smallestSet.add(v)
 })
   bigger.foreach(elementType, (_, v1) =>
 if (v1 == null) {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
index 25e40c4..69a24d9 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
@@ -1965,4 +1965,17 @@ class CollectionExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))),
   Seq(Float.NaN, null, 1f))
   }
+
+  test("SPARK-36755: ArraysOverlap hould handle duplicated Double.NaN and 
Float.Nan") {
+checkEvaluation(ArraysOverlap(
+  Literal.apply(Array(Double.NaN, 1d)), Literal.apply(Array(Double.NaN))), 
true)
+checkEvaluation(ArraysOverlap(
+  Literal.create(Seq(Double.NaN, null), ArrayType(DoubleType)),
+  Literal.create(Seq(Double.NaN, null, 1d), ArrayType(DoubleType))), true)
+checkEvaluation(ArraysOverlap(
+  Literal.apply(Array(Float.NaN)), Literal.apply(Array(Float.NaN, 1f))), 
true)
+checkEvaluation(ArraysOverlap(
+  Literal.create(Seq(Float.NaN, null), ArrayType(FloatType)),
+  Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), true)
+  }
 }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new 75bffd9  [SPARK-36755][SQL] ArraysOverlap should handle duplicated 
Double.NaN and Float.NaN
75bffd9 is described below

commit 75bffd972d53488a732e2e21695e3283cb32e9dc
Author: Angerszh 
AuthorDate: Wed Sep 15 22:31:46 2021 +0800

[SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and 
Float.NaN

### What changes were proposed in this pull request?
For query
```
select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as 
double)))
```
This returns [false], but it should return [true].
This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` 
and `Float.NaN`.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
arrays_overlap won't handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #34006 from AngersZh/SPARK-36755.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit b665782f0d3729928be4ca897ec2eb990b714879)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/collectionOperations.scala |  4 ++--
 .../catalyst/expressions/CollectionExpressionsSuite.scala   | 13 +
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index 47b2719..73a45d7 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -1297,12 +1297,12 @@ case class ArraysOverlap(left: Expression, right: 
Expression)
   (arr2, arr1)
 }
 if (smaller.numElements() > 0) {
-  val smallestSet = new mutable.HashSet[Any]
+  val smallestSet = new java.util.HashSet[Any]()
   smaller.foreach(elementType, (_, v) =>
 if (v == null) {
   hasNull = true
 } else {
-  smallestSet += v
+  smallestSet.add(v)
 })
   bigger.foreach(elementType, (_, v1) =>
 if (v1 == null) {
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
index 496a4c3..caca24a 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala
@@ -2309,4 +2309,17 @@ class CollectionExpressionsSuite extends SparkFunSuite 
with ExpressionEvalHelper
   Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))),
   Seq(Float.NaN, null, 1f))
   }
+
+  test("SPARK-36755: ArraysOverlap hould handle duplicated Double.NaN and 
Float.Nan") {
+checkEvaluation(ArraysOverlap(
+  Literal.apply(Array(Double.NaN, 1d)), Literal.apply(Array(Double.NaN))), 
true)
+checkEvaluation(ArraysOverlap(
+  Literal.create(Seq(Double.NaN, null), ArrayType(DoubleType)),
+  Literal.create(Seq(Double.NaN, null, 1d), ArrayType(DoubleType))), true)
+checkEvaluation(ArraysOverlap(
+  Literal.apply(Array(Float.NaN)), Literal.apply(Array(Float.NaN, 1f))), 
true)
+checkEvaluation(ArraysOverlap(
+  Literal.create(Seq(Float.NaN, null), ArrayType(FloatType)),
+  Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), true)
+  }
 }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (6380859 -> b665782)

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git.


from 6380859  [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated 
Double.NaN and Float.NaN
 add b665782  [SPARK-36755][SQL] ArraysOverlap should handle duplicated 
Double.NaN and Float.NaN

No new revisions were added by this update.

Summary of changes:
 .../sql/catalyst/expressions/collectionOperations.scala |  4 ++--
 .../catalyst/expressions/CollectionExpressionsSuite.scala   | 13 +
 2 files changed, 15 insertions(+), 2 deletions(-)

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.1 updated: [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.1
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.1 by this push:
 new 4a266f0  [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated 
Double.NaN and Float.NaN
4a266f0 is described below

commit 4a266f0b59e4432c09ac5bc1bdd266a84b94a0b9
Author: Angerszh 
AuthorDate: Wed Sep 15 22:04:09 2021 +0800

[SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and 
Float.NaN

### What changes were proposed in this pull request?
According to 
https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized 
 NaN

### Why are the changes needed?
Use normalized NaN for duplicated NaN value

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Exiting UT

Closes #34003 from AngersZh/SPARK-36702-FOLLOWUP.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 638085953f931f98241856c9f652e5f15202fcc0)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/collectionOperations.scala | 13 -
 .../scala/org/apache/spark/sql/util/SQLOpenHashSet.scala|  8 
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index b829ac0..7b231fe 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -3370,6 +3370,7 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
 val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
 val hs = new SQLOpenHashSet[Any]()
 val isNaN = SQLOpenHashSet.isNaN(elementType)
+val valueNaN = SQLOpenHashSet.valueNaN(elementType)
 Seq(array1, array2).foreach { array =>
   var i = 0
   while (i < array.numElements()) {
@@ -3382,7 +3383,7 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
   val elem = array.get(i, elementType)
   if (isNaN(elem)) {
 if (!hs.containsNaN) {
-  arrayBuffer += elem
+  arrayBuffer += valueNaN
   hs.addNaN
 }
   } else {
@@ -3480,16 +3481,18 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
 
 def withNaNCheck(body: String): String = {
   (elementType match {
-case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)")
-case FloatType => Some(s"java.lang.Float.isNaN((float)$value)")
+case DoubleType =>
+  Some((s"java.lang.Double.isNaN((double)$value)", 
"java.lang.Double.NaN"))
+case FloatType =>
+  Some((s"java.lang.Float.isNaN((float)$value)", 
"java.lang.Float.NaN"))
 case _ => None
-  }).map { isNaN =>
+  }).map { case (isNaN, valueNaN) =>
 s"""
|if ($isNaN) {
|  if (!$hashSet.containsNaN()) {
| $size++;
| $hashSet.addNaN();
-   | $builder.$$plus$$eq($value);
+   | $builder.$$plus$$eq($valueNaN);
|  }
|} else {
|  $body
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
index 5ffe733..083cfdd 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
@@ -69,4 +69,12 @@ object SQLOpenHashSet {
   case _ => (_: Any) => false
 }
   }
+
+  def valueNaN(dataType: DataType): Any = {
+dataType match {
+  case DoubleType => java.lang.Double.NaN
+  case FloatType => java.lang.Float.NaN
+  case _ => null
+}
+  }
 }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch branch-3.2 updated: [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch branch-3.2
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.2 by this push:
 new e641556  [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated 
Double.NaN and Float.NaN
e641556 is described below

commit e64155691f23d54d650353eb3e5a083cb10dbfc2
Author: Angerszh 
AuthorDate: Wed Sep 15 22:04:09 2021 +0800

[SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and 
Float.NaN

### What changes were proposed in this pull request?
According to 
https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized 
 NaN

### Why are the changes needed?
Use normalized NaN for duplicated NaN value

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Exiting UT

Closes #34003 from AngersZh/SPARK-36702-FOLLOWUP.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
(cherry picked from commit 638085953f931f98241856c9f652e5f15202fcc0)
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/collectionOperations.scala | 13 -
 .../scala/org/apache/spark/sql/util/SQLOpenHashSet.scala|  8 
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index e5620a1..47b2719 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -3578,6 +3578,7 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
 val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
 val hs = new SQLOpenHashSet[Any]()
 val isNaN = SQLOpenHashSet.isNaN(elementType)
+val valueNaN = SQLOpenHashSet.valueNaN(elementType)
 Seq(array1, array2).foreach { array =>
   var i = 0
   while (i < array.numElements()) {
@@ -3590,7 +3591,7 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
   val elem = array.get(i, elementType)
   if (isNaN(elem)) {
 if (!hs.containsNaN) {
-  arrayBuffer += elem
+  arrayBuffer += valueNaN
   hs.addNaN
 }
   } else {
@@ -3688,16 +3689,18 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
 
 def withNaNCheck(body: String): String = {
   (elementType match {
-case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)")
-case FloatType => Some(s"java.lang.Float.isNaN((float)$value)")
+case DoubleType =>
+  Some((s"java.lang.Double.isNaN((double)$value)", 
"java.lang.Double.NaN"))
+case FloatType =>
+  Some((s"java.lang.Float.isNaN((float)$value)", 
"java.lang.Float.NaN"))
 case _ => None
-  }).map { isNaN =>
+  }).map { case (isNaN, valueNaN) =>
 s"""
|if ($isNaN) {
|  if (!$hashSet.containsNaN()) {
| $size++;
| $hashSet.addNaN();
-   | $builder.$$plus$$eq($value);
+   | $builder.$$plus$$eq($valueNaN);
|  }
|} else {
|  $body
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
index 5ffe733..083cfdd 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
@@ -69,4 +69,12 @@ object SQLOpenHashSet {
   case _ => (_: Any) => false
 }
   }
+
+  def valueNaN(dataType: DataType): Any = {
+dataType match {
+  case DoubleType => java.lang.Double.NaN
+  case FloatType => java.lang.Float.NaN
+  case _ => null
+}
+  }
 }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN

2021-09-15 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6380859  [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated 
Double.NaN and Float.NaN
6380859 is described below

commit 638085953f931f98241856c9f652e5f15202fcc0
Author: Angerszh 
AuthorDate: Wed Sep 15 22:04:09 2021 +0800

[SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and 
Float.NaN

### What changes were proposed in this pull request?
According to 
https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized 
 NaN

### Why are the changes needed?
Use normalized NaN for duplicated NaN value

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Exiting UT

Closes #34003 from AngersZh/SPARK-36702-FOLLOWUP.

Authored-by: Angerszh 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/expressions/collectionOperations.scala | 13 -
 .../scala/org/apache/spark/sql/util/SQLOpenHashSet.scala|  8 
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
index e5620a1..47b2719 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala
@@ -3578,6 +3578,7 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
 val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any]
 val hs = new SQLOpenHashSet[Any]()
 val isNaN = SQLOpenHashSet.isNaN(elementType)
+val valueNaN = SQLOpenHashSet.valueNaN(elementType)
 Seq(array1, array2).foreach { array =>
   var i = 0
   while (i < array.numElements()) {
@@ -3590,7 +3591,7 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
   val elem = array.get(i, elementType)
   if (isNaN(elem)) {
 if (!hs.containsNaN) {
-  arrayBuffer += elem
+  arrayBuffer += valueNaN
   hs.addNaN
 }
   } else {
@@ -3688,16 +3689,18 @@ case class ArrayUnion(left: Expression, right: 
Expression) extends ArrayBinaryLi
 
 def withNaNCheck(body: String): String = {
   (elementType match {
-case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)")
-case FloatType => Some(s"java.lang.Float.isNaN((float)$value)")
+case DoubleType =>
+  Some((s"java.lang.Double.isNaN((double)$value)", 
"java.lang.Double.NaN"))
+case FloatType =>
+  Some((s"java.lang.Float.isNaN((float)$value)", 
"java.lang.Float.NaN"))
 case _ => None
-  }).map { isNaN =>
+  }).map { case (isNaN, valueNaN) =>
 s"""
|if ($isNaN) {
|  if (!$hashSet.containsNaN()) {
| $size++;
| $hashSet.addNaN();
-   | $builder.$$plus$$eq($value);
+   | $builder.$$plus$$eq($valueNaN);
|  }
|} else {
|  $body
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
index 5ffe733..083cfdd 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala
@@ -69,4 +69,12 @@ object SQLOpenHashSet {
   case _ => (_: Any) => false
 }
   }
+
+  def valueNaN(dataType: DataType): Any = {
+dataType match {
+  case DoubleType => java.lang.Double.NaN
+  case FloatType => java.lang.Float.NaN
+  case _ => null
+}
+  }
 }

-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to Scala, Python and R

2021-09-15 Thread sarutak
This is an automated email from the ASF dual-hosted git repository.

sarutak pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0666f5c  [SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to 
Scala, Python and R
0666f5c is described below

commit 0666f5c00393acccecdd82d3794e5a2b88f3210b
Author: Leona Yoda 
AuthorDate: Wed Sep 15 16:27:13 2021 +0900

[SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to Scala, Python 
and R

### What changes were proposed in this pull request?

octet_length: caliculate the byte length of strings
bit_length: caliculate the bit length of strings
Those two string related functions are only implemented on SparkSQL, not on 
Scala, Python and R.

### Why are the changes needed?

Those functions would be useful for multi-bytes character users, who mainly 
working with Scala, Python or R.

### Does this PR introduce _any_ user-facing change?

Yes. Users can call octet_length/bit_length APIs on Scala(Dataframe), 
Python, and R.

### How was this patch tested?

unit tests

Closes #33992 from yoda-mon/add-bit-octet-length.

Authored-by: Leona Yoda 
Signed-off-by: Kousuke Saruta 
---
 R/pkg/NAMESPACE|  2 +
 R/pkg/R/functions.R| 26 +++
 R/pkg/R/generics.R |  8 
 R/pkg/tests/fulltests/test_sparkSQL.R  | 11 +
 python/docs/source/reference/pyspark.sql.rst   |  2 +
 python/pyspark/sql/functions.py| 52 ++
 python/pyspark/sql/functions.pyi   |  2 +
 python/pyspark/sql/tests/test_functions.py | 14 +-
 .../scala/org/apache/spark/sql/functions.scala | 16 +++
 .../apache/spark/sql/StringFunctionsSuite.scala| 52 ++
 10 files changed, 184 insertions(+), 1 deletion(-)

diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE
index 7fa8085..686a49e 100644
--- a/R/pkg/NAMESPACE
+++ b/R/pkg/NAMESPACE
@@ -243,6 +243,7 @@ exportMethods("%<=>%",
   "base64",
   "between",
   "bin",
+  "bit_length",
   "bitwise_not",
   "bitwiseNOT",
   "bround",
@@ -364,6 +365,7 @@ exportMethods("%<=>%",
   "not",
   "nth_value",
   "ntile",
+  "octet_length",
   "otherwise",
   "over",
   "overlay",
diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R
index 62066da1..f0768c7 100644
--- a/R/pkg/R/functions.R
+++ b/R/pkg/R/functions.R
@@ -647,6 +647,19 @@ setMethod("bin",
   })
 
 #' @details
+#' \code{bit_length}: Calculates the bit length for the specified string 
column.
+#'
+#' @rdname column_string_functions
+#' @aliases bit_length bit_length,Column-method
+#' @note length since 3.3.0
+setMethod("bit_length",
+  signature(x = "Column"),
+  function(x) {
+jc <- callJStatic("org.apache.spark.sql.functions", "bit_length", 
x@jc)
+column(jc)
+  })
+
+#' @details
 #' \code{bitwise_not}: Computes bitwise NOT.
 #'
 #' @rdname column_nonaggregate_functions
@@ -1570,6 +1583,19 @@ setMethod("negate",
   })
 
 #' @details
+#' \code{octet_length}: Calculates the byte length for the specified string 
column.
+#'
+#' @rdname column_string_functions
+#' @aliases octet_length octet_length,Column-method
+#' @note length since 3.3.0
+setMethod("octet_length",
+  signature(x = "Column"),
+  function(x) {
+jc <- callJStatic("org.apache.spark.sql.functions", 
"octet_length", x@jc)
+column(jc)
+  })
+
+#' @details
 #' \code{overlay}: Overlay the specified portion of \code{x} with 
\code{replace},
 #' starting from byte position \code{pos} of \code{src} and proceeding for
 #' \code{len} bytes.
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index 9ebea3f..1abde65 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -884,6 +884,10 @@ setGeneric("base64", function(x) { 
standardGeneric("base64") })
 #' @name NULL
 setGeneric("bin", function(x) { standardGeneric("bin") })
 
+#' @rdname column_string_functions
+#' @name NULL
+setGeneric("bit_length", function(x, ...) { standardGeneric("bit_length") })
+
 #' @rdname column_nonaggregate_functions
 #' @name NULL
 setGeneric("bitwise_not", function(x) { standardGeneric("bitwise_not") })
@@ -1232,6 +1236,10 @@ setGeneric("n_distinct", function(x, ...) { 
standardGeneric("n_distinct") })
 
 #' @rdname column_string_functions
 #' @name NULL
+setGeneric("octet_length", function(x, ...) { standardGeneric("octet_length") 
})
+
+#' @rdname column_string_functions
+#' @name NULL
 setGeneric("overlay", function(x, replace, pos, ...) { 
standardGeneric("overlay") })
 
 #' @rdna