[spark] branch branch-3.2 updated: [SPARK-36732][BUILD][FOLLOWUP] Fix dependency manifest
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 6352846 [SPARK-36732][BUILD][FOLLOWUP] Fix dependency manifest 6352846 is described below commit 6352846f1c7bb9390fd0b944fffc7025e08f95b2 Author: Dongjoon Hyun AuthorDate: Wed Sep 15 23:38:48 2021 -0700 [SPARK-36732][BUILD][FOLLOWUP] Fix dependency manifest --- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 1 + 1 file changed, 1 insertion(+) diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 index d0d8d5e..d299528 100644 --- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 @@ -102,6 +102,7 @@ janino/3.0.16//janino-3.0.16.jar javassist/3.25.0-GA//javassist-3.25.0-GA.jar javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar javolution/5.5.1//javolution-5.5.1.jar +jaxb-api/2.2.11//jaxb-api-2.2.11.jar jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar jcl-over-slf4j/1.7.30//jcl-over-slf4j-1.7.30.jar jdo-api/3.0.1//jdo-api-3.0.1.jar - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 63b8417 [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11 63b8417 is described below commit 63b8417794c33fe76adc64d74dea6227debc59f5 Author: Dongjoon Hyun AuthorDate: Wed Sep 15 23:36:26 2021 -0700 [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes. ### Why are the changes needed? Apache ORC 1.6.11 has the following fixes. - https://issues.apache.org/jira/projects/ORC/versions/12350499 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33971 from dongjoon-hyun/SPARK-36732. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit c21779729765f640272e0249c6e71145aa5ccfdf) Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-2.7-hive-2.3| 6 +++--- dev/deps/spark-deps-hadoop-3.2-hive-2.3| 7 +++ pom.xml| 7 +-- .../test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala | 4 ++-- 4 files changed, 9 insertions(+), 15 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 index e92f8a7..8400413 100644 --- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 @@ -195,9 +195,9 @@ objenesis/2.6//objenesis-2.6.jar okhttp/3.12.12//okhttp-3.12.12.jar okio/1.14.0//okio-1.14.0.jar opencsv/2.3//opencsv-2.3.jar -orc-core/1.6.10//orc-core-1.6.10.jar -orc-mapreduce/1.6.10//orc-mapreduce-1.6.10.jar -orc-shims/1.6.10//orc-shims-1.6.10.jar +orc-core/1.6.11//orc-core-1.6.11.jar +orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar +orc-shims/1.6.11//orc-shims-1.6.11.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 index b8426e2..d0d8d5e 100644 --- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 @@ -102,7 +102,6 @@ janino/3.0.16//janino-3.0.16.jar javassist/3.25.0-GA//javassist-3.25.0-GA.jar javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar javolution/5.5.1//javolution-5.5.1.jar -jaxb-api/2.2.11//jaxb-api-2.2.11.jar jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar jcl-over-slf4j/1.7.30//jcl-over-slf4j-1.7.30.jar jdo-api/3.0.1//jdo-api-3.0.1.jar @@ -166,9 +165,9 @@ objenesis/2.6//objenesis-2.6.jar okhttp/3.12.12//okhttp-3.12.12.jar okio/1.14.0//okio-1.14.0.jar opencsv/2.3//opencsv-2.3.jar -orc-core/1.6.10//orc-core-1.6.10.jar -orc-mapreduce/1.6.10//orc-mapreduce-1.6.10.jar -orc-shims/1.6.10//orc-shims-1.6.10.jar +orc-core/1.6.11//orc-core-1.6.11.jar +orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar +orc-shims/1.6.11//orc-shims-1.6.11.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar diff --git a/pom.xml b/pom.xml index ab71632..9f2b2c9 100644 --- a/pom.xml +++ b/pom.xml @@ -137,7 +137,7 @@ 10.14.2.0 1.12.1 -1.6.10 +1.6.11 9.4.43.v20210629 4.0.3 0.10.0 @@ -2306,11 +2306,6 @@ -io.airlift -aircompressor -0.21 - - org.apache.orc orc-mapreduce ${orc.version} diff --git a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala index 2436392..001b6a0 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala @@ -695,9 +695,9 @@ class FileBasedDataSourceSuite extends QueryTest test("SPARK-22790,SPARK-27668: spark.sql.sources.compressionFactor takes effect") { Seq(1.0, 0.5).foreach { compressionFactor => withSQLConf(SQLConf.FILE_COMPRESSION_FACTOR.key -> compressionFactor.toString, -SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "250") { +SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "350") { withTempPath { workDir => - // the file size is 486 bytes + // the file size is 504 bytes val workDirPath = workDir.getAbsolutePath val data1 = Seq(100, 200, 300, 400).toDF("count") data1.write.orc(workDirPath + "/data1") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For
[spark] branch master updated: [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c217797 [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11 c217797 is described below commit c21779729765f640272e0249c6e71145aa5ccfdf Author: Dongjoon Hyun AuthorDate: Wed Sep 15 23:36:26 2021 -0700 [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes. ### Why are the changes needed? Apache ORC 1.6.11 has the following fixes. - https://issues.apache.org/jira/projects/ORC/versions/12350499 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #33971 from dongjoon-hyun/SPARK-36732. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-2.7-hive-2.3| 6 +++--- dev/deps/spark-deps-hadoop-3.2-hive-2.3| 7 +++ pom.xml| 7 +-- .../test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala | 4 ++-- 4 files changed, 9 insertions(+), 15 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 index e55782d..6f91caf 100644 --- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 @@ -195,9 +195,9 @@ objenesis/2.6//objenesis-2.6.jar okhttp/3.12.12//okhttp-3.12.12.jar okio/1.14.0//okio-1.14.0.jar opencsv/2.3//opencsv-2.3.jar -orc-core/1.6.10//orc-core-1.6.10.jar -orc-mapreduce/1.6.10//orc-mapreduce-1.6.10.jar -orc-shims/1.6.10//orc-shims-1.6.10.jar +orc-core/1.6.11//orc-core-1.6.11.jar +orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar +orc-shims/1.6.11//orc-shims-1.6.11.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 index b49998e..ecf448f 100644 --- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 @@ -102,7 +102,6 @@ janino/3.0.16//janino-3.0.16.jar javassist/3.25.0-GA//javassist-3.25.0-GA.jar javax.jdo/3.2.0-m3//javax.jdo-3.2.0-m3.jar javolution/5.5.1//javolution-5.5.1.jar -jaxb-api/2.2.11//jaxb-api-2.2.11.jar jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar jcl-over-slf4j/1.7.30//jcl-over-slf4j-1.7.30.jar jdo-api/3.0.1//jdo-api-3.0.1.jar @@ -166,9 +165,9 @@ objenesis/2.6//objenesis-2.6.jar okhttp/3.12.12//okhttp-3.12.12.jar okio/1.14.0//okio-1.14.0.jar opencsv/2.3//opencsv-2.3.jar -orc-core/1.6.10//orc-core-1.6.10.jar -orc-mapreduce/1.6.10//orc-mapreduce-1.6.10.jar -orc-shims/1.6.10//orc-shims-1.6.10.jar +orc-core/1.6.11//orc-core-1.6.11.jar +orc-mapreduce/1.6.11//orc-mapreduce-1.6.11.jar +orc-shims/1.6.11//orc-shims-1.6.11.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar diff --git a/pom.xml b/pom.xml index e1503fe..b16c03d 100644 --- a/pom.xml +++ b/pom.xml @@ -137,7 +137,7 @@ 10.14.2.0 1.12.1 -1.6.10 +1.6.11 9.4.43.v20210629 4.0.3 0.10.0 @@ -2302,11 +2302,6 @@ -io.airlift -aircompressor -0.21 - - org.apache.orc orc-mapreduce ${orc.version} diff --git a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala index ab28b05..910f159 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/FileBasedDataSourceSuite.scala @@ -695,9 +695,9 @@ class FileBasedDataSourceSuite extends QueryTest test("SPARK-22790,SPARK-27668: spark.sql.sources.compressionFactor takes effect") { Seq(1.0, 0.5).foreach { compressionFactor => withSQLConf(SQLConf.FILE_COMPRESSION_FACTOR.key -> compressionFactor.toString, -SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "250") { +SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "350") { withTempPath { workDir => - // the file size is 486 bytes + // the file size is 504 bytes val workDirPath = workDir.getAbsolutePath val data1 = Seq(100, 200, 300, 400).toDF("count") data1.write.orc(workDirPath + "/data1") - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (bbb33af -> afd406e)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from bbb33af [SPARK-36735][SQL] Adjust overhead of cached relation for DPP add afd406e [SPARK-36745][SQL] ExtractEquiJoinKeys should return the original predicates on join keys No new revisions were added by this update. Summary of changes: .../sql/catalyst/analysis/StreamingJoinHelper.scala | 2 +- .../optimizer/NormalizeFloatingNumbers.scala| 2 +- .../spark/sql/catalyst/planning/patterns.scala | 21 ++--- .../logical/statsEstimation/JoinEstimation.scala| 2 +- .../spark/sql/execution/SparkStrategies.scala | 10 +- .../execution/adaptive/DynamicJoinSelection.scala | 2 +- .../adaptive/LogicalQueryStageStrategy.scala| 8 +--- .../execution/dynamicpruning/PartitionPruning.scala | 2 +- .../sql/execution/joins/ExistenceJoinSuite.scala| 6 +++--- .../spark/sql/execution/joins/InnerJoinSuite.scala | 10 +- .../spark/sql/execution/joins/OuterJoinSuite.scala | 6 +++--- 11 files changed, 40 insertions(+), 31 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new c0c121d [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN c0c121d is described below commit c0c121d772b8af428b9c3edc48d1e2f153a51349 Author: Angerszh AuthorDate: Wed Sep 15 22:31:46 2021 +0800 [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [false], but it should return [true]. This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? arrays_overlap won't handle equal `NaN` value ### How was this patch tested? Added UT Closes #34006 from AngersZh/SPARK-36755. Authored-by: Angerszh Signed-off-by: Wenchen Fan --- .../sql/catalyst/expressions/collectionOperations.scala | 4 ++-- .../catalyst/expressions/CollectionExpressionsSuite.scala | 13 + 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 7adb1ec..f0f71fb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -1253,12 +1253,12 @@ case class ArraysOverlap(left: Expression, right: Expression) (arr2, arr1) } if (smaller.numElements() > 0) { - val smallestSet = new mutable.HashSet[Any] + val smallestSet = new java.util.HashSet[Any]() smaller.foreach(elementType, (_, v) => if (v == null) { hasNull = true } else { - smallestSet += v + smallestSet.add(v) }) bigger.foreach(elementType, (_, v1) => if (v1 == null) { diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala index 639ed47..9221feb 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala @@ -1911,4 +1911,17 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), Seq(Float.NaN, null, 1f)) } + + test("SPARK-36755: ArraysOverlap hould handle duplicated Double.NaN and Float.Nan") { +checkEvaluation(ArraysOverlap( + Literal.apply(Array(Double.NaN, 1d)), Literal.apply(Array(Double.NaN))), true) +checkEvaluation(ArraysOverlap( + Literal.create(Seq(Double.NaN, null), ArrayType(DoubleType)), + Literal.create(Seq(Double.NaN, null, 1d), ArrayType(DoubleType))), true) +checkEvaluation(ArraysOverlap( + Literal.apply(Array(Float.NaN)), Literal.apply(Array(Float.NaN, 1f))), true) +checkEvaluation(ArraysOverlap( + Literal.create(Seq(Float.NaN, null), ArrayType(FloatType)), + Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), true) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.0 updated: [SPARK-36702][SQL][3.0] ArrayUnion handle duplicated Double.NaN and F…
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.0 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.0 by this push: new 46c5def [SPARK-36702][SQL][3.0] ArrayUnion handle duplicated Double.NaN and F… 46c5def is described below commit 46c5def22e1ad8b74389aba2ccd499d849c0e9ea Author: Angerszh AuthorDate: Thu Sep 16 12:37:02 2021 +0800 [SPARK-36702][SQL][3.0] ArrayUnion handle duplicated Double.NaN and F… ### What changes were proposed in this pull request? For query ``` select array_union(array(cast('nan' as double), cast('nan' as double)), array()) ``` This returns [NaN, NaN], but it should return [NaN]. This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too. In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? ArrayUnion won't show duplicated `NaN` value ### How was this patch tested? Added UT Closes #34011 from AngersZh/SPARK-36702-3.0. Authored-by: Angerszh Signed-off-by: Wenchen Fan --- .../expressions/collectionOperations.scala | 65 +- .../org/apache/spark/sql/util/SQLOpenHashSet.scala | 80 ++ .../expressions/CollectionExpressionsSuite.scala | 17 + 3 files changed, 145 insertions(+), 17 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 13363e9..7adb1ec 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -32,6 +32,7 @@ import org.apache.spark.sql.catalyst.util.DateTimeConstants._ import org.apache.spark.sql.catalyst.util.DateTimeUtils._ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ +import org.apache.spark.sql.util.SQLOpenHashSet import org.apache.spark.unsafe.UTF8StringBuilder import org.apache.spark.unsafe.array.ByteArrayMethods import org.apache.spark.unsafe.array.ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH @@ -3321,24 +3322,32 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi if (TypeUtils.typeWithProperEquals(elementType)) { (array1, array2) => val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] -val hs = new OpenHashSet[Any] -var foundNullElement = false +val hs = new SQLOpenHashSet[Any]() +val isNaN = SQLOpenHashSet.isNaN(elementType) +val valueNaN = SQLOpenHashSet.valueNaN(elementType) Seq(array1, array2).foreach { array => var i = 0 while (i < array.numElements()) { if (array.isNullAt(i)) { - if (!foundNullElement) { + if (!hs.containsNull) { +hs.addNull arrayBuffer += null -foundNullElement = true } } else { val elem = array.get(i, elementType) - if (!hs.contains(elem)) { -if (arrayBuffer.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { - ArrayBinaryLike.throwUnionLengthOverflowException(arrayBuffer.size) + if (isNaN(elem)) { +if (!hs.containsNaN) { + arrayBuffer += valueNaN + hs.addNaN +} + } else { +if (!hs.contains(elem)) { + if (arrayBuffer.size > ByteArrayMethods.MAX_ROUNDED_ARRAY_LENGTH) { + ArrayBinaryLike.throwUnionLengthOverflowException(arrayBuffer.size) + } + arrayBuffer += elem + hs.add(elem) } -arrayBuffer += elem -hs.add(elem) } } i += 1 @@ -3395,13 +3404,12 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi val ptName = CodeGenerator.primitiveTypeName(jt) nullSafeCodeGen(ctx, ev, (array1, array2) => { -val foundNullElement = ctx.freshName("foundNullElement") val nullElementIndex = ctx.freshName("nullElementIndex") val builder = ctx.freshName("builder") val array = ctx.freshName("array") val arrays = ctx.freshName("arrays") val arrayDataIdx = ctx.freshName("arrayDataIdx") -val openHashSet = classOf[OpenHashSet[_]].getName +val openHashSet = classOf[SQLOpenHashSet[_]].getName val classTag = s"scala.refle
[spark] branch master updated (16f1f71 -> bbb33af)
This is an automated email from the ASF dual-hosted git repository. viirya pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 16f1f71 [SPARK-36759][BUILD] Upgrade Scala to 2.12.15 add bbb33af [SPARK-36735][SQL] Adjust overhead of cached relation for DPP No new revisions were added by this update. Summary of changes: .../dynamicpruning/PartitionPruning.scala | 27 +++- .../spark/sql/DynamicPartitionPruningSuite.scala | 76 -- 2 files changed, 82 insertions(+), 21 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (a927b08 -> 16f1f71)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from a927b08 [SPARK-36726] Upgrade Parquet to 1.12.1 add 16f1f71 [SPARK-36759][BUILD] Upgrade Scala to 2.12.15 No new revisions were added by this update. Summary of changes: dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 6 +++--- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 6 +++--- pom.xml | 4 ++-- project/SparkBuild.scala| 2 +- 4 files changed, 9 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36759][BUILD] Upgrade Scala to 2.12.15
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 2067661 [SPARK-36759][BUILD] Upgrade Scala to 2.12.15 2067661 is described below commit 2067661869037a1d391587c2f39fce9e101bb247 Author: Dongjoon Hyun AuthorDate: Wed Sep 15 13:43:25 2021 -0700 [SPARK-36759][BUILD] Upgrade Scala to 2.12.15 ### What changes were proposed in this pull request? This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better. ### Why are the changes needed? Scala 2.12.15 improves compatibility with JDK 17 and 18: https://github.com/scala/scala/releases/tag/v2.12.15 - Avoids IllegalArgumentException in JDK 17+ for lambda deserialization - Upgrades to ASM 9.2, for JDK 18 support in optimizer ### Does this PR introduce _any_ user-facing change? Yes, this is a Scala version change. ### How was this patch tested? Pass the CIs Closes #33999 from dongjoon-hyun/SPARK-36759. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit 16f1f71ba5e5b174bd0e964cfad2f466725dd6a5) Signed-off-by: Dongjoon Hyun --- dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 6 +++--- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 6 +++--- pom.xml | 4 ++-- project/SparkBuild.scala| 2 +- 4 files changed, 9 insertions(+), 9 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 index 44f210f..e92f8a7 100644 --- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 @@ -212,10 +212,10 @@ py4j/0.10.9.2//py4j-0.10.9.2.jar pyrolite/4.30//pyrolite-4.30.jar rocksdbjni/6.20.3//rocksdbjni-6.20.3.jar scala-collection-compat_2.12/2.1.1//scala-collection-compat_2.12-2.1.1.jar -scala-compiler/2.12.14//scala-compiler-2.12.14.jar -scala-library/2.12.14//scala-library-2.12.14.jar +scala-compiler/2.12.15//scala-compiler-2.12.15.jar +scala-library/2.12.15//scala-library-2.12.15.jar scala-parser-combinators_2.12/1.1.2//scala-parser-combinators_2.12-1.1.2.jar -scala-reflect/2.12.14//scala-reflect-2.12.14.jar +scala-reflect/2.12.15//scala-reflect-2.12.15.jar scala-xml_2.12/1.2.0//scala-xml_2.12-1.2.0.jar shapeless_2.12/2.3.3//shapeless_2.12-2.3.3.jar shims/0.9.0//shims-0.9.0.jar diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 index 1e28dd1..b8426e2 100644 --- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 @@ -183,10 +183,10 @@ py4j/0.10.9.2//py4j-0.10.9.2.jar pyrolite/4.30//pyrolite-4.30.jar rocksdbjni/6.20.3//rocksdbjni-6.20.3.jar scala-collection-compat_2.12/2.1.1//scala-collection-compat_2.12-2.1.1.jar -scala-compiler/2.12.14//scala-compiler-2.12.14.jar -scala-library/2.12.14//scala-library-2.12.14.jar +scala-compiler/2.12.15//scala-compiler-2.12.15.jar +scala-library/2.12.15//scala-library-2.12.15.jar scala-parser-combinators_2.12/1.1.2//scala-parser-combinators_2.12-1.1.2.jar -scala-reflect/2.12.14//scala-reflect-2.12.14.jar +scala-reflect/2.12.15//scala-reflect-2.12.15.jar scala-xml_2.12/1.2.0//scala-xml_2.12-1.2.0.jar shapeless_2.12/2.3.3//shapeless_2.12-2.3.3.jar shims/0.9.0//shims-0.9.0.jar diff --git a/pom.xml b/pom.xml index e8d3863..ab71632 100644 --- a/pom.xml +++ b/pom.xml @@ -160,7 +160,7 @@ 3.4.1 3.2.2 -2.12.14 +2.12.15 2.12 2.0.2 --test @@ -2652,7 +2652,7 @@ com.github.ghik silencer-plugin_${scala.version} -1.7.5 +1.7.6 diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 2f82b1c..b9068cc 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -208,7 +208,7 @@ object SparkBuild extends PomBuild { lazy val compilerWarningSettings: Seq[sbt.Def.Setting[_]] = Seq( libraryDependencies ++= { if (VersionNumber(scalaVersion.value).matchesSemVer(SemanticSelector("<2.13.2"))) { -val silencerVersion = "1.7.5" +val silencerVersion = "1.7.6" Seq( "org.scala-lang.modules" %% "scala-collection-compat" % "2.2.0", compilerPlugin("com.github.ghik" % "silencer-plugin" % silencerVersion cross CrossVersion.full), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36726] Upgrade Parquet to 1.12.1
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new a7dc824 [SPARK-36726] Upgrade Parquet to 1.12.1 a7dc824 is described below commit a7dc8242ea913841a2627949fde4cd2953d0b053 Author: Chao Sun AuthorDate: Wed Sep 15 19:17:34 2021 + [SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun Signed-off-by: DB Tsai (cherry picked from commit a927b0836bd59d6731b4970957e82ac1e403ddc4) Signed-off-by: DB Tsai --- dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 12 ++-- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 12 ++-- pom.xml | 2 +- .../resources/test-data/malformed-file-offset.parquet | Bin 0 -> 37968 bytes .../execution/datasources/parquet/ParquetIOSuite.scala | 6 ++ 5 files changed, 19 insertions(+), 13 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 index 6ba3c86..44f210f 100644 --- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 @@ -201,12 +201,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar -parquet-column/1.12.0//parquet-column-1.12.0.jar -parquet-common/1.12.0//parquet-common-1.12.0.jar -parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar -parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar -parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar -parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar +parquet-column/1.12.1//parquet-column-1.12.1.jar +parquet-common/1.12.1//parquet-common-1.12.1.jar +parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar +parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar +parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar +parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar protobuf-java/2.5.0//protobuf-java-2.5.0.jar py4j/0.10.9.2//py4j-0.10.9.2.jar pyrolite/4.30//pyrolite-4.30.jar diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 index 326229b..1e28dd1 100644 --- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 @@ -172,12 +172,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar -parquet-column/1.12.0//parquet-column-1.12.0.jar -parquet-common/1.12.0//parquet-common-1.12.0.jar -parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar -parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar -parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar -parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar +parquet-column/1.12.1//parquet-column-1.12.1.jar +parquet-common/1.12.1//parquet-common-1.12.1.jar +parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar +parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar +parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar +parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar protobuf-java/2.5.0//protobuf-java-2.5.0.jar py4j/0.10.9.2//py4j-0.10.9.2.jar pyrolite/4.30//pyrolite-4.30.jar diff --git a/pom.xml b/pom.xml index 54025de..e8d3863 100644 --- a/pom.xml +++ b/pom.xml @@ -136,7 +136,7 @@ 2.8.0 10.14.2.0 -1.12.0 +1.12.1 1.6.10 9.4.43.v20210629 4.0.3 diff --git a/sql/core/src/test/resources/test-data/malformed-file-offset.parquet b/sql/core/src/test/resources/test-data/malformed-file-offset.parquet new file mode 100644 index 000..5abeabe Binary files /dev/null and b/sql/core/src/test/resources/test-data/malform
[spark] branch master updated: [SPARK-36726] Upgrade Parquet to 1.12.1
This is an automated email from the ASF dual-hosted git repository. dbtsai pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a927b08 [SPARK-36726] Upgrade Parquet to 1.12.1 a927b08 is described below commit a927b0836bd59d6731b4970957e82ac1e403ddc4 Author: Chao Sun AuthorDate: Wed Sep 15 19:17:34 2021 + [SPARK-36726] Upgrade Parquet to 1.12.1 ### What changes were proposed in this pull request? Upgrade Apache Parquet to 1.12.1 ### Why are the changes needed? Parquet 1.12.1 contains the following bug fixes: - PARQUET-2064: Make Range public accessible in RowRanges - PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream` - PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding - PARQUET-1633: Fix integer overflow - PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile - PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats - PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase - PARQUET-2078: Failed to read parquet file after writing with the same In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests + a new test for the issue in SPARK-36696 Closes #33969 from sunchao/upgrade-parquet-12.1. Authored-by: Chao Sun Signed-off-by: DB Tsai --- dev/deps/spark-deps-hadoop-2.7-hive-2.3 | 12 ++-- dev/deps/spark-deps-hadoop-3.2-hive-2.3 | 12 ++-- pom.xml | 2 +- .../resources/test-data/malformed-file-offset.parquet | Bin 0 -> 37968 bytes .../execution/datasources/parquet/ParquetIOSuite.scala | 6 ++ 5 files changed, 19 insertions(+), 13 deletions(-) diff --git a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 index e3ddf97..0a83bc0 100644 --- a/dev/deps/spark-deps-hadoop-2.7-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-2.7-hive-2.3 @@ -201,12 +201,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar -parquet-column/1.12.0//parquet-column-1.12.0.jar -parquet-common/1.12.0//parquet-common-1.12.0.jar -parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar -parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar -parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar -parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar +parquet-column/1.12.1//parquet-column-1.12.1.jar +parquet-common/1.12.1//parquet-common-1.12.1.jar +parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar +parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar +parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar +parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar protobuf-java/2.5.0//protobuf-java-2.5.0.jar py4j/0.10.9.2//py4j-0.10.9.2.jar pyrolite/4.30//pyrolite-4.30.jar diff --git a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 index e8c7ad9..30c901b 100644 --- a/dev/deps/spark-deps-hadoop-3.2-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3.2-hive-2.3 @@ -172,12 +172,12 @@ orc-shims/1.6.10//orc-shims-1.6.10.jar oro/2.0.8//oro-2.0.8.jar osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar paranamer/2.8//paranamer-2.8.jar -parquet-column/1.12.0//parquet-column-1.12.0.jar -parquet-common/1.12.0//parquet-common-1.12.0.jar -parquet-encoding/1.12.0//parquet-encoding-1.12.0.jar -parquet-format-structures/1.12.0//parquet-format-structures-1.12.0.jar -parquet-hadoop/1.12.0//parquet-hadoop-1.12.0.jar -parquet-jackson/1.12.0//parquet-jackson-1.12.0.jar +parquet-column/1.12.1//parquet-column-1.12.1.jar +parquet-common/1.12.1//parquet-common-1.12.1.jar +parquet-encoding/1.12.1//parquet-encoding-1.12.1.jar +parquet-format-structures/1.12.1//parquet-format-structures-1.12.1.jar +parquet-hadoop/1.12.1//parquet-hadoop-1.12.1.jar +parquet-jackson/1.12.1//parquet-jackson-1.12.1.jar protobuf-java/2.5.0//protobuf-java-2.5.0.jar py4j/0.10.9.2//py4j-0.10.9.2.jar pyrolite/4.30//pyrolite-4.30.jar diff --git a/pom.xml b/pom.xml index 02627b4..7b472cd 100644 --- a/pom.xml +++ b/pom.xml @@ -136,7 +136,7 @@ 2.8.0 10.14.2.0 -1.12.0 +1.12.1 1.6.10 9.4.43.v20210629 4.0.3 diff --git a/sql/core/src/test/resources/test-data/malformed-file-offset.parquet b/sql/core/src/test/resources/test-data/malformed-file-offset.parquet new file mode 100644 index 000..5abeabe Binary files /dev/null and b/sql/core/src/test/resources/test-data/malformed-file-offset.parquet differ diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasource
[spark] branch branch-3.2 updated: [SPARK-36722][PYTHON] Fix Series.update with another in same frame
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 017bce7 [SPARK-36722][PYTHON] Fix Series.update with another in same frame 017bce7 is described below commit 017bce7b118cd643e652d5a7914294e281b05e6e Author: dgd-contributor AuthorDate: Wed Sep 15 11:08:01 2021 -0700 [SPARK-36722][PYTHON] Fix Series.update with another in same frame ### What changes were proposed in this pull request? Fix Series.update with another in same frame also add test for update series in diff frame ### Why are the changes needed? Fix Series.update with another in same frame Pandas behavior: ``` python >>> pdf = pd.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> pdf ab 0 NaN NaN 1 2.0 5.0 2 3.0 NaN 3 4.0 3.0 4 5.0 2.0 5 6.0 1.0 6 7.0 NaN 7 8.0 0.0 8 NaN 0.0 >>> pdf.a.update(pdf.b) >>> pdf ab 0 NaN NaN 1 5.0 5.0 2 3.0 NaN 3 3.0 3.0 4 2.0 2.0 5 1.0 1.0 6 7.0 NaN 7 0.0 0.0 8 0.0 0.0 ``` ### Does this PR introduce _any_ user-facing change? Before ```python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> psdf.a.update(psdf.b) Traceback (most recent call last): File "", line 1, in File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in update combined = combine_frames(self._psdf, other._psdf, how="leftouter") File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in combine_frames assert not same_anchor( AssertionError: We don't need to combine. `this` and `that` are same. >>> ``` After ```python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> psdf.a.update(psdf.b) >>> psdf ab 0 NaN NaN 1 5.0 5.0 2 3.0 NaN 3 3.0 3.0 4 2.0 2.0 5 1.0 1.0 6 7.0 NaN 7 0.0 0.0 8 0.0 0.0 >>> ``` ### How was this patch tested? unit tests Closes #33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor. Authored-by: dgd-contributor Signed-off-by: Takuya UESHIN (cherry picked from commit c15072cc7397cb59496b7da1153d663d8201865c) Signed-off-by: Takuya UESHIN --- python/pyspark/pandas/series.py| 35 ++ .../pandas/tests/test_ops_on_diff_frames.py| 9 ++ python/pyspark/pandas/tests/test_series.py | 32 3 files changed, 64 insertions(+), 12 deletions(-) diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py index 568754c..0eebcc9 100644 --- a/python/pyspark/pandas/series.py +++ b/python/pyspark/pandas/series.py @@ -4498,22 +4498,33 @@ class Series(Frame, IndexOpsMixin, Generic[T]): if not isinstance(other, Series): raise TypeError("'other' must be a Series") -combined = combine_frames(self._psdf, other._psdf, how="leftouter") +if same_anchor(self, other): +scol = ( +F.when(other.spark.column.isNotNull(), other.spark.column) +.otherwise(self.spark.column) + .alias(self._psdf._internal.spark_column_name_for(self._column_label)) +) +internal = self._psdf._internal.with_new_spark_column( +self._column_label, scol # TODO: dtype? +) +self._psdf._update_internal_frame(internal) +else: +combined = combine_frames(self._psdf, other._psdf, how="leftouter") -this_scol = combined["this"]._internal.spark_column_for(self._column_label) -that_scol = combined["that"]._internal.spark_column_for(other._column_label) +this_scol = combined["this"]._internal.spark_column_for(self._column_label) +that_scol = combined["that"]._internal.spark_column_for(other._column_label) -scol = ( -F.when(that_scol.isNotNull(), that_scol) -.otherwise(this_scol) - .alias(self._psdf._internal.spark_column_name_for(self._column_label)) -) +scol = ( +F.when(that_scol.isNotNull(), that_scol) +.otherwise(this_scol) + .alias(self._psdf._internal.spark_column_name_for(self._column_label)) +) -internal = combined["this"]._internal.with_new_spark_column( -self._column_label, scol # TODO: dtype? -) +i
[spark] branch master updated: [SPARK-36722][PYTHON] Fix Series.update with another in same frame
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c15072c [SPARK-36722][PYTHON] Fix Series.update with another in same frame c15072c is described below commit c15072cc7397cb59496b7da1153d663d8201865c Author: dgd-contributor AuthorDate: Wed Sep 15 11:08:01 2021 -0700 [SPARK-36722][PYTHON] Fix Series.update with another in same frame ### What changes were proposed in this pull request? Fix Series.update with another in same frame also add test for update series in diff frame ### Why are the changes needed? Fix Series.update with another in same frame Pandas behavior: ``` python >>> pdf = pd.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> pdf ab 0 NaN NaN 1 2.0 5.0 2 3.0 NaN 3 4.0 3.0 4 5.0 2.0 5 6.0 1.0 6 7.0 NaN 7 8.0 0.0 8 NaN 0.0 >>> pdf.a.update(pdf.b) >>> pdf ab 0 NaN NaN 1 5.0 5.0 2 3.0 NaN 3 3.0 3.0 4 2.0 2.0 5 1.0 1.0 6 7.0 NaN 7 0.0 0.0 8 0.0 0.0 ``` ### Does this PR introduce _any_ user-facing change? Before ```python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> psdf.a.update(psdf.b) Traceback (most recent call last): File "", line 1, in File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in update combined = combine_frames(self._psdf, other._psdf, how="leftouter") File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in combine_frames assert not same_anchor( AssertionError: We don't need to combine. `this` and `that` are same. >>> ``` After ```python >>> psdf = ps.DataFrame( ... {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]}, ... ) >>> psdf.a.update(psdf.b) >>> psdf ab 0 NaN NaN 1 5.0 5.0 2 3.0 NaN 3 3.0 3.0 4 2.0 2.0 5 1.0 1.0 6 7.0 NaN 7 0.0 0.0 8 0.0 0.0 >>> ``` ### How was this patch tested? unit tests Closes #33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor. Authored-by: dgd-contributor Signed-off-by: Takuya UESHIN --- python/pyspark/pandas/series.py| 35 ++ .../pandas/tests/test_ops_on_diff_frames.py| 9 ++ python/pyspark/pandas/tests/test_series.py | 32 3 files changed, 64 insertions(+), 12 deletions(-) diff --git a/python/pyspark/pandas/series.py b/python/pyspark/pandas/series.py index 8cbfdbf..d72c08d 100644 --- a/python/pyspark/pandas/series.py +++ b/python/pyspark/pandas/series.py @@ -4536,22 +4536,33 @@ class Series(Frame, IndexOpsMixin, Generic[T]): if not isinstance(other, Series): raise TypeError("'other' must be a Series") -combined = combine_frames(self._psdf, other._psdf, how="leftouter") +if same_anchor(self, other): +scol = ( +F.when(other.spark.column.isNotNull(), other.spark.column) +.otherwise(self.spark.column) + .alias(self._psdf._internal.spark_column_name_for(self._column_label)) +) +internal = self._psdf._internal.with_new_spark_column( +self._column_label, scol # TODO: dtype? +) +self._psdf._update_internal_frame(internal) +else: +combined = combine_frames(self._psdf, other._psdf, how="leftouter") -this_scol = combined["this"]._internal.spark_column_for(self._column_label) -that_scol = combined["that"]._internal.spark_column_for(other._column_label) +this_scol = combined["this"]._internal.spark_column_for(self._column_label) +that_scol = combined["that"]._internal.spark_column_for(other._column_label) -scol = ( -F.when(that_scol.isNotNull(), that_scol) -.otherwise(this_scol) - .alias(self._psdf._internal.spark_column_name_for(self._column_label)) -) +scol = ( +F.when(that_scol.isNotNull(), that_scol) +.otherwise(this_scol) + .alias(self._psdf._internal.spark_column_name_for(self._column_label)) +) -internal = combined["this"]._internal.with_new_spark_column( -self._column_label, scol # TODO: dtype? -) +internal = combined["this"]._internal.with_new_spark_column( +self._column_label, scol # TODO: dtyp
[spark] branch branch-3.1 updated: [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 682442a [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN 682442a is described below commit 682442a5aa46d50b6b4b74537fccf04fdd33fe0f Author: Angerszh AuthorDate: Wed Sep 15 22:31:46 2021 +0800 [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [false], but it should return [true]. This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? arrays_overlap won't handle equal `NaN` value ### How was this patch tested? Added UT Closes #34006 from AngersZh/SPARK-36755. Authored-by: Angerszh Signed-off-by: Wenchen Fan (cherry picked from commit b665782f0d3729928be4ca897ec2eb990b714879) Signed-off-by: Wenchen Fan --- .../sql/catalyst/expressions/collectionOperations.scala | 4 ++-- .../catalyst/expressions/CollectionExpressionsSuite.scala | 13 + 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 7b231fe..9f922d1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -1262,12 +1262,12 @@ case class ArraysOverlap(left: Expression, right: Expression) (arr2, arr1) } if (smaller.numElements() > 0) { - val smallestSet = new mutable.HashSet[Any] + val smallestSet = new java.util.HashSet[Any]() smaller.foreach(elementType, (_, v) => if (v == null) { hasNull = true } else { - smallestSet += v + smallestSet.add(v) }) bigger.foreach(elementType, (_, v1) => if (v1 == null) { diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala index 25e40c4..69a24d9 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala @@ -1965,4 +1965,17 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), Seq(Float.NaN, null, 1f)) } + + test("SPARK-36755: ArraysOverlap hould handle duplicated Double.NaN and Float.Nan") { +checkEvaluation(ArraysOverlap( + Literal.apply(Array(Double.NaN, 1d)), Literal.apply(Array(Double.NaN))), true) +checkEvaluation(ArraysOverlap( + Literal.create(Seq(Double.NaN, null), ArrayType(DoubleType)), + Literal.create(Seq(Double.NaN, null, 1d), ArrayType(DoubleType))), true) +checkEvaluation(ArraysOverlap( + Literal.apply(Array(Float.NaN)), Literal.apply(Array(Float.NaN, 1f))), true) +checkEvaluation(ArraysOverlap( + Literal.create(Seq(Float.NaN, null), ArrayType(FloatType)), + Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), true) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new 75bffd9 [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN 75bffd9 is described below commit 75bffd972d53488a732e2e21695e3283cb32e9dc Author: Angerszh AuthorDate: Wed Sep 15 22:31:46 2021 +0800 [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? For query ``` select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double))) ``` This returns [false], but it should return [true]. This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`. ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? arrays_overlap won't handle equal `NaN` value ### How was this patch tested? Added UT Closes #34006 from AngersZh/SPARK-36755. Authored-by: Angerszh Signed-off-by: Wenchen Fan (cherry picked from commit b665782f0d3729928be4ca897ec2eb990b714879) Signed-off-by: Wenchen Fan --- .../sql/catalyst/expressions/collectionOperations.scala | 4 ++-- .../catalyst/expressions/CollectionExpressionsSuite.scala | 13 + 2 files changed, 15 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index 47b2719..73a45d7 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -1297,12 +1297,12 @@ case class ArraysOverlap(left: Expression, right: Expression) (arr2, arr1) } if (smaller.numElements() > 0) { - val smallestSet = new mutable.HashSet[Any] + val smallestSet = new java.util.HashSet[Any]() smaller.foreach(elementType, (_, v) => if (v == null) { hasNull = true } else { - smallestSet += v + smallestSet.add(v) }) bigger.foreach(elementType, (_, v1) => if (v1 == null) { diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala index 496a4c3..caca24a 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala @@ -2309,4 +2309,17 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), Seq(Float.NaN, null, 1f)) } + + test("SPARK-36755: ArraysOverlap hould handle duplicated Double.NaN and Float.Nan") { +checkEvaluation(ArraysOverlap( + Literal.apply(Array(Double.NaN, 1d)), Literal.apply(Array(Double.NaN))), true) +checkEvaluation(ArraysOverlap( + Literal.create(Seq(Double.NaN, null), ArrayType(DoubleType)), + Literal.create(Seq(Double.NaN, null, 1d), ArrayType(DoubleType))), true) +checkEvaluation(ArraysOverlap( + Literal.apply(Array(Float.NaN)), Literal.apply(Array(Float.NaN, 1f))), true) +checkEvaluation(ArraysOverlap( + Literal.create(Seq(Float.NaN, null), ArrayType(FloatType)), + Literal.create(Seq(Float.NaN, null, 1f), ArrayType(FloatType))), true) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (6380859 -> b665782)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git. from 6380859 [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN add b665782 [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN No new revisions were added by this update. Summary of changes: .../sql/catalyst/expressions/collectionOperations.scala | 4 ++-- .../catalyst/expressions/CollectionExpressionsSuite.scala | 13 + 2 files changed, 15 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.1 updated: [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.1 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.1 by this push: new 4a266f0 [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN 4a266f0 is described below commit 4a266f0b59e4432c09ac5bc1bdd266a84b94a0b9 Author: Angerszh AuthorDate: Wed Sep 15 22:04:09 2021 +0800 [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes #34003 from AngersZh/SPARK-36702-FOLLOWUP. Authored-by: Angerszh Signed-off-by: Wenchen Fan (cherry picked from commit 638085953f931f98241856c9f652e5f15202fcc0) Signed-off-by: Wenchen Fan --- .../sql/catalyst/expressions/collectionOperations.scala | 13 - .../scala/org/apache/spark/sql/util/SQLOpenHashSet.scala| 8 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index b829ac0..7b231fe 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -3370,6 +3370,7 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] val hs = new SQLOpenHashSet[Any]() val isNaN = SQLOpenHashSet.isNaN(elementType) +val valueNaN = SQLOpenHashSet.valueNaN(elementType) Seq(array1, array2).foreach { array => var i = 0 while (i < array.numElements()) { @@ -3382,7 +3383,7 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi val elem = array.get(i, elementType) if (isNaN(elem)) { if (!hs.containsNaN) { - arrayBuffer += elem + arrayBuffer += valueNaN hs.addNaN } } else { @@ -3480,16 +3481,18 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi def withNaNCheck(body: String): String = { (elementType match { -case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)") -case FloatType => Some(s"java.lang.Float.isNaN((float)$value)") +case DoubleType => + Some((s"java.lang.Double.isNaN((double)$value)", "java.lang.Double.NaN")) +case FloatType => + Some((s"java.lang.Float.isNaN((float)$value)", "java.lang.Float.NaN")) case _ => None - }).map { isNaN => + }).map { case (isNaN, valueNaN) => s""" |if ($isNaN) { | if (!$hashSet.containsNaN()) { | $size++; | $hashSet.addNaN(); - | $builder.$$plus$$eq($value); + | $builder.$$plus$$eq($valueNaN); | } |} else { | $body diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala index 5ffe733..083cfdd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala @@ -69,4 +69,12 @@ object SQLOpenHashSet { case _ => (_: Any) => false } } + + def valueNaN(dataType: DataType): Any = { +dataType match { + case DoubleType => java.lang.Double.NaN + case FloatType => java.lang.Float.NaN + case _ => null +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.2 updated: [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.2 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.2 by this push: new e641556 [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN e641556 is described below commit e64155691f23d54d650353eb3e5a083cb10dbfc2 Author: Angerszh AuthorDate: Wed Sep 15 22:04:09 2021 +0800 [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes #34003 from AngersZh/SPARK-36702-FOLLOWUP. Authored-by: Angerszh Signed-off-by: Wenchen Fan (cherry picked from commit 638085953f931f98241856c9f652e5f15202fcc0) Signed-off-by: Wenchen Fan --- .../sql/catalyst/expressions/collectionOperations.scala | 13 - .../scala/org/apache/spark/sql/util/SQLOpenHashSet.scala| 8 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index e5620a1..47b2719 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -3578,6 +3578,7 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] val hs = new SQLOpenHashSet[Any]() val isNaN = SQLOpenHashSet.isNaN(elementType) +val valueNaN = SQLOpenHashSet.valueNaN(elementType) Seq(array1, array2).foreach { array => var i = 0 while (i < array.numElements()) { @@ -3590,7 +3591,7 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi val elem = array.get(i, elementType) if (isNaN(elem)) { if (!hs.containsNaN) { - arrayBuffer += elem + arrayBuffer += valueNaN hs.addNaN } } else { @@ -3688,16 +3689,18 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi def withNaNCheck(body: String): String = { (elementType match { -case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)") -case FloatType => Some(s"java.lang.Float.isNaN((float)$value)") +case DoubleType => + Some((s"java.lang.Double.isNaN((double)$value)", "java.lang.Double.NaN")) +case FloatType => + Some((s"java.lang.Float.isNaN((float)$value)", "java.lang.Float.NaN")) case _ => None - }).map { isNaN => + }).map { case (isNaN, valueNaN) => s""" |if ($isNaN) { | if (!$hashSet.containsNaN()) { | $size++; | $hashSet.addNaN(); - | $builder.$$plus$$eq($value); + | $builder.$$plus$$eq($valueNaN); | } |} else { | $body diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala index 5ffe733..083cfdd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala @@ -69,4 +69,12 @@ object SQLOpenHashSet { case _ => (_: Any) => false } } + + def valueNaN(dataType: DataType): Any = { +dataType match { + case DoubleType => java.lang.Double.NaN + case FloatType => java.lang.Float.NaN + case _ => null +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6380859 [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN 6380859 is described below commit 638085953f931f98241856c9f652e5f15202fcc0 Author: Angerszh AuthorDate: Wed Sep 15 22:04:09 2021 +0800 [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized NaN ### Why are the changes needed? Use normalized NaN for duplicated NaN value ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Exiting UT Closes #34003 from AngersZh/SPARK-36702-FOLLOWUP. Authored-by: Angerszh Signed-off-by: Wenchen Fan --- .../sql/catalyst/expressions/collectionOperations.scala | 13 - .../scala/org/apache/spark/sql/util/SQLOpenHashSet.scala| 8 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala index e5620a1..47b2719 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/collectionOperations.scala @@ -3578,6 +3578,7 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi val arrayBuffer = new scala.collection.mutable.ArrayBuffer[Any] val hs = new SQLOpenHashSet[Any]() val isNaN = SQLOpenHashSet.isNaN(elementType) +val valueNaN = SQLOpenHashSet.valueNaN(elementType) Seq(array1, array2).foreach { array => var i = 0 while (i < array.numElements()) { @@ -3590,7 +3591,7 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi val elem = array.get(i, elementType) if (isNaN(elem)) { if (!hs.containsNaN) { - arrayBuffer += elem + arrayBuffer += valueNaN hs.addNaN } } else { @@ -3688,16 +3689,18 @@ case class ArrayUnion(left: Expression, right: Expression) extends ArrayBinaryLi def withNaNCheck(body: String): String = { (elementType match { -case DoubleType => Some(s"java.lang.Double.isNaN((double)$value)") -case FloatType => Some(s"java.lang.Float.isNaN((float)$value)") +case DoubleType => + Some((s"java.lang.Double.isNaN((double)$value)", "java.lang.Double.NaN")) +case FloatType => + Some((s"java.lang.Float.isNaN((float)$value)", "java.lang.Float.NaN")) case _ => None - }).map { isNaN => + }).map { case (isNaN, valueNaN) => s""" |if ($isNaN) { | if (!$hashSet.containsNaN()) { | $size++; | $hashSet.addNaN(); - | $builder.$$plus$$eq($value); + | $builder.$$plus$$eq($valueNaN); | } |} else { | $body diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala index 5ffe733..083cfdd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/util/SQLOpenHashSet.scala @@ -69,4 +69,12 @@ object SQLOpenHashSet { case _ => (_: Any) => false } } + + def valueNaN(dataType: DataType): Any = { +dataType match { + case DoubleType => java.lang.Double.NaN + case FloatType => java.lang.Float.NaN + case _ => null +} + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to Scala, Python and R
This is an automated email from the ASF dual-hosted git repository. sarutak pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0666f5c [SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to Scala, Python and R 0666f5c is described below commit 0666f5c00393acccecdd82d3794e5a2b88f3210b Author: Leona Yoda AuthorDate: Wed Sep 15 16:27:13 2021 +0900 [SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to Scala, Python and R ### What changes were proposed in this pull request? octet_length: caliculate the byte length of strings bit_length: caliculate the bit length of strings Those two string related functions are only implemented on SparkSQL, not on Scala, Python and R. ### Why are the changes needed? Those functions would be useful for multi-bytes character users, who mainly working with Scala, Python or R. ### Does this PR introduce _any_ user-facing change? Yes. Users can call octet_length/bit_length APIs on Scala(Dataframe), Python, and R. ### How was this patch tested? unit tests Closes #33992 from yoda-mon/add-bit-octet-length. Authored-by: Leona Yoda Signed-off-by: Kousuke Saruta --- R/pkg/NAMESPACE| 2 + R/pkg/R/functions.R| 26 +++ R/pkg/R/generics.R | 8 R/pkg/tests/fulltests/test_sparkSQL.R | 11 + python/docs/source/reference/pyspark.sql.rst | 2 + python/pyspark/sql/functions.py| 52 ++ python/pyspark/sql/functions.pyi | 2 + python/pyspark/sql/tests/test_functions.py | 14 +- .../scala/org/apache/spark/sql/functions.scala | 16 +++ .../apache/spark/sql/StringFunctionsSuite.scala| 52 ++ 10 files changed, 184 insertions(+), 1 deletion(-) diff --git a/R/pkg/NAMESPACE b/R/pkg/NAMESPACE index 7fa8085..686a49e 100644 --- a/R/pkg/NAMESPACE +++ b/R/pkg/NAMESPACE @@ -243,6 +243,7 @@ exportMethods("%<=>%", "base64", "between", "bin", + "bit_length", "bitwise_not", "bitwiseNOT", "bround", @@ -364,6 +365,7 @@ exportMethods("%<=>%", "not", "nth_value", "ntile", + "octet_length", "otherwise", "over", "overlay", diff --git a/R/pkg/R/functions.R b/R/pkg/R/functions.R index 62066da1..f0768c7 100644 --- a/R/pkg/R/functions.R +++ b/R/pkg/R/functions.R @@ -647,6 +647,19 @@ setMethod("bin", }) #' @details +#' \code{bit_length}: Calculates the bit length for the specified string column. +#' +#' @rdname column_string_functions +#' @aliases bit_length bit_length,Column-method +#' @note length since 3.3.0 +setMethod("bit_length", + signature(x = "Column"), + function(x) { +jc <- callJStatic("org.apache.spark.sql.functions", "bit_length", x@jc) +column(jc) + }) + +#' @details #' \code{bitwise_not}: Computes bitwise NOT. #' #' @rdname column_nonaggregate_functions @@ -1570,6 +1583,19 @@ setMethod("negate", }) #' @details +#' \code{octet_length}: Calculates the byte length for the specified string column. +#' +#' @rdname column_string_functions +#' @aliases octet_length octet_length,Column-method +#' @note length since 3.3.0 +setMethod("octet_length", + signature(x = "Column"), + function(x) { +jc <- callJStatic("org.apache.spark.sql.functions", "octet_length", x@jc) +column(jc) + }) + +#' @details #' \code{overlay}: Overlay the specified portion of \code{x} with \code{replace}, #' starting from byte position \code{pos} of \code{src} and proceeding for #' \code{len} bytes. diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R index 9ebea3f..1abde65 100644 --- a/R/pkg/R/generics.R +++ b/R/pkg/R/generics.R @@ -884,6 +884,10 @@ setGeneric("base64", function(x) { standardGeneric("base64") }) #' @name NULL setGeneric("bin", function(x) { standardGeneric("bin") }) +#' @rdname column_string_functions +#' @name NULL +setGeneric("bit_length", function(x, ...) { standardGeneric("bit_length") }) + #' @rdname column_nonaggregate_functions #' @name NULL setGeneric("bitwise_not", function(x) { standardGeneric("bitwise_not") }) @@ -1232,6 +1236,10 @@ setGeneric("n_distinct", function(x, ...) { standardGeneric("n_distinct") }) #' @rdname column_string_functions #' @name NULL +setGeneric("octet_length", function(x, ...) { standardGeneric("octet_length") }) + +#' @rdname column_string_functions +#' @name NULL setGeneric("overlay", function(x, replace, pos, ...) { standardGeneric("overlay") }) #' @rdna