[spark-website] branch asf-site updated: Add note on generative tooling to developer tools
This is an automated email from the ASF dual-hosted git repository. srowen pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new fc89ca1ed2 Add note on generative tooling to developer tools fc89ca1ed2 is described below commit fc89ca1ed20551c66dc31ebbe28664d12689bd13 Author: zero323 AuthorDate: Mon Aug 14 21:04:28 2023 -0500 Add note on generative tooling to developer tools This PR adds notes on generative tooling and link to the relevant ASF policy. As requested in comments to https://github.com/apache/spark/pull/42469 Author: zero323 Closes #472 from zero323/SPARK-44782-generative-tooling-notes. --- developer-tools.md| 9 + site/developer-tools.html | 9 + 2 files changed, 18 insertions(+) diff --git a/developer-tools.md b/developer-tools.md index e0a1844ae7..73e708116e 100644 --- a/developer-tools.md +++ b/developer-tools.md @@ -549,3 +549,12 @@ When running Spark tests through SBT, add `javaOptions in Test += "-agentpath:/p to `SparkBuild.scala` to launch the tests with the YourKit profiler agent enabled. The platform-specific paths to the profiler agents are listed in the https://www.yourkit.com/docs/java/help/agent.jsp;>YourKit documentation. + +Generative tooling usage + +In general, the ASF allows contributions co-authored using generative AI tools. However, there are several considerations when you submit a patch containing generated content. + +Foremost, you are required to disclose usage of such tool. Furthermore, you are responsible for ensuring that the terms and conditions of the tool in question are +compatible with usage in an Open Source project and inclusion of the generated content doesn't pose a risk of copyright violation. + +Please refer to https://www.apache.org/legal/generative-tooling.html;>The ASF Generative Tooling Guidance for details and developments. diff --git a/site/developer-tools.html b/site/developer-tools.html index a43786ff91..de94619481 100644 --- a/site/developer-tools.html +++ b/site/developer-tools.html @@ -657,6 +657,15 @@ to SparkBuild.scala to The platform-specific paths to the profiler agents are listed in the https://www.yourkit.com/docs/java/help/agent.jsp;>YourKit documentation. +Generative tooling usage + +In general, the ASF allows contributions co-authored using generative AI tools. However, there are several considerations when you submit a patch containing generated content. + +Foremost, you are required to disclose usage of such tool. Furthermore, you are responsible for ensuring that the terms and conditions of the tool in question are +compatible with usage in an Open Source project and inclusion of the generated content doesnt pose a risk of copyright violation. + +Please refer to https://www.apache.org/legal/generative-tooling.html;>The ASF Generative Tooling Guidance for details and developments. + - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] srowen closed pull request #472: Add note on generative tooling to developer tools
srowen closed pull request #472: Add note on generative tooling to developer tools URL: https://github.com/apache/spark-website/pull/472 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44796][BUILD][CONNECT] Remove `grpc-java` plugin related configuration from the `connect/connect-client-jvm` module
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 9336a155e0a [SPARK-44796][BUILD][CONNECT] Remove `grpc-java` plugin related configuration from the `connect/connect-client-jvm` module 9336a155e0a is described below commit 9336a155e0a7321be50580d5fb5351c291209166 Author: yangjie01 AuthorDate: Tue Aug 15 11:42:25 2023 +0800 [SPARK-44796][BUILD][CONNECT] Remove `grpc-java` plugin related configuration from the `connect/connect-client-jvm` module ### What changes were proposed in this pull request? This pr remove `grpc-java` plugin related configuration from the `connect/connect-client-jvm module` for SBT because the code generation related to `grpc-java` is handled by the `connect-common` module. The reason this pr did not touch the Maven `pom.xml` is because the corresponding Maven modules originally did not have that configuration. ### Why are the changes needed? Clean up unnecessary sbt configurations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions Closes #42477 from LuciferYang/SPARK-44796. Authored-by: yangjie01 Signed-off-by: yangjie01 (cherry picked from commit dc5fdb4aa1cd8ae727fc6d9dce434cbd1f0ddafc) Signed-off-by: yangjie01 --- project/SparkBuild.scala | 19 ++- 1 file changed, 2 insertions(+), 17 deletions(-) diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index bd65d3c4bd4..331ebe0c05a 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -774,7 +774,6 @@ object SparkConnect { SbtPomKeys.effectivePom.value.getProperties.get( "guava.failureaccess.version").asInstanceOf[String] Seq( -"io.grpc" % "protoc-gen-grpc-java" % BuildCommons.gprcVersion asProtocPlugin(), "com.google.guava" % "guava" % guavaVersion, "com.google.guava" % "failureaccess" % guavaFailureaccessVersion, "com.google.protobuf" % "protobuf-java" % protoVersion % "protobuf" @@ -840,14 +839,7 @@ object SparkConnect { case m if m.toLowerCase(Locale.ROOT).endsWith(".proto") => MergeStrategy.discard case _ => MergeStrategy.first } - ) ++ { -Seq( - (Compile / PB.targets) := Seq( -PB.gens.java -> (Compile / sourceManaged).value, -PB.gens.plugin("grpc-java") -> (Compile / sourceManaged).value - ) -) - } + ) } object SparkConnectClient { @@ -924,14 +916,7 @@ object SparkConnectClient { case m if m.toLowerCase(Locale.ROOT).endsWith(".proto") => MergeStrategy.discard case _ => MergeStrategy.first } - ) ++ { -Seq( - (Compile / PB.targets) := Seq( -PB.gens.java -> (Compile / sourceManaged).value, -PB.gens.plugin("grpc-java") -> (Compile / sourceManaged).value - ) -) - } + ) } object SparkProtobuf { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-44719][SQL][3.5] Fix NoClassDefFoundError when using Hive UDF
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new e12a79b6ebe [SPARK-44719][SQL][3.5] Fix NoClassDefFoundError when using Hive UDF e12a79b6ebe is described below commit e12a79b6ebe0bd5ef9220e0e0979d6579638f3ae Author: Yuming Wang AuthorDate: Tue Aug 15 11:42:23 2023 +0800 [SPARK-44719][SQL][3.5] Fix NoClassDefFoundError when using Hive UDF ### What changes were proposed in this pull request? This PR partially reverts SPARK-43225. Added jackson-core-asl and jackson-mapper-asl back to pre-built distributions. ### Why are the changes needed? Fix `NoClassDefFoundError` when using Hive UDF: ``` spark-sql (default)> add jar /Users/yumwang/Downloads/HiveUDFs-1.0-SNAPSHOT.jar; Time taken: 0.413 seconds spark-sql (default)> CREATE TEMPORARY FUNCTION long_to_ip as 'net.petrabarus.hiveudfs.LongToIP'; Time taken: 0.038 seconds spark-sql (default)> SELECT long_to_ip(2130706433L) FROM range(10); 23/08/08 20:17:58 ERROR SparkSQLDriver: Failed in [SELECT long_to_ip(2130706433L) FROM range(10)] java.lang.NoClassDefFoundError: org/codehaus/jackson/map/type/TypeFactory at org.apache.hadoop.hive.ql.udf.UDFJson.(UDFJson.java:64) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:348) ... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test. Closes #42447 from wangyum/SPARK-44719-branch-3.5. Authored-by: Yuming Wang Signed-off-by: Kent Yao --- core/pom.xml | 8 dev/deps/spark-deps-hadoop-3-hive-2.3 | 2 ++ pom.xml | 13 +++-- sql/hive-thriftserver/pom.xml | 9 - 4 files changed, 21 insertions(+), 11 deletions(-) diff --git a/core/pom.xml b/core/pom.xml index 674005c96ee..eb4a563f1f3 100644 --- a/core/pom.xml +++ b/core/pom.xml @@ -481,6 +481,14 @@ commons-logging commons-logging + + org.codehaus.jackson + jackson-mapper-asl + + + org.codehaus.jackson + jackson-core-asl + com.fasterxml.jackson.core jackson-core diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 b/dev/deps/spark-deps-hadoop-3-hive-2.3 index d182f39f882..14af8ea340f 100644 --- a/dev/deps/spark-deps-hadoop-3-hive-2.3 +++ b/dev/deps/spark-deps-hadoop-3-hive-2.3 @@ -100,11 +100,13 @@ ini4j/0.5.4//ini4j-0.5.4.jar istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar ivy/2.5.1//ivy-2.5.1.jar jackson-annotations/2.15.2//jackson-annotations-2.15.2.jar +jackson-core-asl/1.9.13//jackson-core-asl-1.9.13.jar jackson-core/2.15.2//jackson-core-2.15.2.jar jackson-databind/2.15.2//jackson-databind-2.15.2.jar jackson-dataformat-cbor/2.15.2//jackson-dataformat-cbor-2.15.2.jar jackson-dataformat-yaml/2.15.2//jackson-dataformat-yaml-2.15.2.jar jackson-datatype-jsr310/2.15.2//jackson-datatype-jsr310-2.15.2.jar +jackson-mapper-asl/1.9.13//jackson-mapper-asl-1.9.13.jar jackson-module-scala_2.12/2.15.2//jackson-module-scala_2.12-2.15.2.jar jakarta.annotation-api/1.3.5//jakarta.annotation-api-1.3.5.jar jakarta.inject/2.6.1//jakarta.inject-2.6.1.jar diff --git a/pom.xml b/pom.xml index 9438a112c24..d0c3bfc361a 100644 --- a/pom.xml +++ b/pom.xml @@ -1321,6 +1321,10 @@ asm asm + +org.codehaus.jackson +jackson-mapper-asl + org.ow2.asm asm @@ -1821,12 +1825,17 @@ - + +org.codehaus.jackson +jackson-core-asl +${codehaus.jackson.version} +${hadoop.deps.scope} + org.codehaus.jackson jackson-mapper-asl ${codehaus.jackson.version} -test +${hadoop.deps.scope} ${hive.group} diff --git a/sql/hive-thriftserver/pom.xml b/sql/hive-thriftserver/pom.xml index 7bcfb8e1908..3659a0f846a 100644 --- a/sql/hive-thriftserver/pom.xml +++ b/sql/hive-thriftserver/pom.xml @@ -147,15 +147,6 @@ org.apache.httpcomponents httpcore - - - org.codehaus.jackson - jackson-mapper-asl - test - target/scala-${scala.binary.version}/classes - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (c9ff7025399 -> dc5fdb4aa1c)
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from c9ff7025399 [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should not fail add dc5fdb4aa1c [SPARK-44796][BUILD][CONNECT] Remove `grpc-java` plugin related configuration from the `connect/connect-client-jvm` module No new revisions were added by this update. Summary of changes: project/SparkBuild.scala | 19 ++- 1 file changed, 2 insertions(+), 17 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch branch-3.5 updated: [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues`
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 24bf4297f82 [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues` 24bf4297f82 is described below commit 24bf4297f8280b1db42026197282542863420b98 Author: zeruibao AuthorDate: Mon Aug 14 19:36:40 2023 -0700 [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues` ### What changes were proposed in this pull request? Revert my last PR https://github.com/apache/spark/pull/41052 that causes AVRO read performance regression since I change the code structure. ### Why are the changes needed? Remove performance regression ### How was this patch tested? Unit test Closes #42458 from zeruibao/revert-avro-change. Authored-by: zeruibao Signed-off-by: Gengliang Wang (cherry picked from commit 46580ab4cb02390ba71dace1235015749f048fff) Signed-off-by: Gengliang Wang --- .../src/main/resources/error/error-classes.json| 10 - .../apache/spark/sql/avro/AvroDeserializer.scala | 456 + .../org/apache/spark/sql/avro/AvroSuite.scala | 161 docs/sql-error-conditions.md | 14 - docs/sql-migration-guide.md| 1 - .../spark/sql/errors/QueryCompilationErrors.scala | 30 -- .../org/apache/spark/sql/internal/SQLConf.scala| 12 - 7 files changed, 189 insertions(+), 495 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 74542f2b914..5fdb2fbe77e 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -69,16 +69,6 @@ } } }, - "AVRO_INCORRECT_TYPE" : { -"message" : [ - "Cannot convert Avro to SQL because the original encoded data type is , however you're trying to read the field as , which would lead to an incorrect answer. To allow reading this field, enable the SQL configuration: ." -] - }, - "AVRO_LOWER_PRECISION" : { -"message" : [ - "Cannot convert Avro to SQL because the original encoded data type is , however you're trying to read the field as , which leads to data being read as null. Please provide a wider decimal type to get the correct result. To allow reading null to this field, enable the SQL configuration: ." -] - }, "BATCH_METADATA_NOT_FOUND" : { "message" : [ "Unable to find batch ." diff --git a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala index d4d34a891e9..a78ee89a3e9 100644 --- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala +++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala @@ -35,9 +35,8 @@ import org.apache.spark.sql.catalyst.expressions.{SpecificInternalRow, UnsafeArr import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, DateTimeUtils, GenericArrayData} import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY import org.apache.spark.sql.catalyst.util.RebaseDateTime.RebaseSpec -import org.apache.spark.sql.errors.QueryCompilationErrors import org.apache.spark.sql.execution.datasources.DataSourceUtils -import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf} +import org.apache.spark.sql.internal.LegacyBehaviorPolicy import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String @@ -118,268 +117,178 @@ private[sql] class AvroDeserializer( val incompatibleMsg = errorPrefix + s"schema is incompatible (avroType = $avroType, sqlType = ${catalystType.sql})" -val confKey = SQLConf.LEGACY_AVRO_ALLOW_INCOMPATIBLE_SCHEMA -val preventReadingIncorrectType = !SQLConf.get.getConf(confKey) +(avroType.getType, catalystType) match { + case (NULL, NullType) => (updater, ordinal, _) => +updater.setNullAt(ordinal) -val logicalDataType = SchemaConverters.toSqlType(avroType).dataType -avroType.getType match { - case NULL => -(logicalDataType, catalystType) match { - case (_, NullType) => (updater, ordinal, _) => -updater.setNullAt(ordinal) - case _ => throw new IncompatibleSchemaException(incompatibleMsg) -} // TODO: we can avoid boxing if future version of avro provide primitive accessors. - case BOOLEAN => -(logicalDataType, catalystType) match { - case (_, BooleanType) => (updater, ordinal, value) => -updater.setBoolean(ordinal, value.asInstanceOf[Boolean]) - case _ => throw new IncompatibleSchemaException(incompatibleMsg) -}
[spark] branch master updated: [SPARK-44798][BUILD] Fix Scala 2.13 mima check after SPARK-44705 merged
This is an automated email from the ASF dual-hosted git repository. yangjie01 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9f897190bdb [SPARK-44798][BUILD] Fix Scala 2.13 mima check after SPARK-44705 merged 9f897190bdb is described below commit 9f897190bdb29dd80e85bc0926ebacb70825b4ed Author: yangjie01 AuthorDate: Tue Aug 15 11:44:17 2023 +0800 [SPARK-44798][BUILD] Fix Scala 2.13 mima check after SPARK-44705 merged ### What changes were proposed in this pull request? This pr aims add a new `ProblemFilters` to `MimaExcludes.scala` to fix mima check for Scala 2.13 after SPARK-44705 merged. ### Why are the changes needed? Scala 2.13's daily tests have been failing the mima check for several days: - https://github.com/apache/spark/actions/runs/5849050388/job/15856979487 https://github.com/apache/spark/assets/1475305/fa4f859a-293b-429b-a030-a332e6a587d3;> ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass GitHub Actions - Manual verification: 1. The mima check was passing before SPARK-44705 ``` // [SPARK-44765][CONNECT] Simplify retries of ReleaseExecute git reset --hard 9bde882fcb39e9fedced0df9702df2a36c1a84e6 dev/change-scala-version.sh 2.13 dev/mima -Pscala-2.13 ``` ``` [success] Total time: 48 s, completed 2023-8-14 13:02:11 ``` 2. The mima check failed after SPARK-44705 was merged ``` // [SPARK-44705][PYTHON] Make PythonRunner single-threaded git reset --hard 8aaff55839493e80e3ce376f928c04aa8f31d18c dev/change-scala-version.sh 2.13 dev/mima -Pscala-2.13 ``` ``` [error] spark-core: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 4019) [error] * method this(org.apache.spark.api.python.BasePythonRunner,java.io.DataInputStream,org.apache.spark.api.python.BasePythonRunner#WriterThread,Long,org.apache.spark.SparkEnv,java.net.Socket,scala.Option,java.util.concurrent.atomic.AtomicBoolean,org.apache.spark.TaskContext)Unit in class org.apache.spark.api.python.BasePythonRunner#ReaderIterator's type is different in current version, where it is (org.apache.spark.api.python.BasePythonRunner,java.io.DataInputStream,org.apache.s [...] [error]filter with: ProblemFilters.exclude[IncompatibleMethTypeProblem]("org.apache.spark.api.python.BasePythonRunner#ReaderIterator.this") [error] java.lang.RuntimeException: Failed binary compatibility check against org.apache.spark:spark-core_2.13:3.4.0! Found 1 potential problems (filtered 4019) [error] at scala.sys.package$.error(package.scala:30) [error] at com.typesafe.tools.mima.plugin.SbtMima$.reportModuleErrors(SbtMima.scala:89) [error] at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2(MimaPlugin.scala:36) [error] at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$2$adapted(MimaPlugin.scala:26) [error] at scala.collection.Iterator.foreach(Iterator.scala:943) [error] at scala.collection.Iterator.foreach$(Iterator.scala:943) [error] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) [error] at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1(MimaPlugin.scala:26) [error] at com.typesafe.tools.mima.plugin.MimaPlugin$.$anonfun$projectSettings$1$adapted(MimaPlugin.scala:25) [error] at scala.Function1.$anonfun$compose$1(Function1.scala:49) [error] at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:63) [error] at sbt.std.Transform$$anon$4.work(Transform.scala:69) [error] at sbt.Execute.$anonfun$submit$2(Execute.scala:283) [error] at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:24) [error] at sbt.Execute.work(Execute.scala:292) [error] at sbt.Execute.$anonfun$submit$1(Execute.scala:283) [error] at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:265) [error] at sbt.CompletionService$$anon$2.call(CompletionService.scala:65) [error] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [error] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [error] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [error] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [error] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [error] at java.lang.Thread.run(Thread.java:750) [error] (core /
[spark] branch branch-3.5 updated: [SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value when first value is null
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 247948194a0 [SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value when first value is null 247948194a0 is described below commit 247948194a08c6d090bfd26daf6b63a68374e6f2 Author: Andy Grove AuthorDate: Tue Aug 15 10:12:55 2023 +0800 [SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value when first value is null As noted by cloud-fan https://github.com/apache/spark/pull/41432#discussion_r1242593592, https://github.com/apache/spark/pull/41432 fixed a formatting issue when casting map to string but did not fix it in the codegen case. This PR fixes the issue and updates a unit test to test the codegen path. Without the fix, the test fails with: ``` - SPARK-22973 Cast map to string *** FAILED *** Incorrect evaluation (fallback mode = CODEGEN_ONLY): cast(map(keys: [1,2,3], values: [null,[B5c3fd9f3,[B70f84210]) as string), actual: {1 ->null, 2 -> a, 3 -> c}, expected: {1 -> null, 2 -> a, 3 -> c} (ExpressionEvalHelper.scala:270) ``` Closes #42434 from andygrove/space-before-null-cast-codegen. Authored-by: Andy Grove Signed-off-by: Wenchen Fan (cherry picked from commit 7e52169433575a7df164368106fa3f6c73e4233e) Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/expressions/ToStringBase.scala | 2 +- .../org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala| 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala index f903863bec6..1eac386d873 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala @@ -352,7 +352,7 @@ trait ToStringBase { self: UnaryExpression with TimeZoneAwareExpression => | $buffer.append($keyToStringFunc($getMapFirstKey)); | $buffer.append(" ->"); | if ($map.valueArray().isNullAt(0)) { - |${appendNull(buffer, isFirstElement = true)} + |${appendNull(buffer, isFirstElement = false)} | } else { |$buffer.append(" "); |$buffer.append($valueToStringFunc($getMapFirstValue)); diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala index 34f87f940a7..0172fd9b3e4 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala @@ -826,9 +826,10 @@ abstract class CastSuiteBase extends SparkFunSuite with ExpressionEvalHelper { val ret1 = cast(Literal.create(Map(1 -> "a", 2 -> "b", 3 -> "c")), StringType) checkEvaluation(ret1, s"${lb}1 -> a, 2 -> b, 3 -> c$rb") val ret2 = cast( - Literal.create(Map("1" -> "a".getBytes, "2" -> null, "3" -> "c".getBytes)), + Literal.create(Map("1" -> null, "2" -> "a".getBytes, "3" -> null, "4" -> "c".getBytes)), StringType) -checkEvaluation(ret2, s"${lb}1 -> a, 2 ->${if (legacyCast) "" else " null"}, 3 -> c$rb") +val nullStr = if (legacyCast) "" else " null" +checkEvaluation(ret2, s"${lb}1 ->$nullStr, 2 -> a, 3 ->$nullStr, 4 -> c$rb") val ret3 = cast( Literal.create(Map( 1 -> Date.valueOf("2014-12-03"), - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (420e6878c68 -> 7e521694335)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 420e6878c68 [SPARK-43780][SQL] Support correlated references in join predicates for scalar and lateral subqueries add 7e521694335 [SPARK-43063][SQL][FOLLOWUP] Add a space between -> and value when first value is null No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/expressions/ToStringBase.scala | 2 +- .../org/apache/spark/sql/catalyst/expressions/CastSuiteBase.scala| 5 +++-- 2 files changed, 4 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1d562904e4e [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific 1d562904e4e is described below commit 1d562904e4e75aec3ea8d4999ede0183fda326c7 Author: Herman van Hovell AuthorDate: Tue Aug 15 03:10:42 2023 +0200 [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific ### What changes were proposed in this pull request? When you currently use a REPL generated class in a UDF you can get an error saying that that class is not equal to that class. This error is thrown in a code generated class. The problem is that the classes have been loaded by different classloaders. We cache generated code and use the textual code as the string. The problem with this is that in Spark Connect users are free in supplying user classes that can have arbitrary names, a name can point to an entirely different class, or it [...] There are roughly two ways how this problem can arise: 1. Two sessions use the same class names. This is particularly easy when you use the REPL because this always generates the same names. 2. You run in single process mode. In this case wholestage codegen will test compile the class using a different classloader then the 'executor', while sharing the same code generator cache. ### Why are the changes needed? We want to be able to use REPL (and other) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test to the `ReplE2ESuite`. Closes #42478 from hvanhovell/SPARK-44795. Authored-by: Herman van Hovell Signed-off-by: Herman van Hovell --- .../spark/sql/application/ReplE2ESuite.scala | 9 .../spark/sql/catalyst/encoders/OuterScopes.scala | 49 +- .../expressions/codegen/CodeGenerator.scala| 30 ++--- 3 files changed, 54 insertions(+), 34 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala index 0e69b5afa45..0cab66eef3d 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala @@ -276,4 +276,13 @@ class ReplE2ESuite extends RemoteSparkSession with BeforeAndAfterEach { val output = runCommandsInShell(input) assertContains("Array[MyTestClass] = Array(MyTestClass(1), MyTestClass(3))", output) } + + test("REPL class in UDF") { +val input = """ +|case class MyTestClass(value: Int) +|spark.range(2).map(i => MyTestClass(i.toInt)).collect() + """.stripMargin +val output = runCommandsInShell(input) +assertContains("Array[MyTestClass] = Array(MyTestClass(0), MyTestClass(1))", output) + } } diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala index c2ac504c846..6c10e8ece80 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala @@ -26,28 +26,9 @@ import org.apache.spark.util.SparkClassUtils object OuterScopes { private[this] val queue = new ReferenceQueue[AnyRef] - private class HashableWeakReference(v: AnyRef) extends WeakReference[AnyRef](v, queue) { -private[this] val hash = v.hashCode() -override def hashCode(): Int = hash -override def equals(obj: Any): Boolean = { - obj match { -case other: HashableWeakReference => - // Note that referential equality is used to identify & purge - // references from the map whose' referent went out of scope. - if (this eq other) { -true - } else { -val referent = get() -val otherReferent = other.get() -referent != null && otherReferent != null && Objects.equals(referent, otherReferent) - } -case _ => false - } -} - } private def classLoaderRef(c: Class[_]): HashableWeakReference = { -new HashableWeakReference(c.getClassLoader) +new HashableWeakReference(c.getClassLoader, queue) } private[this] val outerScopes = { @@ -154,3 +135,31 @@ object OuterScopes { // e.g. `ammonite.$sess.cmd8$Helper$Foo` -> `ammonite.$sess.cmd8.instance.Foo` private[this] val AmmoniteREPLClass = """^(ammonite\.\$sess\.cmd(?:\d+)\$).*""".r } + +/** + * A [[WeakReference]] that has a stable hash-key. When the referent is still
[spark] branch branch-3.5 updated: [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific
This is an automated email from the ASF dual-hosted git repository. hvanhovell pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 76e61c44aa2 [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific 76e61c44aa2 is described below commit 76e61c44aa28e12d363d86a401fd581bf3b054c1 Author: Herman van Hovell AuthorDate: Tue Aug 15 03:10:42 2023 +0200 [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific ### What changes were proposed in this pull request? When you currently use a REPL generated class in a UDF you can get an error saying that that class is not equal to that class. This error is thrown in a code generated class. The problem is that the classes have been loaded by different classloaders. We cache generated code and use the textual code as the string. The problem with this is that in Spark Connect users are free in supplying user classes that can have arbitrary names, a name can point to an entirely different class, or it [...] There are roughly two ways how this problem can arise: 1. Two sessions use the same class names. This is particularly easy when you use the REPL because this always generates the same names. 2. You run in single process mode. In this case wholestage codegen will test compile the class using a different classloader then the 'executor', while sharing the same code generator cache. ### Why are the changes needed? We want to be able to use REPL (and other) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I added a test to the `ReplE2ESuite`. Closes #42478 from hvanhovell/SPARK-44795. Authored-by: Herman van Hovell Signed-off-by: Herman van Hovell (cherry picked from commit 1d562904e4e75aec3ea8d4999ede0183fda326c7) Signed-off-by: Herman van Hovell --- .../spark/sql/application/ReplE2ESuite.scala | 9 .../spark/sql/catalyst/encoders/OuterScopes.scala | 49 +- .../expressions/codegen/CodeGenerator.scala| 30 ++--- 3 files changed, 54 insertions(+), 34 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala index 0e69b5afa45..0cab66eef3d 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/application/ReplE2ESuite.scala @@ -276,4 +276,13 @@ class ReplE2ESuite extends RemoteSparkSession with BeforeAndAfterEach { val output = runCommandsInShell(input) assertContains("Array[MyTestClass] = Array(MyTestClass(1), MyTestClass(3))", output) } + + test("REPL class in UDF") { +val input = """ +|case class MyTestClass(value: Int) +|spark.range(2).map(i => MyTestClass(i.toInt)).collect() + """.stripMargin +val output = runCommandsInShell(input) +assertContains("Array[MyTestClass] = Array(MyTestClass(0), MyTestClass(1))", output) + } } diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala index c2ac504c846..6c10e8ece80 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/encoders/OuterScopes.scala @@ -26,28 +26,9 @@ import org.apache.spark.util.SparkClassUtils object OuterScopes { private[this] val queue = new ReferenceQueue[AnyRef] - private class HashableWeakReference(v: AnyRef) extends WeakReference[AnyRef](v, queue) { -private[this] val hash = v.hashCode() -override def hashCode(): Int = hash -override def equals(obj: Any): Boolean = { - obj match { -case other: HashableWeakReference => - // Note that referential equality is used to identify & purge - // references from the map whose' referent went out of scope. - if (this eq other) { -true - } else { -val referent = get() -val otherReferent = other.get() -referent != null && otherReferent != null && Objects.equals(referent, otherReferent) - } -case _ => false - } -} - } private def classLoaderRef(c: Class[_]): HashableWeakReference = { -new HashableWeakReference(c.getClassLoader) +new HashableWeakReference(c.getClassLoader, queue) } private[this] val outerScopes = { @@ -154,3 +135,31 @@ object OuterScopes { // e.g. `ammonite.$sess.cmd8$Helper$Foo` -> `ammonite.$sess.cmd8.instance.Foo` private[this] val AmmoniteREPLClass =
[spark] branch master updated: [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues`
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 46580ab4cb0 [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues` 46580ab4cb0 is described below commit 46580ab4cb02390ba71dace1235015749f048fff Author: zeruibao AuthorDate: Mon Aug 14 19:36:40 2023 -0700 [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues` ### What changes were proposed in this pull request? Revert my last PR https://github.com/apache/spark/pull/41052 that causes AVRO read performance regression since I change the code structure. ### Why are the changes needed? Remove performance regression ### How was this patch tested? Unit test Closes #42458 from zeruibao/revert-avro-change. Authored-by: zeruibao Signed-off-by: Gengliang Wang --- .../src/main/resources/error/error-classes.json| 10 - .../apache/spark/sql/avro/AvroDeserializer.scala | 456 + .../org/apache/spark/sql/avro/AvroSuite.scala | 161 docs/sql-error-conditions.md | 14 - docs/sql-migration-guide.md| 1 - .../spark/sql/errors/QueryCompilationErrors.scala | 30 -- .../org/apache/spark/sql/internal/SQLConf.scala| 12 - 7 files changed, 189 insertions(+), 495 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 08f79bcecbb..6ce24bc3b90 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -75,16 +75,6 @@ } } }, - "AVRO_INCORRECT_TYPE" : { -"message" : [ - "Cannot convert Avro to SQL because the original encoded data type is , however you're trying to read the field as , which would lead to an incorrect answer. To allow reading this field, enable the SQL configuration: ." -] - }, - "AVRO_LOWER_PRECISION" : { -"message" : [ - "Cannot convert Avro to SQL because the original encoded data type is , however you're trying to read the field as , which leads to data being read as null. Please provide a wider decimal type to get the correct result. To allow reading null to this field, enable the SQL configuration: ." -] - }, "BATCH_METADATA_NOT_FOUND" : { "message" : [ "Unable to find batch ." diff --git a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala index d4d34a891e9..a78ee89a3e9 100644 --- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala +++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala @@ -35,9 +35,8 @@ import org.apache.spark.sql.catalyst.expressions.{SpecificInternalRow, UnsafeArr import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, ArrayData, DateTimeUtils, GenericArrayData} import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_DAY import org.apache.spark.sql.catalyst.util.RebaseDateTime.RebaseSpec -import org.apache.spark.sql.errors.QueryCompilationErrors import org.apache.spark.sql.execution.datasources.DataSourceUtils -import org.apache.spark.sql.internal.{LegacyBehaviorPolicy, SQLConf} +import org.apache.spark.sql.internal.LegacyBehaviorPolicy import org.apache.spark.sql.types._ import org.apache.spark.unsafe.types.UTF8String @@ -118,268 +117,178 @@ private[sql] class AvroDeserializer( val incompatibleMsg = errorPrefix + s"schema is incompatible (avroType = $avroType, sqlType = ${catalystType.sql})" -val confKey = SQLConf.LEGACY_AVRO_ALLOW_INCOMPATIBLE_SCHEMA -val preventReadingIncorrectType = !SQLConf.get.getConf(confKey) +(avroType.getType, catalystType) match { + case (NULL, NullType) => (updater, ordinal, _) => +updater.setNullAt(ordinal) -val logicalDataType = SchemaConverters.toSqlType(avroType).dataType -avroType.getType match { - case NULL => -(logicalDataType, catalystType) match { - case (_, NullType) => (updater, ordinal, _) => -updater.setNullAt(ordinal) - case _ => throw new IncompatibleSchemaException(incompatibleMsg) -} // TODO: we can avoid boxing if future version of avro provide primitive accessors. - case BOOLEAN => -(logicalDataType, catalystType) match { - case (_, BooleanType) => (updater, ordinal, value) => -updater.setBoolean(ordinal, value.asInstanceOf[Boolean]) - case _ => throw new IncompatibleSchemaException(incompatibleMsg) -} + case (BOOLEAN, BooleanType) => (updater, ordinal, value) => +updater.setBoolean(ordinal,
[spark] branch branch-3.5 updated: [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should not fail
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 4fd5e770209 [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should not fail 4fd5e770209 is described below commit 4fd5e770209ee4eb1c3eb5c0210588362cb53966 Author: Wenchen Fan AuthorDate: Tue Aug 15 10:55:57 2023 +0800 [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should not fail ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/41448 . As an optimizer rule, the produced plan should be resolved and resolved expressions should be able to report data type. The `Instruction` expression fails to report data type and may break external optimizer rules. This PR makes it to return dummy NullType. ### Why are the changes needed? to not break external optimizer rules. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #42482 from cloud-fan/merge. Authored-by: Wenchen Fan Signed-off-by: Wenchen Fan (cherry picked from commit c9ff70253999bf396b07eec84a3a86ca3191efa3) Signed-off-by: Wenchen Fan --- .../org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala index ecca758..9b1c8bc733a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala @@ -21,7 +21,7 @@ import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, Expre import org.apache.spark.sql.catalyst.plans.logical.MergeRows.{Instruction, ROW_ID} import org.apache.spark.sql.catalyst.trees.UnaryLike import org.apache.spark.sql.catalyst.util.truncatedString -import org.apache.spark.sql.types.DataType +import org.apache.spark.sql.types.{DataType, NullType} case class MergeRows( isSourceRowPresent: Expression, @@ -74,7 +74,11 @@ object MergeRows { def condition: Expression def outputs: Seq[Seq[Expression]] override def nullable: Boolean = false -override def dataType: DataType = throw new UnsupportedOperationException("dataType") +// We return NullType here as only the `MergeRows` operator can contain `Instruction` +// expressions and it doesn't care about the data type. Some external optimizer rules may +// assume optimized plan is always resolved and Expression#dataType is always available, so +// we can't just fail here. +override def dataType: DataType = NullType } case class Keep(condition: Expression, output: Seq[Expression]) extends Instruction { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (46580ab4cb0 -> c9ff7025399)
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 46580ab4cb0 [SPARK-43380][SQL] Revert `Fix Avro data type conversion issues` add c9ff7025399 [SPARK-43885][SQL][FOLLOWUP] Instruction#dataType should not fail No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/plans/logical/MergeRows.scala | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (1d562904e4e -> f7002fb25ca)
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 1d562904e4e [SPARK-44795][CONNECT] CodeGenerator Cache should be classloader specific add f7002fb25ca [SPARK-44749][PYTHON][FOLLOWUP][TESTS] Add more tests for named arguments in Python UDTF No new revisions were added by this update. Summary of changes: python/pyspark/sql/functions.py | 2 +- python/pyspark/sql/tests/test_udtf.py | 88 +++ 2 files changed, 80 insertions(+), 10 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-43780][SQL] Support correlated references in join predicates for scalar and lateral subqueries
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 420e6878c68 [SPARK-43780][SQL] Support correlated references in join predicates for scalar and lateral subqueries 420e6878c68 is described below commit 420e6878c68f9ff68171cc6d43ca95d69c49eac4 Author: Andrey Gubichev AuthorDate: Tue Aug 15 09:52:12 2023 +0800 [SPARK-43780][SQL] Support correlated references in join predicates for scalar and lateral subqueries ### What changes were proposed in this pull request? This PR adds support to subqueries that involve joins with correlated references in join predicates, e.g. ``` select * from t0 join lateral (select * from t1 join t2 on t1a = t2a and t1a = t0a); ``` (full example in https://issues.apache.org/jira/browse/SPARK-43780) Currently we only handle scalar and lateral subqueries. ### Why are the changes needed? This is a valid SQL that is not yet supported by Spark SQL. ### Does this PR introduce _any_ user-facing change? Yes, previously unsupported queries become supported. ### How was this patch tested? Query and unit tests Closes #41301 from agubichev/spark-43780-corr-predicate. Authored-by: Andrey Gubichev Signed-off-by: Wenchen Fan --- .../sql/catalyst/analysis/CheckAnalysis.scala | 1 + .../catalyst/optimizer/DecorrelateInnerQuery.scala | 92 -- .../org/apache/spark/sql/internal/SQLConf.scala| 10 ++ .../optimizer/DecorrelateInnerQuerySuite.scala | 133 - .../analyzer-results/join-lateral.sql.out | 66 ++ .../scalar-subquery-predicate.sql.out | 81 + .../scalar-subquery/scalar-subquery-set-op.sql.out | 80 + .../resources/sql-tests/inputs/join-lateral.sql| 5 + .../scalar-subquery/scalar-subquery-predicate.sql | 20 .../scalar-subquery/scalar-subquery-set-op.sql | 20 .../sql-tests/results/join-lateral.sql.out | 39 ++ .../scalar-subquery-predicate.sql.out | 37 ++ .../scalar-subquery/scalar-subquery-set-op.sql.out | 32 + 13 files changed, 605 insertions(+), 11 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 48c38a9bd4c..c7346809f3f 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -1173,6 +1173,7 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB def canHostOuter(plan: LogicalPlan): Boolean = plan match { case _: Filter => true case _: Project => usingDecorrelateInnerQueryFramework + case _: Join => usingDecorrelateInnerQueryFramework case _ => false } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala index 86fa78e96a5..a3e264579f4 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala @@ -804,11 +804,73 @@ object DecorrelateInnerQuery extends PredicateHelper { (d.copy(child = newChild), joinCond, outerReferenceMap) case j @ Join(left, right, joinType, condition, _) => -val outerReferences = collectOuterReferences(j.expressions) -// Join condition containing outer references is not supported. -assert(outerReferences.isEmpty, s"Correlated column is not allowed in join: $j") -val newOuterReferences = parentOuterReferences ++ outerReferences -val shouldPushToLeft = joinType match { +// Given 'condition', computes the tuple of +// (correlated, uncorrelated, equalityCond, predicates, equivalences). +// 'correlated' and 'uncorrelated' are the conjuncts with (resp. without) +// outer (correlated) references. Furthermore, correlated conjuncts are split +// into 'equalityCond' (those that are equalities) and all rest ('predicates'). +// 'equivalences' track equivalent attributes given 'equalityCond'. +// The split is only performed if 'shouldDecorrelatePredicates' is true. +// The input parameter 'isInnerJoin' is set to true for INNER joins and helps +// determine whether some predicates can be lifted up from the join (this
[spark] branch master updated: [SPARK-44802][INFRA] Token based ASF JIRA authentication
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c36d54a569e [SPARK-44802][INFRA] Token based ASF JIRA authentication c36d54a569e is described below commit c36d54a569e59e823aab83f47b97ae286fd7f4c4 Author: Kent Yao AuthorDate: Tue Aug 15 13:39:16 2023 +0800 [SPARK-44802][INFRA] Token based ASF JIRA authentication ### What changes were proposed in this pull request? This PR add a env JIRA_ACCESS_TOKEN for merge script to enable token auth ### Why are the changes needed? Tokens are more secure and easily revoked or expired. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? 1. locally tested ```python >>> JIRA_ACCESS_TOKEN is not None True >>> asf_jira.issue("SPARK-44801") >>> if JIRA_ACCESS_TOKEN is not None: ... asf_jira = jira.client.JIRA(jira_server, token_auth=JIRA_ACCESS_TOKEN) ... else: ... asf_jira = jira.client.JIRA( ... jira_server, basic_auth=(JIRA_USERNAME, JIRA_PASSWORD) ... ) ... >>> asf_jira.issue("SPARK-44801") ``` 2. merged https://github.com/apache/spark/pull/42470 with this Closes #42484 from yaooqinn/SPARK-44802. Authored-by: Kent Yao Signed-off-by: Kent Yao --- dev/merge_spark_pr.py | 14 +++--- 1 file changed, 11 insertions(+), 3 deletions(-) diff --git a/dev/merge_spark_pr.py b/dev/merge_spark_pr.py index c96415c7aeb..0fd42e237a4 100755 --- a/dev/merge_spark_pr.py +++ b/dev/merge_spark_pr.py @@ -52,6 +52,11 @@ PUSH_REMOTE_NAME = os.environ.get("PUSH_REMOTE_NAME", "apache") JIRA_USERNAME = os.environ.get("JIRA_USERNAME", "") # ASF JIRA password JIRA_PASSWORD = os.environ.get("JIRA_PASSWORD", "") +# ASF JIRA access token +# If it is configured, username and password are dismissed +# Go to https://issues.apache.org/jira/secure/ViewProfile.jspa -> Personal Access Tokens for +# your own token management. +JIRA_ACCESS_TOKEN = os.environ.get("JIRA_ACCESS_TOKEN") # OAuth key used for issuing requests against the GitHub API. If this is not defined, then requests # will be unauthenticated. You should only need to configure this if you find yourself regularly # exceeding your IP's unauthenticated request rate limit. You can create an OAuth key at @@ -238,9 +243,12 @@ def cherry_pick(pr_num, merge_hash, default_branch): def resolve_jira_issue(merge_branches, comment, default_jira_id=""): -asf_jira = jira.client.JIRA( -{"server": JIRA_API_BASE}, basic_auth=(JIRA_USERNAME, JIRA_PASSWORD) -) +jira_server = {"server": JIRA_API_BASE} + +if JIRA_ACCESS_TOKEN is not None: +asf_jira = jira.client.JIRA(jira_server, token_auth=JIRA_ACCESS_TOKEN) +else: +asf_jira = jira.client.JIRA(jira_server, basic_auth=(JIRA_USERNAME, JIRA_PASSWORD)) jira_id = input("Enter a JIRA id [%s]: " % default_jira_id) if jira_id == "": - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44783][SQL][TESTS] Checks arrays as named and positional parameters
This is an automated email from the ASF dual-hosted git repository. yao pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0fc85a24a15 [SPARK-44783][SQL][TESTS] Checks arrays as named and positional parameters 0fc85a24a15 is described below commit 0fc85a24a1531c74e7bc5b2d0a36635580a395a6 Author: Max Gekk AuthorDate: Mon Aug 14 18:32:58 2023 +0800 [SPARK-44783][SQL][TESTS] Checks arrays as named and positional parameters ### What changes were proposed in this pull request? In the PR, I propose to add new test which checks arrays as named and positional parameters. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suite: ``` $ build/sbt "test:testOnly *ParametersSuite" ``` Closes #42470 from MaxGekk/sql-parameterized-by-array. Authored-by: Max Gekk Signed-off-by: Kent Yao --- .../org/apache/spark/sql/ParametersSuite.scala | 27 ++ 1 file changed, 27 insertions(+) diff --git a/sql/core/src/test/scala/org/apache/spark/sql/ParametersSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/ParametersSuite.scala index a72c9a600ad..6310a5a50e0 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/ParametersSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/ParametersSuite.scala @@ -502,4 +502,31 @@ class ParametersSuite extends QueryTest with SharedSparkSession { start = 24, stop = 36)) } + + test("SPARK-44783: arrays as parameters") { +checkAnswer( + spark.sql("SELECT array_position(:arrParam, 'abc')", Map("arrParam" -> Array.empty[String])), + Row(0)) +checkAnswer( + spark.sql("SELECT array_position(?, 0.1D)", Array(Array.empty[Double])), + Row(0)) +checkAnswer( + spark.sql("SELECT array_contains(:arrParam, 10)", Map("arrParam" -> Array(10, 20, 30))), + Row(true)) +checkAnswer( + spark.sql("SELECT array_contains(?, ?)", Array(Array("a", "b", "c"), "b")), + Row(true)) +checkAnswer( + spark.sql("SELECT :arr[1]", Map("arr" -> Array(10, 20, 30))), + Row(20)) +checkAnswer( + spark.sql("SELECT ?[?]", Array(Array(1f, 2f, 3f), 0)), + Row(1f)) +checkAnswer( + spark.sql("SELECT :arr[0][1]", Map("arr" -> Array(Array(1, 2), Array(20), Array.empty[Int]))), + Row(2)) +checkAnswer( + spark.sql("SELECT ?[?][?]", Array(Array(Array(1f, 2f), Array.empty[Float], Array(3f)), 0, 1)), + Row(2f)) + } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] zero323 opened a new pull request, #472: Add note on generative tooling to developer tools
zero323 opened a new pull request, #472: URL: https://github.com/apache/spark-website/pull/472 This PR adds notes on generative tooling and link to the relevant ASF policy. As requested in comments to https://github.com/apache/spark/pull/42469 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[GitHub] [spark-website] zero323 commented on pull request #472: Add note on generative tooling to developer tools
zero323 commented on PR #472: URL: https://github.com/apache/spark-website/pull/472#issuecomment-1677700165 cc @srowen @zhengruifeng -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF
This is an automated email from the ASF dual-hosted git repository. ueshin pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d4629563492 [SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF d4629563492 is described below commit d4629563492ec3090b5bd5b924790507c42f4e86 Author: Takuya UESHIN AuthorDate: Mon Aug 14 08:57:55 2023 -0700 [SPARK-44749][SQL][PYTHON] Support named arguments in Python UDTF ### What changes were proposed in this pull request? Supports named arguments in Python UDTF. For example: ```py >>> udtf(returnType="a: int") ... class TestUDTF: ... def eval(self, a, b): ... yield a, ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> TestUDTF(a=lit(10), b=lit("x")).show() +---+ | a| +---+ | 10| +---+ >>> TestUDTF(b=lit("x"), a=lit(10)).show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x')").show() +---+ | a| +---+ | 10| +---+ >>> spark.sql("SELECT * FROM test_udtf(b=>'x', a=>10)").show() +---+ | a| +---+ | 10| +---+ ``` or: ```py >>> udtf ... class TestUDTF: ... staticmethod ... def analyze(**kwargs: AnalyzeArgument) -> AnalyzeResult: ... return AnalyzeResult( ... StructType( ... [StructField(key, arg.data_type) for key, arg in sorted(kwargs.items())] ... ) ... ) ... def eval(self, **kwargs): ... yield tuple(value for _, value in sorted(kwargs.items())) ... >>> spark.udtf.register("test_udtf", TestUDTF) >>> spark.sql("SELECT * FROM test_udtf(a=>10, b=>'x', x=>100.0)").show() +---+---+-+ | a| b|x| +---+---+-+ | 10| x|100.0| +---+---+-+ >>> spark.sql("SELECT * FROM test_udtf(x=>10, a=>'x', z=>100.0)").show() +---+---+-+ | a| x|z| +---+---+-+ | x| 10|100.0| +---+---+-+ ``` ### Why are the changes needed? Now that named arguments are supported (https://github.com/apache/spark/pull/41796, https://github.com/apache/spark/pull/42020). It should be supported in Python UDTF. ### Does this PR introduce _any_ user-facing change? Yes, named arguments will be available for Python UDTF. ### How was this patch tested? Added related tests. Closes #42422 from ueshin/issues/SPARK-44749/kwargs. Authored-by: Takuya UESHIN Signed-off-by: Takuya UESHIN --- .../main/protobuf/spark/connect/expressions.proto | 9 ++ .../sql/connect/planner/SparkConnectPlanner.scala | 7 ++ python/pyspark/sql/column.py | 14 +++ python/pyspark/sql/connect/expressions.py | 20 .../pyspark/sql/connect/proto/expressions_pb2.py | 122 +++-- .../pyspark/sql/connect/proto/expressions_pb2.pyi | 34 ++ python/pyspark/sql/connect/udtf.py | 18 +-- python/pyspark/sql/functions.py| 38 +++ python/pyspark/sql/tests/test_udtf.py | 88 +++ python/pyspark/sql/udtf.py | 36 -- python/pyspark/sql/worker/analyze_udtf.py | 20 +++- python/pyspark/worker.py | 29 - .../plans/logical/FunctionBuilderBase.scala| 67 +++ .../execution/python/ArrowEvalPythonUDTFExec.scala | 5 +- .../execution/python/ArrowPythonUDTFRunner.scala | 8 +- .../execution/python/BatchEvalPythonUDTFExec.scala | 30 +++-- .../sql/execution/python/EvalPythonUDTFExec.scala | 33 -- .../python/UserDefinedPythonFunction.scala | 42 +-- 18 files changed, 472 insertions(+), 148 deletions(-) diff --git a/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto b/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto index 557b9db9123..b222f663cd0 100644 --- a/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto +++ b/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto @@ -47,6 +47,7 @@ message Expression { UnresolvedNamedLambdaVariable unresolved_named_lambda_variable = 14; CommonInlineUserDefinedFunction common_inline_user_defined_function = 15; CallFunction call_function = 16; +NamedArgumentExpression named_argument_expression = 17; // This field is used to mark extensions to the protocol. When plugins generate arbitrary // relations they can add them here. During the planning the correct resolution is done. @@ -380,3 +381,11 @@ message CallFunction { // (Optional) Function arguments. Empty arguments are allowed.