(spark) branch master updated: [SPARK-47579][CORE][PART3][FOLLOWUP] Fix KubernetesSuite
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 541158fe0352 [SPARK-47579][CORE][PART3][FOLLOWUP] Fix KubernetesSuite 541158fe0352 is described below commit 541158fe03529d5a28eaeb61d801d065ff4ef664 Author: panbingkun AuthorDate: Sun May 26 08:35:45 2024 -0700 [SPARK-47579][CORE][PART3][FOLLOWUP] Fix KubernetesSuite ### What changes were proposed in this pull request? The pr is following up https://github.com/apache/spark/pull/46739, and aims to fix `KubernetesSuite `. 1.Unfortunately, after `correcting` the `typo` from `decommision` to `decommission`, it seems that GA has been broken. https://github.com/apache/spark/assets/15246973/6212debb-0ff6-4d22-8999-e37aa2cb2c10;> 2.https://github.com/panbingkun/spark/actions/runs/9232744348/job/25406127982 https://github.com/apache/spark/assets/15246973/4e71598c-22f3-4fd2-afba-fd91ddce5f55;> ### Why are the changes needed? Only fix `KubernetesSuite`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46746 from panbingkun/fix_KubernetesSuite. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala index 1b9b5310c2ee..ae5f037c6b7d 100644 --- a/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala +++ b/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala @@ -175,7 +175,7 @@ private[spark] trait DecommissionSuite { k8sSuite: KubernetesSuite => expectedDriverLogOnCompletion = Seq( "Finished waiting, stopping Spark", "Decommission executors", - "Remove reason statistics: (gracefully decommissioned: 1, decommision unfinished: 0, " + + "Remove reason statistics: (gracefully decommissioned: 1, decommission unfinished: 0, " + "driver killed: 0, unexpectedly exited: 0)."), appArgs = Array.empty[String], driverPodChecker = doBasicDriverPyPodCheck, - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48320][CORE][DOCS] Add structured logging guide to the scala and java doc
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 468aa842c643 [SPARK-48320][CORE][DOCS] Add structured logging guide to the scala and java doc 468aa842c643 is described below commit 468aa842c6435b3c3ff49df30e8958d08ec2edb0 Author: panbingkun AuthorDate: Sat May 25 09:43:01 2024 -0700 [SPARK-48320][CORE][DOCS] Add structured logging guide to the scala and java doc ### What changes were proposed in this pull request? The pr aims to add `external third-party ecosystem access` guide to the `scala/java` doc. The external third-party ecosystem is very extensive. Currently, the document covers two scenarios: - Pure java (for example, an application only uses the java language - many of our internal production applications are like this) - java + scala ### Why are the changes needed? Provide instructions for external third-party ecosystem access to the structured log framework. ### Does this PR introduce _any_ user-facing change? Yes, When an external third-party ecosystem wants to access the structured log framework, developers can get help through this document. ### How was this patch tested? - Add new UT. - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46634 from panbingkun/SPARK-48320. Lead-authored-by: panbingkun Co-authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../org/apache/spark/internal/SparkLogger.java | 43 + .../scala/org/apache/spark/internal/LogKey.scala | 30 +++- .../scala/org/apache/spark/internal/Logging.scala | 38 +++ .../main/scala/org/apache/spark/internal/README.md | 47 -- .../apache/spark/util/PatternSparkLoggerSuite.java | 9 +++- .../apache/spark/util/SparkLoggerSuiteBase.java| 55 +- .../spark/util/StructuredSparkLoggerSuite.java | 20 ++-- .../apache/spark/util/PatternLoggingSuite.scala| 4 +- .../apache/spark/util/StructuredLoggingSuite.scala | 29 ++-- 9 files changed, 193 insertions(+), 82 deletions(-) diff --git a/common/utils/src/main/java/org/apache/spark/internal/SparkLogger.java b/common/utils/src/main/java/org/apache/spark/internal/SparkLogger.java index bf8adb70637e..32dd8f1f26b5 100644 --- a/common/utils/src/main/java/org/apache/spark/internal/SparkLogger.java +++ b/common/utils/src/main/java/org/apache/spark/internal/SparkLogger.java @@ -28,6 +28,49 @@ import org.apache.logging.log4j.message.ParameterizedMessageFactory; import org.slf4j.Logger; // checkstyle.on: RegexpSinglelineJava +// checkstyle.off: RegexpSinglelineJava +/** + * Guidelines for the Structured Logging Framework - Java Logging + * + * + * Use the `org.apache.spark.internal.SparkLoggerFactory` to get the logger instance in Java code: + * Getting Logger Instance: + * Instead of using `org.slf4j.LoggerFactory`, use `org.apache.spark.internal.SparkLoggerFactory` + * to ensure structured logging. + * + * + * import org.apache.spark.internal.SparkLogger; + * import org.apache.spark.internal.SparkLoggerFactory; + * private static final SparkLogger logger = SparkLoggerFactory.getLogger(JavaUtils.class); + * + * + * Logging Messages with Variables: + * When logging messages with variables, wrap all the variables with `MDC`s and they will be + * automatically added to the Mapped Diagnostic Context (MDC). + * + * + * import org.apache.spark.internal.LogKeys; + * import org.apache.spark.internal.MDC; + * logger.error("Unable to delete file for partition {}", MDC.of(LogKeys.PARTITION_ID$.MODULE$, i)); + * + * + * Constant String Messages: + * For logging constant string messages, use the standard logging methods. + * + * + * logger.error("Failed to abort the writer after failing to write map output.", e); + * + * + * If you want to output logs in `java code` through the structured log framework, + * you can define `custom LogKey` and use it in `java` code as follows: + * + * + * // To add a `custom LogKey`, implement `LogKey` + * public static class CUSTOM_LOG_KEY implements LogKey { } + * import org.apache.spark.internal.MDC; + * logger.error("Unable to delete key {} for cache", MDC.of(CUSTOM_LOG_KEY, "key")); + */ +// checkstyle.on: RegexpSinglelineJava public class SparkLogger { private static final MessageFactory MESSAGE_FACTORY = ParameterizedMessageFactory.INSTANCE; diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 534f00911922..1366277827f7 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.sca
(spark) branch master updated: [SPARK-47579][SQL][FOLLOWUP] Restore the `--help` print format of spark sql shell
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3cb30c2366b2 [SPARK-47579][SQL][FOLLOWUP] Restore the `--help` print format of spark sql shell 3cb30c2366b2 is described below commit 3cb30c2366b27c5a65ec02121c30bd1a4eb20584 Author: Kent Yao AuthorDate: Fri May 24 09:43:03 2024 -0700 [SPARK-47579][SQL][FOLLOWUP] Restore the `--help` print format of spark sql shell ### What changes were proposed in this pull request? Restore the print format of spark sql shell ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually ![image](https://github.com/apache/spark/assets/8326978/17b9d009-5d93-4d84-9367-7308b4cda426) ![image](https://github.com/apache/spark/assets/8326978/a5e333bd-0e22-4d5a-83f1-843767f6d5f5) ### Was this patch authored or co-authored using generative AI tooling? no Closes #46735 from yaooqinn/SPARK-47579. Authored-by: Kent Yao Signed-off-by: Gengliang Wang --- common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala | 1 - core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala | 3 ++- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 1f67a211c01f..99fc58b03503 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -585,7 +585,6 @@ object LogKeys { case object SESSION_KEY extends LogKey case object SET_CLIENT_INFO_REQUEST extends LogKey case object SHARD_ID extends LogKey - case object SHELL_OPTIONS extends LogKey case object SHORT_USER_NAME extends LogKey case object SHUFFLE_BLOCK_INFO extends LogKey case object SHUFFLE_DB_BACKEND_KEY extends LogKey diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala index 61235a701907..e47596a6ae43 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala @@ -588,7 +588,8 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S ) if (SparkSubmit.isSqlShell(mainClass)) { - logInfo(log"CLI options:\n${MDC(SHELL_OPTIONS, getSqlShellOptions())}") + logInfo("CLI options:") + logInfo(getSqlShellOptions()) } throw SparkUserAppException(exitCode) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (4a471cceebed -> 2516fd8439df)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 4a471cceebed [MINOR][TESTS] Add a helper function for `spark.table` in dsl add 2516fd8439df [SPARK-45009][SQL][FOLLOW UP] Add error class and tests for decorrelation of predicate subqueries in join condition which reference both join child No new revisions were added by this update. Summary of changes: .../src/main/resources/error/error-conditions.json | 6 +++ .../exists-in-join-condition.sql.out | 44 ++ .../in-subquery-in-join-condition.sql.out | 44 ++ .../exists-subquery/exists-in-join-condition.sql | 4 ++ .../in-subquery/in-subquery-in-join-condition.sql | 4 ++ .../exists-in-join-condition.sql.out | 30 +++ .../in-subquery-in-join-condition.sql.out | 30 +++ 7 files changed, 162 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (6b3a88195e30 -> febdbf56fb22)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 6b3a88195e30 [SPARK-48329][SQL] Enable `spark.sql.sources.v2.bucketing.pushPartValues.enabled` by default add febdbf56fb22 [SPARK-48031] Grandfather legacy views to SCHEMA BINDING No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/catalog/interface.scala | 4 +- .../sql/catalyst/catalog/SessionCatalogSuite.scala | 5 +- .../view-schema-binding-config.sql.out | 136 +++ .../inputs/view-schema-binding-config.sql | 29 +++ .../sql-tests/results/show-tables.sql.out | 2 +- .../results/view-schema-binding-config.sql.out | 256 + .../execution/command/ShowTablesSuiteBase.scala| 6 +- 7 files changed, 429 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated (c1dd4a5df693 -> 1a454287c01e)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git from c1dd4a5df693 [SPARK-48297][SQL] Fix a regression TRANSFORM clause with char/varchar add 1a454287c01e [SPARK-48294][SQL][3.5] Handle lowercase in nestedTypeMissingElementTypeError No new revisions were added by this update. Summary of changes: .../apache/spark/sql/errors/QueryParsingErrors.scala | 2 +- .../spark/sql/errors/QueryParsingErrorsSuite.scala| 19 +++ 2 files changed, 20 insertions(+), 1 deletion(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48303][CORE] Reorganize `LogKeys`
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5643cfb71d34 [SPARK-48303][CORE] Reorganize `LogKeys` 5643cfb71d34 is described below commit 5643cfb71d343133a185aa257f137074f41abfb3 Author: panbingkun AuthorDate: Thu May 16 23:20:23 2024 -0700 [SPARK-48303][CORE] Reorganize `LogKeys` ### What changes were proposed in this pull request? The pr aims to `reorganize` `LogKeys`, includes: - remove some unused `LogLeys` ACTUAL_BROADCAST_OUTPUT_STATUS_SIZE DEFAULT_COMPACTION_INTERVAL DRIVER_LIBRARY_PATH_KEY EXISTING_JARS EXPECTED_ANSWER FILTERS HAS_R_PACKAGE JAR_ENTRY LOG_KEY_FILE NUM_ADDED_MASTERS NUM_ADDED_WORKERS NUM_PARTITION_VALUES OUTPUT_LINE OUTPUT_LINE_NUMBER PARTITIONS_SIZE RULE_BATCH_NAME SERIALIZE_OUTPUT_LENGTH SHELL_COMMAND STREAM_SOURCE - merge `PARAMETER` into `PARAM` (because some are `full` spelled, and some are `abbreviations`, which are not unified) ESTIMATOR_PARAMETER_MAP -> ESTIMATOR_PARAM_MAP FUNCTION_PARAMETER -> FUNCTION_PARAM METHOD_PARAMETER_TYPES -> METHOD_PARAM_TYPES - merge `NUMBER` into `NUM` (abbreviations) MIN_VERSION_NUMBER -> MIN_VERSION_NUM RULE_NUMBER_OF_RUNS -> NUM_RULE_OF_RUNS VERSION_NUMBER -> VERSION_NUM - merge `TOTAL` into `NUM` TOTAL_RECORDS_READ -> NUM_RECORDS_READ TRAIN_WORD_COUNT -> NUM_TRAIN_WORD - `NUM` as prefix CHECKSUM_FILE_NUM -> NUM_CHECKSUM_FILE DATA_FILE_NUM -> NUM_DATA_FILE INDEX_FILE_NUM -> NUM_INDEX_FILE - COUNR -> NUM EXECUTOR_DESIRED_COUNT -> NUM_EXECUTOR_DESIRED EXECUTOR_LAUNCH_COUNT -> NUM_EXECUTOR_LAUNCH EXECUTOR_TARGET_COUNT -> NUM_EXECUTOR_TARGET KAFKA_PULLS_COUNT -> NUM_KAFKA_PULLS KAFKA_RECORDS_PULLED_COUNT -> NUM_KAFKA_RECORDS_PULLED MIN_FREQUENT_PATTERN_COUNT -> MIN_NUM_FREQUENT_PATTERN POD_COUNT -> NUM_POD POD_SHARED_SLOT_COUNT -> NUM_POD_SHARED_SLOT POD_TARGET_COUNT -> NUM_POD_TARGET RETRY_COUNT -> NUM_RETRY - fix some `typo` MALFORMATTED_STIRNG -> MALFORMATTED_STRING - other MAX_LOG_NUM_POLICY -> MAX_NUM_LOG_POLICY WEIGHTED_NUM -> NUM_WEIGHTED_EXAMPLES Changes in other code are additional changes caused by the above adjustments. ### Why are the changes needed? Let's make `LogKeys` easier to understand and more consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46612 from panbingkun/reorganize_logkey. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../network/shuffle/RetryingBlockTransferor.java | 6 +- .../scala/org/apache/spark/internal/LogKey.scala | 68 -- .../sql/connect/client/GrpcRetryHandler.scala | 8 +-- .../sql/kafka010/KafkaOffsetReaderAdmin.scala | 4 +- .../sql/kafka010/KafkaOffsetReaderConsumer.scala | 4 +- .../sql/kafka010/consumer/KafkaDataConsumer.scala | 6 +- .../streaming/kinesis/KinesisBackedBlockRDD.scala | 4 +- .../org/apache/spark/api/r/RBackendHandler.scala | 4 +- .../spark/deploy/history/FsHistoryProvider.scala | 2 +- .../org/apache/spark/deploy/master/Master.scala| 2 +- .../apache/spark/ml/tree/impl/RandomForest.scala | 4 +- .../apache/spark/ml/tuning/CrossValidator.scala| 4 +- .../spark/ml/tuning/TrainValidationSplit.scala | 4 +- .../org/apache/spark/mllib/feature/Word2Vec.scala | 4 +- .../org/apache/spark/mllib/fpm/PrefixSpan.scala| 4 +- .../apache/spark/mllib/linalg/VectorsSuite.scala | 4 +- .../cluster/k8s/ExecutorPodsAllocator.scala| 6 +- ...ernetesLocalDiskShuffleExecutorComponents.scala | 6 +- .../apache/spark/deploy/yarn/YarnAllocator.scala | 6 +- .../catalyst/expressions/V2ExpressionUtils.scala | 4 +- .../spark/sql/catalyst/rules/RuleExecutor.scala| 6 +- .../sql/execution/streaming/state/RocksDB.scala| 18 +++--- .../streaming/state/RocksDBFileManager.scala | 22 +++ .../state/RocksDBStateStoreProvider.scala | 6 +- .../apache/hive/service/server/HiveServer2.java| 2 +- .../spark/sql/hive/client/HiveClientImpl.scala | 2 +- .../org/apache/spark/streaming/Checkpoint.scala| 4 +- .../streaming/util/FileBasedWriteAheadLog.scala| 4 +- 28 files changed, 101 insertions(+), 117 deletions(-) diff --git a/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/RetryingBlockTransferor.jav
(spark) branch master updated: [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 59f88c372522 [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError 59f88c372522 is described below commit 59f88c3725222b84b2d0b51ba40a769d99866b56 Author: Michael Zhang AuthorDate: Thu May 16 14:58:25 2024 -0700 [SPARK-48294][SQL] Handle lowercase in nestedTypeMissingElementTypeError ### What changes were proposed in this pull request? Handle lowercase values inside of nestTypeMissingElementTypeError to prevent match errors. ### Why are the changes needed? The previous match error was not user-friendly. Now it gives an actionable `INCOMPLETE_TYPE_DEFINITION` error. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Newly added tests pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46623 from michaelzhan-db/SPARK-48294. Authored-by: Michael Zhang Signed-off-by: Gengliang Wang --- .../apache/spark/sql/errors/QueryParsingErrors.scala | 2 +- .../spark/sql/errors/QueryParsingErrorsSuite.scala| 19 +++ 2 files changed, 20 insertions(+), 1 deletion(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala b/sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala index 5eafd4d915a4..816fa546a138 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/errors/QueryParsingErrors.scala @@ -289,7 +289,7 @@ private[sql] object QueryParsingErrors extends DataTypeErrorsBase { def nestedTypeMissingElementTypeError( dataType: String, ctx: PrimitiveDataTypeContext): Throwable = { -dataType match { +dataType.toUpperCase(Locale.ROOT) match { case "ARRAY" => new ParseException( errorClass = "INCOMPLETE_TYPE_DEFINITION.ARRAY", diff --git a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala index 29ab6e994e42..b7fb65091ef7 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryParsingErrorsSuite.scala @@ -647,6 +647,13 @@ class QueryParsingErrorsSuite extends QueryTest with SharedSparkSession with SQL sqlState = "42K01", parameters = Map("elementType" -> ""), context = ExpectedContext(fragment = "ARRAY", start = 30, stop = 34)) +// Create column of array type without specifying element type in lowercase +checkError( + exception = parseException("CREATE TABLE tbl_120691 (col1 array)"), + errorClass = "INCOMPLETE_TYPE_DEFINITION.ARRAY", + sqlState = "42K01", + parameters = Map("elementType" -> ""), + context = ExpectedContext(fragment = "array", start = 30, stop = 34)) } test("INCOMPLETE_TYPE_DEFINITION: struct type definition is incomplete") { @@ -674,6 +681,12 @@ class QueryParsingErrorsSuite extends QueryTest with SharedSparkSession with SQL errorClass = "PARSE_SYNTAX_ERROR", sqlState = "42601", parameters = Map("error" -> "'<'", "hint" -> ": missing ')'")) +// Create column of struct type without specifying field type in lowercase +checkError( + exception = parseException("CREATE TABLE tbl_120691 (col1 struct)"), + errorClass = "INCOMPLETE_TYPE_DEFINITION.STRUCT", + sqlState = "42K01", + context = ExpectedContext(fragment = "struct", start = 30, stop = 35)) } test("INCOMPLETE_TYPE_DEFINITION: map type definition is incomplete") { @@ -695,6 +708,12 @@ class QueryParsingErrorsSuite extends QueryTest with SharedSparkSession with SQL errorClass = "PARSE_SYNTAX_ERROR", sqlState = "42601", parameters = Map("error" -> "'<'", "hint" -> ": missing ')'")) +// Create column of map type without specifying key/value types in lowercase +checkError( + exception = parseException("SELECT CAST(map('1',2) AS map)"), + errorClass = "INCOMPLETE_TYPE_DEFINITION.MAP", + sqlState = "42K01", + context = ExpectedContext(fragment = "map", start = 26, stop = 28)) } test("INVALID_ESC: Escape string must contain only one character") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48291][CORE][FOLLOWUP] Rename Java *LoggerSuite* as *SparkLoggerSuite*
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 283b2ff42221 [SPARK-48291][CORE][FOLLOWUP] Rename Java *LoggerSuite* as *SparkLoggerSuite* 283b2ff42221 is described below commit 283b2ff422218b025e7b0170e4b7ed31a1294a80 Author: panbingkun AuthorDate: Thu May 16 11:55:20 2024 -0700 [SPARK-48291][CORE][FOLLOWUP] Rename Java *LoggerSuite* as *SparkLoggerSuite* ### What changes were proposed in this pull request? The pr is follow up https://github.com/apache/spark/pull/46600 to . Similarly, to maintain consistency, should be renamed to ### Why are the changes needed? After `org.apache.spark.internal.Logger` is renamed to `org.apache.spark.internal.SparkLogger` and `org.apache.spark.internal.LoggerFactory` is renamed to `org.apache.spark.internal.SparkLoggerFactory.`, the related UT's names should also be `renamed`, so that developers can easily locate the related UT. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46615 from panbingkun/SPARK-48291_follow_up. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../util/{PatternLoggerSuite.java => PatternSparkLoggerSuite.java} | 7 --- .../spark/util/{LoggerSuiteBase.java => SparkLoggerSuiteBase.java} | 2 +- ...{StructuredLoggerSuite.java => StructuredSparkLoggerSuite.java} | 6 +++--- common/utils/src/test/resources/log4j2.properties | 4 ++-- 4 files changed, 10 insertions(+), 9 deletions(-) diff --git a/common/utils/src/test/java/org/apache/spark/util/PatternLoggerSuite.java b/common/utils/src/test/java/org/apache/spark/util/PatternSparkLoggerSuite.java similarity index 91% rename from common/utils/src/test/java/org/apache/spark/util/PatternLoggerSuite.java rename to common/utils/src/test/java/org/apache/spark/util/PatternSparkLoggerSuite.java index 33de91697efa..2d370bad4cc8 100644 --- a/common/utils/src/test/java/org/apache/spark/util/PatternLoggerSuite.java +++ b/common/utils/src/test/java/org/apache/spark/util/PatternSparkLoggerSuite.java @@ -22,9 +22,10 @@ import org.apache.logging.log4j.Level; import org.apache.spark.internal.SparkLogger; import org.apache.spark.internal.SparkLoggerFactory; -public class PatternLoggerSuite extends LoggerSuiteBase { +public class PatternSparkLoggerSuite extends SparkLoggerSuiteBase { - private static final SparkLogger LOGGER = SparkLoggerFactory.getLogger(PatternLoggerSuite.class); + private static final SparkLogger LOGGER = +SparkLoggerFactory.getLogger(PatternSparkLoggerSuite.class); private String toRegexPattern(Level level, String msg) { return msg @@ -39,7 +40,7 @@ public class PatternLoggerSuite extends LoggerSuiteBase { @Override String className() { -return PatternLoggerSuite.class.getSimpleName(); +return PatternSparkLoggerSuite.class.getSimpleName(); } @Override diff --git a/common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.java b/common/utils/src/test/java/org/apache/spark/util/SparkLoggerSuiteBase.java similarity index 99% rename from common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.java rename to common/utils/src/test/java/org/apache/spark/util/SparkLoggerSuiteBase.java index ecc0a75070c7..46bfe3415080 100644 --- a/common/utils/src/test/java/org/apache/spark/util/LoggerSuiteBase.java +++ b/common/utils/src/test/java/org/apache/spark/util/SparkLoggerSuiteBase.java @@ -30,7 +30,7 @@ import org.apache.spark.internal.SparkLogger; import org.apache.spark.internal.LogKeys; import org.apache.spark.internal.MDC; -public abstract class LoggerSuiteBase { +public abstract class SparkLoggerSuiteBase { abstract SparkLogger logger(); abstract String className(); diff --git a/common/utils/src/test/java/org/apache/spark/util/StructuredLoggerSuite.java b/common/utils/src/test/java/org/apache/spark/util/StructuredSparkLoggerSuite.java similarity index 95% rename from common/utils/src/test/java/org/apache/spark/util/StructuredLoggerSuite.java rename to common/utils/src/test/java/org/apache/spark/util/StructuredSparkLoggerSuite.java index 110e7cc7794e..416f0b6172c0 100644 --- a/common/utils/src/test/java/org/apache/spark/util/StructuredLoggerSuite.java +++ b/common/utils/src/test/java/org/apache/spark/util/StructuredSparkLoggerSuite.java @@ -24,10 +24,10 @@ import org.apache.logging.log4j.Level; import org.apache.spark.internal.SparkLogger; import org.apache.spark.internal.SparkLoggerFactory; -public class StructuredLoggerSuite extends LoggerSuiteBase { +public class StructuredSparkLoggerSuite extends SparkLoggerSuiteBase { private sta
(spark) branch master updated: [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory`
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dec910ba3c36 [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` dec910ba3c36 is described below commit dec910ba3c36e27b9cff5b5e139be82af6c799ab Author: panbingkun AuthorDate: Wed May 15 21:50:48 2024 -0700 [SPARK-48214][INFRA] Ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` ### What changes were proposed in this pull request? The pr aims to ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory`. ### Why are the changes needed? After the migration of structured logs on the `java side` is completed, we need to ban import `org.slf4j.Logger` & `org.slf4j.LoggerFactory` in the code to avoid the log format that is not written as required in the future new java code. ### Does this PR introduce _any_ user-facing change? Yes, only for spark developers. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46502 from panbingkun/ban_import_slf4j. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- common/kvstore/pom.xml | 6 ++ .../java/org/apache/spark/util/kvstore/LevelDBIterator.java| 7 --- .../java/org/apache/spark/util/kvstore/DBIteratorSuite.java| 2 ++ .../java/org/apache/spark/util/kvstore/LevelDBBenchmark.java | 2 ++ .../java/org/apache/spark/util/kvstore/RocksDBBenchmark.java | 2 ++ .../apache/spark/network/util/TransportFrameDecoderSuite.java | 2 ++ .../spark/network/shuffle/RemoteBlockPushResolverSuite.java| 2 ++ .../apache/spark/network/shuffle/TestShuffleDataContext.java | 2 ++ .../src/main/java/org/apache/spark/internal/SparkLogger.java | 9 ++--- .../java/org/apache/spark/internal/SparkLoggerFactory.java | 10 ++ dev/checkstyle.xml | 5 + 11 files changed, 39 insertions(+), 10 deletions(-) diff --git a/common/kvstore/pom.xml b/common/kvstore/pom.xml index 3820d1b8e395..046648e9c2ae 100644 --- a/common/kvstore/pom.xml +++ b/common/kvstore/pom.xml @@ -40,6 +40,12 @@ spark-tags_${scala.binary.version} + + org.apache.spark + spark-common-utils_${scala.binary.version} + ${project.version} + + com.google.guava guava diff --git a/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java b/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java index b830e6afc617..69757fdc65d6 100644 --- a/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java +++ b/common/kvstore/src/main/java/org/apache/spark/util/kvstore/LevelDBIterator.java @@ -29,8 +29,9 @@ import com.google.common.annotations.VisibleForTesting; import com.google.common.base.Preconditions; import com.google.common.base.Throwables; import org.iq80.leveldb.DBIterator; -import org.slf4j.Logger; -import org.slf4j.LoggerFactory; + +import org.apache.spark.internal.SparkLogger; +import org.apache.spark.internal.SparkLoggerFactory; class LevelDBIterator implements KVStoreIterator { @@ -302,7 +303,7 @@ class LevelDBIterator implements KVStoreIterator { } static class ResourceCleaner implements Runnable { -private static final Logger LOG = LoggerFactory.getLogger(ResourceCleaner.class); +private static final SparkLogger LOG = SparkLoggerFactory.getLogger(ResourceCleaner.class); private final DBIterator dbIterator; diff --git a/common/kvstore/src/test/java/org/apache/spark/util/kvstore/DBIteratorSuite.java b/common/kvstore/src/test/java/org/apache/spark/util/kvstore/DBIteratorSuite.java index daedd56890a6..72c3690d1a18 100644 --- a/common/kvstore/src/test/java/org/apache/spark/util/kvstore/DBIteratorSuite.java +++ b/common/kvstore/src/test/java/org/apache/spark/util/kvstore/DBIteratorSuite.java @@ -32,8 +32,10 @@ import org.junit.jupiter.api.AfterAll; import org.junit.jupiter.api.BeforeEach; import org.junit.jupiter.api.BeforeAll; import org.junit.jupiter.api.Test; +// checkstyle.off: RegexpSinglelineJava import org.slf4j.Logger; import org.slf4j.LoggerFactory; +// checkstyle.on: RegexpSinglelineJava import static org.junit.jupiter.api.Assertions.*; public abstract class DBIteratorSuite { diff --git a/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBBenchmark.java b/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBBenchmark.java index 3158c18f9e1d..ff6db8fc34c9 100644 --- a/common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBBenchmark.java +++ b/common/kvstore/src/test/java/org/a
(spark) branch master updated: [SPARK-48291][CORE] Rename Java Logger as SparkLogger
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a252cbd5ca13 [SPARK-48291][CORE] Rename Java Logger as SparkLogger a252cbd5ca13 is described below commit a252cbd5ca13fb7b758c839edc92b50336747d82 Author: Gengliang Wang AuthorDate: Wed May 15 16:43:45 2024 -0700 [SPARK-48291][CORE] Rename Java Logger as SparkLogger ### What changes were proposed in this pull request? Two new classes `org.apache.spark.internal.Logger` and `org.apache.spark.internal.LoggerFactory` were introduced from https://github.com/apache/spark/pull/46301. Given that Logger is a widely recognized **interface** in Log4j, it may lead to confusion to have a class with the same name. To avoid this and clarify its purpose within the Spark framework, I propose renaming `org.apache.spark.internal.Logger` to `org.apache.spark.internal.SparkLogger`. Similarly, to maintain consistency, `org.apache.spark.internal.LoggerFactory` should be renamed to `org.apache.spark.internal.SparkLoggerFactory`. ### Why are the changes needed? To avoid naming confusion and clarify the java Spark logger purpose within the logging framework ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46600 from gengliangwang/refactorLogger. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../java/org/apache/spark/network/TransportContext.java | 6 +++--- .../org/apache/spark/network/client/TransportClient.java| 6 +++--- .../apache/spark/network/client/TransportClientFactory.java | 7 --- .../spark/network/client/TransportResponseHandler.java | 7 --- .../apache/spark/network/crypto/AuthClientBootstrap.java| 6 +++--- .../org/apache/spark/network/crypto/AuthRpcHandler.java | 6 +++--- .../org/apache/spark/network/protocol/MessageDecoder.java | 6 +++--- .../org/apache/spark/network/protocol/MessageEncoder.java | 6 +++--- .../apache/spark/network/protocol/SslMessageEncoder.java| 6 +++--- .../org/apache/spark/network/sasl/SaslClientBootstrap.java | 6 +++--- .../java/org/apache/spark/network/sasl/SaslRpcHandler.java | 6 +++--- .../java/org/apache/spark/network/sasl/SparkSaslClient.java | 6 +++--- .../java/org/apache/spark/network/sasl/SparkSaslServer.java | 6 +++--- .../spark/network/server/ChunkFetchRequestHandler.java | 7 --- .../apache/spark/network/server/OneForOneStreamManager.java | 7 --- .../java/org/apache/spark/network/server/RpcHandler.java| 6 +++--- .../spark/network/server/TransportChannelHandler.java | 7 --- .../spark/network/server/TransportRequestHandler.java | 7 --- .../org/apache/spark/network/server/TransportServer.java| 6 +++--- .../apache/spark/network/ssl/ReloadingX509TrustManager.java | 7 --- .../main/java/org/apache/spark/network/ssl/SSLFactory.java | 6 +++--- .../main/java/org/apache/spark/network/util/DBProvider.java | 6 +++--- .../java/org/apache/spark/network/util/LevelDBProvider.java | 8 .../java/org/apache/spark/network/util/NettyLogger.java | 6 +++--- .../java/org/apache/spark/network/util/RocksDBProvider.java | 8 .../org/apache/spark/network/sasl/ShuffleSecretManager.java | 7 --- .../org/apache/spark/network/shuffle/BlockStoreClient.java | 6 +++--- .../apache/spark/network/shuffle/ExternalBlockHandler.java | 7 --- .../spark/network/shuffle/ExternalShuffleBlockResolver.java | 7 --- .../apache/spark/network/shuffle/OneForOneBlockFetcher.java | 7 --- .../apache/spark/network/shuffle/OneForOneBlockPusher.java | 7 --- .../spark/network/shuffle/RemoteBlockPushResolver.java | 7 --- .../spark/network/shuffle/RetryingBlockTransferor.java | 7 --- .../spark/network/shuffle/ShuffleTransportContext.java | 9 + .../network/shuffle/checksum/ShuffleChecksumHelper.java | 8 .../org/apache/spark/network/yarn/YarnShuffleService.java | 13 +++-- .../apache/spark/internal/{Logger.java => SparkLogger.java} | 4 ++-- .../{LoggerFactory.java => SparkLoggerFactory.java} | 10 +- .../main/java/org/apache/spark/network/util/JavaUtils.java | 6 +++--- .../test/java/org/apache/spark/util/LoggerSuiteBase.java| 4 ++-- .../test/java/org/apache/spark/util/PatternLoggerSuite.java | 8 .../java/org/apache/spark/util/StructuredLoggerSuite.java | 9 + .../java/com/codahale/metrics/ganglia/GangliaReporter.java | 6 +++--- .../main/java/org/apache/spark/io/ReadAheadInputStream.java | 7 --- .../java/org/apache/spark/
(spark) branch master updated: [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3ae78c4c39a7 [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework 3ae78c4c39a7 is described below commit 3ae78c4c39a7084a321f2e01b4745cb6c442b7a5 Author: panbingkun AuthorDate: Tue May 14 17:23:44 2024 -0700 [SPARK-47599][MLLIB] MLLib: Migrate logWarn with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logWarn` in module `MLLib` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46527 from panbingkun/SPARK-47599. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 18 +++ .../main/scala/org/apache/spark/ml/Predictor.scala | 5 +++-- .../spark/ml/classification/Classifier.scala | 5 +++-- .../apache/spark/ml/classification/LinearSVC.scala | 4 ++-- .../ml/classification/LogisticRegression.scala | 10 + .../apache/spark/ml/classification/OneVsRest.scala | 8 --- .../classification/ProbabilisticClassifier.scala | 5 +++-- .../spark/ml/clustering/GaussianMixture.scala | 5 +++-- .../org/apache/spark/ml/clustering/KMeans.scala| 4 ++-- .../org/apache/spark/ml/feature/Binarizer.scala| 6 +++-- .../apache/spark/ml/feature/StopWordsRemover.scala | 7 +++--- .../apache/spark/ml/feature/StringIndexer.scala| 5 +++-- .../spark/ml/optim/WeightedLeastSquares.scala | 16 ++--- .../org/apache/spark/ml/recommendation/ALS.scala | 5 +++-- .../ml/regression/AFTSurvivalRegression.scala | 10 - .../ml/regression/DecisionTreeRegressor.scala | 5 +++-- .../apache/spark/ml/regression/GBTRegressor.scala | 6 ++--- .../regression/GeneralizedLinearRegression.scala | 6 ++--- .../spark/ml/regression/LinearRegression.scala | 18 +++ .../ml/regression/RandomForestRegressor.scala | 5 +++-- .../spark/ml/tree/impl/DecisionTreeMetadata.scala | 8 --- .../apache/spark/ml/tree/impl/RandomForest.scala | 9 .../spark/mllib/clustering/LocalKMeans.scala | 6 ++--- .../mllib/linalg/distributed/BlockMatrix.scala | 7 -- .../spark/mllib/linalg/distributed/RowMatrix.scala | 26 +- .../spark/mllib/optimization/GradientDescent.scala | 11 + .../recommendation/MatrixFactorizationModel.scala | 9 .../apache/spark/mllib/stat/test/ChiSqTest.scala | 7 +++--- .../spark/mllib/tree/model/DecisionTreeModel.scala | 18 +-- .../mllib/tree/model/treeEnsembleModels.scala | 18 +-- 30 files changed, 164 insertions(+), 108 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index bf5b7daab705..e03987933306 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -82,6 +82,7 @@ object LogKeys { case object CHECKPOINT_TIME extends LogKey case object CHECKSUM_FILE_NUM extends LogKey case object CHOSEN_WATERMARK extends LogKey + case object CLASSIFIER extends LogKey case object CLASS_LOADER extends LogKey case object CLASS_NAME extends LogKey case object CLASS_PATH extends LogKey @@ -157,12 +158,14 @@ object LogKeys { case object DEPRECATED_KEY extends LogKey case object DESCRIPTION extends LogKey case object DESIRED_NUM_PARTITIONS extends LogKey + case object DESIRED_TREE_DEPTH extends LogKey case object DESTINATION_PATH extends LogKey case object DFS_FILE extends LogKey case object DIFF_DELTA extends LogKey case object DIVISIBLE_CLUSTER_INDICES_SIZE extends LogKey case object DRIVER_ID extends LogKey case object DRIVER_LIBRARY_PATH_KEY extends LogKey + case object DRIVER_MEMORY_SIZE extends LogKey case object DRIVER_STATE extends LogKey case object DROPPED_PARTITIONS extends LogKey case object DURATION extends LogKey @@ -196,6 +199,7 @@ object LogKeys { case object EXECUTOR_IDS extends LogKey case object EXECUTOR_LAUNCH_COMMANDS extends LogKey case object EXECUTOR_LAUNCH_COUNT extends LogKey + case object EXECUTOR_MEMORY_SIZE extends LogKey case object EXECUTOR_RESOURCES extends LogKey case object EXECUTOR_SHUFFLE_INFO extends LogKey case object EXECUTOR_STATE
(spark) branch master updated: [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 79aeae1a9aaa [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework 79aeae1a9aaa is described below commit 79aeae1a9aaa2e9cfaf03a1f5d88e1447a3f9b19 Author: Tuan Pham AuthorDate: Tue May 14 13:07:08 2024 -0700 [SPARK-47579][CORE][PART2] Migrate logInfo with variables to structured logging framework The PR aims to migrate `logInfo` in Core module with variables to structured logging framework. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46494 from zeotuan/coreInfo2. Lead-authored-by: Tuan Pham Co-authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 33 ++ .../org/apache/spark/api/python/PythonUtils.scala | 5 +- .../apache/spark/broadcast/TorrentBroadcast.scala | 9 +- .../org/apache/spark/deploy/SparkSubmit.scala | 35 --- .../apache/spark/deploy/SparkSubmitArguments.scala | 12 +-- .../spark/deploy/history/ApplicationCache.scala| 8 +- .../deploy/history/EventLogFileCompactor.scala | 3 +- .../spark/deploy/history/EventLogFileWriters.scala | 2 +- .../spark/deploy/history/HistoryServer.scala | 7 +- .../deploy/history/HistoryServerDiskManager.scala | 19 ++-- .../history/HistoryServerMemoryManager.scala | 17 ++-- .../org/apache/spark/deploy/master/Master.scala| 111 - .../spark/deploy/master/ui/MasterWebUI.scala | 6 +- .../spark/deploy/rest/RestSubmissionServer.scala | 6 +- .../security/HBaseDelegationTokenProvider.scala| 4 +- .../security/HadoopDelegationTokenManager.scala| 12 ++- .../security/HadoopFSDelegationTokenProvider.scala | 8 +- .../apache/spark/deploy/worker/DriverRunner.scala | 10 +- .../apache/spark/deploy/worker/DriverWrapper.scala | 5 +- .../spark/deploy/worker/ExecutorRunner.scala | 2 +- .../apache/spark/deploy/worker/WorkerWatcher.scala | 4 +- .../spark/mapred/SparkHadoopMapRedUtil.scala | 15 +-- .../apache/spark/memory/ExecutionMemoryPool.scala | 3 +- .../apache/spark/memory/UnifiedMemoryManager.scala | 8 +- .../org/apache/spark/metrics/sink/StatsdSink.scala | 5 +- .../main/scala/org/apache/spark/rdd/JdbcRDD.scala | 5 +- .../spark/shuffle/ShuffleWriteProcessor.scala | 9 +- .../apache/spark/storage/DiskBlockManager.scala| 4 +- .../org/apache/spark/storage/FallbackStorage.scala | 4 +- .../src/main/scala/org/apache/spark/ui/WebUI.scala | 5 +- .../org/apache/spark/util/HadoopFSUtils.scala | 7 +- .../main/scala/org/apache/spark/util/Utils.scala | 9 +- .../spark/util/collection/ExternalSorter.scala | 8 +- .../apache/spark/util/logging/DriverLogger.scala | 4 +- .../apache/spark/util/logging/FileAppender.scala | 14 ++- .../spark/util/logging/RollingFileAppender.scala | 5 +- 36 files changed, 260 insertions(+), 163 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index a3c93a4b9f5e..bf5b7daab705 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -85,6 +85,7 @@ object LogKeys { case object CLASS_LOADER extends LogKey case object CLASS_NAME extends LogKey case object CLASS_PATH extends LogKey + case object CLASS_PATHS extends LogKey case object CLAUSES extends LogKey case object CLEANUP_LOCAL_DIRS extends LogKey case object CLUSTER_CENTROIDS extends LogKey @@ -122,6 +123,7 @@ object LogKeys { case object COST extends LogKey case object COUNT extends LogKey case object CREATED_POOL_NAME extends LogKey + case object CREDENTIALS_RENEWAL_INTERVAL_RATIO extends LogKey case object CROSS_VALIDATION_METRIC extends LogKey case object CROSS_VALIDATION_METRICS extends LogKey case object CSV_HEADER_COLUMN_NAME extends LogKey @@ -215,6 +217,7 @@ object LogKeys { case object FALLBACK_VERSION extends LogKey case object FEATURE_COLUMN extends LogKey case object FEATURE_DIMENSION extends LogKey + case object FETCH_SIZE extends LogKey case object FIELD_NAME extends LogKey case object FILE_ABSOLUTE_PATH extends LogKey case object FILE_END_OFFSET extends LogKey @@ -226,6 +229,7 @@ object LogKeys { case object FILE_NAME2 extends LogKey
(spark) branch master updated: [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 78d2a86a927f [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework 78d2a86a927f is described below commit 78d2a86a927f64403e485b14715a119e282cbdc8 Author: panbingkun AuthorDate: Mon May 13 21:04:11 2024 -0700 [SPARK-48209][CORE] Common (java side): Migrate `error/warn/info` with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to 1.migrate `error/warn/info` in module `common` with variables to `structured logging framework` for java side. 2.convert all dependencies on `org.slf4j.Logger & org.slf4j.LoggerFactory` to `org.apache.spark.internal.Logger & org.apache.spark.internal.LoggerFactory`, in order to completely `prohibit` importing `org.slf4j.Logger & org.slf4j.LoggerFactory` in java code later. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46493 from panbingkun/common_java_sl. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../org/apache/spark/network/TransportContext.java | 4 +- .../spark/network/client/TransportClient.java | 14 ++- .../network/client/TransportClientFactory.java | 24 ++-- .../network/client/TransportResponseHandler.java | 38 +-- .../spark/network/crypto/AuthClientBootstrap.java | 4 +- .../spark/network/crypto/AuthRpcHandler.java | 8 +- .../spark/network/protocol/MessageDecoder.java | 5 +- .../spark/network/protocol/MessageEncoder.java | 12 +- .../spark/network/protocol/SslMessageEncoder.java | 11 +- .../spark/network/sasl/SaslClientBootstrap.java| 4 +- .../apache/spark/network/sasl/SaslRpcHandler.java | 4 +- .../apache/spark/network/sasl/SparkSaslClient.java | 5 +- .../apache/spark/network/sasl/SparkSaslServer.java | 5 +- .../network/server/ChunkFetchRequestHandler.java | 23 ++-- .../network/server/OneForOneStreamManager.java | 4 +- .../apache/spark/network/server/RpcHandler.java| 5 +- .../network/server/TransportChannelHandler.java| 14 ++- .../network/server/TransportRequestHandler.java| 30 +++-- .../spark/network/server/TransportServer.java | 4 +- .../network/ssl/ReloadingX509TrustManager.java | 9 +- .../org/apache/spark/network/ssl/SSLFactory.java | 7 +- .../org/apache/spark/network/util/DBProvider.java | 4 +- .../apache/spark/network/util/LevelDBProvider.java | 16 +-- .../org/apache/spark/network/util/NettyLogger.java | 5 +- .../apache/spark/network/util/RocksDBProvider.java | 14 ++- .../spark/network/sasl/ShuffleSecretManager.java | 12 +- .../spark/network/shuffle/BlockStoreClient.java| 14 ++- .../network/shuffle/ExternalBlockHandler.java | 10 +- .../network/shuffle/ExternalBlockStoreClient.java | 21 ++-- .../shuffle/ExternalShuffleBlockResolver.java | 46 +--- .../network/shuffle/OneForOneBlockFetcher.java | 4 +- .../network/shuffle/OneForOneBlockPusher.java | 4 +- .../network/shuffle/RemoteBlockPushResolver.java | 123 ++--- .../network/shuffle/RetryingBlockTransferor.java | 34 -- .../network/shuffle/ShuffleTransportContext.java | 5 +- .../shuffle/checksum/ShuffleChecksumHelper.java| 13 ++- .../spark/network/yarn/YarnShuffleService.java | 44 +--- .../org/apache/spark/internal/LoggerFactory.java | 5 + .../org/apache/spark/network/util/JavaUtils.java | 11 +- .../scala/org/apache/spark/internal/LogKey.scala | 31 +- .../sql/connect/client/GrpcRetryHandler.scala | 4 +- .../codahale/metrics/ganglia/GangliaReporter.java | 23 ++-- .../network/netty/NettyBlockTransferService.scala | 14 ++- 43 files changed, 452 insertions(+), 239 deletions(-) diff --git a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java index 9f3b9c59256b..815f4dc6e6cd 100644 --- a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java +++ b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java @@ -34,9 +34,9 @@ import io.netty.handler.ssl.SslHandler; import io.netty.handler.stream.ChunkedWriteHandler; import io.netty.handler.timeout.IdleStateHandler; import io.netty.handler.codec.MessageToMessageEncoder; -import org.slf4j.Logger; -import org.slf4j.L
(spark) branch master updated (a101c48dd965 -> d9ff78e2e341)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from a101c48dd965 [SPARK-44953][CORE] Log a warning when shuffle tracking is enabled along side another DA supported mechanism add d9ff78e2e341 [SPARK-48260][SQL] Disable output committer coordination in one test of ParquetIOSuite No new revisions were added by this update. Summary of changes: .../datasources/parquet/ParquetIOSuite.scala | 89 +- 1 file changed, 51 insertions(+), 38 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (b14abb3a2ed0 -> 8d8cc623085e)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from b14abb3a2ed0 [SPARK-48241][SQL] CSV parsing failure with char/varchar type columns add 8d8cc623085e [SPARK-41794][SQL] Add `try_remainder` function and re-enable column tests No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/sql/functions.scala | 8 .../sql/streaming/StreamingQueryListenerBus.scala | 6 ++- .../CheckConnectJvmClientCompatibility.scala | 3 +- docs/sql-ref-ansi-compliance.md| 1 + python/pyspark/sql/connect/functions/builtin.py| 7 +++ python/pyspark/sql/functions/builtin.py| 52 +- .../sql/tests/connect/test_connect_column.py | 16 +++ .../sql/catalyst/analysis/FunctionRegistry.scala | 1 + .../spark/sql/catalyst/expressions/TryEval.scala | 37 +++ .../sql/catalyst/expressions/arithmetic.scala | 4 ++ .../sql/catalyst/expressions/TryEvalSuite.scala| 13 ++ .../scala/org/apache/spark/sql/functions.scala | 9 .../sql-functions/sql-expression-schema.md | 1 + .../org/apache/spark/sql/MathFunctionsSuite.scala | 11 + 14 files changed, 156 insertions(+), 13 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (5891b20ef492 -> 85a6e35d834e)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5891b20ef492 [SPARK-47186][TESTS][FOLLOWUP] Correct the name of spark.test.docker.connectionTimeout add 85a6e35d834e [SPARK-48182][SQL] SQL (java side): Migrate `error/warn/info` with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../java/org/apache/spark/internal/Logger.java | 4 ++ .../scala/org/apache/spark/internal/LogKey.scala | 13 ++ .../expressions/RowBasedKeyValueBatch.java | 11 ++--- .../spark/sql/util/CaseInsensitiveStringMap.java | 18 .../org/apache/hive/service/AbstractService.java | 13 +++--- .../org/apache/hive/service/CompositeService.java | 14 --- .../java/org/apache/hive/service/CookieSigner.java | 5 ++- .../org/apache/hive/service/ServiceOperations.java | 12 +++--- .../java/org/apache/hive/service/ServiceUtils.java | 2 +- .../apache/hive/service/auth/HiveAuthFactory.java | 15 --- .../apache/hive/service/auth/HttpAuthUtils.java| 12 -- .../hive/service/auth/TSetIpAddressProcessor.java | 7 ++-- .../org/apache/hive/service/cli/CLIService.java| 21 ++ .../apache/hive/service/cli/ColumnBasedSet.java| 9 ++-- .../cli/operation/ClassicTableTypeMapping.java | 13 -- .../hive/service/cli/operation/Operation.java | 28 - .../service/cli/operation/OperationManager.java| 10 +++-- .../hive/service/cli/session/HiveSessionImpl.java | 49 +- .../hive/service/cli/session/SessionManager.java | 49 +- .../hive/service/cli/thrift/ThriftCLIService.java | 16 --- .../hive/service/cli/thrift/ThriftHttpServlet.java | 14 --- .../apache/hive/service/server/HiveServer2.java| 12 +++--- .../service/server/ThreadWithGarbageCleanup.java | 5 ++- .../sql/hive/thriftserver/SparkSQLCLIService.scala | 2 +- 24 files changed, 222 insertions(+), 132 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48126][CORE] Make `spark.log.structuredLogging.enabled` effective
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6bbf6b1eff2c [SPARK-48126][CORE] Make `spark.log.structuredLogging.enabled` effective 6bbf6b1eff2c is described below commit 6bbf6b1eff2cffe8d116ebba0194fac233b42348 Author: Gengliang Wang AuthorDate: Tue May 7 19:10:27 2024 -0700 [SPARK-48126][CORE] Make `spark.log.structuredLogging.enabled` effective ### What changes were proposed in this pull request? Currently, the spark conf `spark.log.structuredLogging.enabled` is not taking effect. The current code base checks this config in the method `prepareSubmitEnvironment`. However, Log4j is already initialized before that. This PR is to fix it by checking the config `spark.log.structuredLogging.enabled` before the initialization of Log4j. Also, this PR enhances the doc for this configuration. ### Why are the changes needed? Bug fix. After the fix, the Spark conf `spark.log.structuredLogging.enabled` takes effect. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: GPT-4 I used GPT-4 to improve the documents. Closes #46452 from gengliangwang/makeConfEffective. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../org/apache/spark/deploy/SparkSubmit.scala | 33 -- .../org/apache/spark/internal/config/package.scala | 9 +++--- docs/configuration.md | 6 +++- docs/core-migration-guide.md | 4 ++- 4 files changed, 31 insertions(+), 21 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala index 076aa8387dc5..5a7e5542cbd0 100644 --- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala +++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala @@ -69,10 +69,20 @@ private[spark] class SparkSubmit extends Logging { def doSubmit(args: Array[String]): Unit = { val appArgs = parseArguments(args) +val sparkConf = appArgs.toSparkConf() + // For interpreters, structured logging is disabled by default to avoid generating mixed // plain text and structured logs on the same console. if (isShell(appArgs.primaryResource) || isSqlShell(appArgs.mainClass)) { Logging.disableStructuredLogging() +} else { + // For non-shell applications, enable structured logging if it's not explicitly disabled + // via the configuration `spark.log.structuredLogging.enabled`. + if (sparkConf.getBoolean(STRUCTURED_LOGGING_ENABLED.key, defaultValue = true)) { +Logging.enableStructuredLogging() + } else { +Logging.disableStructuredLogging() + } } // Initialize logging if it hasn't been done yet. Keep track of whether logging needs to // be reset before the application starts. @@ -82,9 +92,9 @@ private[spark] class SparkSubmit extends Logging { logInfo(appArgs.toString) } appArgs.action match { - case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog) - case SparkSubmitAction.KILL => kill(appArgs) - case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs) + case SparkSubmitAction.SUBMIT => submit(appArgs, uninitLog, sparkConf) + case SparkSubmitAction.KILL => kill(appArgs, sparkConf) + case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs, sparkConf) case SparkSubmitAction.PRINT_VERSION => printVersion() } } @@ -96,12 +106,11 @@ private[spark] class SparkSubmit extends Logging { /** * Kill an existing submission. */ - private def kill(args: SparkSubmitArguments): Unit = { + private def kill(args: SparkSubmitArguments, sparkConf: SparkConf): Unit = { if (RestSubmissionClient.supportsRestClient(args.master)) { new RestSubmissionClient(args.master) .killSubmission(args.submissionToKill) } else { - val sparkConf = args.toSparkConf() sparkConf.set("spark.master", args.master) SparkSubmitUtils .getSubmitOperations(args.master) @@ -112,12 +121,11 @@ private[spark] class SparkSubmit extends Logging { /** * Request the status of an existing submission. */ - private def requestStatus(args: SparkSubmitArguments): Unit = { + private def requestStatus(args: SparkSubmitArguments, sparkConf: SparkConf): Unit = { if (RestSubmissionClient.supportsRestClient(args.master)) { new RestSubmissionClient(args.master) .requestSubmissionStatus(args.submissionToRequestStatusFor)
(spark) branch master updated: [SPARK-47240][CORE][PART1] Migrate logInfo with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a15adeb3a215 [SPARK-47240][CORE][PART1] Migrate logInfo with variables to structured logging framework a15adeb3a215 is described below commit a15adeb3a215ad2ef7222e18112d23cdffa8569a Author: Tuan Pham AuthorDate: Tue May 7 17:35:35 2024 -0700 [SPARK-47240][CORE][PART1] Migrate logInfo with variables to structured logging framework The PR aims to migrate `logInfo` in Core module with variables to structured logging framework. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46362 from zeotuan/coreInfo. Lead-authored-by: Tuan Pham Co-authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 48 ++- .../org/apache/spark/BarrierCoordinator.scala | 17 --- .../org/apache/spark/BarrierTaskContext.scala | 35 +++--- .../apache/spark/ExecutorAllocationManager.scala | 13 -- .../scala/org/apache/spark/MapOutputTracker.scala | 48 --- .../main/scala/org/apache/spark/SparkContext.scala | 54 ++ .../apache/spark/api/python/PythonHadoopUtil.scala | 2 +- .../org/apache/spark/api/python/PythonRDD.scala| 6 ++- .../org/apache/spark/api/python/PythonRunner.scala | 2 +- .../org/apache/spark/api/python/PythonUtils.scala | 7 +-- .../spark/api/python/StreamingPythonRunner.scala | 12 +++-- .../scala/org/apache/spark/deploy/Client.scala | 18 .../spark/deploy/ExternalShuffleService.scala | 10 ++-- .../apache/spark/rdd/ReliableCheckpointRDD.scala | 9 ++-- .../spark/rdd/ReliableRDDCheckpointData.scala | 6 ++- .../scala/org/apache/spark/ui/JettyUtils.scala | 7 ++- .../main/scala/org/apache/spark/ui/SparkUI.scala | 4 +- .../scala/org/apache/spark/util/ListenerBus.scala | 7 +-- .../scala/org/apache/spark/util/SignalUtils.scala | 2 +- 19 files changed, 205 insertions(+), 102 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index c127f9c3d1f9..14e822c6349f 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -26,13 +26,15 @@ trait LogKey { } /** - * Various keys used for mapped diagnostic contexts(MDC) in logging. - * All structured logging keys should be defined here for standardization. + * Various keys used for mapped diagnostic contexts(MDC) in logging. All structured logging keys + * should be defined here for standardization. */ object LogKeys { case object ACCUMULATOR_ID extends LogKey + case object ACTUAL_BROADCAST_OUTPUT_STATUS_SIZE extends LogKey case object ACTUAL_NUM_FILES extends LogKey case object ACTUAL_PARTITION_COLUMN extends LogKey + case object ADDED_JARS extends LogKey case object AGGREGATE_FUNCTIONS extends LogKey case object ALPHA extends LogKey case object ANALYSIS_ERROR extends LogKey @@ -42,7 +44,10 @@ object LogKeys { case object APP_NAME extends LogKey case object APP_STATE extends LogKey case object ARGS extends LogKey + case object AUTH_ENABLED extends LogKey case object BACKUP_FILE extends LogKey + case object BARRIER_EPOCH extends LogKey + case object BARRIER_ID extends LogKey case object BATCH_ID extends LogKey case object BATCH_NAME extends LogKey case object BATCH_TIMESTAMP extends LogKey @@ -55,6 +60,7 @@ object LogKeys { case object BOOT extends LogKey case object BROADCAST extends LogKey case object BROADCAST_ID extends LogKey + case object BROADCAST_OUTPUT_STATUS_SIZE extends LogKey case object BUCKET extends LogKey case object BYTECODE_SIZE extends LogKey case object CACHED_TABLE_PARTITION_METADATA_SIZE extends LogKey @@ -62,6 +68,7 @@ object LogKeys { case object CACHE_UNTIL_HIGHEST_CONSUMED_SIZE extends LogKey case object CACHE_UNTIL_LAST_PRODUCED_SIZE extends LogKey case object CALL_SITE_LONG_FORM extends LogKey + case object CALL_SITE_SHORT_FORM extends LogKey case object CATALOG_NAME extends LogKey case object CATEGORICAL_FEATURES extends LogKey case object CHECKPOINT_FILE extends LogKey @@ -142,11 +149,13 @@ object LogKeys { case object DEPRECATED_KEY extends LogKey case object DESCRIPTION extends LogKey case object DESIRED_NUM_PARTITIONS extends LogKey + case object
(spark) branch master updated: [SPARK-48134][CORE] Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3d9d1f3dc05a [SPARK-48134][CORE] Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework 3d9d1f3dc05a is described below commit 3d9d1f3dc05a2825bf315c68fc4e4232354dbd00 Author: panbingkun AuthorDate: Tue May 7 13:08:00 2024 -0700 [SPARK-48134][CORE] Spark core (java side): Migrate `error/warn/info` with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to 1.migrate `error/warn/info` in module `core` with variables to `structured logging framework` for java side. 2.convert all dependencies on `org.slf4j.Logger & org.slf4j.LoggerFactory` to `org.apache.spark.internal.Logger & org.apache.spark.internal.LoggerFactory`, in order to completely `prohibit` importing `org.slf4j.Logger & org.slf4j.LoggerFactory` in java code later. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46390 from panbingkun/core_java_sl. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../java/org/apache/spark/internal/Logger.java | 21 +++- .../scala/org/apache/spark/internal/LogKey.scala | 9 +++ .../org/apache/spark/io/ReadAheadInputStream.java | 19 --- .../org/apache/spark/memory/TaskMemoryManager.java | 28 +++--- .../shuffle/sort/BypassMergeSortShuffleWriter.java | 9 --- .../spark/shuffle/sort/ShuffleExternalSorter.java | 25 ++- .../spark/shuffle/sort/UnsafeShuffleWriter.java| 9 --- .../sort/io/LocalDiskShuffleMapOutputWriter.java | 10 .../apache/spark/unsafe/map/BytesToBytesMap.java | 12 ++ .../unsafe/sort/UnsafeExternalSorter.java | 21 +--- .../unsafe/sort/UnsafeSorterSpillReader.java | 4 ++-- 11 files changed, 113 insertions(+), 54 deletions(-) diff --git a/common/utils/src/main/java/org/apache/spark/internal/Logger.java b/common/utils/src/main/java/org/apache/spark/internal/Logger.java index 2b4dd3bb45bc..d8ab26424bae 100644 --- a/common/utils/src/main/java/org/apache/spark/internal/Logger.java +++ b/common/utils/src/main/java/org/apache/spark/internal/Logger.java @@ -34,6 +34,10 @@ public class Logger { this.slf4jLogger = slf4jLogger; } + public boolean isErrorEnabled() { +return slf4jLogger.isErrorEnabled(); + } + public void error(String msg) { slf4jLogger.error(msg); } @@ -58,6 +62,10 @@ public class Logger { } } + public boolean isWarnEnabled() { +return slf4jLogger.isWarnEnabled(); + } + public void warn(String msg) { slf4jLogger.warn(msg); } @@ -82,6 +90,10 @@ public class Logger { } } + public boolean isInfoEnabled() { +return slf4jLogger.isInfoEnabled(); + } + public void info(String msg) { slf4jLogger.info(msg); } @@ -106,6 +118,10 @@ public class Logger { } } + public boolean isDebugEnabled() { +return slf4jLogger.isDebugEnabled(); + } + public void debug(String msg) { slf4jLogger.debug(msg); } @@ -126,6 +142,10 @@ public class Logger { slf4jLogger.debug(msg, throwable); } + public boolean isTraceEnabled() { +return slf4jLogger.isTraceEnabled(); + } + public void trace(String msg) { slf4jLogger.trace(msg); } @@ -146,7 +166,6 @@ public class Logger { slf4jLogger.trace(msg, throwable); } - private void withLogContext( String pattern, MDC[] mdcs, diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index d4e1d9f535af..c127f9c3d1f9 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -168,6 +168,7 @@ object LogKeys { case object EXCEPTION extends LogKey case object EXECUTE_INFO extends LogKey case object EXECUTE_KEY extends LogKey + case object EXECUTION_MEMORY_SIZE extends LogKey case object EXECUTION_PLAN_LEAVES extends LogKey case object EXECUTOR_BACKEND extends LogKey case object EXECUTOR_DESIRED_COUNT extends LogKey @@ -302,6 +303,7 @@ object LogKeys { case object MAX_SLOTS extends LogKey case object MAX_SPLIT_BYTES extends LogKey case object MAX_TABLE_PARTITION_METADATA_SIZE extends LogKey + case object MEMORY_CONSUMER extends LogKey case object MEMORY_POOL_NAME extends Lo
(spark) branch master updated: [SPARK-48124][CORE] Disable structured logging for Connect-Repl by default
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b924e689942d [SPARK-48124][CORE] Disable structured logging for Connect-Repl by default b924e689942d is described below commit b924e689942d735f165d31660d26efad057f4827 Author: panbingkun AuthorDate: Sat May 4 22:47:24 2024 -0700 [SPARK-48124][CORE] Disable structured logging for Connect-Repl by default ### What changes were proposed in this pull request? The pr is followup https://github.com/apache/spark/pull/46383, to `disable` structured logging for` Connect-Repl` by default. ### Why are the changes needed? Before: https://github.com/apache/spark/assets/15246973/10d93a09-f098-4653-9e95-571481dd03e9;> After: https://github.com/apache/spark/assets/15246973/e3354359-d6bc-4b2c-801b-8a2c3697f78e;> ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46387 from panbingkun/SPARK-48124_FOLLOWUP. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../main/scala/org/apache/spark/sql/application/ConnectRepl.scala| 5 + 1 file changed, 5 insertions(+) diff --git a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/application/ConnectRepl.scala b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/application/ConnectRepl.scala index 0360a4057886..9fd3ae4368f4 100644 --- a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/application/ConnectRepl.scala +++ b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/application/ConnectRepl.scala @@ -26,6 +26,7 @@ import ammonite.compiler.iface.CodeWrapper import ammonite.util.{Bind, Imports, Name, Util} import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging import org.apache.spark.sql.SparkSession import org.apache.spark.sql.connect.client.{SparkConnectClient, SparkConnectClientParser} @@ -55,6 +56,10 @@ object ConnectRepl { inputStream: InputStream = System.in, outputStream: OutputStream = System.out, errorStream: OutputStream = System.err): Unit = { +// For interpreters, structured logging is disabled by default to avoid generating mixed +// plain text and structured logs on the same console. +Logging.disableStructuredLogging() + // Build the client. val client = try { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-48123][CORE] Provide a constant table schema for querying structured logs
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new e4453b480f98 [SPARK-48123][CORE] Provide a constant table schema for querying structured logs e4453b480f98 is described below commit e4453b480f988bf6683930ae14b7043a2cecffc4 Author: Gengliang Wang AuthorDate: Sat May 4 00:18:18 2024 -0700 [SPARK-48123][CORE] Provide a constant table schema for querying structured logs ### What changes were proposed in this pull request? Providing a table schema LOG_SCHEMA, so that users can load structured logs with the following code: ``` import org.apache.spark.util.LogUtils.LOG_SCHEMA val logDf = spark.read.schema(LOG_SCHEMA).json("path/to/logs") ``` ### Why are the changes needed? Provide a convenient way to query Spark logs using Spark SQL. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #46375 from gengliangwang/logSchema. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/util/LogUtils.scala | 50 + docs/configuration.md | 10 ++- sql/core/src/test/resources/log4j2.properties | 12 .../scala/org/apache/spark/sql/LogQuerySuite.scala | 83 ++ 4 files changed, 154 insertions(+), 1 deletion(-) diff --git a/common/utils/src/main/scala/org/apache/spark/util/LogUtils.scala b/common/utils/src/main/scala/org/apache/spark/util/LogUtils.scala new file mode 100644 index ..5a798ffad3a9 --- /dev/null +++ b/common/utils/src/main/scala/org/apache/spark/util/LogUtils.scala @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.util + +import org.apache.spark.annotation.DeveloperApi + +/** + * :: : DeveloperApi :: + * Utils for querying Spark logs with Spark SQL. + * + * @since 4.0.0 + */ +@DeveloperApi +object LogUtils { + /** + * Schema for structured Spark logs. + * Example usage: + * val logDf = spark.read.schema(LOG_SCHEMA).json("path/to/logs") + */ + val LOG_SCHEMA: String = """ +|ts TIMESTAMP, +|level STRING, +|msg STRING, +|context map, +|exception STRUCT< +| class STRING, +| msg STRING, +| stacktrace ARRAY> +|>, +|logger STRING""".stripMargin +} diff --git a/docs/configuration.md b/docs/configuration.md index a3b4e731f057..7966aceccdea 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -3675,7 +3675,15 @@ Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can config ## Structured Logging Starting from version 4.0.0, Spark has adopted the [JSON Template Layout](https://logging.apache.org/log4j/2.x/manual/json-template-layout.html) for logging, which outputs logs in JSON format. This format facilitates querying logs using Spark SQL with the JSON data source. Additionally, the logs include all Mapped Diagnostic Context (MDC) information for search and debugging purposes. -To implement structured logging, start with the `log4j2.properties.template` file. +To configure the layout of structured logging, start with the `log4j2.properties.template` file. + +To query Spark logs using Spark SQL, you can use the following Scala code snippet: + +```scala +import org.apache.spark.util.LogUtils.LOG_SCHEMA + +val logDf = spark.read.schema(LOG_SCHEMA).json("path/to/logs") +``` ## Plain Text Logging If you prefer plain text logging, you can use the `log4j2.properties.pattern-layout-template` file as a starting point. This is the default configuration used by Spark before the 4.0.0 release. This configuration uses the [PatternLayout](https://logging.apache.org/log4j/2.x/manual/layouts.html#PatternLayout) to log all the logs in plain text. MDC
(spark) branch master updated: [SPARK-48059][CORE] Implement the structured log framework on the java side
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 5c01f196afc3 [SPARK-48059][CORE] Implement the structured log framework on the java side 5c01f196afc3 is described below commit 5c01f196afc3ba75f10c4aedf2c8405b6f59336a Author: panbingkun AuthorDate: Fri May 3 16:30:36 2024 -0700 [SPARK-48059][CORE] Implement the structured log framework on the java side ### What changes were proposed in this pull request? The pr aims to implement the structured log framework on the `java side`. ### Why are the changes needed? Currently, the structured log framework on the `scala side` is basically available, but the`Spark Core` code also includes some `Java code`, which also needs to be connected to the structured log framework. ### Does this PR introduce _any_ user-facing change? Yes, only for developers. ### How was this patch tested? - Add some new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46301 from panbingkun/structured_logger_java. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../java/org/apache/spark/internal/Logger.java | 184 +++ .../org/apache/spark/internal/LoggerFactory.java | 26 +++ .../scala/org/apache/spark/internal/Logging.scala | 4 + .../org/apache/spark/util/LoggerSuiteBase.java | 248 + .../org/apache/spark/util/PatternLoggerSuite.java | 89 .../apache/spark/util/StructuredLoggerSuite.java | 164 ++ common/utils/src/test/resources/log4j2.properties | 28 ++- .../apache/spark/util/StructuredLoggingSuite.scala | 8 +- 8 files changed, 739 insertions(+), 12 deletions(-) diff --git a/common/utils/src/main/java/org/apache/spark/internal/Logger.java b/common/utils/src/main/java/org/apache/spark/internal/Logger.java new file mode 100644 index ..f252f44b3b76 --- /dev/null +++ b/common/utils/src/main/java/org/apache/spark/internal/Logger.java @@ -0,0 +1,184 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.internal; + +import java.util.HashMap; +import java.util.Map; +import java.util.function.Consumer; + +import org.apache.logging.log4j.CloseableThreadContext; +import org.apache.logging.log4j.message.MessageFactory; +import org.apache.logging.log4j.message.ParameterizedMessageFactory; + +public class Logger { + + private static final MessageFactory MESSAGE_FACTORY = ParameterizedMessageFactory.INSTANCE; + private final org.slf4j.Logger slf4jLogger; + + Logger(org.slf4j.Logger slf4jLogger) { +this.slf4jLogger = slf4jLogger; + } + + public void error(String msg) { +slf4jLogger.error(msg); + } + + public void error(String msg, Throwable throwable) { +slf4jLogger.error(msg, throwable); + } + + public void error(String msg, MDC... mdcs) { +if (mdcs == null || mdcs.length == 0) { + slf4jLogger.error(msg); +} else if (slf4jLogger.isErrorEnabled()) { + withLogContext(msg, mdcs, null, mt -> slf4jLogger.error(mt.message)); +} + } + + public void error(String msg, Throwable throwable, MDC... mdcs) { +if (mdcs == null || mdcs.length == 0) { + slf4jLogger.error(msg, throwable); +} else if (slf4jLogger.isErrorEnabled()) { + withLogContext(msg, mdcs, throwable, mt -> slf4jLogger.error(mt.message, mt.throwable)); +} + } + + public void warn(String msg) { +slf4jLogger.warn(msg); + } + + public void warn(String msg, Throwable throwable) { +slf4jLogger.warn(msg, throwable); + } + + public void warn(String msg, MDC... mdcs) { +if (mdcs == null || mdcs.length == 0) { + slf4jLogger.warn(msg); +} else if (slf4jLogger.isWarnEnabled()) { + withLogContext(msg, mdcs, null, mt -> slf4jLogger.warn(mt.message)); +} + } + + public void warn(String msg, Throwable throwable, MDC... mdcs) { +if (mdcs == null || mdcs.length == 0) { + slf4j
(spark) branch master updated: [SPARK-48067][SQL] Fix variant default columns
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ffa4d198cec6 [SPARK-48067][SQL] Fix variant default columns ffa4d198cec6 is described below commit ffa4d198cec6620f0385a0e428b023d2ac4e3d5c Author: Richard Chen AuthorDate: Thu May 2 12:22:02 2024 -0700 [SPARK-48067][SQL] Fix variant default columns ### What changes were proposed in this pull request? Changes the literal `sql` representation of a variant value to `parse_json(variant.toJson)`. This is because there is no other representation of a literal variant. This allows variant default columns to work because default columns store a literal string representation in the schema struct fields metadata as the default value. ### Why are the changes needed? previously we could not set a variant default column like ``` create table t( v6 variant default parse_json('{\"k\": \"v\"}') ) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? added UT ### Was this patch authored or co-authored using generative AI tooling? no Closes #46312 from richardc-db/fix_variant_default_cols. Authored-by: Richard Chen Signed-off-by: Gengliang Wang --- .../spark/sql/catalyst/expressions/literals.scala | 4 + .../scala/org/apache/spark/sql/VariantSuite.scala | 145 - 2 files changed, 146 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala index 0fad3eff2da5..4cffc7f0b53a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala @@ -42,6 +42,7 @@ import org.json4s.JsonAST._ import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow, ScalaReflection} import org.apache.spark.sql.catalyst.expressions.codegen._ +import org.apache.spark.sql.catalyst.expressions.variant.VariantExpressionEvalUtils import org.apache.spark.sql.catalyst.trees.TreePattern import org.apache.spark.sql.catalyst.trees.TreePattern.{LITERAL, NULL_LITERAL, TRUE_OR_FALSE_LITERAL} import org.apache.spark.sql.catalyst.types._ @@ -204,6 +205,8 @@ object Literal { create(new GenericInternalRow( struct.fields.map(f => default(f.dataType).value)), struct) case udt: UserDefinedType[_] => Literal(default(udt.sqlType).value, udt) +case VariantType => + create(VariantExpressionEvalUtils.castToVariant(0, IntegerType), VariantType) case other => throw QueryExecutionErrors.noDefaultForDataTypeError(dataType) } @@ -549,6 +552,7 @@ case class Literal (value: Any, dataType: DataType) extends LeafExpression { s"${Literal(kv._1, mapType.keyType).sql}, ${Literal(kv._2, mapType.valueType).sql}" } s"MAP(${keysAndValues.mkString(", ")})" +case (v: VariantVal, variantType: VariantType) => s"PARSE_JSON('${v.toJson(timeZoneId)}')" case _ => value.toString } } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala index 19e5f9ba63e6..caab98b6239a 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/VariantSuite.scala @@ -26,15 +26,17 @@ import scala.jdk.CollectionConverters._ import scala.util.Random import org.apache.spark.SparkRuntimeException -import org.apache.spark.sql.catalyst.expressions.CodegenObjectFactoryMode +import org.apache.spark.sql.catalyst.expressions.{CodegenObjectFactoryMode, ExpressionEvalHelper, Literal} +import org.apache.spark.sql.catalyst.expressions.variant.{VariantExpressionEvalUtils, VariantGet} +import org.apache.spark.sql.catalyst.util.{ArrayBasedMapData, DateTimeUtils, GenericArrayData} import org.apache.spark.sql.functions._ import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.test.SharedSparkSession import org.apache.spark.sql.types._ -import org.apache.spark.unsafe.types.VariantVal +import org.apache.spark.unsafe.types.{UTF8String, VariantVal} import org.apache.spark.util.ArrayImplicits._ -class VariantSuite extends QueryTest with SharedSparkSession { +class VariantSuite extends QueryTest with SharedSparkSession with ExpressionEvalHelper { import testImplicits._ test("basic tests") { @@ -445,4 +447,141 @@ class VariantSuite extends QueryTest with SharedSparkSession { } } } +
(spark) branch branch-3.5 updated: [SPARK-48016][SQL][3.5] Fix a bug in try_divide function when with decimals
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 6a4475c0b8cb [SPARK-48016][SQL][3.5] Fix a bug in try_divide function when with decimals 6a4475c0b8cb is described below commit 6a4475c0b8cbcc6fca5fe7a9cd499d05c428c418 Author: Gengliang Wang AuthorDate: Wed May 1 14:32:52 2024 -0700 [SPARK-48016][SQL][3.5] Fix a bug in try_divide function when with decimals ### What changes were proposed in this pull request? Currently, the following query will throw DIVIDE_BY_ZERO error instead of returning null ``` SELECT try_divide(1, decimal(0)); ``` This is caused by the rule `DecimalPrecision`: ``` case b BinaryOperator(left, right) if left.dataType != right.dataType => (left, right) match { ... case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) ``` The result of the above makeCopy will contain `ANSI` as the `evalMode`, instead of `TRY`. This PR is to fix this bug by replacing the makeCopy method calls with withNewChildren ### Why are the changes needed? Bug fix in try_* functions. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a long-standing bug in the try_divide function. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #46323 from gengliangwang/pickFix. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../sql/catalyst/analysis/DecimalPrecision.scala | 14 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 10 ++-- .../analyzer-results/ansi/try_arithmetic.sql.out | 56 +++ .../analyzer-results/try_arithmetic.sql.out| 56 +++ .../resources/sql-tests/inputs/try_arithmetic.sql | 8 +++ .../sql-tests/results/ansi/try_arithmetic.sql.out | 64 ++ .../sql-tests/results/try_arithmetic.sql.out | 64 ++ 7 files changed, 260 insertions(+), 12 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala index 09cf61a77955..f51127f53b38 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala @@ -83,7 +83,7 @@ object DecimalPrecision extends TypeCoercionRule { val resultType = widerDecimalType(p1, s1, p2, s2) val newE1 = if (e1.dataType == resultType) e1 else Cast(e1, resultType) val newE2 = if (e2.dataType == resultType) e2 else Cast(e2, resultType) - b.makeCopy(Array(newE1, newE2)) + b.withNewChildren(Seq(newE1, newE2)) } /** @@ -202,21 +202,21 @@ object DecimalPrecision extends TypeCoercionRule { case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) + b.withNewChildren(Seq(Cast(l, DataTypeUtils.fromLiteral(l)), r)) case (l, r: Literal) if l.dataType.isInstanceOf[DecimalType] && r.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.makeCopy(Array(l, Cast(r, DataTypeUtils.fromLiteral(r + b.withNewChildren(Seq(l, Cast(r, DataTypeUtils.fromLiteral(r // Promote integers inside a binary expression with fixed-precision decimals to decimals, // and fixed-precision decimals in an expression with floats / doubles to doubles case (l @ IntegralTypeExpression(), r @ DecimalExpression(_, _)) => - b.makeCopy(Array(Cast(l, DecimalType.forType(l.dataType)), r)) + b.withNewChildren(Seq(Cast(l, DecimalType.forType(l.dataType)), r)) case (l @ DecimalExpression(_, _), r @ IntegralTypeExpression()) => - b.makeCopy(Array(l, Cast(r, DecimalType.forType(r.dataType + b.withNewChildren(Seq(l, Cast(r, DecimalType.forType(r.dataType case (l, r @ DecimalExpression(_, _)) if isFloat(l.dataType) => - b.makeCopy(Array(l, Cast(r, DoubleType))) + b.withNewChildren(Seq(l, Cast(r, DoubleType))) case (l @ DecimalExpression(_, _), r) if isFloat(r.dataType) => - b.makeCopy(Array(Cast(l,
(spark) branch master updated: [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 87b20b166c41 [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework 87b20b166c41 is described below commit 87b20b166c41d4c265ac54eed75707b7726d371f Author: panbingkun AuthorDate: Mon Apr 29 22:10:59 2024 -0700 [SPARK-47585][SQL] SQL core: Migrate logInfo with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logInfo` in module `SQL core` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46264 from panbingkun/SPARK-47585. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 55 +++--- .../org/apache/spark/ml/util/Instrumentation.scala | 4 +- .../spark/sql/catalyst/optimizer/Optimizer.scala | 9 ++-- .../spark/sql/catalyst/rules/RuleExecutor.scala| 6 +-- .../spark/sql/columnar/CachedBatchSerializer.scala | 6 ++- .../spark/sql/execution/DataSourceScanExec.scala | 11 +++-- .../ExternalAppendOnlyUnsafeRowArray.scala | 8 ++-- .../sql/execution/WholeStageCodegenExec.scala | 14 +++--- .../sql/execution/adaptive/AQEOptimizer.scala | 9 ++-- .../aggregate/AggregateCodegenSupport.scala| 10 ++-- .../execution/aggregate/HashAggregateExec.scala| 6 ++- .../aggregate/ObjectAggregationIterator.scala | 13 +++-- .../spark/sql/execution/command/CommandUtils.scala | 9 ++-- .../execution/command/createDataSourceTables.scala | 2 +- .../apache/spark/sql/execution/command/ddl.scala | 30 +++- .../datasources/BasicWriteStatsTracker.scala | 8 ++-- .../sql/execution/datasources/DataSource.scala | 4 +- .../execution/datasources/DataSourceStrategy.scala | 5 +- .../datasources/FileFormatDataWriter.scala | 11 +++-- .../execution/datasources/FileFormatWriter.scala | 9 ++-- .../sql/execution/datasources/FilePartition.scala | 8 ++-- .../sql/execution/datasources/FileScanRDD.scala| 4 +- .../execution/datasources/FileSourceStrategy.scala | 12 ++--- .../execution/datasources/InMemoryFileIndex.scala | 7 +-- .../datasources/PartitioningAwareFileIndex.scala | 7 +-- .../SQLHadoopMapReduceCommitProtocol.scala | 9 ++-- .../sql/execution/datasources/jdbc/JDBCRDD.scala | 5 +- .../execution/datasources/jdbc/JDBCRelation.scala | 7 +-- .../datasources/parquet/ParquetUtils.scala | 7 +-- .../execution/datasources/v2/FileBatchWrite.scala | 9 ++-- .../datasources/v2/FilePartitionReader.scala | 4 +- .../GroupBasedRowLevelOperationScanPlanning.scala | 17 --- .../datasources/v2/V2ScanRelationPushDown.scala| 32 +++-- .../datasources/v2/WriteToDataSourceV2Exec.scala | 52 +++- .../python/PythonStreamingSinkCommitRunner.scala | 5 +- .../execution/exchange/EnsureRequirements.scala| 42 ++--- .../python/PythonStreamingSourceRunner.scala | 5 +- .../spark/sql/execution/r/ArrowRRunner.scala | 24 +- .../WriteToContinuousDataSourceExec.scala | 2 +- .../state/HDFSBackedStateStoreProvider.scala | 16 --- .../apache/spark/sql/internal/SharedState.scala| 15 +++--- .../sql/streaming/StreamingQueryManager.scala | 4 +- 42 files changed, 318 insertions(+), 204 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 2ca80a496ccb..238432d354f6 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -33,6 +33,7 @@ object LogKeys { case object ACCUMULATOR_ID extends LogKey case object ACTUAL_NUM_FILES extends LogKey case object ACTUAL_PARTITION_COLUMN extends LogKey + case object AGGREGATE_FUNCTIONS extends LogKey case object ALPHA extends LogKey case object ANALYSIS_ERROR extends LogKey case object APP_ATTEMPT_ID extends LogKey @@ -43,10 +44,13 @@ object LogKeys { case object ARGS extends LogKey case object BACKUP_FILE extends LogKey case object BATCH_ID extends LogKey + case object BATCH_NAME extends LogKey case object BATCH_TIMESTAMP extends LogKey case object BATCH_WRITE extends LogKey case object BLOCK_ID extends LogKey case object BLOCK_MANAGER_ID
(spark) branch branch-3.4 updated: [SPARK-48016][SQL][3.4] Fix a bug in try_divide function when with decimals
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 2870c76fb582 [SPARK-48016][SQL][3.4] Fix a bug in try_divide function when with decimals 2870c76fb582 is described below commit 2870c76fb58266db69419e05a204c554f9733357 Author: Gengliang Wang AuthorDate: Mon Apr 29 22:01:16 2024 -0700 [SPARK-48016][SQL][3.4] Fix a bug in try_divide function when with decimals ### What changes were proposed in this pull request? Currently, the following query will throw DIVIDE_BY_ZERO error instead of returning null ``` SELECT try_divide(1, decimal(0)); ``` This is caused by the rule `DecimalPrecision`: ``` case b BinaryOperator(left, right) if left.dataType != right.dataType => (left, right) match { ... case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) ``` The result of the above makeCopy will contain `ANSI` as the `evalMode`, instead of `TRY`. This PR is to fix this bug by replacing the makeCopy method calls with withNewChildren ### Why are the changes needed? Bug fix in try_* functions. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a long-standing bug in the try_divide function. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #46289 from gengliangwang/PICK_PR_46286_BRANCH-3.4. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../sql/catalyst/analysis/DecimalPrecision.scala | 14 +- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 10 +- .../analyzer-results/ansi/try_arithmetic.sql.out | 491 + .../analyzer-results/try_arithmetic.sql.out| 491 + .../resources/sql-tests/inputs/try_arithmetic.sql | 8 + .../sql-tests/results/ansi/try_arithmetic.sql.out | 64 +++ .../sql-tests/results/try_arithmetic.sql.out | 64 +++ 7 files changed, 1130 insertions(+), 12 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala index 46fbf071f437..19b6f2cf8dab 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala @@ -82,7 +82,7 @@ object DecimalPrecision extends TypeCoercionRule { val resultType = widerDecimalType(p1, s1, p2, s2) val newE1 = if (e1.dataType == resultType) e1 else Cast(e1, resultType) val newE2 = if (e2.dataType == resultType) e2 else Cast(e2, resultType) - b.makeCopy(Array(newE1, newE2)) + b.withNewChildren(Seq(newE1, newE2)) } /** @@ -201,21 +201,21 @@ object DecimalPrecision extends TypeCoercionRule { case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.makeCopy(Array(Cast(l, DecimalType.fromLiteral(l)), r)) + b.withNewChildren(Seq(Cast(l, DecimalType.fromLiteral(l)), r)) case (l, r: Literal) if l.dataType.isInstanceOf[DecimalType] && r.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.makeCopy(Array(l, Cast(r, DecimalType.fromLiteral(r + b.withNewChildren(Seq(l, Cast(r, DecimalType.fromLiteral(r // Promote integers inside a binary expression with fixed-precision decimals to decimals, // and fixed-precision decimals in an expression with floats / doubles to doubles case (l @ IntegralType(), r @ DecimalType.Expression(_, _)) => - b.makeCopy(Array(Cast(l, DecimalType.forType(l.dataType)), r)) + b.withNewChildren(Seq(Cast(l, DecimalType.forType(l.dataType)), r)) case (l @ DecimalType.Expression(_, _), r @ IntegralType()) => - b.makeCopy(Array(l, Cast(r, DecimalType.forType(r.dataType + b.withNewChildren(Seq(l, Cast(r, DecimalType.forType(r.dataType case (l, r @ DecimalType.Expression(_, _)) if isFloat(l.dataType) => - b.makeCopy(Array(l, Cast(r, DoubleType))) + b.withNewChildren(Seq(l, Cast(r, DoubleType))) case (l @ DecimalType.Expression(_, _), r) if isFloat(r.dataType) => - b.makeCopy(Array(Cast(l, DoubleType), r)) +
(spark) branch branch-3.5 updated: [SPARK-48016][SQL] Fix a bug in try_divide function when with decimals
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new e78ee2c57702 [SPARK-48016][SQL] Fix a bug in try_divide function when with decimals e78ee2c57702 is described below commit e78ee2c5770218a521340cb84f57a02dd00f7f3a Author: Gengliang Wang AuthorDate: Mon Apr 29 16:40:56 2024 -0700 [SPARK-48016][SQL] Fix a bug in try_divide function when with decimals Currently, the following query will throw DIVIDE_BY_ZERO error instead of returning null ``` SELECT try_divide(1, decimal(0)); ``` This is caused by the rule `DecimalPrecision`: ``` case b BinaryOperator(left, right) if left.dataType != right.dataType => (left, right) match { ... case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) ``` The result of the above makeCopy will contain `ANSI` as the `evalMode`, instead of `TRY`. This PR is to fix this bug by replacing the makeCopy method calls with withNewChildren Bug fix in try_* functions. Yes, it fixes a long-standing bug in the try_divide function. New UT No Closes #46286 from gengliangwang/avoidMakeCopy. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang (cherry picked from commit 3fbcb26d8e992c65a2778b96da4142e234786e53) Signed-off-by: Gengliang Wang --- .../sql/catalyst/analysis/DecimalPrecision.scala | 14 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 10 ++-- sql/core/src/test/resources/log4j2.properties | 2 +- .../analyzer-results/ansi/try_arithmetic.sql.out | 56 +++ .../analyzer-results/try_arithmetic.sql.out| 56 +++ .../resources/sql-tests/inputs/try_arithmetic.sql | 8 +++ .../sql-tests/results/ansi/try_arithmetic.sql.out | 64 ++ .../sql-tests/results/try_arithmetic.sql.out | 64 ++ 8 files changed, 261 insertions(+), 13 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala index 09cf61a77955..f51127f53b38 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala @@ -83,7 +83,7 @@ object DecimalPrecision extends TypeCoercionRule { val resultType = widerDecimalType(p1, s1, p2, s2) val newE1 = if (e1.dataType == resultType) e1 else Cast(e1, resultType) val newE2 = if (e2.dataType == resultType) e2 else Cast(e2, resultType) - b.makeCopy(Array(newE1, newE2)) + b.withNewChildren(Seq(newE1, newE2)) } /** @@ -202,21 +202,21 @@ object DecimalPrecision extends TypeCoercionRule { case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) + b.withNewChildren(Seq(Cast(l, DataTypeUtils.fromLiteral(l)), r)) case (l, r: Literal) if l.dataType.isInstanceOf[DecimalType] && r.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.makeCopy(Array(l, Cast(r, DataTypeUtils.fromLiteral(r + b.withNewChildren(Seq(l, Cast(r, DataTypeUtils.fromLiteral(r // Promote integers inside a binary expression with fixed-precision decimals to decimals, // and fixed-precision decimals in an expression with floats / doubles to doubles case (l @ IntegralTypeExpression(), r @ DecimalExpression(_, _)) => - b.makeCopy(Array(Cast(l, DecimalType.forType(l.dataType)), r)) + b.withNewChildren(Seq(Cast(l, DecimalType.forType(l.dataType)), r)) case (l @ DecimalExpression(_, _), r @ IntegralTypeExpression()) => - b.makeCopy(Array(l, Cast(r, DecimalType.forType(r.dataType + b.withNewChildren(Seq(l, Cast(r, DecimalType.forType(r.dataType case (l, r @ DecimalExpression(_, _)) if isFloat(l.dataType) => - b.makeCopy(Array(l, Cast(r, DoubleType))) + b.withNewChildren(Seq(l, Cast(r, DoubleType))) case (l @ DecimalExpression(_, _), r) if isFloat(r.dataType) => - b.makeCopy(Array(Cast(l, DoubleType), r)) + b.withNewChildren(Seq(Cast(l, DoubleType), r)) case _ => b } } diff
(spark) branch master updated: [SPARK-48016][SQL] Fix a bug in try_divide function when with decimals
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3fbcb26d8e99 [SPARK-48016][SQL] Fix a bug in try_divide function when with decimals 3fbcb26d8e99 is described below commit 3fbcb26d8e992c65a2778b96da4142e234786e53 Author: Gengliang Wang AuthorDate: Mon Apr 29 16:40:56 2024 -0700 [SPARK-48016][SQL] Fix a bug in try_divide function when with decimals ### What changes were proposed in this pull request? Currently, the following query will throw DIVIDE_BY_ZERO error instead of returning null ``` SELECT try_divide(1, decimal(0)); ``` This is caused by the rule `DecimalPrecision`: ``` case b BinaryOperator(left, right) if left.dataType != right.dataType => (left, right) match { ... case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) ``` The result of the above makeCopy will contain `ANSI` as the `evalMode`, instead of `TRY`. This PR is to fix this bug by replacing the makeCopy method calls with withNewChildren ### Why are the changes needed? Bug fix in try_* functions. ### Does this PR introduce _any_ user-facing change? Yes, it fixes a long-standing bug in the try_divide function. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #46286 from gengliangwang/avoidMakeCopy. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../sql/catalyst/analysis/DecimalPrecision.scala | 14 ++--- .../spark/sql/catalyst/analysis/TypeCoercion.scala | 10 ++-- .../analyzer-results/ansi/try_arithmetic.sql.out | 56 +++ .../analyzer-results/try_arithmetic.sql.out| 56 +++ .../resources/sql-tests/inputs/try_arithmetic.sql | 8 +++ .../sql-tests/results/ansi/try_arithmetic.sql.out | 64 ++ .../sql-tests/results/try_arithmetic.sql.out | 64 ++ 7 files changed, 260 insertions(+), 12 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala index 9ad8368d007e..6524ff9b2c57 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala @@ -92,7 +92,7 @@ object DecimalPrecision extends TypeCoercionRule { val resultType = widerDecimalType(p1, s1, p2, s2) val newE1 = if (e1.dataType == resultType) e1 else Cast(e1, resultType) val newE2 = if (e2.dataType == resultType) e2 else Cast(e2, resultType) - b.makeCopy(Array(newE1, newE2)) + b.withNewChildren(Seq(newE1, newE2)) } /** @@ -211,21 +211,21 @@ object DecimalPrecision extends TypeCoercionRule { case (l: Literal, r) if r.dataType.isInstanceOf[DecimalType] && l.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.makeCopy(Array(Cast(l, DataTypeUtils.fromLiteral(l)), r)) + b.withNewChildren(Seq(Cast(l, DataTypeUtils.fromLiteral(l)), r)) case (l, r: Literal) if l.dataType.isInstanceOf[DecimalType] && r.dataType.isInstanceOf[IntegralType] && literalPickMinimumPrecision => - b.makeCopy(Array(l, Cast(r, DataTypeUtils.fromLiteral(r + b.withNewChildren(Seq(l, Cast(r, DataTypeUtils.fromLiteral(r // Promote integers inside a binary expression with fixed-precision decimals to decimals, // and fixed-precision decimals in an expression with floats / doubles to doubles case (l @ IntegralTypeExpression(), r @ DecimalExpression(_, _)) => - b.makeCopy(Array(Cast(l, DecimalType.forType(l.dataType)), r)) + b.withNewChildren(Seq(Cast(l, DecimalType.forType(l.dataType)), r)) case (l @ DecimalExpression(_, _), r @ IntegralTypeExpression()) => - b.makeCopy(Array(l, Cast(r, DecimalType.forType(r.dataType + b.withNewChildren(Seq(l, Cast(r, DecimalType.forType(r.dataType case (l, r @ DecimalExpression(_, _)) if isFloat(l.dataType) => - b.makeCopy(Array(l, Cast(r, DoubleType))) + b.withNewChildren(Seq(l, Cast(r, DoubleType))) case (l @ DecimalExpression(_, _), r) if isFloat(r.dataType) => - b.makeCopy(Array(Cast(l, DoubleType), r))
(spark) branch master updated: [SPARK-47597][STREAMING] Streaming: Migrate logInfo with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d540786d9cea [SPARK-47597][STREAMING] Streaming: Migrate logInfo with variables to structured logging framework d540786d9cea is described below commit d540786d9ceacd7426803ad615f7ab32ec6faf67 Author: Daniel Tenedorio AuthorDate: Thu Apr 25 13:23:21 2024 -0700 [SPARK-47597][STREAMING] Streaming: Migrate logInfo with variables to structured logging framework ### What changes were proposed in this pull request? Migrate logInfo with variables of the streaming module to structured logging framework. This transforms the logInfo entries of the following API ``` def logInfo(msg: => String): Unit ``` to ``` def logInfo(entry: LogEntry): Unit ``` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? Yes, Spark core logs will contain additional MDC ### How was this patch tested? Compiler and scala style checks, as well as code review. ### Was this patch authored or co-authored using generative AI tooling? No Closes #46192 from dtenedor/streaming-log-info. Authored-by: Daniel Tenedorio Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 47 - .../execution/adaptive/AdaptiveSparkPlanExec.scala | 2 +- .../AsyncProgressTrackingMicroBatchExecution.scala | 5 +- .../streaming/CheckpointFileManager.scala | 8 ++- .../streaming/CompactibleFileStreamLog.scala | 11 ++-- .../sql/execution/streaming/FileStreamSink.scala | 4 +- .../execution/streaming/FileStreamSinkLog.scala| 4 +- .../sql/execution/streaming/FileStreamSource.scala | 13 ++-- .../sql/execution/streaming/HDFSMetadataLog.scala | 7 +- .../execution/streaming/IncrementalExecution.scala | 4 +- .../streaming/ManifestFileCommitProtocol.scala | 4 +- .../execution/streaming/MetadataLogFileIndex.scala | 4 +- .../execution/streaming/MicroBatchExecution.scala | 34 ++ .../sql/execution/streaming/ProgressReporter.scala | 8 +-- .../execution/streaming/ResolveWriteToStream.scala | 5 +- .../sql/execution/streaming/StreamExecution.scala | 7 +- .../sql/execution/streaming/WatermarkTracker.scala | 7 +- .../streaming/continuous/ContinuousExecution.scala | 13 ++-- .../continuous/ContinuousQueuedDataReader.scala| 6 +- .../streaming/continuous/ContinuousWriteRDD.scala | 10 +-- .../WriteToContinuousDataSourceExec.scala | 7 +- .../sources/RateStreamMicroBatchStream.scala | 5 +- .../state/HDFSBackedStateStoreProvider.scala | 34 ++ .../sql/execution/streaming/state/RocksDB.scala| 54 --- .../streaming/state/RocksDBFileManager.scala | 77 +- .../streaming/state/RocksDBMemoryManager.scala | 7 +- .../state/RocksDBStateStoreProvider.scala | 12 ++-- .../sql/execution/streaming/state/StateStore.scala | 23 --- .../streaming/state/StateStoreChangelog.scala | 8 ++- .../state/StreamingSessionWindowStateManager.scala | 5 +- .../state/SymmetricHashJoinStateManager.scala | 6 +- 31 files changed, 286 insertions(+), 155 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index fab5e80dd0e6..6df7cb5a5867 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -34,6 +34,7 @@ object LogKey extends Enumeration { val ARGS = Value val BACKUP_FILE = Value val BATCH_ID = Value + val BATCH_TIMESTAMP = Value val BATCH_WRITE = Value val BLOCK_ID = Value val BLOCK_MANAGER_ID = Value @@ -48,8 +49,12 @@ object LogKey extends Enumeration { val CATALOG_NAME = Value val CATEGORICAL_FEATURES = Value val CHECKPOINT_FILE = Value + val CHECKPOINT_LOCATION = Value + val CHECKPOINT_PATH = Value + val CHECKPOINT_ROOT = Value val CHECKPOINT_TIME = Value val CHECKSUM_FILE_NUM = Value + val CHOSEN_WATERMARK = Value val CLASS_LOADER = Value val CLASS_NAME = Value val CLUSTER_CENTROIDS = Value @@ -66,6 +71,8 @@ object LogKey extends Enumeration { val COLUMN_NAME = Value val COMMAND = Value val COMMAND_OUTPUT = Value + val COMMITTED_VERSION = Value + val COMPACT_INTERVAL = Value val COMPONENT = Value val CONFIG = Value val CONFIG2 = Value @@ -86,6 +93,7 @@ object LogKey extends Enumeration { val CSV_SCHEMA_FIELD_NAME = Value val CSV_SCHEMA_FIELD_NAMES = Value val CSV_SOURCE = Value + val CURRENT_BATCH_ID = Value
(spark) branch master updated (08caa567fb29 -> 775bc54fcd0d)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 08caa567fb29 [SPARK-47980][SQL][TESTS] Reactivate test 'Empty float/double array columns raise EOFException' add 775bc54fcd0d [SPARK-47580][SQL] SQL catalyst: eliminate unnamed variables in error logs No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/expressions/codegen/CodeGenerator.scala | 6 ++ .../scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala| 2 +- .../expressions/CodeGeneratorWithInterpretedFallbackSuite.scala | 2 +- 3 files changed, 4 insertions(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47583][CORE] SQL core: Migrate logError with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 62dd64a5d13d [SPARK-47583][CORE] SQL core: Migrate logError with variables to structured logging framework 62dd64a5d13d is described below commit 62dd64a5d13d14a4e3bce50d9c264f8e494c7863 Author: Daniel Tenedorio AuthorDate: Wed Apr 24 13:43:05 2024 -0700 [SPARK-47583][CORE] SQL core: Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? Migrate logError with variables of the sql/core module to structured logging framework. This transforms the logError entries of the following API ``` def logError(msg: => String): Unit ``` to ``` def logError(entry: LogEntry): Unit ``` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? Yes, Spark core logs will contain additional MDC ### How was this patch tested? Compiler and scala style checks, as well as code review. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45969 from dtenedor/log-error-sql-core. Authored-by: Daniel Tenedorio Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 6 + .../execution/BaseScriptTransformationExec.scala | 10 --- .../execution/adaptive/AdaptiveSparkPlanExec.scala | 31 +- .../command/InsertIntoDataSourceDirCommand.scala | 4 ++- .../execution/command/createDataSourceTables.scala | 5 +++- .../apache/spark/sql/execution/command/ddl.scala | 13 - .../execution/datasources/FileFormatWriter.scala | 7 ++--- .../datasources/v2/WriteToDataSourceV2Exec.scala | 20 -- .../execution/exchange/BroadcastExchangeExec.scala | 4 ++- 9 files changed, 63 insertions(+), 37 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index b9b0e372a2b0..fab5e80dd0e6 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -34,6 +34,7 @@ object LogKey extends Enumeration { val ARGS = Value val BACKUP_FILE = Value val BATCH_ID = Value + val BATCH_WRITE = Value val BLOCK_ID = Value val BLOCK_MANAGER_ID = Value val BROADCAST_ID = Value @@ -116,6 +117,7 @@ object LogKey extends Enumeration { val ESTIMATOR_PARAMETER_MAP = Value val EVENT_LOOP = Value val EVENT_QUEUE = Value + val EXCEPTION = Value val EXECUTE_INFO = Value val EXECUTE_KEY = Value val EXECUTION_PLAN_LEAVES = Value @@ -162,6 +164,7 @@ object LogKey extends Enumeration { val HIVE_OPERATION_TYPE = Value val HOST = Value val HOST_PORT = Value + val IDENTIFIER = Value val INCOMPATIBLE_TYPES = Value val INDEX = Value val INDEX_FILE_NUM = Value @@ -330,11 +333,13 @@ object LogKey extends Enumeration { val SPARK_PLAN_ID = Value val SQL_TEXT = Value val SRC_PATH = Value + val STAGE_ATTEMPT = Value val STAGE_ID = Value val START_INDEX = Value val STATEMENT_ID = Value val STATE_STORE_PROVIDER = Value val STATUS = Value + val STDERR = Value val STORAGE_LEVEL = Value val STORAGE_LEVEL_DESERIALIZED = Value val STORAGE_LEVEL_REPLICATION = Value @@ -402,6 +407,7 @@ object LogKey extends Enumeration { val WEIGHTED_NUM = Value val WORKER_URL = Value val WRITE_AHEAD_LOG_INFO = Value + val WRITE_JOB_UUID = Value val XSD_PATH = Value type LogKey = Value diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala index 91042b59677b..6e54bde46942 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/BaseScriptTransformationExec.scala @@ -27,7 +27,8 @@ import scala.util.control.NonFatal import org.apache.hadoop.conf.Configuration import org.apache.spark.{SparkFiles, TaskContext} -import org.apache.spark.internal.Logging +import org.apache.spark.internal.{Logging, MDC} +import org.apache.spark.internal.LogKey._ import org.apache.spark.rdd.RDD import org.apache.spark.sql.catalyst.{CatalystTypeConverters, InternalRow} import org.apache.spark.sql.catalyst.expressions.{Attribute, AttributeSet, Cast, Expression, GenericInternalRow, JsonToStructs, Literal, StructsToJson, UnsafeProjection} @@ -185,7 +186,7 @@ trait BaseScriptTransformationExec extends UnaryExecN
(spark) branch master updated: [SPARK-47604][CORE] Resource managers: Migrate logInfo with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c88fabfee41d [SPARK-47604][CORE] Resource managers: Migrate logInfo with variables to structured logging framework c88fabfee41d is described below commit c88fabfee41df1ca4729058450ec6f798641c936 Author: panbingkun AuthorDate: Tue Apr 23 11:00:44 2024 -0700 [SPARK-47604][CORE] Resource managers: Migrate logInfo with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logInfo` in module `Resource managers` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46130 from panbingkun/SPARK-47604. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 45 +++- .../execution/ExecuteResponseObserver.scala| 2 +- .../deploy/k8s/SparkKubernetesClientFactory.scala | 9 +- .../k8s/submit/KubernetesClientApplication.scala | 8 +- .../deploy/k8s/submit/KubernetesClientUtils.scala | 4 +- .../k8s/submit/LoggingPodStatusWatcher.scala | 20 ++-- .../cluster/k8s/ExecutorPodsAllocator.scala| 25 ++-- .../cluster/k8s/ExecutorPodsLifecycleManager.scala | 8 +- .../scheduler/cluster/k8s/ExecutorRollPlugin.scala | 4 +- .../cluster/k8s/KubernetesClusterManager.scala | 5 +- .../k8s/KubernetesClusterSchedulerBackend.scala| 11 +- ...ernetesLocalDiskShuffleExecutorComponents.scala | 21 ++-- .../spark/deploy/yarn/ApplicationMaster.scala | 40 --- .../org/apache/spark/deploy/yarn/Client.scala | 59 ++ .../apache/spark/deploy/yarn/ClientArguments.scala | 5 +- .../spark/deploy/yarn/ExecutorRunnable.scala | 29 ++--- .../spark/deploy/yarn/SparkRackResolver.scala | 7 +- .../apache/spark/deploy/yarn/YarnAllocator.scala | 126 - .../yarn/YarnAllocatorNodeHealthTracker.scala | 12 +- .../cluster/YarnClientSchedulerBackend.scala | 4 +- .../scheduler/cluster/YarnSchedulerBackend.scala | 17 +-- 21 files changed, 283 insertions(+), 178 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 585373f1782b..b9b0e372a2b0 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -26,9 +26,12 @@ object LogKey extends Enumeration { val ACTUAL_PARTITION_COLUMN = Value val ALPHA = Value val ANALYSIS_ERROR = Value + val APP_ATTEMPT_ID = Value val APP_DESC = Value val APP_ID = Value + val APP_NAME = Value val APP_STATE = Value + val ARGS = Value val BACKUP_FILE = Value val BATCH_ID = Value val BLOCK_ID = Value @@ -45,6 +48,7 @@ object LogKey extends Enumeration { val CATEGORICAL_FEATURES = Value val CHECKPOINT_FILE = Value val CHECKPOINT_TIME = Value + val CHECKSUM_FILE_NUM = Value val CLASS_LOADER = Value val CLASS_NAME = Value val CLUSTER_CENTROIDS = Value @@ -70,6 +74,7 @@ object LogKey extends Enumeration { val CONSUMER = Value val CONTAINER = Value val CONTAINER_ID = Value + val CONTAINER_STATE = Value val COST = Value val COUNT = Value val CROSS_VALIDATION_METRIC = Value @@ -85,6 +90,7 @@ object LogKey extends Enumeration { val DATABASE_NAME = Value val DATAFRAME_CACHE_ENTRY = Value val DATAFRAME_ID = Value + val DATA_FILE_NUM = Value val DATA_SOURCE = Value val DATA_SOURCES = Value val DATA_SOURCE_PROVIDER = Value @@ -113,10 +119,16 @@ object LogKey extends Enumeration { val EXECUTE_INFO = Value val EXECUTE_KEY = Value val EXECUTION_PLAN_LEAVES = Value + val EXECUTOR_DESIRED_COUNT = Value + val EXECUTOR_ENVS = Value val EXECUTOR_ENV_REGEX = Value val EXECUTOR_ID = Value val EXECUTOR_IDS = Value + val EXECUTOR_LAUNCH_COMMANDS = Value + val EXECUTOR_LAUNCH_COUNT = Value + val EXECUTOR_RESOURCES = Value val EXECUTOR_STATE = Value + val EXECUTOR_TARGET_COUNT = Value val EXIT_CODE = Value val EXPECTED_NUM_FILES = Value val EXPECTED_PARTITION_COLUMN = Value @@ -129,8 +141,10 @@ object LogKey extends Enumeration { val FEATURE_COLUMN = Value val FEATURE_DIMENSION = Value val FIELD_NAME = Value + val FILE_ABSOLUTE_PATH = Value val FILE_FORMAT = Value val FILE_FORMAT2 = Value + val FILE_NAME = Value val FILE_VERSION
(spark) branch master updated (e01ac581f46a -> 3c7905e00d2e)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e01ac581f46a [SPARK-47933][PYTHON] Parent Column class for Spark Connect and Spark Classic add 3c7905e00d2e [SPARK-47600][CORE] MLLib: Migrate logInfo with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/LogKey.scala | 47 ++ .../org/apache/spark/ml/clustering/KMeans.scala| 22 +- .../optim/IterativelyReweightedLeastSquares.scala | 10 +++-- .../spark/ml/optim/WeightedLeastSquares.scala | 5 ++- .../org/apache/spark/ml/r/RWrapperUtils.scala | 11 ++--- .../ml/regression/RandomForestRegressor.scala | 2 +- .../spark/ml/tree/impl/GradientBoostedTrees.scala | 5 ++- .../apache/spark/ml/tree/impl/RandomForest.scala | 14 --- .../apache/spark/ml/tuning/CrossValidator.scala| 10 +++-- .../spark/ml/tuning/TrainValidationSplit.scala | 11 +++-- .../org/apache/spark/ml/util/DatasetUtils.scala| 8 ++-- .../org/apache/spark/ml/util/Instrumentation.scala | 18 ++--- .../scala/org/apache/spark/ml/util/ReadWrite.scala | 5 ++- .../spark/mllib/clustering/BisectingKMeans.scala | 18 + .../org/apache/spark/mllib/clustering/KMeans.scala | 23 ++- .../spark/mllib/clustering/LocalKMeans.scala | 8 ++-- .../clustering/PowerIterationClustering.scala | 15 +++ .../spark/mllib/clustering/StreamingKMeans.scala | 9 +++-- .../evaluation/BinaryClassificationMetrics.scala | 8 ++-- .../org/apache/spark/mllib/feature/Word2Vec.scala | 11 +++-- .../org/apache/spark/mllib/fpm/PrefixSpan.scala| 24 ++- .../regression/StreamingLinearAlgorithm.scala | 7 ++-- .../apache/spark/ml/recommendation/ALSSuite.scala | 13 +++--- .../apache/spark/mllib/linalg/VectorsSuite.scala | 4 +- 24 files changed, 206 insertions(+), 102 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (86563169eef8 -> f2d0cf23018f)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 86563169eef8 [SPARK-47940][BUILD][TESTS] Upgrade `guava` dependency to `33.1.0-jre` in Docker IT add f2d0cf23018f [SPARK-47907][SQL] Put bang under a config No new revisions were added by this update. Summary of changes: .../src/main/resources/error/error-conditions.json | 15 ++ ...-conditions-syntax-discontinued-error-class.md} | 16 +- docs/sql-migration-guide.md| 1 + .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 55 ++--- .../spark/sql/catalyst/parser/AstBuilder.scala | 62 +- .../org/apache/spark/sql/internal/SQLConf.scala| 10 + .../sql/catalyst/parser/ErrorParserSuite.scala | 33 +++ .../analyzer-results/predicate-functions.sql.out | 194 ++ .../sql-tests/inputs/predicate-functions.sql | 36 .../sql-tests/results/predicate-functions.sql.out | 224 + 10 files changed, 602 insertions(+), 44 deletions(-) copy docs/{sql-error-conditions-illegal-state-store-value-error-class.md => sql-error-conditions-syntax-discontinued-error-class.md} (71%) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (fe47edece059 -> 8aa2dad46b79)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from fe47edece059 [SPARK-47883][SQL] Make `CollectTailExec.doExecute` lazy with RowQueue add 8aa2dad46b79 [SPARK-47596][DSTREAMS] Streaming: Migrate logWarn with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/LogKey.scala| 18 ++ .../org/apache/spark/streaming/Checkpoint.scala | 21 + .../apache/spark/streaming/dstream/DStream.scala| 9 ++--- .../streaming/dstream/DStreamCheckpointData.scala | 9 ++--- .../spark/streaming/dstream/FileInputDStream.scala | 13 ++--- .../spark/streaming/dstream/InputDStream.scala | 6 -- .../spark/streaming/receiver/BlockGenerator.scala | 6 -- .../streaming/receiver/ReceivedBlockHandler.scala | 18 +++--- .../streaming/receiver/ReceiverSupervisor.scala | 6 +++--- .../streaming/receiver/ReceiverSupervisorImpl.scala | 5 +++-- .../spark/streaming/scheduler/JobGenerator.scala| 6 -- .../streaming/scheduler/ReceivedBlockTracker.scala | 5 +++-- .../spark/streaming/scheduler/ReceiverTracker.scala | 14 -- .../spark/streaming/util/BatchedWriteAheadLog.scala | 5 +++-- .../streaming/util/FileBasedWriteAheadLog.scala | 5 +++-- 15 files changed, 95 insertions(+), 51 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (47d783bc6489 -> 9718573ce748)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 47d783bc6489 [SPARK-47882][SQL] createTableColumnTypes need to be mapped to database types instead of using directly add 9718573ce748 [SPARK-47591][SQL] Hive-thriftserver: Migrate logInfo with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../main/scala/org/apache/spark/internal/LogKey.scala | 3 +++ .../thriftserver/SparkExecuteStatementOperation.scala | 18 -- .../hive/thriftserver/SparkGetCatalogsOperation.scala | 5 +++-- .../hive/thriftserver/SparkGetColumnsOperation.scala | 12 ++-- .../hive/thriftserver/SparkGetSchemasOperation.scala | 11 +-- .../thriftserver/SparkGetTableTypesOperation.scala | 5 +++-- .../hive/thriftserver/SparkGetTablesOperation.scala| 12 ++-- .../hive/thriftserver/SparkGetTypeInfoOperation.scala | 5 +++-- .../spark/sql/hive/thriftserver/SparkOperation.scala | 2 +- 9 files changed, 54 insertions(+), 19 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47584][SQL] SQL core: Migrate logWarn with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 4957a40d6e6b [SPARK-47584][SQL] SQL core: Migrate logWarn with variables to structured logging framework 4957a40d6e6b is described below commit 4957a40d6e6bf68226c8047687e8f30c93adb8ce Author: panbingkun AuthorDate: Wed Apr 17 11:59:09 2024 -0700 [SPARK-47584][SQL] SQL core: Migrate logWarn with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logWarning` in module `SQL core` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #46057 from panbingkun/SPARK-47584. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 65 +- .../catalyst/analysis/StreamingJoinHelper.scala| 4 +- .../ReplaceNullWithFalseInPredicate.scala | 4 +- .../main/scala/org/apache/spark/sql/Column.scala | 13 +++-- .../scala/org/apache/spark/sql/SparkSession.scala | 20 --- .../spark/sql/api/python/PythonSQLUtils.scala | 7 ++- .../org/apache/spark/sql/api/r/SQLUtils.scala | 9 +-- .../catalyst/analysis/ResolveSessionCatalog.scala | 9 ++- .../apache/spark/sql/execution/ExistingRDD.scala | 12 ++-- .../spark/sql/execution/QueryExecution.scala | 6 +- .../spark/sql/execution/SparkStrategies.scala | 10 ++-- .../sql/execution/WholeStageCodegenExec.scala | 8 ++- .../adaptive/InsertAdaptiveSparkPlan.scala | 6 +- .../execution/command/AnalyzeTablesCommand.scala | 6 +- .../spark/sql/execution/command/CommandUtils.scala | 9 +-- .../spark/sql/execution/command/SetCommand.scala | 28 ++ .../apache/spark/sql/execution/command/ddl.scala | 8 ++- .../datasources/BasicWriteStatsTracker.scala | 9 +-- .../sql/execution/datasources/DataSource.scala | 10 ++-- .../execution/datasources/DataSourceManager.scala | 6 +- .../sql/execution/datasources/FilePartition.scala | 11 ++-- .../sql/execution/datasources/FileScanRDD.scala| 8 ++- .../execution/datasources/FileStatusCache.scala| 14 +++-- .../execution/datasources/csv/CSVDataSource.scala | 9 +-- .../execution/datasources/jdbc/JDBCRelation.scala | 15 ++--- .../sql/execution/datasources/jdbc/JdbcUtils.scala | 11 ++-- .../datasources/json/JsonOutputWriter.scala| 8 ++- .../sql/execution/datasources/orc/OrcUtils.scala | 5 +- .../datasources/parquet/ParquetFileFormat.scala| 13 +++-- .../datasources/parquet/ParquetUtils.scala | 9 +-- .../execution/datasources/v2/CacheTableExec.scala | 4 +- .../execution/datasources/v2/CreateIndexExec.scala | 5 +- .../datasources/v2/CreateNamespaceExec.scala | 5 +- .../execution/datasources/v2/CreateTableExec.scala | 5 +- .../datasources/v2/DataSourceV2Strategy.scala | 5 +- .../execution/datasources/v2/DropIndexExec.scala | 4 +- .../datasources/v2/FilePartitionReader.scala | 7 ++- .../sql/execution/datasources/v2/FileScan.scala| 7 ++- .../v2/V2ScanPartitioningAndOrdering.scala | 8 ++- .../ApplyInPandasWithStatePythonRunner.scala | 8 ++- .../python/AttachDistributedSequenceExec.scala | 5 +- .../streaming/AvailableNowDataStreamWrapper.scala | 15 ++--- .../streaming/CheckpointFileManager.scala | 24 .../streaming/CompactibleFileStreamLog.scala | 5 +- .../sql/execution/streaming/FileStreamSink.scala | 9 +-- .../sql/execution/streaming/FileStreamSource.scala | 16 -- .../execution/streaming/IncrementalExecution.scala | 9 +-- .../streaming/ManifestFileCommitProtocol.scala | 6 +- .../execution/streaming/MicroBatchExecution.scala | 24 .../spark/sql/execution/streaming/OffsetSeq.scala | 15 +++-- .../sql/execution/streaming/ProgressReporter.scala | 19 --- .../execution/streaming/ResolveWriteToStream.scala | 15 +++-- .../sql/execution/streaming/StreamExecution.scala | 8 ++- .../sql/execution/streaming/TimerStateImpl.scala | 11 ++-- .../sql/execution/streaming/TriggerExecutor.scala | 8 ++- .../continuous/ContinuousTextSocketSource.scala| 5 +- .../sources/TextSocketMicroBatchStream.scala | 5 +- .../state/HDFSBackedStateStoreProvider.scala | 25 + .../sql/execution/streaming/state/RocksDB.scala| 6 +- .../streaming/state/RocksDBFileManager.scala | 14 +++-- .../sql/execution/streaming/state
(spark) branch master updated: [SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 898838a239d3 [SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution 898838a239d3 is described below commit 898838a239d370429e49108a56c6a7fb22d6b399 Author: Paddy Xu AuthorDate: Wed Apr 17 10:53:02 2024 -0700 [SPARK-47627][SQL] Add SQL MERGE syntax to enable schema evolution ### Why are the changes needed? This PR introduces a syntax `WITH SCHEMA EVOLUTION` to the SQL MERGE command, which allows the user to specify automatic schema evolution for a specific operation. ```sql MERGE WITH SCHEMA EVOLUTION INTO tgt USING src ON ... WHEN ... ``` When `WITH SCHEMA EVOLUTION` is specified, schema evolution-related features must be turned on for this single statement and only in this statement. Spark is only responsible for recognizing the existence or absence of the syntax `WITH SCHEMA EVOLUTION`, and the result is passed down to the MERGE command. Data sources must respect the syntax and give appropriate reactions: turn on features that are categorised as "schema evolution" when the syntax does exist. For example, when the underlying table is Delta Lake, the feature "mergeSchema" will be turned on (see https://github.com/delta-io/delta/blob/c41977db3529a3139d6306abe5ded161 [...] ### Does this PR introduce _any_ user-facing change? Yes, see the previous section. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45748 from xupefei/merge-schema-evolution. Authored-by: Paddy Xu Signed-off-by: Gengliang Wang --- .../CheckConnectJvmClientCompatibility.scala | 1 + docs/sql-ref-ansi-compliance.md| 1 + .../spark/sql/catalyst/parser/SqlBaseLexer.g4 | 1 + .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 4 +- .../spark/sql/catalyst/analysis/Analyzer.scala | 2 +- .../catalyst/analysis/RewriteMergeIntoTable.scala | 6 +-- .../spark/sql/catalyst/parser/AstBuilder.scala | 4 +- .../sql/catalyst/plans/logical/v2Commands.scala| 3 +- .../sql/catalyst/analysis/AnalysisSuite.scala | 7 ++-- .../PullupCorrelatedPredicatesSuite.scala | 5 ++- .../ReplaceNullWithFalseInPredicateSuite.scala | 6 ++- .../spark/sql/catalyst/parser/DDLParserSuite.scala | 42 + .../org/apache/spark/sql/MergeIntoWriter.scala | 19 +- .../sql-tests/results/ansi/keywords.sql.out| 1 + .../resources/sql-tests/results/keywords.sql.out | 1 + .../execution/command/PlanResolutionSuite.scala| 43 +- .../ThriftServerWithSparkContextSuite.scala| 2 +- 17 files changed, 113 insertions(+), 35 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala index 0f383d007f29..f73290c5ce29 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala @@ -304,6 +304,7 @@ object CheckConnectJvmClientCompatibility { // MergeIntoWriter ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.MergeIntoWriter"), + ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.MergeIntoWriter$"), ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.WhenMatched"), ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.WhenMatched$"), ProblemFilters.exclude[MissingClassProblem]("org.apache.spark.sql.WhenNotMatched"), diff --git a/docs/sql-ref-ansi-compliance.md b/docs/sql-ref-ansi-compliance.md index bf1819b9767b..0256a3e0869d 100644 --- a/docs/sql-ref-ansi-compliance.md +++ b/docs/sql-ref-ansi-compliance.md @@ -492,6 +492,7 @@ Below is a list of all the keywords in Spark SQL. |END|reserved|non-reserved|reserved| |ESCAPE|reserved|non-reserved|reserved| |ESCAPED|non-reserved|non-reserved|non-reserved| +|EVOLUTION|non-reserved|non-reserved|non-reserved| |EXCEPT|reserved|strict-non-reserved|reserved| |EXCHANGE|non-reserved|non-reserved|non-reserved| |EXCLUDE|non-reserved|non-reserved|non-reserved| diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 inde
(spark) branch master updated: [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f77495909b29 [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework f77495909b29 is described below commit f77495909b29fe4883afcfd8fec7be048fe494a3 Author: Gengliang Wang AuthorDate: Tue Apr 16 22:32:34 2024 -0700 [SPARK-47588][CORE] Hive module: Migrate logInfo with variables to structured logging framework ### What changes were proposed in this pull request? Migrate logInfo in Hive module with variables to structured logging framework. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GA tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46086 from gengliangwang/hive_loginfo. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 4 +++ .../spark/sql/hive/HiveExternalCatalog.scala | 30 -- .../spark/sql/hive/HiveMetastoreCatalog.scala | 9 --- .../org/apache/spark/sql/hive/HiveUtils.scala | 27 +++ .../spark/sql/hive/client/HiveClientImpl.scala | 5 ++-- .../sql/hive/client/IsolatedClientLoader.scala | 4 +-- .../spark/sql/hive/orc/OrcFileOperator.scala | 4 +-- 7 files changed, 48 insertions(+), 35 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index bfeb733af30a..838ef0355e3a 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -95,10 +95,13 @@ object LogKey extends Enumeration { val GROUP_ID = Value val HADOOP_VERSION = Value val HISTORY_DIR = Value + val HIVE_CLIENT_VERSION = Value + val HIVE_METASTORE_VERSION = Value val HIVE_OPERATION_STATE = Value val HIVE_OPERATION_TYPE = Value val HOST = Value val HOST_PORT = Value + val INCOMPATIBLE_TYPES = Value val INDEX = Value val INFERENCE_MODE = Value val INITIAL_CAPACITY = Value @@ -152,6 +155,7 @@ object LogKey extends Enumeration { val POLICY = Value val PORT = Value val PRODUCER_ID = Value + val PROVIDER = Value val QUERY_CACHE_VALUE = Value val QUERY_HINT = Value val QUERY_ID = Value diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala index 8c35e10b383f..60f2d2f3e5fe 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala @@ -34,7 +34,7 @@ import org.apache.thrift.TException import org.apache.spark.{SparkConf, SparkException} import org.apache.spark.internal.{Logging, MDC} -import org.apache.spark.internal.LogKey.{DATABASE_NAME, SCHEMA, SCHEMA2, TABLE_NAME} +import org.apache.spark.internal.LogKey.{DATABASE_NAME, INCOMPATIBLE_TYPES, PROVIDER, SCHEMA, SCHEMA2, TABLE_NAME} import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException @@ -338,35 +338,37 @@ private[spark] class HiveExternalCatalog(conf: SparkConf, hadoopConf: Configurat val (hiveCompatibleTable, logMessage) = maybeSerde match { case _ if options.skipHiveMetadata => val message = - s"Persisting data source table $qualifiedTableName into Hive metastore in" + -"Spark SQL specific format, which is NOT compatible with Hive." + log"Persisting data source table ${MDC(TABLE_NAME, qualifiedTableName)} into Hive " + +log"metastore in Spark SQL specific format, which is NOT compatible with Hive." (None, message) case _ if incompatibleTypes.nonEmpty => +val incompatibleTypesStr = incompatibleTypes.mkString(", ") val message = - s"Hive incompatible types found: ${incompatibleTypes.mkString(", ")}. " + -s"Persisting data source table $qualifiedTableName into Hive metastore in " + -"Spark SQL specific format, which is NOT compatible with Hive." + log"Hive incompatible types found: ${MDC(INCOMPATIBLE_TYPES, incompatibleTypesStr)}. " + +log"Persisting data source table ${MDC(TABLE_NAME, qualifiedTableN
(spark) branch master updated: [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f7440f384191 [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework f7440f384191 is described below commit f7440f3841918f2cdb4a8e710cfe31d3fc85230c Author: Haejoon Lee AuthorDate: Tue Apr 16 13:56:03 2024 -0700 [SPARK-47590][SQL] Hive-thriftserver: Migrate logWarn with variables to structured logging framework ### What changes were proposed in this pull request? This PR proposes to migrate `logWarning` with variables of Hive-thriftserver module to structured logging framework. ### Why are the changes needed? To improve the existing logging system by migrating into structured logging. ### Does this PR introduce _any_ user-facing change? No API changes, but the SQL catalyst logs will contain MDC(Mapped Diagnostic Context) from now. ### How was this patch tested? Run Scala auto formatting and style check. Also the existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45923 from itholic/hive-ts-logwarn. Lead-authored-by: Haejoon Lee Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com> Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 1 + .../SparkExecuteStatementOperation.scala | 4 ++- .../sql/hive/thriftserver/SparkSQLCLIDriver.scala | 15 - .../ui/HiveThriftServer2Listener.scala | 36 -- 4 files changed, 38 insertions(+), 18 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 41289c641424..bfeb733af30a 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -94,6 +94,7 @@ object LogKey extends Enumeration { val FUNCTION_PARAMETER = Value val GROUP_ID = Value val HADOOP_VERSION = Value + val HISTORY_DIR = Value val HIVE_OPERATION_STATE = Value val HIVE_OPERATION_TYPE = Value val HOST = Value diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala index 628925007f7e..f8f58cd422b6 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala @@ -256,7 +256,9 @@ private[hive] class SparkExecuteStatementOperation( val currentState = getStatus().getState() if (currentState.isTerminal) { // This may happen if the execution was cancelled, and then closed from another thread. - logWarning(s"Ignore exception in terminal state with $statementId: $e") + logWarning( +log"Ignore exception in terminal state with ${MDC(STATEMENT_ID, statementId)}", e + ) } else { logError(log"Error executing query with ${MDC(STATEMENT_ID, statementId)}, " + log"currentState ${MDC(HIVE_OPERATION_STATE, currentState)}, ", e) diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala index 03d8fd0c8ff2..888c086e9042 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala @@ -41,7 +41,7 @@ import sun.misc.{Signal, SignalHandler} import org.apache.spark.{ErrorMessageFormat, SparkConf, SparkThrowable, SparkThrowableHelper} import org.apache.spark.deploy.SparkHadoopUtil import org.apache.spark.internal.{Logging, MDC} -import org.apache.spark.internal.LogKey.ERROR +import org.apache.spark.internal.LogKey._ import org.apache.spark.sql.AnalysisException import org.apache.spark.sql.catalyst.analysis.FunctionRegistry import org.apache.spark.sql.catalyst.util.SQLKeywordUtils @@ -232,14 +232,14 @@ private[hive] object SparkSQLCLIDriver extends Logging { val historyFile = historyDirectory + File.separator + ".hivehistory" reader.setHistory(new FileHistory(new File(historyFile))) } else { -logWarning("WARNING: Directory for Hive history file: "
(spark) branch master updated (9a1fc112677f -> 6919febfcc87)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 9a1fc112677f [SPARK-47871][SQL] Oracle: Map TimestampType to TIMESTAMP WITH LOCAL TIME ZONE add 6919febfcc87 [SPARK-47594] Connector module: Migrate logInfo with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/LogKey.scala | 36 +- .../org/apache/spark/sql/avro/AvroUtils.scala | 7 +++-- .../execution/ExecuteGrpcResponseSender.scala | 33 +++- .../execution/ExecuteResponseObserver.scala| 19 +++- .../sql/connect/planner/SparkConnectPlanner.scala | 7 +++-- .../planner/StreamingForeachBatchHelper.scala | 20 .../planner/StreamingQueryListenerHelper.scala | 7 +++-- .../sql/connect/service/LoggingInterceptor.scala | 9 -- .../spark/sql/connect/service/SessionHolder.scala | 15 ++--- .../service/SparkConnectExecutionManager.scala | 17 ++ .../sql/connect/service/SparkConnectServer.scala | 7 +++-- .../sql/connect/service/SparkConnectService.scala | 5 +-- .../service/SparkConnectSessionManager.scala | 11 +-- .../service/SparkConnectStreamingQueryCache.scala | 26 +++- .../spark/sql/connect/utils/ErrorUtils.scala | 4 +-- .../sql/kafka010/KafkaBatchPartitionReader.scala | 14 ++--- .../spark/sql/kafka010/KafkaContinuousStream.scala | 4 +-- .../spark/sql/kafka010/KafkaMicroBatchStream.scala | 4 +-- .../sql/kafka010/KafkaOffsetReaderAdmin.scala | 4 +-- .../sql/kafka010/KafkaOffsetReaderConsumer.scala | 4 +-- .../apache/spark/sql/kafka010/KafkaRelation.scala | 7 +++-- .../org/apache/spark/sql/kafka010/KafkaSink.scala | 5 +-- .../apache/spark/sql/kafka010/KafkaSource.scala| 11 --- .../apache/spark/sql/kafka010/KafkaSourceRDD.scala | 6 ++-- .../sql/kafka010/consumer/KafkaDataConsumer.scala | 13 +--- .../kafka010/producer/CachedKafkaProducer.scala| 5 +-- .../apache/spark/sql/kafka010/KafkaTestUtils.scala | 10 +++--- .../kafka010/DirectKafkaInputDStream.scala | 9 -- .../streaming/kafka010/KafkaDataConsumer.scala | 18 ++- .../apache/spark/streaming/kafka010/KafkaRDD.scala | 12 +--- .../spark/streaming/kinesis/KinesisReceiver.scala | 7 +++-- .../streaming/kinesis/KinesisRecordProcessor.scala | 12 +--- .../executor/profiler/ExecutorJVMProfiler.scala| 5 +-- .../executor/profiler/ExecutorProfilerPlugin.scala | 6 ++-- .../scala/org/apache/spark/deploy/Client.scala | 6 ++-- .../spark/deploy/yarn/ApplicationMaster.scala | 4 +-- .../scheduler/cluster/YarnSchedulerBackend.scala | 6 ++-- 37 files changed, 257 insertions(+), 138 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47804] Add Dataframe cache debug log
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f10ad3d56896 [SPARK-47804] Add Dataframe cache debug log f10ad3d56896 is described below commit f10ad3d56896f8a0eb9b0c73a6ee628cfc7df3a2 Author: Xinyi Yu AuthorDate: Mon Apr 15 18:37:36 2024 -0700 [SPARK-47804] Add Dataframe cache debug log ### What changes were proposed in this pull request? This PR adds a debug log for Dataframe cache that uses SQL conf to turn on. It logs necessary information on * cache hit during cache application (the application happens basically on every query) * cache miss * adding new cache entries * removing cache entries (including clear all entries) Because every query applies cache, this log could be huge and should be only turned on during some debugging process, and should not enabled by default in production. Example: ``` spark.conf.set("spark.sql.dataframeCache.logLevel", "warn") val df = spark.range(1, 10) df.collect() {"ts":"2024-04-10T16:41:10.010-0700","level":"WARN","msg":"Dataframe cache miss for input plan:\nRange (1, 10, step=1, splits=Some(10))\n","logger":"org.apache.spark.sql.execution.CacheManager"} {"ts":"2024-04-10T16:41:10.010-0700","level":"WARN","msg":"Last 20 Dataframe cache entry logical plans:\n[]","logger":"org.apache.spark.sql.execution.CacheManager"} df.cache() {"ts":"2024-04-10T16:42:18.647-0700","level":"WARN","msg":"Dataframe cache miss for input plan:\nRange (1, 10, step=1, splits=Some(10))\n","logger":"org.apache.spark.sql.execution.CacheManager"} {"ts":"2024-04-10T16:42:18.647-0700","level":"WARN","msg":"Last 20 Dataframe cache entry logical plans:\n[]","logger":"org.apache.spark.sql.execution.CacheManager"} {"ts":"2024-04-10T16:42:18.662-0700","level":"WARN","msg":"Added Dataframe cache entry:\nCachedData(\nlogicalPlan=Range (1, 10, step=1, splits=Some(10))\n\nInMemoryRelation=InMemoryRelation [id#2L], StorageLevel(disk, memory, deserialized, 1 replicas)\n +- *(1) Range (1, 10, step=1, splits=10)\n)\n","logger":"org.apache.spark.sql.execution.CacheManager"} df.count() {"ts":"2024-04-10T16:43:36.033-0700","level":"WARN","msg":"Dataframe cache hit for input plan:\nRange (1, 10, step=1, splits=Some(10))\nmatched with cache entry:\nCachedData(\nlogicalPlan=Range (1, 10, step=1, splits=Some(10))\n\nInMemoryRelation=InMemoryRelation [id#2L], StorageLevel(disk, memory, deserialized, 1 replicas)\n +- *(1) Range (1, 10, step=1, splits=10)\n)\n","logger":"org.apache.spark.sql.execution.CacheManager"} {"ts":"2024-04-10T16:43:36.041-0700","level":"WARN","msg":"Dataframe cache hit plan change summary:\n Aggregate [count(1) AS count#13L] Aggregate [count(1) AS count#13L]\n!+- Range (1, 10, step=1, splits=Some(10)) +- InMemoryRelation [id#2L], StorageLevel(disk, memory, deserialized, 1 replicas)\n! +- *(1) Range (1, 10, step=1, splits=10)","logger":"org.apache.spark.sql.execution.CacheManager"} df.unpersist() {"ts":"2024-04-10T16:44:15.965-0700","level":"WARN","msg":"Removed 1 Dataframe cache entries, with logical plans being \n[Range (1, 10, step=1, splits=Some(10))\n]","logger":"org.apache.spark.sql.execution.CacheManager"} ``` ### Why are the changes needed? Easier debugging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run local spark shell. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45990 from anchovYu/SPARK-47804. Authored-by: Xinyi Yu Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 2 + .../scala/org/apache/spark/internal/Logging.scala | 32 common/utils/src/test/resources/log4j2.properties | 4 +- .../apache/spark/util/StructuredLoggingSuite.scala | 20 ++-- .../org/apache/spark/sql/internal/SQLConf.scala| 16 ++ .../apache/spark/sql/execution/CacheManager.scala | 59 -- 6 fil
(spark) branch master updated (83fe9b16ab5a -> 61264f77fd68)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 83fe9b16ab5a [SPARK-47694][CONNECT] Make max message size configurable on the client side add 61264f77fd68 [SPARK-47603][KUBERNETES][YARN] Resource managers: Migrate logWarn with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/LogKey.scala | 10 +++- .../apache/spark/deploy/k8s/KubernetesConf.scala | 11 +++-- .../apache/spark/deploy/k8s/KubernetesUtils.scala | 7 +-- .../k8s/features/DriverCommandFeatureStep.scala| 12 +++-- .../deploy/k8s/submit/KubernetesClientUtils.scala | 12 +++-- .../cluster/k8s/ExecutorPodsAllocator.scala| 11 +++-- .../cluster/k8s/ExecutorPodsSnapshot.scala | 8 +-- .../scheduler/cluster/k8s/ExecutorRollPlugin.scala | 11 +++-- .../spark/deploy/yarn/ApplicationMaster.scala | 7 +-- .../org/apache/spark/deploy/yarn/Client.scala | 27 +- .../spark/deploy/yarn/ResourceRequestHelper.scala | 9 ++-- .../apache/spark/deploy/yarn/YarnAllocator.scala | 57 -- .../cluster/YarnClientSchedulerBackend.scala | 4 +- .../scheduler/cluster/YarnSchedulerBackend.scala | 15 +++--- 14 files changed, 118 insertions(+), 83 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (9ffdbc65029a -> 1ee3496f4836)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 9ffdbc65029a [SPARK-47784][SS] Merge TTLMode and TimeoutMode into a single TimeMode add 1ee3496f4836 [SPARK-47792][CORE] Make the value of MDC can support `null` & cannot be `MessageWithContext` No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/Logging.scala | 10 +--- .../scala/org/apache/spark/util/MDCSuite.scala | 15 +++ .../apache/spark/util/PatternLoggingSuite.scala| 3 +++ .../apache/spark/util/StructuredLoggingSuite.scala | 30 ++ 4 files changed, 55 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47587][SQL] Hive module: Migrate logWarn with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new eaf6b518f67c [SPARK-47587][SQL] Hive module: Migrate logWarn with variables to structured logging framework eaf6b518f67c is described below commit eaf6b518f67c0e3ed04f264c3a89573bd7e74fe7 Author: panbingkun AuthorDate: Wed Apr 10 22:34:14 2024 -0700 [SPARK-47587][SQL] Hive module: Migrate logWarn with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logWarning` in module `Hive` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45927 from panbingkun/SPARK-47587. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 9 +++ .../security/HBaseDelegationTokenProvider.scala| 7 ++--- .../main/scala/org/apache/spark/util/Utils.scala | 10 .../spark/sql/hive/HiveExternalCatalog.scala | 28 ++-- .../spark/sql/hive/HiveMetastoreCatalog.scala | 30 ++ .../org/apache/spark/sql/hive/HiveUtils.scala | 5 ++-- .../spark/sql/hive/client/HiveClientImpl.scala | 8 +++--- .../apache/spark/sql/hive/client/HiveShim.scala| 23 + .../sql/hive/client/IsolatedClientLoader.scala | 13 ++ .../spark/sql/hive/execution/HiveFileFormat.scala | 11 .../spark/sql/hive/execution/HiveTempPath.scala| 5 ++-- .../spark/sql/hive/orc/OrcFileOperator.scala | 5 ++-- .../security/HiveDelegationTokenProvider.scala | 8 +++--- 13 files changed, 97 insertions(+), 65 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index a9a79de05c27..28b06f448784 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -44,6 +44,7 @@ object LogKey extends Enumeration { val COMPONENT = Value val CONFIG = Value val CONFIG2 = Value + val CONFIG3 = Value val CONTAINER = Value val CONTAINER_ID = Value val COUNT = Value @@ -58,6 +59,7 @@ object LogKey extends Enumeration { val DRIVER_ID = Value val DROPPED_PARTITIONS = Value val END_POINT = Value + val ENGINE = Value val ERROR = Value val EVENT_LOOP = Value val EVENT_QUEUE = Value @@ -66,14 +68,19 @@ object LogKey extends Enumeration { val EXIT_CODE = Value val EXPRESSION_TERMS = Value val FAILURES = Value + val FALLBACK_VERSION = Value val FIELD_NAME = Value + val FILE_FORMAT = Value + val FILE_FORMAT2 = Value val FUNCTION_NAME = Value val FUNCTION_PARAMETER = Value val GROUP_ID = Value + val HADOOP_VERSION = Value val HIVE_OPERATION_STATE = Value val HIVE_OPERATION_TYPE = Value val HOST = Value val INDEX = Value + val INFERENCE_MODE = Value val JOB_ID = Value val JOIN_CONDITION = Value val JOIN_CONDITION_SUB_EXPRESSION = Value @@ -132,6 +139,8 @@ object LogKey extends Enumeration { val RULE_BATCH_NAME = Value val RULE_NAME = Value val RULE_NUMBER_OF_RUNS = Value + val SCHEMA = Value + val SCHEMA2 = Value val SERVICE_NAME = Value val SESSION_ID = Value val SHARD_ID = Value diff --git a/core/src/main/scala/org/apache/spark/deploy/security/HBaseDelegationTokenProvider.scala b/core/src/main/scala/org/apache/spark/deploy/security/HBaseDelegationTokenProvider.scala index d60e5975071d..1b2e41bc0a2e 100644 --- a/core/src/main/scala/org/apache/spark/deploy/security/HBaseDelegationTokenProvider.scala +++ b/core/src/main/scala/org/apache/spark/deploy/security/HBaseDelegationTokenProvider.scala @@ -27,7 +27,8 @@ import org.apache.hadoop.security.Credentials import org.apache.hadoop.security.token.{Token, TokenIdentifier} import org.apache.spark.SparkConf -import org.apache.spark.internal.Logging +import org.apache.spark.internal.{Logging, MDC} +import org.apache.spark.internal.LogKey.SERVICE_NAME import org.apache.spark.security.HadoopDelegationTokenProvider import org.apache.spark.util.Utils @@ -53,8 +54,8 @@ private[security] class HBaseDelegationTokenProvider creds.addToken(token.getService, token) } catch { case NonFatal(e) => -logWarning(Utils.createFailedToGetTokenMessage(serviceName, e) + - s" Retrying to fetch HBase security token with $serviceName connection p
(spark) branch master updated (3da52fb4490e -> 75d43dd05757)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3da52fb4490e [SPARK-47798][SQL] Enrich the error message for the reading failures of decimal values add 75d43dd05757 [SPARK-47601][GRAPHX] Graphx: Migrate logs with variables to structured logging framework No new revisions were added by this update. Summary of changes: graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala | 8 +--- graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala | 5 +++-- graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala | 9 + 3 files changed, 13 insertions(+), 9 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47595][STREAMING] Streaming: Migrate logError with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 52fd8c01cc8b [SPARK-47595][STREAMING] Streaming: Migrate logError with variables to structured logging framework 52fd8c01cc8b is described below commit 52fd8c01cc8b2a6ce1db3e059b0b962d258f4342 Author: panbingkun AuthorDate: Wed Apr 10 15:21:13 2024 -0700 [SPARK-47595][STREAMING] Streaming: Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logError` in module `Streaming` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45910 from panbingkun/SPARK-47595. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../utils/src/main/scala/org/apache/spark/internal/LogKey.scala | 2 ++ .../org/apache/spark/streaming/dstream/FileInputDStream.scala | 8 +--- .../org/apache/spark/streaming/receiver/ReceiverSupervisor.scala | 8 +--- .../apache/spark/streaming/scheduler/ReceivedBlockTracker.scala | 6 -- .../org/apache/spark/streaming/scheduler/ReceiverTracker.scala| 6 -- .../org/apache/spark/streaming/util/FileBasedWriteAheadLog.scala | 5 +++-- .../test/scala/org/apache/spark/streaming/MasterFailureTest.scala | 5 +++-- 7 files changed, 26 insertions(+), 14 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 6cdec011e2ae..a9a79de05c27 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -121,6 +121,7 @@ object LogKey extends Enumeration { val RANGE = Value val RDD_ID = Value val REASON = Value + val RECEIVED_BLOCK_INFO = Value val REDUCE_ID = Value val RELATION_NAME = Value val REMAINING_PARTITIONS = Value @@ -143,6 +144,7 @@ object LogKey extends Enumeration { val STAGE_ID = Value val STATEMENT_ID = Value val STATUS = Value + val STREAM_ID = Value val STREAM_NAME = Value val SUBMISSION_ID = Value val SUBSAMPLING_RATE = Value diff --git a/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala b/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala index 414fdf5d619d..e301311c922a 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala @@ -26,6 +26,8 @@ import org.apache.hadoop.conf.Configuration import org.apache.hadoop.fs.{FileStatus, FileSystem, Path} import org.apache.hadoop.mapreduce.{InputFormat => NewInputFormat} +import org.apache.spark.internal.LogKey.PATH +import org.apache.spark.internal.MDC import org.apache.spark.rdd.{RDD, UnionRDD} import org.apache.spark.streaming._ import org.apache.spark.streaming.scheduler.StreamInputInfo @@ -288,9 +290,9 @@ class FileInputDStream[K, V, F <: NewInputFormat[K, V]]( case None => context.sparkContext.newAPIHadoopFile[K, V, F](file) } if (rdd.partitions.isEmpty) { -logError("File " + file + " has no data in it. Spark Streaming can only ingest " + - "files that have been \"moved\" to the directory assigned to the file stream. " + - "Refer to the streaming programming guide for more details.") +logError(log"File ${MDC(PATH, file)} has no data in it. Spark Streaming can only ingest " + + log"""files that have been "moved" to the directory assigned to the file stream. """ + + log"Refer to the streaming programming guide for more details.") } rdd } diff --git a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala index 672452a4af4f..15f346484864 100644 --- a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala +++ b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala @@ -25,7 +25,8 @@ import scala.concurrent._ import scala.util.control.NonFatal import org.apache.spark.SparkConf -import org.apache.spark.internal.Logging +import org.apache.spark.internal.{Logging,
(spark) branch master updated: [SPARK-47593][CORE] Connector module: Migrate logWarn with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 520f3b1c192b [SPARK-47593][CORE] Connector module: Migrate logWarn with variables to structured logging framework 520f3b1c192b is described below commit 520f3b1c192b1bae53509fdad770f5711ca3791f Author: panbingkun AuthorDate: Tue Apr 9 21:42:39 2024 -0700 [SPARK-47593][CORE] Connector module: Migrate logWarn with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logWarning` in module `Connector` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45879 from panbingkun/SPARK-47593_warning. Lead-authored-by: panbingkun Co-authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 23 + .../scala/org/apache/spark/util/MDCSuite.scala | 15 +++- .../org/apache/spark/sql/avro/AvroUtils.scala | 9 ++--- .../ExecutePlanResponseReattachableIterator.scala | 2 +- .../sql/connect/client/GrpcRetryHandler.scala | 18 +- .../execution/ExecuteGrpcResponseSender.scala | 15 +--- .../service/SparkConnectStreamingQueryCache.scala | 14 +--- .../connect/ui/SparkConnectServerListener.scala| 36 +-- .../sql/jdbc/DockerJDBCIntegrationSuite.scala | 7 ++-- .../spark/sql/kafka010/KafkaContinuousStream.scala | 5 +-- .../spark/sql/kafka010/KafkaMicroBatchStream.scala | 5 +-- .../sql/kafka010/KafkaOffsetReaderAdmin.scala | 10 +++--- .../sql/kafka010/KafkaOffsetReaderConsumer.scala | 10 +++--- .../apache/spark/sql/kafka010/KafkaSource.scala| 5 +-- .../sql/kafka010/consumer/FetchedDataPool.scala| 7 ++-- .../sql/kafka010/consumer/KafkaDataConsumer.scala | 40 +- .../producer/InternalKafkaProducerPool.scala | 6 ++-- .../apache/spark/sql/kafka010/KafkaTestUtils.scala | 9 ++--- .../kafka010/KafkaDelegationTokenProvider.scala| 10 +++--- .../streaming/kafka010/ConsumerStrategy.scala | 11 +++--- .../streaming/kafka010/KafkaDataConsumer.scala | 7 ++-- .../spark/streaming/kafka010/KafkaUtils.scala | 16 + .../streaming/kinesis/KinesisBackedBlockRDD.scala | 6 ++-- .../streaming/kinesis/KinesisRecordProcessor.scala | 2 +- .../spark/streaming/kinesis/KinesisTestUtils.scala | 7 ++-- 25 files changed, 195 insertions(+), 100 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 2cb5eac4548c..6cdec011e2ae 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -34,6 +34,7 @@ object LogKey extends Enumeration { val CATEGORICAL_FEATURES = Value val CLASS_LOADER = Value val CLASS_NAME = Value + val CLUSTER_ID = Value val COLUMN_DATA_TYPE_SOURCE = Value val COLUMN_DATA_TYPE_TARGET = Value val COLUMN_DEFAULT_VALUE = Value @@ -43,6 +44,7 @@ object LogKey extends Enumeration { val COMPONENT = Value val CONFIG = Value val CONFIG2 = Value + val CONTAINER = Value val CONTAINER_ID = Value val COUNT = Value val CSV_HEADER_COLUMN_NAME = Value @@ -51,6 +53,7 @@ object LogKey extends Enumeration { val CSV_SCHEMA_FIELD_NAME = Value val CSV_SCHEMA_FIELD_NAMES = Value val CSV_SOURCE = Value + val DATA = Value val DATABASE_NAME = Value val DRIVER_ID = Value val DROPPED_PARTITIONS = Value @@ -70,9 +73,11 @@ object LogKey extends Enumeration { val HIVE_OPERATION_STATE = Value val HIVE_OPERATION_TYPE = Value val HOST = Value + val INDEX = Value val JOB_ID = Value val JOIN_CONDITION = Value val JOIN_CONDITION_SUB_EXPRESSION = Value + val KEY = Value val LEARNING_RATE = Value val LINE = Value val LINE_NUM = Value @@ -80,17 +85,23 @@ object LogKey extends Enumeration { val LOG_TYPE = Value val MASTER_URL = Value val MAX_ATTEMPTS = Value + val MAX_CAPACITY = Value val MAX_CATEGORIES = Value val MAX_EXECUTOR_FAILURES = Value val MAX_SIZE = Value val MERGE_DIR_NAME = Value val METHOD_NAME = Value val MIN_SIZE = Value + val NEW_VALUE = Value val NUM_COLUMNS = Value val NUM_ITERATIONS = Value val OBJECT_ID = Value + val OFFSET = Value + val OFFSETS = Value val OLD_BLOCK_MANAGER_ID = Value + val OLD_VALUE = Value
(spark) branch master updated: [SPARK-47586][SQL] Hive module: Migrate logError with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ec509b49dcaa [SPARK-47586][SQL] Hive module: Migrate logError with variables to structured logging framework ec509b49dcaa is described below commit ec509b49dcaa21d6dcdf18c1b40ac9d6df1827d7 Author: Haejoon Lee AuthorDate: Tue Apr 9 18:22:40 2024 -0700 [SPARK-47586][SQL] Hive module: Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? This PR proposes to migrate `logError` with variables of Hive module to structured logging framework. ### Why are the changes needed? To improve the existing logging system by migrating into structured logging. ### Does this PR introduce _any_ user-facing change? No API changes, but the SQL catalyst logs will contain MDC(Mapped Diagnostic Context) from now. ### How was this patch tested? Run Scala auto formatting and style check. Also the existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45876 from itholic/hive-logerror. Lead-authored-by: Haejoon Lee Co-authored-by: Haejoon Lee <44108233+itho...@users.noreply.github.com> Signed-off-by: Gengliang Wang --- .../main/scala/org/apache/spark/internal/LogKey.scala | 6 ++ .../scala/org/apache/spark/sql/hive/TableReader.scala | 8 ++-- .../apache/spark/sql/hive/client/HiveClientImpl.scala | 17 ++--- 3 files changed, 22 insertions(+), 9 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 7fa0331515cb..2cb5eac4548c 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -51,7 +51,9 @@ object LogKey extends Enumeration { val CSV_SCHEMA_FIELD_NAME = Value val CSV_SCHEMA_FIELD_NAMES = Value val CSV_SOURCE = Value + val DATABASE_NAME = Value val DRIVER_ID = Value + val DROPPED_PARTITIONS = Value val END_POINT = Value val ERROR = Value val EVENT_LOOP = Value @@ -61,6 +63,7 @@ object LogKey extends Enumeration { val EXIT_CODE = Value val EXPRESSION_TERMS = Value val FAILURES = Value + val FIELD_NAME = Value val FUNCTION_NAME = Value val FUNCTION_PARAMETER = Value val GROUP_ID = Value @@ -92,6 +95,7 @@ object LogKey extends Enumeration { val PARSE_MODE = Value val PARTITION_ID = Value val PARTITION_SPECIFICATION = Value + val PARTITION_SPECS = Value val PATH = Value val PATHS = Value val POD_ID = Value @@ -105,6 +109,7 @@ object LogKey extends Enumeration { val REASON = Value val REDUCE_ID = Value val RELATION_NAME = Value + val REMAINING_PARTITIONS = Value val REMOTE_ADDRESS = Value val RETRY_COUNT = Value val RETRY_INTERVAL = Value @@ -124,6 +129,7 @@ object LogKey extends Enumeration { val STATEMENT_ID = Value val SUBMISSION_ID = Value val SUBSAMPLING_RATE = Value + val TABLE_NAME = Value val TASK_ATTEMPT_ID = Value val TASK_ID = Value val TASK_NAME = Value diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala index d72406f094a6..60970eecc2df 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala @@ -36,7 +36,8 @@ import org.apache.hadoop.mapred.{FileInputFormat, InputFormat => oldInputClass, import org.apache.hadoop.mapreduce.{InputFormat => newInputClass} import org.apache.spark.deploy.SparkHadoopUtil -import org.apache.spark.internal.Logging +import org.apache.spark.internal.{Logging, MDC} +import org.apache.spark.internal.LogKey._ import org.apache.spark.rdd.{EmptyRDD, HadoopRDD, NewHadoopRDD, RDD, UnionRDD} import org.apache.spark.sql.SparkSession import org.apache.spark.sql.catalyst.{InternalRow, SQLConfHelper} @@ -518,7 +519,10 @@ private[hive] object HadoopTableReader extends HiveInspectors with Logging { i += 1 } catch { case ex: Throwable => -logError(s"Exception thrown in field <${fieldRefs(i).getFieldName}>") +logError( + log"Exception thrown in field <${MDC(FIELD_NAME, fieldRefs(i).getFieldName)}>", + ex +) throw ex } } diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala index 46dc56372334..92561be
(spark) branch master updated (07b1346db477 -> a68337892246)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 07b1346db477 [SPARK-47581][CORE][FOLLOWUP] Fix GA failure add a68337892246 [SPARK-47783] Add some missing SQLSTATEs an clean up the YY000 to use… No new revisions were added by this update. Summary of changes: .../src/main/resources/error/error-categories.json | 2 +- .../src/main/resources/error/error-states.json | 356 + 2 files changed, 219 insertions(+), 139 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (2793397140af -> 07b1346db477)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 2793397140af [SPARK-47782][BUILD] Remove redundant json4s-jackson definition in sql/api POM add 07b1346db477 [SPARK-47581][CORE][FOLLOWUP] Fix GA failure No new revisions were added by this update. Summary of changes: .../org/apache/spark/sql/catalyst/csv/CSVHeaderChecker.scala | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47581][CORE] SQL catalyst: Migrate logWarning with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 149ac0f8893b [SPARK-47581][CORE] SQL catalyst: Migrate logWarning with variables to structured logging framework 149ac0f8893b is described below commit 149ac0f8893b5be8b8b0556ef47a2384aaad1850 Author: Daniel Tenedorio AuthorDate: Mon Apr 8 22:56:10 2024 -0700 [SPARK-47581][CORE] SQL catalyst: Migrate logWarning with variables to structured logging framework ### What changes were proposed in this pull request? Migrate logWarning with variables of the Catalyst module to structured logging framework. This transforms the logWarning entries of the following API ``` def logWarning(msg: => String): Unit ``` to ``` def logWarning(entry: LogEntry): Unit ``` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? Yes, Spark core logs will contain additional MDC ### How was this patch tested? Compiler and scala style checks, as well as code review. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45904 from dtenedor/log-warn-catalyst. Lead-authored-by: Daniel Tenedorio Co-authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 26 ++ .../sql/catalyst/analysis/FunctionRegistry.scala | 6 +++-- .../sql/catalyst/analysis/HintErrorLogger.scala| 19 ++-- .../catalyst/analysis/StreamingJoinHelper.scala| 19 ++-- .../analysis/UnsupportedOperationChecker.scala | 6 +++-- .../spark/sql/catalyst/catalog/interface.scala | 6 +++-- .../spark/sql/catalyst/csv/CSVHeaderChecker.scala | 25 - .../catalyst/expressions/V2ExpressionUtils.scala | 10 + .../spark/sql/catalyst/optimizer/Optimizer.scala | 4 ++-- .../ReplaceNullWithFalseInPredicate.scala | 7 -- .../spark/sql/catalyst/optimizer/joins.scala | 7 -- .../spark/sql/catalyst/parser/AstBuilder.scala | 8 --- .../spark/sql/catalyst/rules/RuleExecutor.scala| 14 +++- .../spark/sql/catalyst/util/CharVarcharUtils.scala | 11 - .../apache/spark/sql/catalyst/util/ParseMode.scala | 9 +--- .../catalyst/util/ResolveDefaultColumnsUtil.scala | 10 ++--- .../spark/sql/catalyst/util/StringUtils.scala | 11 + 17 files changed, 133 insertions(+), 65 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index a0e99f1edc34..7fa0331515cb 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -22,6 +22,7 @@ package org.apache.spark.internal */ object LogKey extends Enumeration { val ACCUMULATOR_ID = Value + val ANALYSIS_ERROR = Value val APP_DESC = Value val APP_ID = Value val APP_STATE = Value @@ -33,6 +34,10 @@ object LogKey extends Enumeration { val CATEGORICAL_FEATURES = Value val CLASS_LOADER = Value val CLASS_NAME = Value + val COLUMN_DATA_TYPE_SOURCE = Value + val COLUMN_DATA_TYPE_TARGET = Value + val COLUMN_DEFAULT_VALUE = Value + val COLUMN_NAME = Value val COMMAND = Value val COMMAND_OUTPUT = Value val COMPONENT = Value @@ -40,6 +45,12 @@ object LogKey extends Enumeration { val CONFIG2 = Value val CONTAINER_ID = Value val COUNT = Value + val CSV_HEADER_COLUMN_NAME = Value + val CSV_HEADER_COLUMN_NAMES = Value + val CSV_HEADER_LENGTH = Value + val CSV_SCHEMA_FIELD_NAME = Value + val CSV_SCHEMA_FIELD_NAMES = Value + val CSV_SOURCE = Value val DRIVER_ID = Value val END_POINT = Value val ERROR = Value @@ -48,13 +59,17 @@ object LogKey extends Enumeration { val EXECUTOR_ID = Value val EXECUTOR_STATE = Value val EXIT_CODE = Value + val EXPRESSION_TERMS = Value val FAILURES = Value + val FUNCTION_NAME = Value + val FUNCTION_PARAMETER = Value val GROUP_ID = Value val HIVE_OPERATION_STATE = Value val HIVE_OPERATION_TYPE = Value val HOST = Value val JOB_ID = Value val JOIN_CONDITION = Value + val JOIN_CONDITION_SUB_EXPRESSION = Value val LEARNING_RATE = Value val LINE = Value val LINE_NUM = Value @@ -68,21 +83,28 @@ object LogKey extends Enumeration { val MERGE_DIR_NAME = Value val METHOD_NAME = Value val MIN_SIZE = Value + val NUM_COLUMNS = Value val NUM_ITERATIONS = Value val OBJECT_ID = Value val OLD_BLOCK_MANAGER_ID = Value val OPTIMIZER_CLASS_NAME = Value val OP_TYPE = Value +
(spark) branch master updated: [SPARK-47589][SQL] Hive-Thriftserver: Migrate logError with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new f8e652e88320 [SPARK-47589][SQL] Hive-Thriftserver: Migrate logError with variables to structured logging framework f8e652e88320 is described below commit f8e652e88320528a70e605a6a3cf986725e153a5 Author: Gengliang Wang AuthorDate: Mon Apr 8 17:13:28 2024 -0700 [SPARK-47589][SQL] Hive-Thriftserver: Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? Migrate logError with variables of Hive-thriftserver module to the structured logging framework. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #45936 from gengliangwang/LogError_HiveThriftServer2. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../main/scala/org/apache/spark/internal/LogKey.scala | 3 +++ .../thriftserver/SparkExecuteStatementOperation.scala | 17 +++-- .../spark/sql/hive/thriftserver/SparkOperation.scala| 6 -- .../spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala | 5 +++-- .../spark/sql/hive/thriftserver/SparkSQLDriver.scala| 5 +++-- 5 files changed, 24 insertions(+), 12 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 1144887e0b47..a0e99f1edc34 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -50,6 +50,8 @@ object LogKey extends Enumeration { val EXIT_CODE = Value val FAILURES = Value val GROUP_ID = Value + val HIVE_OPERATION_STATE = Value + val HIVE_OPERATION_TYPE = Value val HOST = Value val JOB_ID = Value val JOIN_CONDITION = Value @@ -96,6 +98,7 @@ object LogKey extends Enumeration { val SIZE = Value val SLEEP_TIME = Value val STAGE_ID = Value + val STATEMENT_ID = Value val SUBMISSION_ID = Value val SUBSAMPLING_RATE = Value val TASK_ATTEMPT_ID = Value diff --git a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala index 77b2aa131a24..628925007f7e 100644 --- a/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala +++ b/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala @@ -30,9 +30,11 @@ import org.apache.hive.service.cli.operation.ExecuteStatementOperation import org.apache.hive.service.cli.session.HiveSession import org.apache.hive.service.rpc.thrift.{TCLIServiceConstants, TColumnDesc, TPrimitiveTypeEntry, TRowSet, TTableSchema, TTypeDesc, TTypeEntry, TTypeId, TTypeQualifiers, TTypeQualifierValue} -import org.apache.spark.internal.Logging +import org.apache.spark.internal.{Logging, MDC} +import org.apache.spark.internal.LogKey.{HIVE_OPERATION_STATE, STATEMENT_ID, TIMEOUT, USER_NAME} import org.apache.spark.sql.{DataFrame, Row, SQLContext} import org.apache.spark.sql.catalyst.util.CharVarcharUtils +import org.apache.spark.sql.catalyst.util.DateTimeConstants.MILLIS_PER_SECOND import org.apache.spark.sql.execution.HiveResult.getTimeFormatters import org.apache.spark.sql.internal.{SQLConf, VariableSubstitution} import org.apache.spark.sql.types._ @@ -142,7 +144,9 @@ private[hive] class SparkExecuteStatementOperation( } catch { case NonFatal(e) => setOperationException(new HiveSQLException(e)) - logError(s"Error cancelling the query after timeout: $timeout seconds") + val timeout_ms = timeout * MILLIS_PER_SECOND + logError( +log"Error cancelling the query after timeout: ${MDC(TIMEOUT, timeout_ms)} ms") } finally { timeoutExecutor.shutdown() } @@ -178,8 +182,8 @@ private[hive] class SparkExecuteStatementOperation( } catch { case e: Exception => setOperationException(new HiveSQLException(e)) - logError("Error running hive query as user : " + -sparkServiceUGI.getShortUserName(), e) + logError(log"Error running hive query as user : " + +log"${MDC(
(spark) branch master updated (42dc815b8446 -> 7385f19c7539)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 42dc815b8446 [SPARK-47743][CORE] Use milliseconds as the time unit in logging add 7385f19c7539 [SPARK-47592][CORE] Connector module: Migrate logError with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/internal/LogKey.scala | 6 ++ .../org/apache/spark/sql/connect/utils/ErrorUtils.scala | 7 --- .../spark/sql/connect/ProtoToParsedPlanTestSuite.scala| 4 +++- .../spark/sql/jdbc/DockerJDBCIntegrationSuite.scala | 5 - .../org/apache/spark/streaming/kafka010/KafkaUtils.scala | 6 -- .../spark/streaming/kinesis/KinesisCheckpointer.scala | 8 +--- .../spark/streaming/kinesis/KinesisRecordProcessor.scala | 15 +-- 7 files changed, 35 insertions(+), 16 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (11abc64a731d -> 42dc815b8446)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 11abc64a731d [SPARK-47094][SQL] SPJ : Dynamically rebalance number of buckets when they are not equal add 42dc815b8446 [SPARK-47743][CORE] Use milliseconds as the time unit in logging No new revisions were added by this update. Summary of changes: .../src/main/scala/org/apache/spark/internal/LogKey.scala | 5 +++-- .../utils/src/main/scala/org/apache/spark/internal/README.md | 1 + .../src/main/scala/org/apache/spark/storage/BlockManager.scala | 8 .../spark/sql/catalyst/expressions/codegen/CodeGenerator.scala | 2 +- .../org/apache/spark/sql/catalyst/rules/RuleExecutor.scala | 10 +- 5 files changed, 14 insertions(+), 12 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (1efbf43160aa -> d1ace24f8fac)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 1efbf43160aa [SPARK-47310][SS] Add micro-benchmark for merge operations for multiple values in value portion of state store add d1ace24f8fac [SPARK-47582][SQL] Migrate Catalyst logInfo with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/LogKey.scala | 9 + .../scala/org/apache/spark/internal/Logging.scala | 2 ++ .../catalyst/analysis/StreamingJoinHelper.scala| 8 +++-- .../expressions/codegen/CodeGenerator.scala| 7 ++-- .../spark/sql/catalyst/optimizer/Optimizer.scala | 9 +++-- .../spark/sql/catalyst/rules/RuleExecutor.scala| 42 +++--- .../spark/sql/catalyst/xml/ValidatorUtil.scala | 5 ++- 7 files changed, 54 insertions(+), 28 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47577][CORE][PART2] Migrate logError with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 18072b5357a5 [SPARK-47577][CORE][PART2] Migrate logError with variables to structured logging framework 18072b5357a5 is described below commit 18072b5357a5fd671829e312ca359fcf34d47c63 Author: Gengliang Wang AuthorDate: Fri Apr 5 14:04:51 2024 -0700 [SPARK-47577][CORE][PART2] Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? Migrate logError with variables of core module to structured logging framework. This is part2 which transforms the logError entries of the following API ``` def logError(msg: => String, throwable: Throwable): Unit ``` to ``` def logError(entry: LogEntry, throwable: Throwable): Unit ``` migration Part1 was in https://github.com/apache/spark/pull/45834 ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? Yes, Spark core logs will contain additional MDC ### How was this patch tested? Compiler and scala style checks, as well as code review. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45890 from gengliangwang/coreError2. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 23 +- .../scala/org/apache/spark/ContextCleaner.scala| 19 +++--- .../scala/org/apache/spark/MapOutputTracker.scala | 2 +- .../main/scala/org/apache/spark/SparkContext.scala | 15 +- .../scala/org/apache/spark/TaskContextImpl.scala | 5 +++-- .../org/apache/spark/api/r/RBackendHandler.scala | 7 --- .../scala/org/apache/spark/deploy/Client.scala | 4 ++-- .../spark/deploy/StandaloneResourceUtils.scala | 8 +--- .../main/scala/org/apache/spark/deploy/Utils.scala | 6 -- .../spark/deploy/history/FsHistoryProvider.scala | 7 --- .../org/apache/spark/deploy/worker/Worker.scala| 18 ++--- .../apache/spark/deploy/worker/ui/LogPage.scala| 6 -- .../scala/org/apache/spark/executor/Executor.scala | 11 ++- .../spark/executor/ExecutorClassLoader.scala | 5 +++-- .../spark/internal/io/SparkHadoopWriter.scala | 4 ++-- .../spark/mapred/SparkHadoopMapRedUtil.scala | 7 +-- .../org/apache/spark/metrics/MetricsConfig.scala | 5 +++-- .../org/apache/spark/metrics/MetricsSystem.scala | 3 ++- .../main/scala/org/apache/spark/rdd/PipedRDD.scala | 7 --- .../scala/org/apache/spark/rpc/netty/Inbox.scala | 6 -- .../org/apache/spark/rpc/netty/MessageLoop.scala | 5 +++-- .../org/apache/spark/scheduler/DAGScheduler.scala | 10 ++ .../apache/spark/scheduler/ReplayListenerBus.scala | 4 ++-- .../spark/scheduler/SchedulableBuilder.scala | 14 - .../apache/spark/scheduler/TaskSetManager.scala| 4 ++-- .../apache/spark/serializer/KryoSerializer.scala | 5 +++-- .../org/apache/spark/status/AppStatusStore.scala | 5 +++-- .../org/apache/spark/storage/BlockManager.scala| 7 --- .../spark/storage/BlockManagerDecommissioner.scala | 10 ++ .../spark/storage/BlockManagerMasterEndpoint.scala | 5 +++-- .../storage/BlockManagerStorageEndpoint.scala | 21 +++- .../apache/spark/storage/DiskBlockManager.scala| 11 +++ .../spark/storage/DiskBlockObjectWriter.scala | 3 ++- .../spark/storage/PushBasedFetchHelper.scala | 15 -- .../storage/ShuffleBlockFetcherIterator.scala | 5 +++-- .../scala/org/apache/spark/ui/DriverLogPage.scala | 6 -- .../main/scala/org/apache/spark/ui/SparkUI.scala | 5 +++-- .../src/main/scala/org/apache/spark/ui/WebUI.scala | 5 +++-- .../scala/org/apache/spark/util/EventLoop.scala| 7 --- .../scala/org/apache/spark/util/ListenerBus.scala | 8 +--- .../apache/spark/util/ShutdownHookManager.scala| 6 -- .../spark/util/SparkUncaughtExceptionHandler.scala | 20 +-- .../main/scala/org/apache/spark/util/Utils.scala | 22 + .../apache/spark/util/logging/FileAppender.scala | 5 +++-- .../spark/util/logging/RollingFileAppender.scala | 8 +--- 45 files changed, 244 insertions(+), 140 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 66f3b803c0d4..1d8161282c5b 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/inter
(spark) branch master updated: [SPARK-47719][SQL] Change spark.sql.legacy.timeParserPolicy default to CORRECTED
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c34baebb36d4 [SPARK-47719][SQL] Change spark.sql.legacy.timeParserPolicy default to CORRECTED c34baebb36d4 is described below commit c34baebb36d4e4c8895085b3114da8dc07165469 Author: Serge Rielau AuthorDate: Fri Apr 5 11:35:38 2024 -0700 [SPARK-47719][SQL] Change spark.sql.legacy.timeParserPolicy default to CORRECTED ### What changes were proposed in this pull request? We changed the time parser policy in Spark 3.0.0. The config has since defaulted to raise an exception if there is a potential conflict between teh legacy and the new policy. Spark 4.0.0 is a good time to default to the new policy ### Why are the changes needed? Move the product forward and retire legacy behavior over time. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run existing unit tests and verify changes. ### Was this patch authored or co-authored using generative AI tooling? No Closes #45859 from srielau/SPARK-47719-parser-policy-default-to-corrected. Lead-authored-by: Serge Rielau Co-authored-by: Wenchen Fan Signed-off-by: Gengliang Wang --- .../org/apache/spark/sql/ClientE2ETestSuite.scala | 4 +- docs/sql-migration-guide.md| 2 + .../sql/tests/connect/test_connect_session.py | 1 + .../org/apache/spark/sql/internal/SqlApiConf.scala | 2 +- .../org/apache/spark/sql/internal/SQLConf.scala| 6 +- .../sql/catalyst/util/DateFormatterSuite.scala | 2 +- .../sql/catalyst/util/DatetimeFormatterSuite.scala | 59 +++ .../catalyst/util/TimestampFormatterSuite.scala| 36 +- .../results/ansi/datetime-parsing-invalid.sql.out | 72 +-- .../results/datetime-parsing-invalid.sql.out | 84 -- .../sql-tests/results/json-functions.sql.out | 24 ++- .../sql-tests/results/xml-functions.sql.out| 24 ++- 12 files changed, 115 insertions(+), 201 deletions(-) diff --git a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala index f2f1571452c0..95ee69d2a47d 100644 --- a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala +++ b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/ClientE2ETestSuite.scala @@ -74,7 +74,9 @@ class ClientE2ETestSuite extends RemoteSparkSession with SQLHelper with PrivateM for (enrichErrorEnabled <- Seq(false, true)) { test(s"cause exception - ${enrichErrorEnabled}") { - withSQLConf("spark.sql.connect.enrichError.enabled" -> enrichErrorEnabled.toString) { + withSQLConf( +"spark.sql.connect.enrichError.enabled" -> enrichErrorEnabled.toString, +"spark.sql.legacy.timeParserPolicy" -> "EXCEPTION") { val ex = intercept[SparkUpgradeException] { spark .sql(""" diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 13d6702c4cf9..019728a45f40 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -46,6 +46,8 @@ license: | - Since Spark 4.0, MySQL JDBC datasource will read FLOAT as FloatType, while in Spark 3.5 and previous, it was read as DoubleType. To restore the previous behavior, you can cast the column to the old type. - Since Spark 4.0, MySQL JDBC datasource will read BIT(n > 1) as BinaryType, while in Spark 3.5 and previous, read as LongType. To restore the previous behavior, set `spark.sql.legacy.mysql.bitArrayMapping.enabled` to `true`. - Since Spark 4.0, MySQL JDBC datasource will write ShortType as SMALLINT, while in Spark 3.5 and previous, write as INTEGER. To restore the previous behavior, you can replace the column with IntegerType whenever before writing. +- Since Spark 4.0, The default value for `spark.sql.legacy.ctePrecedencePolicy` has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an error, inner CTE definitions take precedence over outer definitions. +- Since Spark 4.0, The default value for `spark.sql.legacy.timeParserPolicy` has been changed from `EXCEPTION` to `CORRECTED`. Instead of raising an `INCONSISTENT_BEHAVIOR_CROSS_VERSION` error, `CANNOT_PARSE_TIMESTAMP` will be raised if ANSI mode is enable. `NULL` will be returned if ANSI mode is disabled. See [Datetime Patterns for Formatting and Parsing](sql-ref-datetime-pattern.html). ## Upgrading from Spark SQL 3.5.1 to 3.5.2 diff --git a/python/pyspark/sql/tests/connect/test_connect_session.p
(spark) branch master updated: [SPARK-47723][CORE][TESTS] Introduce a tool that can sort alphabetically enumeration field in `LogEntry` automatically
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fb96b1a8d648 [SPARK-47723][CORE][TESTS] Introduce a tool that can sort alphabetically enumeration field in `LogEntry` automatically fb96b1a8d648 is described below commit fb96b1a8d6480612ca61ec39f62c8db0b341327b Author: panbingkun AuthorDate: Thu Apr 4 17:04:53 2024 -0700 [SPARK-47723][CORE][TESTS] Introduce a tool that can sort alphabetically enumeration field in `LogEntry` automatically ### What changes were proposed in this pull request? The pr aims to `introduce` a `tool` that can `sort alphabetically` enumeration field in `LogEntry` automatically. ### Why are the changes needed? Enable developers to more conveniently write the enumeration values in `LogEntry` in alphabetical order according to the requirements of structured log development documents. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "common-utils/testOnly *LogKeySuite -- -t \"LogKey enumeration fields are correctly sorted\"" ``` - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45867 from panbingkun/SPARK-47723. Lead-authored-by: panbingkun Co-authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/util/LogKeySuite.scala | 71 -- 1 file changed, 67 insertions(+), 4 deletions(-) diff --git a/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala b/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala index 1f3c2d77d35f..24a24538ad72 100644 --- a/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala +++ b/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala @@ -17,17 +17,80 @@ package org.apache.spark.util +import java.nio.charset.StandardCharsets +import java.nio.file.{Files, Path} +import java.util.{ArrayList => JList} + +import scala.jdk.CollectionConverters._ + +import org.apache.commons.io.FileUtils import org.scalatest.funsuite.AnyFunSuite // scalastyle:ignore funsuite import org.apache.spark.internal.{Logging, LogKey} +import org.apache.spark.internal.LogKey.LogKey +// scalastyle:off line.size.limit +/** + * To re-generate the LogKey class file, run: + * {{{ + * SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "common-utils/testOnly org.apache.spark.util.LogKeySuite" + * }}} + */ +// scalastyle:on line.size.limit class LogKeySuite extends AnyFunSuite // scalastyle:ignore funsuite with Logging { - test("LogKey enumeration fields must be sorted alphabetically") { -val keys = LogKey.values.toSeq -assert(keys === keys.sortBy(_.toString), - "LogKey enumeration fields must be sorted alphabetically") + /** + * Get a Path relative to the root project. It is assumed that a spark home is set. + */ + protected final def getWorkspaceFilePath(first: String, more: String*): Path = { +if (!(sys.props.contains("spark.test.home") || sys.env.contains("SPARK_HOME"))) { + fail("spark.test.home or SPARK_HOME is not set.") +} +val sparkHome = sys.props.getOrElse("spark.test.home", sys.env("SPARK_HOME")) +java.nio.file.Paths.get(sparkHome, first +: more: _*) + } + + private val regenerateGoldenFiles: Boolean = System.getenv("SPARK_GENERATE_GOLDEN_FILES") == "1" + + private val logKeyFilePath = getWorkspaceFilePath("common", "utils", "src", "main", "scala", +"org", "apache", "spark", "internal", "LogKey.scala") + + // regenerate the file `LogKey.scala` with its enumeration fields sorted alphabetically + private def regenerateLogKeyFile( + originalKeys: Seq[LogKey], sortedKeys: Seq[LogKey]): Unit = { +if (originalKeys != sortedKeys) { + val logKeyFile = logKeyFilePath.toFile + logInfo(s"Regenerating LogKey file $logKeyFile") + val originalContents = FileUtils.readLines(logKeyFile, StandardCharsets.UTF_8) + val sortedContents = new JList[String]() + var firstMatch = false + originalContents.asScala.foreach { line => +if (line.trim.startsWith("val ") && line.trim.endsWith(" = Value")) { + if (!firstMatch) { +sortedKeys.foreach { logKey => + sortedContents.add(s" val ${logKey.toString} = Value") +} +firstMatch = true + } +} else { + sortedConten
(spark) branch master updated: [SPARK-47598][CORE] MLLib: Migrate logError with variables to structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3fd0cd609df6 [SPARK-47598][CORE] MLLib: Migrate logError with variables to structured logging framework 3fd0cd609df6 is described below commit 3fd0cd609df65051920c56861fa6da54caf4cc9e Author: panbingkun AuthorDate: Thu Apr 4 10:46:54 2024 -0700 [SPARK-47598][CORE] MLLib: Migrate logError with variables to structured logging framework ### What changes were proposed in this pull request? The pr aims to migrate `logError` in module `MLLib` with variables to `structured logging framework`. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45837 from panbingkun/SPARK-47598. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 7 + .../scala/org/apache/spark/internal/Logging.scala | 2 +- .../apache/spark/ml/classification/LinearSVC.scala | 15 +- .../ml/classification/LogisticRegression.scala | 14 - .../ml/regression/AFTSurvivalRegression.scala | 5 +--- .../spark/ml/regression/LinearRegression.scala | 5 +--- .../org/apache/spark/ml/util/Instrumentation.scala | 35 +- .../apache/spark/mllib/util/DataValidators.scala | 11 --- .../org/apache/spark/mllib/util/MLUtils.scala | 10 ++- .../spark/ml/feature/VectorIndexerSuite.scala | 17 ++- .../mllib/tree/GradientBoostedTreesSuite.scala | 20 + 11 files changed, 99 insertions(+), 42 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 608c0c6d521e..66f3b803c0d4 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -28,6 +28,7 @@ object LogKey extends Enumeration { val BLOCK_MANAGER_ID = Value val BROADCAST_ID = Value val BUCKET = Value + val CATEGORICAL_FEATURES = Value val CLASS_LOADER = Value val CLASS_NAME = Value val COMMAND = Value @@ -44,17 +45,22 @@ object LogKey extends Enumeration { val EXIT_CODE = Value val HOST = Value val JOB_ID = Value + val LEARNING_RATE = Value val LINE = Value val LINE_NUM = Value val MASTER_URL = Value val MAX_ATTEMPTS = Value + val MAX_CATEGORIES = Value val MAX_EXECUTOR_FAILURES = Value val MAX_SIZE = Value val MIN_SIZE = Value + val NUM_ITERATIONS = Value val OLD_BLOCK_MANAGER_ID = Value + val OPTIMIZER_CLASS_NAME = Value val PARTITION_ID = Value val PATH = Value val POD_ID = Value + val RANGE = Value val REASON = Value val REMOTE_ADDRESS = Value val RETRY_COUNT = Value @@ -63,6 +69,7 @@ object LogKey extends Enumeration { val SIZE = Value val STAGE_ID = Value val SUBMISSION_ID = Value + val SUBSAMPLING_RATE = Value val TASK_ATTEMPT_ID = Value val TASK_ID = Value val TASK_NAME = Value diff --git a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala index 84b9debb2afd..2132e166eacf 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala @@ -117,7 +117,7 @@ trait Logging { } } - private def withLogContext(context: java.util.HashMap[String, String])(body: => Unit): Unit = { + protected def withLogContext(context: java.util.HashMap[String, String])(body: => Unit): Unit = { val threadContext = CloseableThreadContext.putAll(context) try { body diff --git a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala index 13898a304b3d..024693ba06f2 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala @@ -25,7 +25,8 @@ import org.apache.hadoop.fs.Path import org.apache.spark.SparkException import org.apache.spark.annotation.Since -import org.apache.spark.internal.Logging +import org.apache.spark.internal.{Logging, MDC} +import org.apache.spark.internal.LogKey.{COUNT, RANGE} import org.apache.spark.ml.feature._ import org.apache.spark.ml.linalg._ import org.apache.spark.ml.optim.aggregator._ @@ -36,6 +37,7 @@ import org.apache.spark.ml.stat._
(spark) branch master updated (d75c77562d27 -> 3f6ac60e9966)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from d75c77562d27 [SPARK-46812][PYTHON][TESTS][FOLLOWUP] Skip `pandas`-required tests if pandas is not available add 3f6ac60e9966 [SPARK-47577][CORE][PART1] Migrate logError with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/LogKey.scala | 43 +- .../scala/org/apache/spark/MapOutputTracker.scala | 22 ++- .../spark/api/python/PythonGatewayServer.scala | 7 ++-- .../apache/spark/api/python/PythonHadoopUtil.scala | 5 ++- .../apache/spark/broadcast/TorrentBroadcast.scala | 7 +++- .../scala/org/apache/spark/deploy/Client.scala | 16 .../org/apache/spark/deploy/SparkSubmit.scala | 9 +++-- .../apache/spark/deploy/history/EventFilter.scala | 7 ++-- .../org/apache/spark/deploy/master/Master.scala| 12 -- .../spark/deploy/rest/RestSubmissionClient.scala | 20 ++ .../org/apache/spark/deploy/worker/Worker.scala| 12 +++--- .../apache/spark/deploy/worker/WorkerWatcher.scala | 10 +++-- .../executor/CoarseGrainedExecutorBackend.scala| 11 +++--- .../scala/org/apache/spark/executor/Executor.scala | 14 --- .../spark/internal/io/SparkHadoopWriter.scala | 5 ++- .../org/apache/spark/metrics/MetricsSystem.scala | 5 ++- .../main/scala/org/apache/spark/rdd/PipedRDD.scala | 6 ++- .../apache/spark/scheduler/AsyncEventQueue.scala | 9 +++-- .../org/apache/spark/scheduler/DAGScheduler.scala | 21 +-- .../org/apache/spark/scheduler/HealthTracker.scala | 8 ++-- .../apache/spark/scheduler/LiveListenerBus.scala | 8 ++-- .../apache/spark/scheduler/ReplayListenerBus.scala | 5 ++- .../apache/spark/scheduler/TaskResultGetter.scala | 6 ++- .../apache/spark/scheduler/TaskSchedulerImpl.scala | 17 + .../apache/spark/scheduler/TaskSetManager.scala| 24 +++- .../cluster/CoarseGrainedSchedulerBackend.scala| 7 ++-- .../cluster/StandaloneSchedulerBackend.scala | 5 ++- .../spark/shuffle/IndexShuffleBlockResolver.scala | 10 +++-- .../org/apache/spark/storage/BlockManager.scala| 6 +-- .../spark/storage/BlockManagerMasterEndpoint.scala | 8 ++-- .../spark/storage/DiskBlockObjectWriter.scala | 11 +++--- .../storage/ShuffleBlockFetcherIterator.scala | 17 + .../main/scala/org/apache/spark/util/Utils.scala | 12 -- .../org/apache/spark/deploy/yarn/Client.scala | 4 +- .../cluster/YarnClientSchedulerBackend.scala | 4 +- 35 files changed, 242 insertions(+), 151 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47705][INFRA][FOLLOWUP] Sort LogKey alphabetically and build a test to ensure it
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c25fd93199cc [SPARK-47705][INFRA][FOLLOWUP] Sort LogKey alphabetically and build a test to ensure it c25fd93199cc is described below commit c25fd93199cc2d8795414cdb09a7129793a3e206 Author: panbingkun AuthorDate: Wed Apr 3 19:38:37 2024 -0700 [SPARK-47705][INFRA][FOLLOWUP] Sort LogKey alphabetically and build a test to ensure it ### What changes were proposed in this pull request? The pr aims to fix bug about https://github.com/apache/spark/pull/45857 ### Why are the changes needed? In fact, `LogKey.values.toSeq.sorted` did not sort alphabetically as expected. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass GA. - Manually test. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45864 from panbingkun/fix_sort_logkey. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala | 2 +- common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index b8a43a03d8b6..86ea648d47c1 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -30,8 +30,8 @@ object LogKey extends Enumeration { val MAX_EXECUTOR_FAILURES = Value val MAX_SIZE = Value val MIN_SIZE = Value - val REMOTE_ADDRESS = Value val POD_ID = Value + val REMOTE_ADDRESS = Value type LogKey = Value } diff --git a/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala b/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala index 39229f4b910b..1f3c2d77d35f 100644 --- a/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala +++ b/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala @@ -27,6 +27,7 @@ class LogKeySuite test("LogKey enumeration fields must be sorted alphabetically") { val keys = LogKey.values.toSeq -assert(keys === keys.sorted, "LogKey enumeration fields must be sorted alphabetically") +assert(keys === keys.sortBy(_.toString), + "LogKey enumeration fields must be sorted alphabetically") } } - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47705][INFRA] Sort LogKey alphabetically and build a test to ensure it
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7dec5eb14644 [SPARK-47705][INFRA] Sort LogKey alphabetically and build a test to ensure it 7dec5eb14644 is described below commit 7dec5eb14644aee6c0562bad1d14421d9fa07f17 Author: Daniel Tenedorio AuthorDate: Wed Apr 3 14:38:52 2024 -0700 [SPARK-47705][INFRA] Sort LogKey alphabetically and build a test to ensure it ### What changes were proposed in this pull request? This PR adds a unit test to ensure that the fields of the `LogKey` enumeration are sorted alphabetically, as specified by https://issues.apache.org/jira/browse/SPARK-47705. ### Why are the changes needed? This will make sure that the fields of the enumeration remain easy to read in the future as we add more cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR adds testing coverage only. ### Was this patch authored or co-authored using generative AI tooling? GitHub copilot offered some suggestions, but I rejected them Closes #45857 from dtenedor/logs. Authored-by: Daniel Tenedorio Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/util/LogKeySuite.scala | 32 ++ 1 file changed, 32 insertions(+) diff --git a/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala b/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala new file mode 100644 index ..39229f4b910b --- /dev/null +++ b/common/utils/src/test/scala/org/apache/spark/util/LogKeySuite.scala @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.util + +import org.scalatest.funsuite.AnyFunSuite // scalastyle:ignore funsuite + +import org.apache.spark.internal.{Logging, LogKey} + +class LogKeySuite +extends AnyFunSuite // scalastyle:ignore funsuite +with Logging { + + test("LogKey enumeration fields must be sorted alphabetically") { +val keys = LogKey.values.toSeq +assert(keys === keys.sorted, "LogKey enumeration fields must be sorted alphabetically") + } +} - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (8d4e9647c971 -> db0975cb2a1c)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 8d4e9647c971 [SPARK-47684][SQL] Postgres: Map length unspecified bpchar to StringType add db0975cb2a1c [SPARK-47602][CORE] Resource managers: Migrate logError with variables to structured logging framework No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/internal/LogKey.scala | 14 ++- .../scala/org/apache/spark/internal/Logging.scala | 7 ++-- .../scala/org/apache/spark/util/MDCSuite.scala | 49 ++ .../apache/spark/util/PatternLoggingSuite.scala| 6 ++- .../apache/spark/util/StructuredLoggingSuite.scala | 40 -- .../cluster/k8s/ExecutorPodsAllocator.scala| 8 ++-- .../k8s/integrationtest/DepsTestsSuite.scala | 3 +- .../spark/deploy/yarn/ApplicationMaster.scala | 11 +++-- .../org/apache/spark/deploy/yarn/Client.scala | 5 ++- .../apache/spark/deploy/yarn/YarnAllocator.scala | 6 ++- .../cluster/YarnClientSchedulerBackend.scala | 7 ++-- 11 files changed, 133 insertions(+), 23 deletions(-) create mode 100644 common/utils/src/test/scala/org/apache/spark/util/MDCSuite.scala - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47659][CORE][TESTS] Improve `*LoggingSuite*`
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 14811974338e [SPARK-47659][CORE][TESTS] Improve `*LoggingSuite*` 14811974338e is described below commit 14811974338ed30d990039c84a563f5e6cd0b26e Author: panbingkun AuthorDate: Mon Apr 1 10:16:49 2024 -0700 [SPARK-47659][CORE][TESTS] Improve `*LoggingSuite*` ### What changes were proposed in this pull request? The pr aims to improve `UT` related to `structured logs`, including: `LoggingSuiteBase`, `StructuredLoggingSuite` and `PatternLoggingSuite`. ### Why are the changes needed? Enhance readability and make it more elegant. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Manually test. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #45784 from panbingkun/SPARK-47659. Authored-by: panbingkun Signed-off-by: Gengliang Wang --- .../apache/spark/util/PatternLoggingSuite.scala| 21 +-- .../apache/spark/util/StructuredLoggingSuite.scala | 148 ++--- 2 files changed, 115 insertions(+), 54 deletions(-) diff --git a/common/utils/src/test/scala/org/apache/spark/util/PatternLoggingSuite.scala b/common/utils/src/test/scala/org/apache/spark/util/PatternLoggingSuite.scala index 7e4318306c82..02895f708ff0 100644 --- a/common/utils/src/test/scala/org/apache/spark/util/PatternLoggingSuite.scala +++ b/common/utils/src/test/scala/org/apache/spark/util/PatternLoggingSuite.scala @@ -16,30 +16,33 @@ */ package org.apache.spark.util +import org.apache.logging.log4j.Level import org.scalatest.BeforeAndAfterAll import org.apache.spark.internal.Logging class PatternLoggingSuite extends LoggingSuiteBase with BeforeAndAfterAll { - override protected def logFilePath: String = "target/pattern.log" + override def className: String = classOf[PatternLoggingSuite].getSimpleName + override def logFilePath: String = "target/pattern.log" override def beforeAll(): Unit = Logging.disableStructuredLogging() override def afterAll(): Unit = Logging.enableStructuredLogging() - override def expectedPatternForBasicMsg(level: String): String = -s""".*$level PatternLoggingSuite: This is a log message\n""" + override def expectedPatternForBasicMsg(level: Level): String = { +s""".*$level $className: This is a log message\n""" + } - override def expectedPatternForMsgWithMDC(level: String): String = -s""".*$level PatternLoggingSuite: Lost executor 1.\n""" + override def expectedPatternForMsgWithMDC(level: Level): String = +s""".*$level $className: Lost executor 1.\n""" - override def expectedPatternForMsgWithMDCAndException(level: String): String = -s""".*$level PatternLoggingSuite: Error in executor 1.\njava.lang.RuntimeException: OOM\n.*""" + override def expectedPatternForMsgWithMDCAndException(level: Level): String = +s""".*$level $className: Error in executor 1.\njava.lang.RuntimeException: OOM\n.*""" - override def verifyMsgWithConcat(level: String, logOutput: String): Unit = { + override def verifyMsgWithConcat(level: Level, logOutput: String): Unit = { val pattern = - s""".*$level PatternLoggingSuite: Min Size: 2, Max Size: 4. Please double check.\n""" + s""".*$level $className: Min Size: 2, Max Size: 4. Please double check.\n""" assert(pattern.r.matches(logOutput)) } } diff --git a/common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala b/common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala index 8165c5f5b751..fe42c7fec990 100644 --- a/common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala +++ b/common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala @@ -19,23 +19,28 @@ package org.apache.spark.util import java.io.File import java.nio.file.Files +import com.fasterxml.jackson.databind.ObjectMapper +import com.fasterxml.jackson.module.scala.DefaultScalaModule +import org.apache.logging.log4j.Level import org.scalatest.funsuite.AnyFunSuite // scalastyle:ignore funsuite import org.apache.spark.internal.{LogEntry, Logging, MDC} import org.apache.spark.internal.LogKey.{EXECUTOR_ID, MAX_SIZE, MIN_SIZE} -abstract class LoggingSuiteBase extends AnyFunSuite // scalastyle:ignore funsuite - with Logging { +trait LoggingSuiteBase +extends AnyFunSuite // scalastyle:ignore funsuite +with Logging { - protected def logFilePath:
(spark) branch master updated: [SPARK-47654][INFRA] Structured logging framework: support log concatenation
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a5ca5867298f [SPARK-47654][INFRA] Structured logging framework: support log concatenation a5ca5867298f is described below commit a5ca5867298f8ad6d40f3132ad74cbf078cc62b3 Author: Gengliang Wang AuthorDate: Mon Apr 1 00:15:24 2024 -0700 [SPARK-47654][INFRA] Structured logging framework: support log concatenation ### What changes were proposed in this pull request? Support the log concatenation in the structured logging framework. For example ``` log"${MDC(CONFIG, SHUFFLE_MAPOUTPUT_MIN_SIZE_FOR_BROADCAST.key)} " + log"(${MDC(MIN_SIZE, minSizeForBroadcast.toString)} bytes) " + log"must be <= spark.rpc.message.maxSize (${MDC(MAX_SIZE, maxRpcMessageSize.toString)} " + log"bytes) to prevent sending an rpc message that is too large." ``` ### Why are the changes needed? Although most of the Spark logs are short, we need this convenient syntax when handling long logs with multiple variables. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? Yes, GitHub copilot Closes #45779 from gengliangwang/logConcat. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/LogKey.scala | 2 +- .../scala/org/apache/spark/internal/Logging.scala | 62 +- .../apache/spark/util/PatternLoggingSuite.scala| 6 +++ .../apache/spark/util/StructuredLoggingSuite.scala | 33 +++- 4 files changed, 77 insertions(+), 26 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala index 6ab6ac0eb58a..760077af6d3e 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/LogKey.scala @@ -21,5 +21,5 @@ package org.apache.spark.internal * All structured logging keys should be defined here for standardization. */ object LogKey extends Enumeration { - val EXECUTOR_ID = Value + val EXECUTOR_ID, MIN_SIZE, MAX_SIZE = Value } diff --git a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala index 0aa93d6289d1..5765a6eed542 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala @@ -22,7 +22,6 @@ import java.util.Locale import scala.jdk.CollectionConverters._ import org.apache.logging.log4j.{CloseableThreadContext, Level, LogManager} -import org.apache.logging.log4j.CloseableThreadContext.Instance import org.apache.logging.log4j.core.{Filter, LifeCycle, LogEvent, Logger => Log4jLogger, LoggerContext} import org.apache.logging.log4j.core.appender.ConsoleAppender import org.apache.logging.log4j.core.config.DefaultConfiguration @@ -43,7 +42,13 @@ case class MDC(key: LogKey.Value, value: String) * Wrapper class for log messages that include a logging context. * This is used as the return type of the string interpolator `LogStringContext`. */ -case class MessageWithContext(message: String, context: Option[Instance]) +case class MessageWithContext(message: String, context: java.util.HashMap[String, String]) { + def +(mdc: MessageWithContext): MessageWithContext = { +val resultMap = new java.util.HashMap(context) +resultMap.putAll(mdc.context) +MessageWithContext(message + mdc.message, resultMap) + } +} /** * Companion class for lazy evaluation of the MessageWithContext instance. @@ -51,7 +56,7 @@ case class MessageWithContext(message: String, context: Option[Instance]) class LogEntry(messageWithContext: => MessageWithContext) { def message: String = messageWithContext.message - def context: Option[Instance] = messageWithContext.context + def context: java.util.HashMap[String, String] = messageWithContext.context } /** @@ -94,12 +99,12 @@ trait Logging { def log(args: MDC*): MessageWithContext = { val processedParts = sc.parts.iterator val sb = new StringBuilder(processedParts.next()) - lazy val map = new java.util.HashMap[String, String]() + val context = new java.util.HashMap[String, String]() args.foreach { mdc => sb.append(mdc.value) if (Logging.isStructuredLoggingEnabled) { - map.put(mdc.key.toString.toLowerCase(Locale.ROOT), mdc.value) + context.put(mdc.key.toString.toLowerCase(Locale.ROOT), mdc.value) }
(spark) branch master updated: [SPARK-47576][INFRA] Implement logInfo API in structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 11d76c96554c [SPARK-47576][INFRA] Implement logInfo API in structured logging framework 11d76c96554c is described below commit 11d76c96554cc71c6a941c99222c08c76bd04bf2 Author: Gengliang Wang AuthorDate: Fri Mar 29 13:10:40 2024 -0700 [SPARK-47576][INFRA] Implement logInfo API in structured logging framework ### What changes were proposed in this pull request? Implement logWarning API in structured logging framework. Also, revise the test case names to make it more reasonable for the `PatternLoggingSuite` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #45777 from gengliangwang/logInfo. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/Logging.scala | 14 ++ .../apache/spark/util/StructuredLoggingSuite.scala | 21 +++-- 2 files changed, 25 insertions(+), 10 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala index 2fed115f3dbb..0aa93d6289d1 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala @@ -122,6 +122,20 @@ trait Logging { if (log.isInfoEnabled) log.info(msg) } + protected def logInfo(entry: LogEntry): Unit = { +if (log.isInfoEnabled) { + log.info(entry.message) + entry.context.map(_.close()) +} + } + + protected def logInfo(entry: LogEntry, throwable: Throwable): Unit = { +if (log.isInfoEnabled) { + log.info(entry.message, throwable) + entry.context.map(_.close()) +} + } + protected def logDebug(msg: => String): Unit = { if (log.isDebugEnabled) log.debug(msg) } diff --git a/common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala b/common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala index b032649170bc..5dfd3bb46021 100644 --- a/common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala +++ b/common/utils/src/test/scala/org/apache/spark/util/StructuredLoggingSuite.scala @@ -58,33 +58,34 @@ abstract class LoggingSuiteBase extends AnyFunSuite // scalastyle:ignore funsuit def expectedPatternForMsgWithMDCAndException(level: String): String - test("Structured logging") { + test("Basic logging") { val msg = "This is a log message" Seq( ("ERROR", () => logError(msg)), - ("WARN", () => logWarning(msg))).foreach { case (level, logFunc) => + ("WARN", () => logWarning(msg)), + ("INFO", () => logInfo(msg))).foreach { case (level, logFunc) => val logOutput = captureLogOutput(logFunc) assert(expectedPatternForBasicMsg(level).r.matches(logOutput)) } } - test("Structured logging with MDC") { + test("Logging with MDC") { Seq( - ("ERROR", () => logError(log"Lost executor ${MDC(EXECUTOR_ID, "1")}.")), - ("WARN", () => logWarning(log"Lost executor ${MDC(EXECUTOR_ID, "1")}."))) - .foreach { + ("ERROR", () => logError(msgWithMDC)), + ("WARN", () => logWarning(msgWithMDC)), + ("INFO", () => logInfo(msgWithMDC))).foreach { case (level, logFunc) => val logOutput = captureLogOutput(logFunc) assert(expectedPatternForMsgWithMDC(level).r.matches(logOutput)) } } - test("Structured exception logging with MDC") { + test("Logging with MDC and Exception") { val exception = new RuntimeException("OOM") Seq( - ("ERROR", () => logError(log"Error in executor ${MDC(EXECUTOR_ID, "1")}.", exception)), - ("WARN", () => logWarning(log"Error in executor ${MDC(EXECUTOR_ID, "1")}.", exception))) - .foreach { + ("ERROR", () => logError(msgWithMDCAndException, exception)), + ("WARN", () => logWarning(msgWithMDCAndException, exception)), + ("INFO", () => logInfo(msgWithMDCAndException, exception))).foreach { case (level, logFunc) => val logOutput = captureLogOutput(logFunc) assert(expectedPatternForMsgWithMDCAndException(level).r.findFirstIn(logOutput).isDefined) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (d182810abcd8 -> db14be8ab5f7)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from d182810abcd8 [SPARK-47575][INFRA] Implement logWarning API in structured logging framework add db14be8ab5f7 [SPARK-47637][SQL] Use errorCapturingIdentifier in more places No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/parser/SqlBaseParser.g4 | 18 +- .../sql/catalyst/parser/DataTypeAstBuilder.scala| 2 +- .../spark/sql/catalyst/parser/AstBuilder.scala | 14 +++--- .../sql/catalyst/parser/ErrorParserSuite.scala | 21 + .../apache/spark/sql/execution/SparkSqlParser.scala | 4 ++-- 5 files changed, 40 insertions(+), 19 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47575][INFRA] Implement logWarning API in structured logging framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d182810abcd8 [SPARK-47575][INFRA] Implement logWarning API in structured logging framework d182810abcd8 is described below commit d182810abcd8ff6a86211b90f0b4217100546688 Author: Gengliang Wang AuthorDate: Fri Mar 29 11:13:21 2024 -0700 [SPARK-47575][INFRA] Implement logWarning API in structured logging framework ### What changes were proposed in this pull request? Implement logWarning API in structured logging framework. Also, refactor the logging test suites to reduce duplicated code. ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45770 from gengliangwang/logWarning. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../scala/org/apache/spark/internal/Logging.scala | 14 .../apache/spark/util/PatternLoggingSuite.scala| 33 ++ .../apache/spark/util/StructuredLoggingSuite.scala | 74 +++--- 3 files changed, 72 insertions(+), 49 deletions(-) diff --git a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala index 7f380a9c7887..2fed115f3dbb 100644 --- a/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala +++ b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala @@ -134,6 +134,20 @@ trait Logging { if (log.isWarnEnabled) log.warn(msg) } + protected def logWarning(entry: LogEntry): Unit = { +if (log.isWarnEnabled) { + log.warn(entry.message) + entry.context.map(_.close()) +} + } + + protected def logWarning(entry: LogEntry, throwable: Throwable): Unit = { +if (log.isWarnEnabled) { + log.warn(entry.message, throwable) + entry.context.map(_.close()) +} + } + protected def logError(msg: => String): Unit = { if (log.isErrorEnabled) log.error(msg) } diff --git a/common/utils/src/test/scala/org/apache/spark/util/PatternLoggingSuite.scala b/common/utils/src/test/scala/org/apache/spark/util/PatternLoggingSuite.scala index 0c6ed89172e0..ef0aa7050b07 100644 --- a/common/utils/src/test/scala/org/apache/spark/util/PatternLoggingSuite.scala +++ b/common/utils/src/test/scala/org/apache/spark/util/PatternLoggingSuite.scala @@ -18,8 +18,7 @@ package org.apache.spark.util import org.scalatest.BeforeAndAfterAll -import org.apache.spark.internal.{Logging, MDC} -import org.apache.spark.internal.LogKey.EXECUTOR_ID +import org.apache.spark.internal.Logging class PatternLoggingSuite extends LoggingSuiteBase with BeforeAndAfterAll { @@ -29,30 +28,12 @@ class PatternLoggingSuite extends LoggingSuiteBase with BeforeAndAfterAll { override def afterAll(): Unit = Logging.enableStructuredLogging() - test("Pattern layout logging") { -val msg = "This is a log message" + override def expectedPatternForBasicMsg(level: String): String = +s""".*$level PatternLoggingSuite: This is a log message\n""" -val logOutput = captureLogOutput(() => logError(msg)) -// scalastyle:off line.size.limit -val pattern = """.*ERROR PatternLoggingSuite: This is a log message\n""".r -// scalastyle:on -assert(pattern.matches(logOutput)) - } + override def expectedPatternForMsgWithMDC(level: String): String = +s""".*$level PatternLoggingSuite: Lost executor 1.\n""" - test("Pattern layout logging with MDC") { -logError(log"Lost executor ${MDC(EXECUTOR_ID, "1")}.") - -val logOutput = captureLogOutput(() => logError(log"Lost executor ${MDC(EXECUTOR_ID, "1")}.")) -val pattern = """.*ERROR PatternLoggingSuite: Lost executor 1.\n""".r -assert(pattern.matches(logOutput)) - } - - test("Pattern layout exception logging") { -val exception = new RuntimeException("OOM") - -val logOutput = captureLogOutput(() => - logError(log"Error in executor ${MDC(EXECUTOR_ID, "1")}.", exception)) -assert(logOutput.contains("ERROR PatternLoggingSuite: Error in executor 1.")) -assert(logOutput.contains("java.lang.RuntimeException: OOM")) - } + override def expectedPatternForMsgWithMDCAndException(level: String): String = +s""".*$level PatternLoggingSuite: Error in executor 1.\n
(spark) branch master updated: [SPARK-47574][INFRA] Introduce Structured Logging Framework
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 874d033fc61b [SPARK-47574][INFRA] Introduce Structured Logging Framework 874d033fc61b is described below commit 874d033fc61becb5679db70c804592a0f9cc37ed Author: Gengliang Wang AuthorDate: Thu Mar 28 22:58:51 2024 -0700 [SPARK-47574][INFRA] Introduce Structured Logging Framework ### What changes were proposed in this pull request? Introduce Structured Logging Framework as per [SPIP: Structured Logging Framework for Apache Spark](https://docs.google.com/document/d/1rATVGmFLNVLmtxSpWrEceYm7d-ocgu8ofhryVs4g3XU/edit?usp=sharing) . * The default logging output format will be json lines. For example ``` { "ts":"2023-03-12T12:02:46.661-0700", "level":"ERROR", "msg":"Cannot determine whether executor 289 is alive or not", "context":{ "executor_id":"289" }, "exception":{ "class":"org.apache.spark.SparkException", "msg":"Exception thrown in awaitResult", "stackTrace":"..." }, "source":"BlockManagerMasterEndpoint" } ``` * Introduce a new configuration `spark.log.structuredLogging.enabled` to set the default log4j configuration. It is true by default. Users can disable it to get plain text log outputs. * The change will start with the `logError` method. Example changes on the API: from ``` logError(s"Cannot determine whether executor $executorId is alive or not.", e) ``` to ``` logError(log"Cannot determine whether executor ${MDC(EXECUTOR_ID, executorId)} is alive or not.", e) ``` ### Why are the changes needed? To enhance Apache Spark's logging system by implementing structured logging. This transition will change the format of the default log output from plain text to JSON lines, making it more analyzable. ### Does this PR introduce _any_ user-facing change? Yes, the default log output format will be json lines instead of plain text. User can restore the default plain text output when disabling configuration `spark.log.structuredLogging.enabled`. If a user is a customized log4j configuration, there is no changes in the log output. ### How was this patch tested? New Unit tests ### Was this patch authored or co-authored using generative AI tooling? Yes, some of the code comments are from github copilot Closes #45729 from gengliangwang/LogInterpolator. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- common/utils/pom.xml | 4 + .../resources/org/apache/spark/SparkLayout.json| 38 .../org/apache/spark/log4j2-defaults.properties| 4 +- ...s => log4j2-pattern-layout-defaults.properties} | 0 .../scala/org/apache/spark/internal/LogKey.scala | 25 + .../scala/org/apache/spark/internal/Logging.scala | 105 - common/utils/src/test/resources/log4j2.properties | 50 ++ .../apache/spark/util/PatternLoggingSuite.scala| 58 .../apache/spark/util/StructuredLoggingSuite.scala | 83 .../org/apache/spark/deploy/SparkSubmit.scala | 5 + .../org/apache/spark/internal/config/package.scala | 10 ++ dev/deps/spark-deps-hadoop-3-hive-2.3 | 1 + docs/core-migration-guide.md | 2 + pom.xml| 5 + 14 files changed, 386 insertions(+), 4 deletions(-) diff --git a/common/utils/pom.xml b/common/utils/pom.xml index d360e041dd64..1dbf2a769fff 100644 --- a/common/utils/pom.xml +++ b/common/utils/pom.xml @@ -98,6 +98,10 @@ org.apache.logging.log4j log4j-1.2-api + + org.apache.logging.log4j + log4j-layout-template-json + target/scala-${scala.binary.version}/classes diff --git a/common/utils/src/main/resources/org/apache/spark/SparkLayout.json b/common/utils/src/main/resources/org/apache/spark/SparkLayout.json new file mode 100644 index ..b0d8ea27ffbc --- /dev/null +++ b/common/utils/src/main/resources/org/apache/spark/SparkLayout.json @@ -0,0 +1,38 @@ +{ + "ts": { +"$resolver": "timestamp" + }, + "level": { +"$resolver": "level", +"field": "name" + }, + "msg": { +"$resolver": "message", +"stringified": true + }, + "context": { +"$resolver": "mdc
(spark) branch master updated: [SPARK-47492][SQL] Widen whitespace rules in lexer
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c832e2ac1d04 [SPARK-47492][SQL] Widen whitespace rules in lexer c832e2ac1d04 is described below commit c832e2ac1d04668c77493577662c639785808657 Author: Serge Rielau AuthorDate: Thu Mar 28 15:51:32 2024 -0700 [SPARK-47492][SQL] Widen whitespace rules in lexer ### What changes were proposed in this pull request? In this pull PR we extend the Lexer's understanding of WhiteSpace (what separates tokens) from the ASCII: , to the various Unicode flavors of "space" such as "narrow" and "wide". ### Why are the changes needed? SQL statements are frequently copy pasted from various sources. Many of these sources are "rich text" and based on Unicode. When doing do it is inevitable that non ASCII whitespace characters are copied. This results today in often incomprehensible syntax errors. Incomprehensible because the error message prints the "bad" whitespace just like an ASCII whitespace. So the user stands little chance to find root cause unless they use possible editor options to to highlight non ASCII space or they, by sheer luck, happen to remove the whitespace. So in this PR we acknowledge the reality and stop "discriminating" against non-ASCII whitespace. ### Does this PR introduce _any_ user-facing change? Queries that used to fail before with a Syntax error, now succeed. ### How was this patch tested? Added a new set of unit tests in SparkSQLParserSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes #45620 from srielau/SPARK-47492-Widen-whitespace-rules-in-lexer. Lead-authored-by: Serge Rielau Co-authored-by: Serge Rielau Signed-off-by: Gengliang Wang --- .../spark/sql/catalyst/parser/SqlBaseLexer.g4 | 2 +- .../spark/sql/execution/SparkSqlParserSuite.scala | 80 ++ 2 files changed, 81 insertions(+), 1 deletion(-) diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 index 7c376e226850..f5565f0a63fb 100644 --- a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 +++ b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 @@ -554,7 +554,7 @@ BRACKETED_COMMENT ; WS -: [ \r\n\t]+ -> channel(HIDDEN) +: [ \t\n\f\r\u000B\u00A0\u1680\u2000\u2001\u2002\u2003\u2004\u2005\u2006\u2007\u2008\u2009\u200A\u2028\u202F\u205F\u3000]+ -> channel(HIDDEN) ; // Catch-all for anything we can't recognize. diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala index c3768afa90f1..f60df77b7e9b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/SparkSqlParserSuite.scala @@ -800,4 +800,84 @@ class SparkSqlParserSuite extends AnalysisTest with SharedSparkSession { start = 0, stop = 63)) } + + test("verify whitespace handling - standard whitespace") { +parser.parsePlan("SELECT 1") // ASCII space +parser.parsePlan("SELECT\r1") // ASCII carriage return +parser.parsePlan("SELECT\n1") // ASCII line feed +parser.parsePlan("SELECT\t1") // ASCII tab +parser.parsePlan("SELECT\u000B1") // ASCII vertical tab +parser.parsePlan("SELECT\f1") // ASCII form feed + } + + // Need to switch off scala style for Unicode characters + // scalastyle:off + test("verify whitespace handling - Unicode no-break space") { +parser.parsePlan("SELECT\u00A01") // Unicode no-break space + } + + test("verify whitespace handling - Unicode ogham space mark") { +parser.parsePlan("SELECT\u16801") // Unicode ogham space mark + } + + test("verify whitespace handling - Unicode en quad") { +parser.parsePlan("SELECT\u20001") // Unicode en quad + } + + test("verify whitespace handling - Unicode em quad") { +parser.parsePlan("SELECT\u20011") // Unicode em quad + } + + test("verify whitespace handling - Unicode en space") { +parser.parsePlan("SELECT\u20021") // Unicode en space + } + + test("verify whitespace handling - Unicode em space") { +parser.parsePlan("SELECT\u20031") // Unicode em space + } + + test("verify whitespace handling - Unicode three-p
(spark) branch master updated: [SPARK-47447][SQL] Allow reading Parquet TimestampLTZ as TimestampNTZ
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c3a04fa59ce1 [SPARK-47447][SQL] Allow reading Parquet TimestampLTZ as TimestampNTZ c3a04fa59ce1 is described below commit c3a04fa59ce1aabe4818430ae294fb8d210c0e4b Author: Gengliang Wang AuthorDate: Tue Mar 19 23:04:59 2024 -0700 [SPARK-47447][SQL] Allow reading Parquet TimestampLTZ as TimestampNTZ ### What changes were proposed in this pull request? Currently, Parquet TimestampNTZ type columns can be read as TimestampLTZ, while reading TimestampLTZ as TimestampNTZ will cause errors. This makes it impossible to read parquet files containing both TimestampLTZ and TimestampNTZ as TimestampNTZ. To make the data type system on Parquet simpler, this PR allows reading TimestampLTZ as TimestampNTZ in the Parquet data source. ### Why are the changes needed? * Make it possible to read parquet files containing both TimestampLTZ and TimestampNTZ as TimestampNTZ * Make the data type system on Parquet simpler ### Does this PR introduce _any_ user-facing change? Yes, Parquet TimestampLTZ type column are now allowed to be read as TimestampNTZ ### How was this patch tested? UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #45571 from gengliangwang/allowReadLTZAsNTZ. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../parquet/ParquetVectorUpdaterFactory.java | 19 ++- .../datasources/parquet/ParquetRowConverter.scala | 16 .../datasources/parquet/ParquetQuerySuite.scala| 22 +++--- 3 files changed, 21 insertions(+), 36 deletions(-) diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java index abb44915cbcd..b6065c24f2ec 100644 --- a/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java +++ b/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/ParquetVectorUpdaterFactory.java @@ -148,12 +148,10 @@ public class ParquetVectorUpdaterFactory { } } else if (sparkType == DataTypes.TimestampNTZType && isTimestampTypeMatched(LogicalTypeAnnotation.TimeUnit.MICROS)) { - validateTimestampNTZType(); // TIMESTAMP_NTZ is a new data type and has no legacy files that need to do rebase. return new LongUpdater(); } else if (sparkType == DataTypes.TimestampNTZType && isTimestampTypeMatched(LogicalTypeAnnotation.TimeUnit.MILLIS)) { - validateTimestampNTZType(); // TIMESTAMP_NTZ is a new data type and has no legacy files that need to do rebase. return new LongAsMicrosUpdater(); } else if (sparkType instanceof DayTimeIntervalType) { @@ -176,7 +174,8 @@ public class ParquetVectorUpdaterFactory { } case INT96 -> { if (sparkType == DataTypes.TimestampNTZType) { - convertErrorForTimestampNTZ(typeName.name()); + // TimestampNTZ type does not require rebasing due to its lack of time zone context. + return new BinaryToSQLTimestampUpdater(); } else if (sparkType == DataTypes.TimestampType) { final boolean failIfRebase = "EXCEPTION".equals(int96RebaseMode); if (!shouldConvertTimestamps()) { @@ -232,20 +231,6 @@ public class ParquetVectorUpdaterFactory { annotation.getUnit() == unit; } - private void validateTimestampNTZType() { -assert(logicalTypeAnnotation instanceof TimestampLogicalTypeAnnotation); -// Throw an exception if the Parquet type is TimestampLTZ as the Catalyst type is TimestampNTZ. -// This is to avoid mistakes in reading the timestamp values. -if (((TimestampLogicalTypeAnnotation) logicalTypeAnnotation).isAdjustedToUTC()) { - convertErrorForTimestampNTZ("int64 time(" + logicalTypeAnnotation + ")"); -} - } - - void convertErrorForTimestampNTZ(String parquetType) { -throw new RuntimeException("Unable to create Parquet converter for data type " + - DataTypes.TimestampNTZType.json() + " whose Parquet type is " + parquetType); - } - boolean isUnsignedIntTypeMatched(int bitWidth) { return logicalTypeAnnotation instanceof IntLogicalTypeAnnotation annotation && !annotation.isSigned() && annotation.getBitWidth() == bitWidth; diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter
(spark) branch branch-3.4 updated: [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 922f5f686dc7 [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc 922f5f686dc7 is described below commit 922f5f686dc72433a9028bbe471a35a5b84f2855 Author: Gengliang Wang AuthorDate: Wed Mar 13 21:00:35 2024 -0700 [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc ### What changes were proposed in this pull request? Correct the preferTimestampNTZ option description in JDBC doc as per https://github.com/apache/spark/pull/45496 ### Why are the changes needed? The current doc is wrong about the jdbc option preferTimestampNTZ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45502 from gengliangwang/ntzJdbc. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang (cherry picked from commit abfbd2718159d62e3322cca8c2d4ef1c29781b21) Signed-off-by: Gengliang Wang --- docs/sql-data-sources-jdbc.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md index ef11a3a77dd8..004244b7328c 100644 --- a/docs/sql-data-sources-jdbc.md +++ b/docs/sql-data-sources-jdbc.md @@ -368,8 +368,9 @@ logging into the data sources. preferTimestampNTZ false - When the option is set to true, all timestamps are inferred as TIMESTAMP WITHOUT TIME ZONE. - Otherwise, timestamps are read as TIMESTAMP with local time zone. + When the option is set to true, TIMESTAMP WITHOUT TIME ZONE type are inferred as Spark's TimestampNTZ type. + Otherwise, it is interpreted as Spark's Timestamp type(equivalent to TIMESTAMP WITHOUT LOCAL TIME ZONE). + This setting specifically affects only the inference of TIMESTAMP WITHOUT TIME ZONE data type. Both TIMESTAMP WITHOUT LOCAL TIME ZONE and TIMESTAMP WITH TIME ZONE data types are consistently interpreted as Spark's Timestamp type regardless of this setting. read - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 706f54f69fe7 [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc 706f54f69fe7 is described below commit 706f54f69fe797027b5fcf1cfb4867811fb41c3d Author: Gengliang Wang AuthorDate: Wed Mar 13 21:00:35 2024 -0700 [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc ### What changes were proposed in this pull request? Correct the preferTimestampNTZ option description in JDBC doc as per https://github.com/apache/spark/pull/45496 ### Why are the changes needed? The current doc is wrong about the jdbc option preferTimestampNTZ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45502 from gengliangwang/ntzJdbc. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang (cherry picked from commit abfbd2718159d62e3322cca8c2d4ef1c29781b21) Signed-off-by: Gengliang Wang --- docs/sql-data-sources-jdbc.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/sql-data-sources-jdbc.md b/docs/sql-data-sources-jdbc.md index edcdef4bf008..d794116091fe 100644 --- a/docs/sql-data-sources-jdbc.md +++ b/docs/sql-data-sources-jdbc.md @@ -368,8 +368,9 @@ logging into the data sources. preferTimestampNTZ false - When the option is set to true, all timestamps are inferred as TIMESTAMP WITHOUT TIME ZONE. - Otherwise, timestamps are read as TIMESTAMP with local time zone. + When the option is set to true, TIMESTAMP WITHOUT TIME ZONE type are inferred as Spark's TimestampNTZ type. + Otherwise, it is interpreted as Spark's Timestamp type(equivalent to TIMESTAMP WITHOUT LOCAL TIME ZONE). + This setting specifically affects only the inference of TIMESTAMP WITHOUT TIME ZONE data type. Both TIMESTAMP WITHOUT LOCAL TIME ZONE and TIMESTAMP WITH TIME ZONE data types are consistently interpreted as Spark's Timestamp type regardless of this setting. read - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (bc0ba7dccdd5 -> abfbd2718159)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from bc0ba7dccdd5 [SPARK-41762][PYTHON][CONNECT][TESTS] Enable column name comparsion in `test_column_arithmetic_ops` add abfbd2718159 [SPARK-47375][DOC][FOLLOWUP] Correct the preferTimestampNTZ option description in JDBC doc No new revisions were added by this update. Summary of changes: docs/sql-data-sources-jdbc.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al.
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new ebe9f669e3fd [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. ebe9f669e3fd is described below commit ebe9f669e3fd4f391336c12c2e15df048eaa11bc Author: Serge Rielau AuthorDate: Wed Mar 13 13:19:29 2024 -0700 [SPARK-47344] Extend INVALID_IDENTIFIER error beyond catching '-' in an unquoted identifier and fix "IS ! NULL" et al. ### What changes were proposed in this pull request? In this PR we propose to extend the lexing of IDENTIFIER beyond what is legitimate for unquoted identifiers to include "plausible" identifiers. We then use the "exit" hook in the parser raise INVALID_IDENTIFIER error which is more meaningful than a syntax error. Specifically we allow: * general letters beyond the ASCII a-z. This will catch locale specific names * URIs which are used for table's represented by a path. As part of this PR we also found that rolling `NOT` and `!` into one token is a "bad idea". We allow: CREATE TABLE t(c1 INT ! NULL); etc. This is clearly not intended. ! is now ONLY allowed as a boolean prefix operator. ### Why are the changes needed? This change improves the user experience in case of an error by returning a more meaningful error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test suite + new unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #45470 from srielau/SPARK-47344. Lead-authored-by: Serge Rielau Co-authored-by: Wenchen Fan Signed-off-by: Gengliang Wang --- .../utils/src/main/resources/error/error-classes.json | 5 - docs/sql-error-conditions.md| 5 - .../apache/spark/sql/catalyst/parser/SqlBaseLexer.g4| 14 -- .../apache/spark/sql/catalyst/parser/SqlBaseParser.g4 | 3 ++- .../org/apache/spark/sql/catalyst/parser/parsers.scala | 12 .../apache/spark/sql/errors/QueryParsingErrors.scala| 2 +- .../spark/sql/catalyst/parser/ErrorParserSuite.scala| 17 + .../resources/sql-tests/results/ansi/keywords.sql.out | 2 ++ .../test/resources/sql-tests/results/keywords.sql.out | 1 + .../ThriftServerWithSparkContextSuite.scala | 2 +- 10 files changed, 56 insertions(+), 7 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 92c72e03e483..8272c442ddfa 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -2098,7 +2098,10 @@ }, "INVALID_IDENTIFIER" : { "message" : [ - "The identifier is invalid. Please, consider quoting it with back-quotes as ``." + "The unquoted identifier is invalid and must be back quoted as: ``.", + "Unquoted identifiers can only contain ASCII letters ('a' - 'z', 'A' - 'Z'), digits ('0' - '9'), and underbar ('_').", + "Unquoted identifiers must also not start with a digit.", + "Different data sources and meta stores may impose additional restrictions on valid identifiers." ], "sqlState" : "42602" }, diff --git a/docs/sql-error-conditions.md b/docs/sql-error-conditions.md index efead13251c1..dba87bf0136e 100644 --- a/docs/sql-error-conditions.md +++ b/docs/sql-error-conditions.md @@ -1240,7 +1240,10 @@ For more details see [INVALID_HANDLE](sql-error-conditions-invalid-handle-error- [SQLSTATE: 42602](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation) -The identifier `` is invalid. Please, consider quoting it with back-quotes as . +The unquoted identifier `` is invalid and must be back quoted as: . +Unquoted identifiers can only contain ASCII letters ('a' - 'z', 'A' - 'Z'), digits ('0' - '9'), and underbar ('_'). +Unquoted identifiers must also not start with a digit. +Different data sources and meta stores may impose additional restrictions on valid identifiers. ### INVALID_INDEX_OF_ZERO diff --git a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 index 174887def66d..7c376e226850 100644 --- a/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4 +++ b/sql/api/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBaseLexer.g4
(spark) branch branch-3.4 updated: [SPARK-47368][SQL]][3.5] Remove inferTimestampNTZ config check in ParquetRo…
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 60b4c0b690ad [SPARK-47368][SQL]][3.5] Remove inferTimestampNTZ config check in ParquetRo… 60b4c0b690ad is described below commit 60b4c0b690ad980ff4eef93180c70d6e64e5e347 Author: Gengliang Wang AuthorDate: Tue Mar 12 22:42:45 2024 -0700 [SPARK-47368][SQL]][3.5] Remove inferTimestampNTZ config check in ParquetRo… ### What changes were proposed in this pull request? The configuration `spark.sql.parquet.inferTimestampNTZ.enabled` is not related the parquet row converter. This PR is the remove the config check `spark.sql.parquet.inferTimestampNTZ.enabled` in the ParquetRowConverter ### Why are the changes needed? Bug fix. Otherwise reading TimestampNTZ columns may fail when `spark.sql.parquet.inferTimestampNTZ.enabled` is disabled. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #45492 from gengliangwang/PR_TOOL_PICK_PR_45480_BRANCH-3.5. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang (cherry picked from commit 3018a5d8cd96a569b3bfe7e11b4b26fb4fb54f32) Signed-off-by: Gengliang Wang --- .../datasources/parquet/ParquetRowConverter.scala | 9 +++--- .../parquet/ParquetSchemaConverter.scala | 7 - .../datasources/parquet/ParquetQuerySuite.scala| 36 +- 3 files changed, 25 insertions(+), 27 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala index 9101e7d0ac52..1e07c6db2a06 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRowConverter.scala @@ -505,11 +505,10 @@ private[parquet] class ParquetRowConverter( // can be read as Spark's TimestampNTZ type. This is to avoid mistakes in reading the timestamp // values. private def canReadAsTimestampNTZ(parquetType: Type): Boolean = -schemaConverter.isTimestampNTZEnabled() && - parquetType.asPrimitiveType().getPrimitiveTypeName == INT64 && - parquetType.getLogicalTypeAnnotation.isInstanceOf[TimestampLogicalTypeAnnotation] && - !parquetType.getLogicalTypeAnnotation -.asInstanceOf[TimestampLogicalTypeAnnotation].isAdjustedToUTC +parquetType.asPrimitiveType().getPrimitiveTypeName == INT64 && + parquetType.getLogicalTypeAnnotation.isInstanceOf[TimestampLogicalTypeAnnotation] && +!parquetType.getLogicalTypeAnnotation + .asInstanceOf[TimestampLogicalTypeAnnotation].isAdjustedToUTC /** * Parquet converter for strings. A dictionary is used to minimize string decoding cost. diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala index 9c9e7ce729c1..a78b96ae6fcc 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala @@ -72,13 +72,6 @@ class ParquetToSparkSchemaConverter( inferTimestampNTZ = conf.get(SQLConf.PARQUET_INFER_TIMESTAMP_NTZ_ENABLED.key).toBoolean, nanosAsLong = conf.get(SQLConf.LEGACY_PARQUET_NANOS_AS_LONG.key).toBoolean) - /** - * Returns true if TIMESTAMP_NTZ type is enabled in this ParquetToSparkSchemaConverter. - */ - def isTimestampNTZEnabled(): Boolean = { -inferTimestampNTZ - } - /** * Converts Parquet [[MessageType]] `parquetSchema` to a Spark SQL [[StructType]]. */ diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala index 828ec39c7d72..29cb224c8787 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala @@ -160,21 +160,27 @@ abstract class ParquetQuerySuite extends QueryTest with ParquetTest with SharedS } } - test("SPARK-36182: writing and reading TimestampNTZType column") { -withTable("ts") { - sql("create table ts (c1
(spark) branch branch-3.5 updated (6629b9a6a5ae -> 3018a5d8cd96)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git from 6629b9a6a5ae [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files add 3018a5d8cd96 [SPARK-47368][SQL]][3.5] Remove inferTimestampNTZ config check in ParquetRo… No new revisions were added by this update. Summary of changes: .../datasources/parquet/ParquetRowConverter.scala | 9 +++--- .../parquet/ParquetSchemaConverter.scala | 7 - .../datasources/parquet/ParquetQuerySuite.scala| 36 +- 3 files changed, 25 insertions(+), 27 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (5d32e62436dc -> 625589f736fe)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 5d32e62436dc [MINOR][TESTS] Enable nullability check in `test_create_dataframe_from_arrays` add 625589f736fe [SPARK-47368][SQL] Remove inferTimestampNTZ config check in ParquetRowConverter No new revisions were added by this update. Summary of changes: .../datasources/parquet/ParquetRowConverter.scala | 14 - .../parquet/ParquetSchemaConverter.scala | 7 - .../datasources/parquet/ParquetQuerySuite.scala| 36 +- 3 files changed, 27 insertions(+), 30 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.4 updated: [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new 982fbc5b63e6 [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files 982fbc5b63e6 is described below commit 982fbc5b63e61cbc280f8049caf60fbb6e178423 Author: Gengliang Wang AuthorDate: Tue Mar 12 15:11:34 2024 -0700 [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files ### What changes were proposed in this pull request? Add migration doc: TimestampNTZ type inference on Parquet files ### Why are the changes needed? Update docs. The behavior change was not mentioned in the SQL migration guide ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45482 from gengliangwang/ntzMigrationDoc. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang (cherry picked from commit 621f2c88f3e56257ee517d65e093d32fb44b783e) Signed-off-by: Gengliang Wang --- docs/sql-migration-guide.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 1ad6c8faa3db..b83745e75c79 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -43,6 +43,7 @@ license: | - Since Spark 3.4, vectorized readers are enabled by default for the nested data types (array, map and struct). To restore the legacy behavior, set `spark.sql.orc.enableNestedColumnVectorizedReader` and `spark.sql.parquet.enableNestedColumnVectorizedReader` to `false`. - Since Spark 3.4, `BinaryType` is not supported in CSV datasource. In Spark 3.3 or earlier, users can write binary columns in CSV datasource, but the output content in CSV files is `Object.toString()` which is meaningless; meanwhile, if users read CSV tables with binary columns, Spark will throw an `Unsupported type: binary` exception. - Since Spark 3.4, bloom filter joins are enabled by default. To restore the legacy behavior, set `spark.sql.optimizer.runtime.bloomFilter.enabled` to `false`. + - Since Spark 3.4, when schema inference on external Parquet files, INT64 timestamps with annotation `isAdjustedToUTC=false` will be inferred as TimestampNTZ type instead of Timestamp type. To restore the legacy behavior, set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. ## Upgrading from Spark SQL 3.2 to 3.3 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 6629b9a6a5ae [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files 6629b9a6a5ae is described below commit 6629b9a6a5ae35db486dc69ce1ce5a86246daf1d Author: Gengliang Wang AuthorDate: Tue Mar 12 15:11:34 2024 -0700 [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files ### What changes were proposed in this pull request? Add migration doc: TimestampNTZ type inference on Parquet files ### Why are the changes needed? Update docs. The behavior change was not mentioned in the SQL migration guide ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45482 from gengliangwang/ntzMigrationDoc. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang (cherry picked from commit 621f2c88f3e56257ee517d65e093d32fb44b783e) Signed-off-by: Gengliang Wang --- docs/sql-migration-guide.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 2eba9500e907..0e54c33c6d12 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -49,6 +49,7 @@ license: | - Since Spark 3.4, vectorized readers are enabled by default for the nested data types (array, map and struct). To restore the legacy behavior, set `spark.sql.orc.enableNestedColumnVectorizedReader` and `spark.sql.parquet.enableNestedColumnVectorizedReader` to `false`. - Since Spark 3.4, `BinaryType` is not supported in CSV datasource. In Spark 3.3 or earlier, users can write binary columns in CSV datasource, but the output content in CSV files is `Object.toString()` which is meaningless; meanwhile, if users read CSV tables with binary columns, Spark will throw an `Unsupported type: binary` exception. - Since Spark 3.4, bloom filter joins are enabled by default. To restore the legacy behavior, set `spark.sql.optimizer.runtime.bloomFilter.enabled` to `false`. + - Since Spark 3.4, when schema inference on external Parquet files, INT64 timestamps with annotation `isAdjustedToUTC=false` will be inferred as TimestampNTZ type instead of Timestamp type. To restore the legacy behavior, set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. ## Upgrading from Spark SQL 3.2 to 3.3 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 621f2c88f3e5 [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files 621f2c88f3e5 is described below commit 621f2c88f3e56257ee517d65e093d32fb44b783e Author: Gengliang Wang AuthorDate: Tue Mar 12 15:11:34 2024 -0700 [SPARK-47370][DOC] Add migration doc: TimestampNTZ type inference on Parquet files ### What changes were proposed in this pull request? Add migration doc: TimestampNTZ type inference on Parquet files ### Why are the changes needed? Update docs. The behavior change was not mentioned in the SQL migration guide ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? It's just doc change ### Was this patch authored or co-authored using generative AI tooling? No Closes #45482 from gengliangwang/ntzMigrationDoc. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- docs/sql-migration-guide.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 3d0c7280496a..9f92d6fc8347 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -67,6 +67,7 @@ license: | - Since Spark 3.4, vectorized readers are enabled by default for the nested data types (array, map and struct). To restore the legacy behavior, set `spark.sql.orc.enableNestedColumnVectorizedReader` and `spark.sql.parquet.enableNestedColumnVectorizedReader` to `false`. - Since Spark 3.4, `BinaryType` is not supported in CSV datasource. In Spark 3.3 or earlier, users can write binary columns in CSV datasource, but the output content in CSV files is `Object.toString()` which is meaningless; meanwhile, if users read CSV tables with binary columns, Spark will throw an `Unsupported type: binary` exception. - Since Spark 3.4, bloom filter joins are enabled by default. To restore the legacy behavior, set `spark.sql.optimizer.runtime.bloomFilter.enabled` to `false`. + - Since Spark 3.4, when schema inference on external Parquet files, INT64 timestamps with annotation `isAdjustedToUTC=false` will be inferred as TimestampNTZ type instead of Timestamp type. To restore the legacy behavior, set `spark.sql.parquet.inferTimestampNTZ.enabled` to `false`. ## Upgrading from Spark SQL 3.2 to 3.3 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 5067447bf9a4 [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option 5067447bf9a4 is described below commit 5067447bf9a420b2f972a03351058ebfa61e0e41 Author: Gengliang Wang AuthorDate: Fri Feb 16 18:21:19 2024 -0800 [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/39856. The configuration changes should be reflected in the Parquet data source doc ### Why are the changes needed? To fix doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview: https://github.com/apache/spark/assets/1097932/618df731-49ad-49e7-afa2-22381cb3bbef;> ### Was this patch authored or co-authored using generative AI tooling? No Closes #45145 from gengliangwang/changeConfigName. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang (cherry picked from commit dc2f2673a73ccde44b59cada00e95e869ad64c01) Signed-off-by: Gengliang Wang --- docs/sql-data-sources-parquet.md | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/sql-data-sources-parquet.md b/docs/sql-data-sources-parquet.md index f49bbd7a9d04..707871e79802 100644 --- a/docs/sql-data-sources-parquet.md +++ b/docs/sql-data-sources-parquet.md @@ -616,14 +616,15 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession 3.3.0 - spark.sql.parquet.timestampNTZ.enabled + spark.sql.parquet.inferTimestampNTZ.enabled true -Enables TIMESTAMP_NTZ support for Parquet reads and writes. -When enabled, TIMESTAMP_NTZ values are written as Parquet timestamp -columns with annotation isAdjustedToUTC = false and are inferred in a similar way. -When disabled, such values are read as TIMESTAMP_LTZ and have to be -converted to TIMESTAMP_LTZ for writes. +When enabled, Parquet timestamp columns with annotation isAdjustedToUTC = false +are inferred as TIMESTAMP_NTZ type during schema inference. Otherwise, all the Parquet +timestamp columns are inferred as TIMESTAMP_LTZ types. Note that Spark writes the +output schema into Parquet's footer metadata on file writing and leverages it on file +reading. Thus this configuration only affects the schema inference on Parquet files +which are not written by Spark. 3.4.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new dc2f2673a73c [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option dc2f2673a73c is described below commit dc2f2673a73ccde44b59cada00e95e869ad64c01 Author: Gengliang Wang AuthorDate: Fri Feb 16 18:21:19 2024 -0800 [SPARK-42285][DOC] Update Parquet data source doc on the timestamp_ntz inference option ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/39856. The configuration changes should be reflected in the Parquet data source doc ### Why are the changes needed? To fix doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Preview: https://github.com/apache/spark/assets/1097932/618df731-49ad-49e7-afa2-22381cb3bbef;> ### Was this patch authored or co-authored using generative AI tooling? No Closes #45145 from gengliangwang/changeConfigName. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- docs/sql-data-sources-parquet.md | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/docs/sql-data-sources-parquet.md b/docs/sql-data-sources-parquet.md index e944db24d76b..f5c5ccd3b89a 100644 --- a/docs/sql-data-sources-parquet.md +++ b/docs/sql-data-sources-parquet.md @@ -616,14 +616,15 @@ Configuration of Parquet can be done via `spark.conf.set` or by running 3.3.0 - spark.sql.parquet.timestampNTZ.enabled + spark.sql.parquet.inferTimestampNTZ.enabled true -Enables TIMESTAMP_NTZ support for Parquet reads and writes. -When enabled, TIMESTAMP_NTZ values are written as Parquet timestamp -columns with annotation isAdjustedToUTC = false and are inferred in a similar way. -When disabled, such values are read as TIMESTAMP_LTZ and have to be -converted to TIMESTAMP_LTZ for writes. +When enabled, Parquet timestamp columns with annotation isAdjustedToUTC = false +are inferred as TIMESTAMP_NTZ type during schema inference. Otherwise, all the Parquet +timestamp columns are inferred as TIMESTAMP_LTZ types. Note that Spark writes the +output schema into Parquet's footer metadata on file writing and leverages it on file +reading. Thus this configuration only affects the schema inference on Parquet files +which are not written by Spark. 3.4.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46849][SQL][FOLLOWUP] Column default value cannot reference session variables
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 1e13243ca394 [SPARK-46849][SQL][FOLLOWUP] Column default value cannot reference session variables 1e13243ca394 is described below commit 1e13243ca394b04e0b1d2972d7c8eab2c63414e5 Author: Wenchen Fan AuthorDate: Mon Feb 5 13:31:58 2024 -0800 [SPARK-46849][SQL][FOLLOWUP] Column default value cannot reference session variables ### What changes were proposed in this pull request? One more followup of https://github.com/apache/spark/pull/44876 . Previously, by using a fake analyzer, session variables can't be resolved and thus can't be in the default value expression. Now we use the actual analyzer and optimizer, session variables can be properly resolved and replaced with literals at the end. This is not expected as default value shouldn't references temporary things. This PR fixes this by explicitly failing the check if default value references session variables. ### Why are the changes needed? fix behavior changes. ### Does this PR introduce _any_ user-facing change? no, the behavior change is not released. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes #45032 from cloud-fan/default-value. Authored-by: Wenchen Fan Signed-off-by: Gengliang Wang --- .../catalyst/util/ResolveDefaultColumnsUtil.scala | 5 ++- .../org/apache/spark/sql/sources/InsertSuite.scala | 36 +- 2 files changed, 39 insertions(+), 2 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala index da03de73557f..a2bfc6e08da8 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/ResolveDefaultColumnsUtil.scala @@ -468,10 +468,13 @@ object ResolveDefaultColumns extends QueryErrorsBase } // Our analysis check passes here. We do not further inspect whether the // expression is `foldable` here, as the plan is not optimized yet. -} else if (default.references.nonEmpty) { +} + +if (default.references.nonEmpty || default.exists(_.isInstanceOf[VariableReference])) { // Ideally we should let the rest of `CheckAnalysis` report errors about why the default // expression is unresolved. But we should report a better error here if the default // expression references columns, which means it's not a constant for sure. + // Note that, session variable should be considered as non-constant as well. throw QueryCompilationErrors.defaultValueNotConstantError( statement, colName, default.originalSQL) } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala index 704df9d78ffa..2cc434318aa2 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala @@ -28,6 +28,7 @@ import org.apache.spark.sql._ import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType} import org.apache.spark.sql.catalyst.parser.ParseException +import org.apache.spark.sql.connector.FakeV2Provider import org.apache.spark.sql.execution.datasources.DataSourceUtils import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.internal.SQLConf.PartitionOverwriteMode @@ -1150,7 +1151,7 @@ class InsertSuite extends DataSourceTest with SharedSparkSession { } test("SPARK-38336 INSERT INTO statements with tables with default columns: negative tests") { -// The default value fails to analyze. +// The default value references columns. withTable("t") { checkError( exception = intercept[AnalysisException] { @@ -1162,6 +1163,39 @@ class InsertSuite extends DataSourceTest with SharedSparkSession { "colName" -> "`s`", "defaultValue" -> "badvalue")) } +try { + // The default value references session variables. + sql("DECLARE test_var INT") + withTable("t") { +checkError( + exception = intercept[AnalysisException] { +sql("create table t(i boolean, s int default test_var) using parquet") + }, + // V1 command still u
(spark) branch master updated: [SPARK-46964][SQL] Change the signature of the hllInvalidLgK query execution error to take an integer as 4th argument
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0965412d5174 [SPARK-46964][SQL] Change the signature of the hllInvalidLgK query execution error to take an integer as 4th argument 0965412d5174 is described below commit 0965412d517441a15d4da0b5fc8fe34a9b5ec40f Author: Menelaos Karavelas AuthorDate: Fri Feb 2 11:55:21 2024 -0800 [SPARK-46964][SQL] Change the signature of the hllInvalidLgK query execution error to take an integer as 4th argument ### What changes were proposed in this pull request? The current signature of the `hllInvalidLgK` query execution error takes four arguments: 1. The SQL function (a string). 2. The minimum possible `lgk` value (an integer). 3. The maximum possible `lgk` value (an integer). 4. The actual invalid `lgk` value (a string). There is no meaningful reason for the 4th argument to be a string. In this PR we change it to be an integer, just like the minimum and maximum valid values. ### Why are the changes needed? Seeking to make the signature of the `hllInvalidLgK` error more meaningful and self-consistent. ### Does this PR introduce _any_ user-facing change? No, there is no user-facing changes because of this PR. This is just an internal change. ### How was this patch tested? Existing tests suffice. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44995 from mkaravel/hll-invalid-lgk-error-arg. Authored-by: Menelaos Karavelas Signed-off-by: Gengliang Wang --- .../sql/catalyst/expressions/aggregate/datasketchesAggregates.scala | 2 +- .../main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala index 595ae32d77b9..02925f3625d2 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/datasketchesAggregates.scala @@ -196,7 +196,7 @@ object HllSketchAgg { def checkLgK(lgConfigK: Int): Unit = { if (lgConfigK < minLgConfigK || lgConfigK > maxLgConfigK) { throw QueryExecutionErrors.hllInvalidLgK(function = "hll_sketch_agg", -min = minLgConfigK, max = maxLgConfigK, value = lgConfigK.toString) +min = minLgConfigK, max = maxLgConfigK, value = lgConfigK) } } } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala index 9ff076c5fd50..af5cafdc8a3a 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala @@ -2601,14 +2601,14 @@ private[sql] object QueryExecutionErrors extends QueryErrorsBase with ExecutionE cause = e) } - def hllInvalidLgK(function: String, min: Int, max: Int, value: String): Throwable = { + def hllInvalidLgK(function: String, min: Int, max: Int, value: Int): Throwable = { new SparkRuntimeException( errorClass = "HLL_INVALID_LG_K", messageParameters = Map( "function" -> toSQLId(function), "min" -> toSQLValue(min, IntegerType), "max" -> toSQLValue(max, IntegerType), -"value" -> value)) +"value" -> toSQLValue(value, IntegerType))) } def hllInvalidInputSketchBuffer(function: String): Throwable = { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new d3e30848084 [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website d3e30848084 is described below commit d3e3084808453769ba0cd4278ee8650e40c185ea Author: Gengliang Wang AuthorDate: Wed Jan 10 09:32:30 2024 +0900 [SPARK-46637][DOCS] Enhancing the Visual Appeal of Spark doc website ### What changes were proposed in this pull request? Enhance the Visual Appeal of Spark doc website after https://github.com/apache/spark/pull/40269: 1. There is a weird indent on the top right side of the first paragraph of the Spark 3.5.0 doc overview page Before this PR https://github.com/apache/spark/assets/1097932/84d21ca1-a4d0-4bd4-8f20-a34fa5db4000;> After this PR: https://github.com/apache/spark/assets/1097932/4ffc0d5a-ed75-44c5-b20a-475ff401afa8;> 2. All the titles are too big and therefore less readable. In the website https://spark.apache.org/downloads.html, titles are h2 while in doc site https://spark.apache.org/docs/latest/ titles are h1. So we should make the font size of titles smaller. Before this PR: https://github.com/apache/spark/assets/1097932/5bbbd9eb-432a-42c0-98be-ff00a9099cd6;> After this PR: https://github.com/apache/spark/assets/1097932/dc94c1fb-6ac1-41a8-b4a4-19b3034125d7;> 3. The banner image can't be displayed correct. Even when it shows up, it will be hover by the text. To make it simple, let's not show the banner image as we did in https://spark.apache.org/docs/3.4.2/ https://github.com/apache/spark/assets/1097932/f6d34261-a352-44e2-9633-6e96b311a0b3;> https://github.com/apache/spark/assets/1097932/c49ce6b6-13d9-4d8f-97a9-7ed8b037be57;> ### Why are the changes needed? Improve the Visual Appeal of Spark doc website ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build doc and verify on local setup. ### Was this patch authored or co-authored using generative AI tooling? No Closes #44642 from gengliangwang/enhance_doc. Authored-by: Gengliang Wang Signed-off-by: Hyukjin Kwon --- docs/_layouts/global.html | 26 +++--- docs/css/custom.css| 35 ++- docs/img/spark-hero-thin-light.jpg | Bin 278664 -> 0 bytes 3 files changed, 25 insertions(+), 36 deletions(-) diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 8c4435fdf31..5116472eaa7 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -138,25 +138,21 @@ {% if page.url == "/" %} - - Apache Spark - A Unified engine for large-scale data analytics - - -Apache Spark is a unified analytics engine for large-scale data processing. -It provides high-level APIs in Java, Scala, Python and R, -and an optimized engine that supports general execution graphs. -It also supports a rich set of higher-level tools including -Spark SQL for SQL and structured data processing, -pandas API on Spark for pandas workloads, -MLlib for machine learning, -GraphX for graph processing, - and Structured Streaming - for incremental computation and stream processing. - + + Apache Spark is a unified analytics engine for large-scale data processing. + It provides high-level APIs in Java, Scala, Python and R, + and an optimized engine that supports general execution graphs. + It also supports a rich set of higher-level tools including + Spark SQL for SQL and structured data processing, + pandas API on Spark for pandas workloads, + MLlib for machine learning, + GraphX for graph processing, + and Structured Streaming + for incremental computation and stream processing. diff --git a/docs/css/custom.css b/docs/css/custom.css index 1239c0ed440..8158938866c 100644 --- a/docs/css/custom.css +++ b/docs/css/custom.css @@ -95,18 +95,7 @@ section { border-color: transparent; } -.hero-banner .bg { - background: url(/img/spark-hero-thin-light.jpg) no-repeat; - transform: translate(36%, 0%); - height: 475px; - top: 0;
(spark) branch branch-3.5 updated: [SPARK-46396][SQL] Timestamp inference should not throw exception
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 908c472728f2 [SPARK-46396][SQL] Timestamp inference should not throw exception 908c472728f2 is described below commit 908c472728f24034baf0b59f03b04ca148eabeca Author: Gengliang Wang AuthorDate: Thu Dec 14 00:06:22 2023 -0800 [SPARK-46396][SQL] Timestamp inference should not throw exception ### What changes were proposed in this pull request? When setting `spark.sql.legacy.timeParserPolicy=LEGACY`, Spark will use the LegacyFastTimestampFormatter to infer potential timestamp columns. The inference shouldn't throw exception. However, when the input is 23012150952, there is exception: ``` For input string: "23012150952" java.lang.NumberFormatException: For input string: "23012150952" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) at java.base/java.lang.Integer.parseInt(Integer.java:668) at java.base/java.lang.Integer.parseInt(Integer.java:786) at org.apache.commons.lang3.time.FastDateParser$NumberStrategy.parse(FastDateParser.java:304) at org.apache.commons.lang3.time.FastDateParser.parse(FastDateParser.java:1045) at org.apache.commons.lang3.time.FastDateFormat.parse(FastDateFormat.java:651) at org.apache.spark.sql.catalyst.util.LegacyFastTimestampFormatter.parseOptional(TimestampFormatter.scala:418) ``` This PR is to fix the issue. ### Why are the changes needed? Bug fix, Timestamp inference should not throw exception ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? New test case + existing tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #44338 from gengliangwang/fixParseOptional. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang (cherry picked from commit 4a79ae9d821e9b04fbe949251050c3e4819dff92) Signed-off-by: Gengliang Wang --- .../apache/spark/sql/catalyst/util/TimestampFormatter.scala | 12 .../spark/sql/catalyst/util/TimestampFormatterSuite.scala| 3 ++- 2 files changed, 10 insertions(+), 5 deletions(-) diff --git a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala index 55eee41c14ca..0866cee9334c 100644 --- a/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala +++ b/sql/api/src/main/scala/org/apache/spark/sql/catalyst/util/TimestampFormatter.scala @@ -414,10 +414,14 @@ class LegacyFastTimestampFormatter( override def parseOptional(s: String): Option[Long] = { cal.clear() // Clear the calendar because it can be re-used many times -if (fastDateFormat.parse(s, new ParsePosition(0), cal)) { - Some(extractMicros(cal)) -} else { - None +try { + if (fastDateFormat.parse(s, new ParsePosition(0), cal)) { +Some(extractMicros(cal)) + } else { +None + } +} catch { + case NonFatal(_) => None } } diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala index 2134a0d6ecd3..27d60815766d 100644 --- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala +++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/TimestampFormatterSuite.scala @@ -502,10 +502,11 @@ class TimestampFormatterSuite extends DatetimeFormatterSuite { assert(fastFormatter.parseOptional("2023-12-31 23:59:59.9990").contains(170406719000L)) assert(fastFormatter.parseOptional("abc").isEmpty) +assert(fastFormatter.parseOptional("23012150952").isEmpty) assert(simpleFormatter.parseOptional("2023-12-31 23:59:59.9990").contains(170406720899L)) assert(simpleFormatter.parseOptional("abc").isEmpty) - +assert(simpleFormatter.parseOptional("23012150952").isEmpty) } test("SPARK-45424: do not return optional parse results when only prefix match") { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (f1e5a136fa79 -> 4a79ae9d821e)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from f1e5a136fa79 [SPARK-46393][SQL] Classify exceptions in the JDBC table catalog add 4a79ae9d821e [SPARK-46396][SQL] Timestamp inference should not throw exception No new revisions were added by this update. Summary of changes: .../apache/spark/sql/catalyst/util/TimestampFormatter.scala | 12 .../spark/sql/catalyst/util/TimestampFormatterSuite.scala| 3 ++- 2 files changed, 10 insertions(+), 5 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark-website) branch asf-site updated: Fix the CSS of Spark 3.5.0 doc's generated tables (#492)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch asf-site in repository https://gitbox.apache.org/repos/asf/spark-website.git The following commit(s) were added to refs/heads/asf-site by this push: new 0ceaaaf528 Fix the CSS of Spark 3.5.0 doc's generated tables (#492) 0ceaaaf528 is described below commit 0ceaaaf528ec1d0201e1eab1288f37cce607268b Author: Gengliang Wang AuthorDate: Thu Nov 30 15:06:18 2023 -0800 Fix the CSS of Spark 3.5.0 doc's generated tables (#492) After https://github.com/apache/spark/pull/40269, there is no border in the generated tables of Spark doc(for example, [sql-ref-ansi-compliance.html](https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html)) . Currently only the doc of Spark 3.5.0 is affected. This PR is to apply the changes in https://github.com/apache/spark/pull/44096 on the current Spark 3.5.0 doc by 1. change the `site/docs/3.5.0/css/custom.css` 2. Execute `sed -i '' 's/table class="table table-striped"/table/' *.html` under `site/docs/3.5.0/` directory. This should be a safe change. I have verified it on my local env. --- site/docs/3.5.0/building-spark.html| 2 +- site/docs/3.5.0/cluster-overview.html | 2 +- site/docs/3.5.0/configuration.html | 40 +++--- site/docs/3.5.0/css/custom.css | 13 +++ site/docs/3.5.0/ml-classification-regression.html | 14 site/docs/3.5.0/ml-clustering.html | 8 ++--- .../3.5.0/mllib-classification-regression.html | 2 +- site/docs/3.5.0/mllib-decision-tree.html | 2 +- site/docs/3.5.0/mllib-ensembles.html | 2 +- site/docs/3.5.0/mllib-evaluation-metrics.html | 10 +++--- site/docs/3.5.0/mllib-linear-methods.html | 4 +-- site/docs/3.5.0/mllib-pmml-model-export.html | 2 +- site/docs/3.5.0/monitoring.html| 10 +++--- site/docs/3.5.0/rdd-programming-guide.html | 8 ++--- site/docs/3.5.0/running-on-kubernetes.html | 8 ++--- site/docs/3.5.0/running-on-mesos.html | 2 +- site/docs/3.5.0/running-on-yarn.html | 8 ++--- site/docs/3.5.0/security.html | 26 +++--- site/docs/3.5.0/spark-standalone.html | 12 +++ site/docs/3.5.0/sparkr.html| 6 ++-- site/docs/3.5.0/sql-data-sources-avro.html | 12 +++ site/docs/3.5.0/sql-data-sources-csv.html | 2 +- site/docs/3.5.0/sql-data-sources-hive-tables.html | 4 +-- site/docs/3.5.0/sql-data-sources-jdbc.html | 2 +- site/docs/3.5.0/sql-data-sources-json.html | 2 +- .../sql-data-sources-load-save-functions.html | 2 +- site/docs/3.5.0/sql-data-sources-orc.html | 4 +-- site/docs/3.5.0/sql-data-sources-parquet.html | 4 +-- site/docs/3.5.0/sql-data-sources-text.html | 2 +- .../sql-distributed-sql-engine-spark-sql-cli.html | 4 +-- .../docs/3.5.0/sql-error-conditions-sqlstates.html | 26 +++--- site/docs/3.5.0/sql-migration-guide.html | 4 +-- site/docs/3.5.0/sql-performance-tuning.html| 16 - site/docs/3.5.0/storage-openstack-swift.html | 2 +- site/docs/3.5.0/streaming-custom-receivers.html| 2 +- site/docs/3.5.0/streaming-programming-guide.html | 10 +++--- .../structured-streaming-kafka-integration.html| 20 +-- .../structured-streaming-programming-guide.html| 12 +++ site/docs/3.5.0/submitting-applications.html | 2 +- site/docs/3.5.0/web-ui.html| 2 +- 40 files changed, 164 insertions(+), 151 deletions(-) diff --git a/site/docs/3.5.0/building-spark.html b/site/docs/3.5.0/building-spark.html index 0af9dd6517..672d686bc3 100644 --- a/site/docs/3.5.0/building-spark.html +++ b/site/docs/3.5.0/building-spark.html @@ -481,7 +481,7 @@ Change the major Scala version using (e.g. 2.13): Related environment variables - + Variable NameDefaultMeaning SPARK_PROJECT_URL diff --git a/site/docs/3.5.0/cluster-overview.html b/site/docs/3.5.0/cluster-overview.html index d6015a8686..552b24b729 100644 --- a/site/docs/3.5.0/cluster-overview.html +++ b/site/docs/3.5.0/cluster-overview.html @@ -216,7 +216,7 @@ The job scheduling overview describes this in The following table summarizes terms youll see used to refer to cluster concepts: - + TermMeaning diff --git a/site/docs/3.5.0/configuration.html b/site/docs/3.5.0/configuration.html index d6c9255302..3ca1684ffd 100644 --- a/site/docs/3.5.0/configuration.html +++ b/site/docs/3.5.0/configuration.html @@ -309,7 +309,7 @@ of the most common options to set are: Application Properties - + Property NameDefaultMeaningSince Version spark.app.name @@ -694,7 +694,7 @@ of the most common options to set are: Runtime Environment - +
(spark) branch branch-3.5 updated: [SPARK-46188][DOC][3.5] Fix the CSS of Spark doc's generated tables
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 00bb4ad46e37 [SPARK-46188][DOC][3.5] Fix the CSS of Spark doc's generated tables 00bb4ad46e37 is described below commit 00bb4ad46e373311a6303952f3944680b08e03d7 Author: Gengliang Wang AuthorDate: Thu Nov 30 14:56:48 2023 -0800 [SPARK-46188][DOC][3.5] Fix the CSS of Spark doc's generated tables ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/40269, there is no border in the generated tables of Spark doc(for example, https://spark.apache.org/docs/latest/sql-ref-ansi-compliance.html) . This PR is to fix it by restoring part of the table style in https://github.com/apache/spark/pull/40269/files#diff-309b964023ca899c9505205f36d3f4d5b36a6487e5c9b2e242204ee06bbc9ce9L26 This PR also unifies all the styles of tables by removing the `class="table table-striped"` in HTML style tables in markdown docs. ### Why are the changes needed? Fix a regression in the table CSS of Spark docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build docs and verify. Before changes: https://github.com/apache/spark/assets/1097932/1eb7abff-65af-4c4c-bbd5-9077f38c1b43;> After changes: https://github.com/apache/spark/assets/1097932/be77d4c6-1279-43ec-a234-b69ee02e3dc6;> ### Was this patch authored or co-authored using generative AI tooling? Generated-by: ChatGPT 4 Closes #44097 from gengliangwang/fixTable3.5. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- docs/building-spark.md | 2 +- docs/cluster-overview.md | 2 +- docs/configuration.md| 40 docs/css/custom.css | 13 docs/ml-classification-regression.md | 14 - docs/ml-clustering.md| 8 ++--- docs/mllib-classification-regression.md | 2 +- docs/mllib-decision-tree.md | 2 +- docs/mllib-ensembles.md | 2 +- docs/mllib-evaluation-metrics.md | 10 +++--- docs/mllib-linear-methods.md | 4 +-- docs/mllib-pmml-model-export.md | 2 +- docs/monitoring.md | 10 +++--- docs/rdd-programming-guide.md| 8 ++--- docs/running-on-kubernetes.md| 8 ++--- docs/running-on-mesos.md | 2 +- docs/running-on-yarn.md | 8 ++--- docs/security.md | 26 +++ docs/spark-standalone.md | 12 +++ docs/sparkr.md | 6 ++-- docs/sql-data-sources-avro.md| 12 +++ docs/sql-data-sources-csv.md | 2 +- docs/sql-data-sources-hive-tables.md | 4 +-- docs/sql-data-sources-jdbc.md| 2 +- docs/sql-data-sources-json.md| 2 +- docs/sql-data-sources-load-save-functions.md | 2 +- docs/sql-data-sources-orc.md | 4 +-- docs/sql-data-sources-parquet.md | 4 +-- docs/sql-data-sources-text.md| 2 +- docs/sql-distributed-sql-engine-spark-sql-cli.md | 4 +-- docs/sql-error-conditions-sqlstates.md | 26 +++ docs/sql-migration-guide.md | 4 +-- docs/sql-performance-tuning.md | 16 +- docs/storage-openstack-swift.md | 2 +- docs/streaming-custom-receivers.md | 2 +- docs/streaming-programming-guide.md | 10 +++--- docs/structured-streaming-kafka-integration.md | 20 ++-- docs/structured-streaming-programming-guide.md | 12 +++ docs/submitting-applications.md | 2 +- docs/web-ui.md | 2 +- 40 files changed, 164 insertions(+), 151 deletions(-) diff --git a/docs/building-spark.md b/docs/building-spark.md index 4b8e70655d59..33d253a49dbf 100644 --- a/docs/building-spark.md +++ b/docs/building-spark.md @@ -286,7 +286,7 @@ If use an individual repository or a repository on GitHub Enterprise, export bel ### Related environment variables - + Variable NameDefaultMeaning SPARK_PROJECT_URL diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md index 7da06a852089..34913bd97a41 100644 --- a/docs/cluster-overview.md +++ b/docs/cluster-overview.md @@ -91,7 +91,7 @@ The [job scheduling overview](job-scheduling.html) describ
(spark) branch master updated (9bb358b51e30 -> 99b80a7f17e2)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 9bb358b51e30 [SPARK-46170][SQL] Support inject adaptive query post planner strategy rules in SparkSessionExtensions add 99b80a7f17e2 [SPARK-46188][DOC] Fix the CSS of Spark doc's generated tables No new revisions were added by this update. Summary of changes: docs/building-spark.md | 2 +- docs/cluster-overview.md | 2 +- docs/configuration.md| 38 docs/css/custom.css | 13 docs/ml-classification-regression.md | 14 - docs/ml-clustering.md| 8 ++--- docs/mllib-classification-regression.md | 2 +- docs/mllib-decision-tree.md | 2 +- docs/mllib-ensembles.md | 2 +- docs/mllib-evaluation-metrics.md | 10 +++ docs/mllib-linear-methods.md | 4 +-- docs/mllib-pmml-model-export.md | 2 +- docs/monitoring.md | 10 +++ docs/rdd-programming-guide.md| 8 ++--- docs/running-on-kubernetes.md| 8 ++--- docs/running-on-yarn.md | 8 ++--- docs/security.md | 26 docs/spark-standalone.md | 14 - docs/sparkr.md | 6 ++-- docs/sql-data-sources-avro.md| 12 docs/sql-data-sources-csv.md | 2 +- docs/sql-data-sources-hive-tables.md | 4 +-- docs/sql-data-sources-jdbc.md| 2 +- docs/sql-data-sources-json.md| 2 +- docs/sql-data-sources-load-save-functions.md | 2 +- docs/sql-data-sources-orc.md | 4 +-- docs/sql-data-sources-parquet.md | 4 +-- docs/sql-data-sources-protobuf.md| 6 ++-- docs/sql-data-sources-text.md| 2 +- docs/sql-data-sources-xml.md | 2 +- docs/sql-distributed-sql-engine-spark-sql-cli.md | 4 +-- docs/sql-error-conditions-sqlstates.md | 26 docs/sql-migration-guide.md | 4 +-- docs/sql-performance-tuning.md | 16 +- docs/storage-openstack-swift.md | 2 +- docs/streaming-custom-receivers.md | 2 +- docs/streaming-programming-guide.md | 10 +++ docs/structured-streaming-kafka-integration.md | 20 ++--- docs/structured-streaming-programming-guide.md | 12 docs/structured-streaming-state-data-source.md | 8 ++--- docs/submitting-applications.md | 2 +- docs/web-ui.md | 2 +- 42 files changed, 171 insertions(+), 158 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46144][SQL] Fail INSERT INTO ... REPLACE statement if the condition contains subquery
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new c162f6df1b3d [SPARK-46144][SQL] Fail INSERT INTO ... REPLACE statement if the condition contains subquery c162f6df1b3d is described below commit c162f6df1b3d6ccc2944b6fb6db033482c9f01ee Author: Gengliang Wang AuthorDate: Wed Nov 29 21:19:58 2023 -0800 [SPARK-46144][SQL] Fail INSERT INTO ... REPLACE statement if the condition contains subquery ### What changes were proposed in this pull request? For the following query: ``` INSERT INTO tbl REPLACE WHERE id = (select c2 from values(1) as t(c2)) SELECT * FROM source ``` There will be an analysis error: ``` [UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function parameter with name `c2` cannot be resolved. SQLSTATE: 42703; line 1 pos 51; 'OverwriteByExpression RelationV2[id#27L, data#28] testcat.tbl testcat.tbl, (id#27L = scalar-subquery#26 []), false ``` The error message is confusing. The actual reason is the OverwriteByExpression plan doesn't support subqueries. While supporting the feature is non-trivial, this PR is to improve the error message as ``` [UNSUPPORTED_FEATURE.OVERWRITE_BY_SUBQUERY] The feature is not supported: INSERT OVERWRITE with a subquery condition. SQLSTATE: 0A000; line 1 pos 43; ``` ### Why are the changes needed? Error message improvement ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #44060 from gengliangwang/replace. Authored-by: Gengliang Wang Signed-off-by: Gengliang Wang --- .../src/main/resources/error/error-classes.json| 5 + ...r-conditions-unsupported-feature-error-class.md | 4 .../sql/catalyst/analysis/CheckAnalysis.scala | 5 + .../spark/sql/connector/DataSourceV2SQLSuite.scala | 26 ++ 4 files changed, 40 insertions(+) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 5b70edf249d1..9e0019b34728 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -3529,6 +3529,11 @@ "Unable to convert of Orc to data type ." ] }, + "OVERWRITE_BY_SUBQUERY" : { +"message" : [ + "INSERT OVERWRITE with a subquery condition." +] + }, "PANDAS_UDAF_IN_PIVOT" : { "message" : [ "Pandas user defined aggregate function in the PIVOT clause." diff --git a/docs/sql-error-conditions-unsupported-feature-error-class.md b/docs/sql-error-conditions-unsupported-feature-error-class.md index 0541b9d0589e..1143aff634c2 100644 --- a/docs/sql-error-conditions-unsupported-feature-error-class.md +++ b/docs/sql-error-conditions-unsupported-feature-error-class.md @@ -121,6 +121,10 @@ The target JDBC server hosting table `` does not support ALTER TABLE Unable to convert `` of Orc to data type ``. +## OVERWRITE_BY_SUBQUERY + +INSERT OVERWRITE with a subquery condition. + ## PANDAS_UDAF_IN_PIVOT Pandas user defined aggregate function in the PIVOT clause. diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala index 3843901a2e01..ea1af1d3c8cd 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala @@ -271,6 +271,11 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB case _ => } + case o: OverwriteByExpression if o.deleteExpr.exists(_.isInstanceOf[SubqueryExpression]) => +o.deleteExpr.failAnalysis ( + errorClass = "UNSUPPORTED_FEATURE.OVERWRITE_BY_SUBQUERY", + messageParameters = Map.empty) + case operator: LogicalPlan => operator transformExpressionsDown { // Check argument data types of higher-order functions downwards first. diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala index c2e759efe402..b92b512aa1d3 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala @@ -3226,
(spark) branch branch-3.5 updated: [SPARK-43380][SQL][FOLLOW-UP] Fix slowdown in Avro read
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 64242bf6a64 [SPARK-43380][SQL][FOLLOW-UP] Fix slowdown in Avro read 64242bf6a64 is described below commit 64242bf6a6425274b83bc1191230437c2d3fbc71 Author: zeruibao AuthorDate: Tue Oct 31 16:46:40 2023 -0700 [SPARK-43380][SQL][FOLLOW-UP] Fix slowdown in Avro read ### What changes were proposed in this pull request? Fix slowdown in Avro read. There is a https://github.com/apache/spark/pull/42503 that causes the performance regression. It seems that `SQLConf.get.getConf(confKey)` is very costly. Move it out of `newWriter` function. ### Why are the changes needed? Need to fix the performance regression of Avro read. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43606 from zeruibao/SPARK-43380-FIX-SLOWDOWN. Authored-by: zeruibao Signed-off-by: Gengliang Wang (cherry picked from commit 45f73bc69655a236323be1bcb2988341d2aa5203) Signed-off-by: Gengliang Wang --- .../src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala index fe0bd7392b6..ec34d10a5ff 100644 --- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala +++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala @@ -105,6 +105,9 @@ private[sql] class AvroDeserializer( s"Cannot convert Avro type $rootAvroType to SQL type ${rootCatalystType.sql}.", ise) } + private lazy val preventReadingIncorrectType = !SQLConf.get +.getConf(SQLConf.LEGACY_AVRO_ALLOW_INCOMPATIBLE_SCHEMA) + def deserialize(data: Any): Option[Any] = converter(data) /** @@ -122,8 +125,6 @@ private[sql] class AvroDeserializer( s"schema is incompatible (avroType = $avroType, sqlType = ${catalystType.sql})" val realDataType = SchemaConverters.toSqlType(avroType, useStableIdForUnionType).dataType -val confKey = SQLConf.LEGACY_AVRO_ALLOW_INCOMPATIBLE_SCHEMA -val preventReadingIncorrectType = !SQLConf.get.getConf(confKey) (avroType.getType, catalystType) match { case (NULL, NullType) => (updater, ordinal, _) => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-43380][SQL][FOLLOW-UP] Fix slowdown in Avro read
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 45f73bc6965 [SPARK-43380][SQL][FOLLOW-UP] Fix slowdown in Avro read 45f73bc6965 is described below commit 45f73bc69655a236323be1bcb2988341d2aa5203 Author: zeruibao AuthorDate: Tue Oct 31 16:46:40 2023 -0700 [SPARK-43380][SQL][FOLLOW-UP] Fix slowdown in Avro read ### What changes were proposed in this pull request? Fix slowdown in Avro read. There is a https://github.com/apache/spark/pull/42503 that causes the performance regression. It seems that `SQLConf.get.getConf(confKey)` is very costly. Move it out of `newWriter` function. ### Why are the changes needed? Need to fix the performance regression of Avro read. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT test ### Was this patch authored or co-authored using generative AI tooling? No Closes #43606 from zeruibao/SPARK-43380-FIX-SLOWDOWN. Authored-by: zeruibao Signed-off-by: Gengliang Wang --- .../src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala index c04fe820f0b..29b9fdf9dfb 100644 --- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala +++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDeserializer.scala @@ -105,6 +105,9 @@ private[sql] class AvroDeserializer( s"Cannot convert Avro type $rootAvroType to SQL type ${rootCatalystType.sql}.", ise) } + private lazy val preventReadingIncorrectType = !SQLConf.get +.getConf(SQLConf.LEGACY_AVRO_ALLOW_INCOMPATIBLE_SCHEMA) + def deserialize(data: Any): Option[Any] = converter(data) /** @@ -122,8 +125,6 @@ private[sql] class AvroDeserializer( s"schema is incompatible (avroType = $avroType, sqlType = ${catalystType.sql})" val realDataType = SchemaConverters.toSqlType(avroType, useStableIdForUnionType).dataType -val confKey = SQLConf.LEGACY_AVRO_ALLOW_INCOMPATIBLE_SCHEMA -val preventReadingIncorrectType = !SQLConf.get.getConf(confKey) (avroType.getType, catalystType) match { case (NULL, NullType) => (updater, ordinal, _) => - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (49f9e74973f -> af8907a0873)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 49f9e74973f [SPARK-45481][SPARK-45664][SPARK-45711][SQL][FOLLOWUP] Avoid magic strings copy from parquet|orc|avro compression codes add af8907a0873 [SPARK-45242][SQL][FOLLOWUP] Canonicalize DataFrame ID in CollectMetrics No new revisions were added by this update. Summary of changes: .../spark/sql/catalyst/plans/logical/basicLogicalOperators.scala | 4 .../scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala | 5 + 2 files changed, 9 insertions(+) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-45581] Make SQLSTATE mandatory
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7e82e1bc43e [SPARK-45581] Make SQLSTATE mandatory 7e82e1bc43e is described below commit 7e82e1bc43e0297c3036d802b3a151d2b93db2f6 Author: srielau AuthorDate: Wed Oct 18 11:04:44 2023 -0700 [SPARK-45581] Make SQLSTATE mandatory ### What changes were proposed in this pull request? We propose to make SQLSTATEs mandatory field when using error classes in the new error framework. ### Why are the changes needed? Being able to rely on the existence of SQLSTATEs allows easier classification of errors as well as usage of tooling to intercept SQLSTATEs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? A new test was added to SparkThrowableSuite to enforce SQLSTATE existence ### Was this patch authored or co-authored using generative AI tooling? No Closes #43412 from srielau/SPARK-45581-make-SQLSTATEs-mandatory. Authored-by: srielau Signed-off-by: Gengliang Wang --- common/utils/src/main/resources/error/README.md| 26 ++ .../src/main/resources/error/error-classes.json| 9 +--- .../org/apache/spark/ErrorClassesJSONReader.scala | 8 +++ .../org/apache/spark/SparkThrowableSuite.scala | 16 - docs/sql-error-conditions.md | 6 ++--- 5 files changed, 41 insertions(+), 24 deletions(-) diff --git a/common/utils/src/main/resources/error/README.md b/common/utils/src/main/resources/error/README.md index ac388c29250..8d8529bea56 100644 --- a/common/utils/src/main/resources/error/README.md +++ b/common/utils/src/main/resources/error/README.md @@ -1,6 +1,6 @@ # Guidelines -To throw a standardized user-facing error or exception, developers should specify the error class +To throw a standardized user-facing error or exception, developers should specify the error class, a SQLSTATE, and message parameters rather than an arbitrary error message. ## Usage @@ -10,7 +10,7 @@ and message parameters rather than an arbitrary error message. If true, use the error class `INTERNAL_ERROR` and skip to step 4. 2. Check if an appropriate error class already exists in `error-classes.json`. If true, use the error class and skip to step 4. -3. Add a new class to `error-classes.json`; keep in mind the invariants below. +3. Add a new class with a new or existing SQLSTATE to `error-classes.json`; keep in mind the invariants below. 4. Check if the exception type already extends `SparkThrowable`. If true, skip to step 6. 5. Mix `SparkThrowable` into the exception. @@ -26,9 +26,9 @@ Throw with arbitrary error message: `error-classes.json` -"PROBLEM_BECAUSE": { - "message": ["Problem because "], - "sqlState": "X" +"PROBLEM_BECAUSE" : { + "message" : ["Problem because "], + "sqlState" : "X" } `SparkException.scala` @@ -70,6 +70,8 @@ Error classes are a succinct, human-readable representation of the error categor An uncategorized errors can be assigned to a legacy error class with the prefix `_LEGACY_ERROR_TEMP_` and an unused sequential number, for instance `_LEGACY_ERROR_TEMP_0053`. +You should not introduce new uncategorized errors. Instead, convert them to proper errors whenever encountering them in new code. + Invariants - Unique @@ -79,7 +81,10 @@ An uncategorized errors can be assigned to a legacy error class with the prefix ### Message Error messages provide a descriptive, human-readable representation of the error. -The message format accepts string parameters via the C-style printf syntax. +The message format accepts string parameters via the HTML tag syntax: e.g. . + +The values passed to the message shoudl not themselves be messages. +They should be: runtime-values, keywords, identifiers, or other values that are not translated. The quality of the error message should match the [guidelines](https://spark.apache.org/error-message-guidelines.html). @@ -90,21 +95,24 @@ The quality of the error message should match the ### SQLSTATE -SQLSTATE is an optional portable error identifier across SQL engines. +SQLSTATE is an mandatory portable error identifier across SQL engines. SQLSTATE comprises a 2-character class value followed by a 3-character subclass value. Spark prefers to re-use existing SQLSTATEs, preferably used by multiple vendors. For extension Spark claims the 'K**' subclass range. If a new class is needed it will also claim the 'K0' class. +Internal errors should use the 'XX' class. You can subdivide internal errors by component. +For exam
[spark] branch master updated (e3b1bb117fe9 -> 3593f8a8919d)
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from e3b1bb117fe9 [SPARK-45262][SQL][TESTS][DOCS] Improve examples for regexp parameters add 3593f8a8919d [SPARK-45425][SQL] Mapped TINYINT to ShortType for MsSqlServerDialect No new revisions were added by this update. Summary of changes: .../spark/sql/jdbc/MsSqlServerIntegrationSuite.scala| 12 ++-- .../org/apache/spark/sql/jdbc/MsSqlServerDialect.scala | 4 +++- .../scala/org/apache/spark/sql/jdbc/JDBCSuite.scala | 17 + 3 files changed, 30 insertions(+), 3 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-44838][SQL] raise_error improvement
This is an automated email from the ASF dual-hosted git repository. gengliang pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 9109d7037f4 [SPARK-44838][SQL] raise_error improvement 9109d7037f4 is described below commit 9109d7037f44158e72d14019eb33f9c7b8838868 Author: srielau AuthorDate: Wed Sep 27 10:02:44 2023 -0700 [SPARK-44838][SQL] raise_error improvement ### What changes were proposed in this pull request? Extend the raise_error() function to a two-argument version: raise_error(errorClassStr, errorParamMap) This new form will accept any error class defined in error-classes.json and require Map to provide values for the parameters in the error classes template. Externally an error raised via raise_error() is indistinguishable from an error raised from within the Spark engine. The single-parameter raise_error(str) will raise USER_RAISED_EXCEPTION (SQLSTATE P0001 - borrowed from PostgreSQL). USER_RAISED_EXCEPTION text is: "" which will be filled in with the str - value. We will also provide `spark.sql.legacy.raiseErrorWithoutErrorClass` (default: false) to revert to the old behavior for the single-parameter version. Naturally assert_true() will also return `USER_RAISED_EXCEPTION`. Examples ``` SELECT raise_error('VIEW_NOT_FOUND', map('relationName', '`v1`'); [VIEW_NOT_FOUND] The view `v1` cannot be found. Verify the spelling ... SELECT raise_error('Error!'); [USER_RAISED_EXCEPTION] Error! SELECT assert_true(1 < 0); [USER_RAISED_EXCEPTION] '(1 < 0)' is not true! SELECT assert_true(1 < 0, 'bad!') [USER_RAISED_EXCEPTION] bad! ``` ### Why are the changes needed? This change moves raise_error() and assert_true() to the new error frame work. It greatly expands the ability of users to raise error messages which can be intercepted via SQLSTATE and/or error class. ### Does this PR introduce _any_ user-facing change? Yes, the result of assert_true() changes and raise_error() gains a new signature. ### How was this patch tested? Run existing QA and add new tests for assert_true and raise_error ### Was this patch authored or co-authored using generative AI tooling? No Closes #42985 from srielau/SPARK-44838-raise_error. Lead-authored-by: srielau Co-authored-by: Serge Rielau Co-authored-by: Wenchen Fan Signed-off-by: Gengliang Wang --- .../src/main/resources/error/error-classes.json| 26 +++ .../org/apache/spark/ErrorClassesJSONReader.scala | 18 ++ .../org/apache/spark/SparkThrowableHelper.scala| 10 +- .../scala/org/apache/spark/sql/functions.scala | 8 + .../function_assert_true_with_message.explain | 2 +- .../explain-results/function_raise_error.explain | 2 +- .../org/apache/spark/SparkThrowableSuite.scala | 4 +- docs/sql-error-conditions.md | 20 ++ python/pyspark/sql/tests/test_functions.py | 4 +- .../spark/sql/catalyst/expressions/misc.scala | 71 --- .../spark/sql/errors/QueryExecutionErrors.scala| 49 - .../org/apache/spark/sql/internal/SQLConf.scala| 14 ++ .../expressions/ExpressionEvalHelper.scala | 4 + .../expressions/MiscExpressionsSuite.scala | 10 +- .../catalyst/optimizer/ConstantFoldingSuite.scala | 2 +- .../scala/org/apache/spark/sql/functions.scala | 8 + .../sql-functions/sql-expression-schema.md | 2 +- .../analyzer-results/misc-functions.sql.out| 86 +++- .../resources/sql-tests/inputs/misc-functions.sql | 22 +++ .../sql-tests/results/misc-functions.sql.out | 220 +++-- .../apache/spark/sql/ColumnExpressionSuite.scala | 26 ++- .../spark/sql/execution/ui/UISeleniumSuite.scala | 9 +- .../sql/expressions/ExpressionInfoSuite.scala | 1 + 23 files changed, 551 insertions(+), 67 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index dd0190c3462..0882e387176 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -3502,6 +3502,26 @@ "3. set \"spark.sql.legacy.allowUntypedScalaUDF\" to \"true\" and use this API with caution." ] }, + "USER_RAISED_EXCEPTION" : { +"message" : [ + "" +], +"sqlState" : "P0001" + }, + "USER_RAISED_EXCEPTION_PARAMETER_MISMATCH" : { +"message" : [ + "The `raise_error()` function was used to raise error class: which expects parameters: .", +