(spark) branch master updated: [SPARK-46360][PYTHON] Enhance error message debugging with new `getMessage` API
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 8d64cb4144d1 [SPARK-46360][PYTHON] Enhance error message debugging with new `getMessage` API 8d64cb4144d1 is described below commit 8d64cb4144d17107fd3758d2a46430439203c7ad Author: Haejoon Lee AuthorDate: Mon Dec 11 21:03:11 2023 -0800 [SPARK-46360][PYTHON] Enhance error message debugging with new `getMessage` API ### What changes were proposed in this pull request? This PR proposes to introduce `getMessage` to provide a standardized way for users to obtain a concise and clear error message. ### Why are the changes needed? Previously, extracting a simple and informative error message in PySpark was not straightforward. The internal `ErrorClassesReader.get_error_message` method was often used, but for JVM-originated errors not defined in `error_classes.py`, obtaining a succinct error message was challenging. The new `getMessage` API harmonizes error message retrieval across PySpark, leveraging existing JVM implementations to ensure consistency and clarity in the messages presented to the users. ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces a `getMessage` for directly accessing simplified error messages in PySpark. - **Before**: No official API for simplified error messages; excessive details in the error output: ```python from pyspark.sql.utils import AnalysisException try: spark.sql("""SELECT a""") except AnalysisException as e: str(e) # "[UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function parameter with name `a` cannot be resolved. SQLSTATE: 42703; line 1 pos 7;\n'Project ['a]\n+- OneRowRelation\n" ``` - **After**: The `getMessage` API provides streamlined, user-friendly error messages: ```python from pyspark.sql.utils import AnalysisException try: spark.sql("""SELECT a""") except AnalysisException as e: e.getMessage() # '[UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or function parameter with name `a` cannot be resolved. SQLSTATE: 42703' ``` ### How was this patch tested? Added UTs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44292 from itholic/getMessage. Authored-by: Haejoon Lee Signed-off-by: Hyukjin Kwon --- python/pyspark/errors/exceptions/base.py | 19 ++- python/pyspark/errors/exceptions/captured.py | 18 ++ python/pyspark/sql/tests/test_utils.py | 8 3 files changed, 44 insertions(+), 1 deletion(-) diff --git a/python/pyspark/errors/exceptions/base.py b/python/pyspark/errors/exceptions/base.py index b60800da3ff8..e40e1b2e93cb 100644 --- a/python/pyspark/errors/exceptions/base.py +++ b/python/pyspark/errors/exceptions/base.py @@ -60,6 +60,7 @@ class PySparkException(Exception): See Also +:meth:`PySparkException.getMessage` :meth:`PySparkException.getMessageParameters` :meth:`PySparkException.getSqlState` """ @@ -74,6 +75,7 @@ class PySparkException(Exception): See Also :meth:`PySparkException.getErrorClass` +:meth:`PySparkException.getMessage` :meth:`PySparkException.getSqlState` """ return self._message_parameters @@ -89,13 +91,28 @@ class PySparkException(Exception): See Also :meth:`PySparkException.getErrorClass` +:meth:`PySparkException.getMessage` :meth:`PySparkException.getMessageParameters` """ return None +def getMessage(self) -> str: +""" +Returns full error message. + +.. versionadded:: 4.0.0 + +See Also + +:meth:`PySparkException.getErrorClass` +:meth:`PySparkException.getMessageParameters` +:meth:`PySparkException.getSqlState` +""" +return f"[{self.getErrorClass()}] {self._message}" + def __str__(self) -> str: if self.getErrorClass() is not None: -return f"[{self.getErrorClass()}] {self._message}" +return self.getMessage() else: return self._message diff --git a/python/pyspark/errors/exceptions/captured.py b/python/pyspark/errors/exceptions/captured.py index ec987e0854ea..4164bb7b428d 100644 --- a/python/pyspark/errors/exceptions/captured.py +++ b/python/pyspark/errors/exceptions/captured.py @@ -118,6 +118,24 @@ class CapturedException(PySparkException): else: return None +def
(spark) branch master updated: [SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2a49feeb5d7 [SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types 2a49feeb5d7 is described below commit 2a49feeb5d727552758a75fdcfbc49e8f6eed72f Author: Parth Upadhyay AuthorDate: Mon Dec 11 16:37:04 2023 -0800 [SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types ### What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-43427 Protobuf supports unsigned integer types, including uint32 and uint64. When deserializing protobuf values with fields of these types, `from_protobuf` currently transforms them to the spark types of: ``` uint32 => IntegerType uint64 => LongType ``` IntegerType and LongType are [signed](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) integer types, so this can lead to confusing results. Namely, if a uint32 value in a stored proto is above 2^31 or a uint64 value is above 2^63, their representation in binary will contain a 1 in the highest bit, which when interpreted as a signed integer will be negative (I.e. overflow). No information is lost, as `IntegerType` and `LongType` contain 32 and 64 bits respectively, however [...] In this PR, we add an option (`upcast.unsigned.ints`) to allow upcasting unsigned integer types into a larger integer type that can represent them natively, i.e. ``` uint32 => LongType uint64 => Decimal(20, 0) ``` I added an option so that it doesn't break any existing clients. **Example of current behavior** Consider a protobuf message like: ``` syntax = "proto3"; message Test { uint64 val = 1; } ``` If we compile the above and then generate a message with a value for `val` above 2^63: ``` import test_pb2 s = test_pb2.Test() s.val = 9223372036854775809 # 2**63 + 1 serialized = s.SerializeToString() print(serialized) ``` This generates the binary representation: b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01' Then, deserializing this using `from_protobuf`, we can see that it is represented as a negative number. I did this in a notebook so its easier to see, but could reproduce in a scala test as well: ![image](https://github.com/apache/spark/assets/1002986/7144e6a9-3f43-455e-94c3-9065ae88206e) **Precedent** I believe that unsigned integer types in parquet are deserialized in a similar manner, i.e. put into a larger type so that the unsigned representation natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and https://github.com/apache/spark/pull/31921. So an option to get similar behavior would be useful. ### Why are the changes needed? Improve unsigned integer deserialization behavior. ### Does this PR introduce any user-facing change? Yes, adds a new option. ### How was this patch tested? Unit Testing ### Was this patch authored or co-authored using generative AI tooling? No Closes #43773 from justaparth/parth/43427-add-option-to-expand-unsigned-integers. Authored-by: Parth Upadhyay Signed-off-by: Hyukjin Kwon --- .../spark/sql/protobuf/ProtobufDeserializer.scala | 12 ++ .../spark/sql/protobuf/ProtobufSerializer.scala| 11 +- .../spark/sql/protobuf/utils/ProtobufOptions.scala | 12 ++ .../sql/protobuf/utils/SchemaConverters.scala | 18 - .../sql/protobuf/ProtobufFunctionsSuite.scala | 46 ++ 5 files changed, 96 insertions(+), 3 deletions(-) diff --git a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala index a46baf51379..45f3419edf9 100644 --- a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala +++ b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala @@ -193,6 +193,11 @@ private[sql] class ProtobufDeserializer( case (INT, ShortType) => (updater, ordinal, value) => updater.setShort(ordinal, value.asInstanceOf[Short]) + case (INT, LongType) => +(updater, ordinal, value) => + updater.setLong( +ordinal, +Integer.toUnsignedLong(value.asInstanceOf[Int])) case ( MESSAGE | BOOLEAN | INT | FLOAT | DOUBLE | LONG | STRING | ENUM | BYTE_STRING, ArrayType(dataType: DataType, containsNull)) if protoType.isRepeated => @@ -201,6 +206,13 @@ private[sql] class ProtobufDeserializer( case (LONG,
(spark) branch branch-3.4 updated: [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.4 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.4 by this push: new b813e2e100f [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` b813e2e100f is described below commit b813e2e100faf7bab88c23ba9bba6e3197b169aa Author: Dongjoon Hyun AuthorDate: Mon Dec 11 15:05:21 2023 -0800 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` ### What changes were proposed in this pull request? This PR aims to remove `kill` hyperlink from `RELAUNCHING` drivers in `MasterPage`. ### Why are the changes needed? Since Apache Spark 1.4.0 (SPARK-5495), `RELAUNCHING` drivers have `kill` hyperlinks in the `Completed Drivers` table. ![Screenshot 2023-12-11 at 1 02 29 PM](https://github.com/apache/spark/assets/9700541/38f4bf08-efb9-47e5-8a7a-f7d127429012) However, this is a bug because the driver was already terminated by definition. Newly relaunched driver has an independent ID and there is no relationship with the previously terminated ID. https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala#L27 If we clicked the `kill` link, `Master` always complains like the following. ``` 23/12/11 21:25:50 INFO Master: Asked to kill driver 202312112113-0 23/12/11 21:25:50 WARN Master: Driver 202312112113-0 has already finished or does not exist ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44301 from dongjoon-hyun/SPARK-46369. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit e434c9f0d5792b7af43c87dd6145fd8a6a04d8e2) Signed-off-by: Dongjoon Hyun (cherry picked from commit ac031d68a01f14cc73f05e83a790a6787aa6453d) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index a71eb33a2fe..e7e90aa0a37 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -322,8 +322,7 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def driverRow(driver: DriverInfo, showDuration: Boolean): Seq[Node] = { val killLink = if (parent.killEnabled && (driver.state == DriverState.RUNNING || -driver.state == DriverState.SUBMITTED || -driver.state == DriverState.RELAUNCHING)) { +driver.state == DriverState.SUBMITTED)) { val confirm = s"if (window.confirm('Are you sure you want to kill driver ${driver.id} ?')) " + "{ this.parentNode.submit(); return true; } else { return false; }" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new ac031d68a01 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` ac031d68a01 is described below commit ac031d68a01f14cc73f05e83a790a6787aa6453d Author: Dongjoon Hyun AuthorDate: Mon Dec 11 15:05:21 2023 -0800 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` ### What changes were proposed in this pull request? This PR aims to remove `kill` hyperlink from `RELAUNCHING` drivers in `MasterPage`. ### Why are the changes needed? Since Apache Spark 1.4.0 (SPARK-5495), `RELAUNCHING` drivers have `kill` hyperlinks in the `Completed Drivers` table. ![Screenshot 2023-12-11 at 1 02 29 PM](https://github.com/apache/spark/assets/9700541/38f4bf08-efb9-47e5-8a7a-f7d127429012) However, this is a bug because the driver was already terminated by definition. Newly relaunched driver has an independent ID and there is no relationship with the previously terminated ID. https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala#L27 If we clicked the `kill` link, `Master` always complains like the following. ``` 23/12/11 21:25:50 INFO Master: Asked to kill driver 202312112113-0 23/12/11 21:25:50 WARN Master: Driver 202312112113-0 has already finished or does not exist ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44301 from dongjoon-hyun/SPARK-46369. Authored-by: Dongjoon Hyun Signed-off-by: Dongjoon Hyun (cherry picked from commit e434c9f0d5792b7af43c87dd6145fd8a6a04d8e2) Signed-off-by: Dongjoon Hyun --- core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala index a71eb33a2fe..e7e90aa0a37 100644 --- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala +++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala @@ -322,8 +322,7 @@ private[ui] class MasterPage(parent: MasterWebUI) extends WebUIPage("") { private def driverRow(driver: DriverInfo, showDuration: Boolean): Seq[Node] = { val killLink = if (parent.killEnabled && (driver.state == DriverState.RUNNING || -driver.state == DriverState.SUBMITTED || -driver.state == DriverState.RELAUNCHING)) { +driver.state == DriverState.SUBMITTED)) { val confirm = s"if (window.confirm('Are you sure you want to kill driver ${driver.id} ?')) " + "{ this.parentNode.submit(); return true; } else { return false; }" - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (3e0808c33f1 -> e434c9f0d57)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 3e0808c33f1 [SPARK-46351][SQL] Require an error class in `AnalysisException` add e434c9f0d57 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage` No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46351][SQL] Require an error class in `AnalysisException`
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 3e0808c33f1 [SPARK-46351][SQL] Require an error class in `AnalysisException` 3e0808c33f1 is described below commit 3e0808c33f185c13808ce2d547ce9ba0057d31a6 Author: Max Gekk AuthorDate: Tue Dec 12 01:29:26 2023 +0300 [SPARK-46351][SQL] Require an error class in `AnalysisException` ### What changes were proposed in this pull request? In the PR, I propose to create `AnalysisException` only with an error class by making the constructor with `message` protected. So, in this way only sub-classes can create `AnalysisException` by passing a `message`, but others shall provide an error class. ### Why are the changes needed? To improve user experience with Spark SQL by unifying error exceptions: the final goal is all Spark exception should contain an error class. ### Does this PR introduce _any_ user-facing change? No since user's code shouldn't throw `AnalysisException` but it can if it depends on error message formats. ### How was this patch tested? By existing test test suites like: ``` $ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite" ``` and the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveDDLSuite" ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44277 from MaxGekk/protected-AnalysisException. Authored-by: Max Gekk Signed-off-by: Max Gekk --- .../src/main/resources/error/error-classes.json| 253 + .../apache/spark/sql/avro/AvroDataToCatalyst.scala | 19 +- .../org/apache/spark/sql/test/SQLHelper.scala | 4 +- .../connect/client/GrpcExceptionConverter.scala| 22 +- .../spark/sql/kafka010/KafkaSourceProvider.scala | 9 +- .../apache/spark/sql/kafka010/KafkaWriter.scala| 5 +- .../org/apache/spark/sql/AnalysisException.scala | 14 +- .../apache/spark/sql/catalyst/SQLConfHelper.scala | 4 +- .../catalyst/analysis/ColumnResolutionHelper.scala | 8 +- .../ResolveRowLevelCommandAssignments.scala| 4 +- .../catalyst/analysis/RewriteMergeIntoTable.scala | 12 +- .../catalyst/expressions/V2ExpressionUtils.scala | 12 +- .../spark/sql/catalyst/planning/patterns.scala | 5 +- .../sql/catalyst/plans/logical/v2Commands.scala| 8 +- .../org/apache/spark/sql/util/SchemaUtils.scala| 25 +- .../catalyst/catalog/ExternalCatalogSuite.scala| 4 +- .../spark/sql/RelationalGroupedDataset.scala | 4 +- .../spark/sql/execution/SparkStrategies.scala | 3 +- .../spark/sql/execution/aggregate/AggUtils.scala | 5 +- .../execution/datasources/FileSourceStrategy.scala | 13 +- .../parquet/ParquetSchemaConverter.scala | 4 +- .../spark/sql/execution/datasources/rules.scala| 5 +- .../execution/datasources/v2/MergeRowsExec.scala | 4 +- .../datasources/v2/state/utils/SchemaUtil.scala| 6 +- .../RowLevelOperationRuntimeGroupFiltering.scala | 5 +- .../execution/streaming/WatermarkPropagator.scala | 6 +- .../execution/streaming/statefulOperators.scala| 6 +- .../org/apache/spark/sql/jdbc/JdbcDialects.scala | 5 +- .../sql/connector/DataSourceV2FunctionSuite.scala | 10 +- .../spark/sql/connector/DataSourceV2SQLSuite.scala | 79 +-- .../connector/FileDataSourceV2FallBackSuite.scala | 28 ++- .../spark/sql/execution/QueryExecutionSuite.scala | 2 +- .../execution/datasources/orc/OrcFilterSuite.scala | 3 +- .../sql/execution/datasources/orc/OrcTest.scala| 3 +- .../datasources/parquet/ParquetFilterSuite.scala | 3 +- .../spark/sql/hive/HiveExternalCatalog.scala | 32 ++- .../org/apache/spark/sql/hive/HiveInspectors.scala | 15 +- .../spark/sql/hive/HiveMetastoreCatalog.scala | 17 +- .../spark/sql/hive/HiveSessionStateBuilder.scala | 7 +- .../org/apache/spark/sql/hive/HiveStrategies.scala | 8 +- .../sql/hive/execution/V1WritesHiveUtils.scala | 4 +- .../spark/sql/hive/HiveMetastoreCatalogSuite.scala | 11 +- .../spark/sql/hive/execution/HiveDDLSuite.scala| 72 +++--- .../spark/sql/hive/execution/HiveQuerySuite.scala | 8 +- .../sql/hive/execution/Hive_2_1_DDLSuite.scala | 8 +- 45 files changed, 611 insertions(+), 173 deletions(-) diff --git a/common/utils/src/main/resources/error/error-classes.json b/common/utils/src/main/resources/error/error-classes.json index 62d10c0d34c..d52ffc011b7 100644 --- a/common/utils/src/main/resources/error/error-classes.json +++ b/common/utils/src/main/resources/error/error-classes.json @@ -6705,6 +6705,259 @@ "Failed to get block , which is not a shuffle block" ] }, +
(spark) branch master updated: [SPARK-46273][SQL] Support INSERT INTO/OVERWRITE using DSv2 sources
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 7db85642600 [SPARK-46273][SQL] Support INSERT INTO/OVERWRITE using DSv2 sources 7db85642600 is described below commit 7db85642600b1e3b39ca11e41d4e3e0bf1c8962b Author: allisonwang-db AuthorDate: Mon Dec 11 10:32:48 2023 -0800 [SPARK-46273][SQL] Support INSERT INTO/OVERWRITE using DSv2 sources ### What changes were proposed in this pull request? This PR adds test cases for INSERT INTO and INSERT OVERWRITE queries with DSv2 sources. ### Why are the changes needed? To improve test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #44213 from allisonwang-db/spark-46273-dsv2-insert. Authored-by: allisonwang-db Signed-off-by: Hyukjin Kwon --- .../spark/sql/errors/QueryCompilationErrors.scala | 4 +- .../datasources/v2/TableCapabilityCheck.scala | 2 +- .../spark/sql/connector/DataSourceV2Suite.scala| 127 - 3 files changed, 129 insertions(+), 4 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala index b7e10dc194a..1195e9dd78d 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala @@ -926,8 +926,8 @@ private[sql] object QueryCompilationErrors extends QueryErrorsBase with Compilat unsupportedTableOperationError(table.name(), "truncate in batch mode") } - def unsupportedOverwriteByFilterInBatchModeError(table: Table): Throwable = { -unsupportedTableOperationError(table.name(), "overwrite by filter in batch mode") + def unsupportedOverwriteByFilterInBatchModeError(name: String): Throwable = { +unsupportedTableOperationError(name, "overwrite by filter in batch mode") } def catalogOperationNotSupported(catalog: CatalogPlugin, operation: String): Throwable = { diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala index a3afaa36ab9..b1a93addc80 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala @@ -63,7 +63,7 @@ object TableCapabilityCheck extends (LogicalPlan => Unit) { case _ => if (!supportsBatchWrite(r.table) || !r.table.supports(OVERWRITE_BY_FILTER)) { throw QueryCompilationErrors.unsupportedOverwriteByFilterInBatchModeError( -r.table) + r.name) } } diff --git a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala index f2e518e8acc..6e365e1d605 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala @@ -18,6 +18,7 @@ package org.apache.spark.sql.connector import java.io.File +import java.util import java.util.OptionalLong import test.org.apache.spark.sql.connector._ @@ -723,8 +724,116 @@ class DataSourceV2Suite extends QueryTest with SharedSparkSession with AdaptiveS } } } -} + test("SPARK-46273: insert into") { +val cls = classOf[SupportsExternalMetadataDataSource] +withTable("test") { + sql( +s""" + |CREATE TABLE test (x INT, y INT) USING ${cls.getName} + |""".stripMargin) + sql("INSERT INTO test VALUES (1, 2)") + checkAnswer(sql("SELECT * FROM test"), Seq(Row(1, 2))) + // Insert by name + sql("INSERT INTO test(y, x) VALUES (3, 2)") + checkAnswer(sql("SELECT * FROM test"), Seq(Row(1, 2), Row(2, 3))) + // Can be casted automatically + sql("INSERT INTO test(y, x) VALUES (4L, 3L)") + checkAnswer(sql("SELECT * FROM test"), Seq(Row(1, 2), Row(2, 3), Row(3, 4))) + // Insert values by name + sql("INSERT INTO test BY NAME VALUES (5, 4) t(y, x)") + checkAnswer(sql("SELECT * FROM test"), Seq(Row(1, 2), Row(2, 3), Row(3, 4), Row(4, 5))) + // Missing columns + checkError( +exception = intercept[AnalysisException] { + sql(s"INSERT INTO test VALUES (4)") +}, +errorClass
(spark) branch master updated: [SPARK-46341][PS][TESTS][FOLLOWUP] Sort indices before dataframe comparison
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 35bc0b4ecd2 [SPARK-46341][PS][TESTS][FOLLOWUP] Sort indices before dataframe comparison 35bc0b4ecd2 is described below commit 35bc0b4ecd266de2eb5f53f92491a53efe9c3eef Author: Ruifeng Zheng AuthorDate: Mon Dec 11 10:32:14 2023 -0800 [SPARK-46341][PS][TESTS][FOLLOWUP] Sort indices before dataframe comparison ### What changes were proposed in this pull request? ### Why are the changes needed? Sort indices before dataframe comparison ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci and manually check ### Was this patch authored or co-authored using generative AI tooling? no Closes #44295 from zhengruifeng/ps_test_interpolate_sort_index. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- python/pyspark/pandas/tests/series/test_interpolate.py | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/python/pyspark/pandas/tests/series/test_interpolate.py b/python/pyspark/pandas/tests/series/test_interpolate.py index 0008dd2ee91..fad684822a6 100644 --- a/python/pyspark/pandas/tests/series/test_interpolate.py +++ b/python/pyspark/pandas/tests/series/test_interpolate.py @@ -24,7 +24,10 @@ from pyspark.testing.pandasutils import PandasOnSparkTestCase class SeriesInterpolateMixin: def _test_interpolate(self, pobj): psobj = ps.from_pandas(pobj) -self.assert_eq(psobj.interpolate(), pobj.interpolate()) +self.assert_eq( +psobj.interpolate().sort_index(), +pobj.interpolate().sort_index(), +) for limit, limit_direction, limit_area in [ (1, None, None), (2, "forward", "inside"), @@ -35,10 +38,10 @@ class SeriesInterpolateMixin: self.assert_eq( psobj.interpolate( limit=limit, limit_direction=limit_direction, limit_area=limit_area -), +).sort_index(), pobj.interpolate( limit=limit, limit_direction=limit_direction, limit_area=limit_area -), +).sort_index(), ) def test_interpolate(self): - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46347][PS][TESTS] Reorganize `RollingTests `
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new d5241ff2689 [SPARK-46347][PS][TESTS] Reorganize `RollingTests ` d5241ff2689 is described below commit d5241ff26892fa615b27ae39b0be1b8907f59f29 Author: Ruifeng Zheng AuthorDate: Mon Dec 11 10:29:49 2023 -0800 [SPARK-46347][PS][TESTS] Reorganize `RollingTests ` ### What changes were proposed in this pull request? Reorganize `RollingTests`, break it into multiple small files ### Why are the changes needed? to be consistent with Pandas's tests ### Does this PR introduce _any_ user-facing change? no, test only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes #44281 from zhengruifeng/ps_test_rolling. Authored-by: Ruifeng Zheng Signed-off-by: Hyukjin Kwon --- dev/sparktestsupport/modules.py| 16 +- .../test_parity_groupby_rolling.py}| 12 +- .../test_parity_groupby_rolling_adv.py}| 12 +- .../test_parity_groupby_rolling_count.py} | 12 +- .../connect/{ => window}/test_parity_rolling.py| 10 +- .../test_parity_rolling_adv.py}| 12 +- .../test_parity_rolling_count.py} | 12 +- .../test_parity_rolling_error.py} | 12 +- python/pyspark/pandas/tests/test_rolling.py| 317 - .../pandas/tests/window/test_groupby_rolling.py| 132 + .../tests/window/test_groupby_rolling_adv.py | 60 .../tests/window/test_groupby_rolling_count.py | 113 python/pyspark/pandas/tests/window/test_rolling.py | 91 ++ .../test_rolling_adv.py} | 33 ++- .../pandas/tests/window/test_rolling_count.py | 72 + .../test_rolling_error.py} | 31 +- 16 files changed, 578 insertions(+), 369 deletions(-) diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py index 68e9ed8101d..c77a34f1d22 100644 --- a/dev/sparktestsupport/modules.py +++ b/dev/sparktestsupport/modules.py @@ -750,7 +750,13 @@ pyspark_pandas = Module( "pyspark.pandas.tests.resample.test_series", "pyspark.pandas.tests.resample.test_timezone", "pyspark.pandas.tests.test_reshape", -"pyspark.pandas.tests.test_rolling", +"pyspark.pandas.tests.window.test_rolling", +"pyspark.pandas.tests.window.test_rolling_adv", +"pyspark.pandas.tests.window.test_rolling_count", +"pyspark.pandas.tests.window.test_rolling_error", +"pyspark.pandas.tests.window.test_groupby_rolling", +"pyspark.pandas.tests.window.test_groupby_rolling_adv", +"pyspark.pandas.tests.window.test_groupby_rolling_count", "pyspark.pandas.tests.test_scalars", "pyspark.pandas.tests.test_series_conversion", "pyspark.pandas.tests.test_series_datetime", @@ -1120,7 +1126,13 @@ pyspark_pandas_connect_part2 = Module( "pyspark.pandas.tests.connect.window.test_parity_ewm_error", "pyspark.pandas.tests.connect.window.test_parity_ewm_mean", "pyspark.pandas.tests.connect.window.test_parity_groupby_ewm_mean", -"pyspark.pandas.tests.connect.test_parity_rolling", +"pyspark.pandas.tests.connect.window.test_parity_rolling", +"pyspark.pandas.tests.connect.window.test_parity_rolling_adv", +"pyspark.pandas.tests.connect.window.test_parity_rolling_count", +"pyspark.pandas.tests.connect.window.test_parity_rolling_error", +"pyspark.pandas.tests.connect.window.test_parity_groupby_rolling", +"pyspark.pandas.tests.connect.window.test_parity_groupby_rolling_adv", + "pyspark.pandas.tests.connect.window.test_parity_groupby_rolling_count", "pyspark.pandas.tests.connect.test_parity_expanding", "pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_groupby_rolling", "pyspark.pandas.tests.connect.computation.test_parity_missing_data", diff --git a/python/pyspark/pandas/tests/connect/test_parity_rolling.py b/python/pyspark/pandas/tests/connect/window/test_parity_groupby_rolling.py similarity index 76% copy from python/pyspark/pandas/tests/connect/test_parity_rolling.py copy to python/pyspark/pandas/tests/connect/window/test_parity_groupby_rolling.py index 8318bed24f0..0a3e0b1358f 100644 --- a/python/pyspark/pandas/tests/connect/test_parity_rolling.py +++ b/python/pyspark/pandas/tests/connect/window/test_parity_groupby_rolling.py @@ -16,19 +16,21 @@ # import unittest -from pyspark.pandas.tests.test_rolling import RollingTestsMixin +from pyspark.pandas.tests.window.test_groupby_rolling import GroupByRollingMixin from pyspark.testing.connectutils
(spark) branch master updated: [SPARK-46355][SQL] XML: Close InputStreamReader on read completion
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fc5868791a5 [SPARK-46355][SQL] XML: Close InputStreamReader on read completion fc5868791a5 is described below commit fc5868791a5d261fb416b376cc6eb39e8e29b90a Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> AuthorDate: Mon Dec 11 09:41:03 2023 -0800 [SPARK-46355][SQL] XML: Close InputStreamReader on read completion ### What changes were proposed in this pull request? XML InputStreamReader need to be closed on read completion to timely release InputStream resources. ### Why are the changes needed? Not closing the reader may result in not timely recycling underlying InputStream connection resources for cloud objects. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test and manual testing ### Was this patch authored or co-authored using generative AI tooling? No Closes #44287 from sandip-db/xml-reader-close. Authored-by: Sandip Agarwala <131817656+sandip...@users.noreply.github.com> Signed-off-by: Hyukjin Kwon --- .../spark/sql/catalyst/xml/StaxXmlParser.scala | 21 +++-- .../spark/sql/catalyst/xml/XmlInferSchema.scala | 4 +++- 2 files changed, 18 insertions(+), 7 deletions(-) diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala index 567074bbf12..373ef410f05 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala @@ -133,7 +133,9 @@ class StaxXmlParser( val isRootAttributesOnly = schema.fields.forall { f => f.name == options.valueTag || f.name.startsWith(options.attributePrefix) } - Some(convertObject(parser, schema, rootAttributes, isRootAttributesOnly)) + val result = Some(convertObject(parser, schema, rootAttributes, isRootAttributesOnly)) + parser.close() + result } catch { case e: SparkUpgradeException => throw e case e@(_: RuntimeException | _: XMLStreamException | _: MalformedInputException @@ -576,7 +578,7 @@ class StaxXmlParser( class XmlTokenizer( inputStream: InputStream, options: XmlOptions) { - private val reader = new InputStreamReader(inputStream, Charset.forName(options.charset)) + private var reader = new InputStreamReader(inputStream, Charset.forName(options.charset)) private var currentStartTag: String = _ private var buffer = new StringBuilder() private val startTag = s"<${options.rowTag}>" @@ -591,17 +593,24 @@ class XmlTokenizer( * @return whether it reads successfully */ def next(): Option[String] = { -if (readUntilStartElement()) { - try { +try { + if (readUntilStartElement()) { buffer.append(currentStartTag) // Don't check whether the end element was found. Even if not, return everything // that was read, which will invariably cause a parse error later readUntilEndElement(currentStartTag.endsWith(">")) -return Some(buffer.toString()) - } finally { +val str = buffer.toString() buffer = new StringBuilder() +return Some(str) } +} catch { + case e: Throwable => +reader.close() +reader = null +throw e } +reader.close() +reader = null None } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala index f81c476cd38..52c98bc43eb 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala @@ -118,7 +118,9 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: Boolean) } val parser = StaxXmlParserUtils.filteredReader(xml) val rootAttributes = StaxXmlParserUtils.gatherRootAttributes(parser) - Some(inferObject(parser, rootAttributes)) + val schema = Some(inferObject(parser, rootAttributes)) + parser.close() + schema } catch { case NonFatal(_) if options.parseMode == PermissiveMode => Some(StructType(Seq(StructField(options.columnNameOfCorruptRecord, StringType - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-46358][CONNECT] Simplify the condition check in the `ResponseValidator#verifyResponse`
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 051b1781827 [SPARK-46358][CONNECT] Simplify the condition check in the `ResponseValidator#verifyResponse` 051b1781827 is described below commit 051b1781827dd3a4e1e95a5354caa747ff41ae1a Author: yangjie01 AuthorDate: Mon Dec 11 08:53:10 2023 -0800 [SPARK-46358][CONNECT] Simplify the condition check in the `ResponseValidator#verifyResponse` ### What changes were proposed in this pull request? This PR has made the following refactoring to the `verifyResponse` function in `ResponseValidator`: 1. The check condition `response.hasField(field)` is moved before getting `value`, and only when `response.hasField(field)` is true, `value` is obtained, which seems more in line with the existing comments. 2. Removed the `value != ""` condition check in the case match, because only when `value.nonEmpty` is true will it enter the `if` branch, and the condition `value.nonEmpty` has already covered the check for `value != ""`. 3. The condition check `value != id` is moved inside `case Some(id)`. After the modification, an `IllegalStateException` will still be thrown when the id exists and `value != id`, but `serverSideSessionId` will no longer be reassigned when the id exists and `value == id`. 4. Removed the redundant `toString` operation when reassigning `serverSideSessionId`, because `value` is String type. 5. Removed the No-op `case _` match, because it is unreachable code after the above modifications. ### Why are the changes needed? Simplify the condition check in the `verifyResponse` function ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #44291 from LuciferYang/Simplify-ResponseValidator-verifyResponse. Lead-authored-by: yangjie01 Co-authored-by: YangJie Signed-off-by: Dongjoon Hyun --- .../spark/sql/connect/client/ResponseValidator.scala | 16 +--- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala index 67f29c727ef..22c5505e7d4 100644 --- a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala +++ b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala @@ -35,18 +35,20 @@ class ResponseValidator extends Logging { val field = response.getDescriptorForType.findFieldByName("server_side_session_id") // If the field does not exist, we ignore it. New / Old message might not contain it and this // behavior allows us to be compatible. -if (field != null) { +if (field != null && response.hasField(field)) { val value = response.getField(field).asInstanceOf[String] // Ignore, if the value is unset. - if (response.hasField(field) && value != null && value.nonEmpty) { + if (value != null && value.nonEmpty) { serverSideSessionId match { - case Some(id) if value != id && value != "" => -throw new IllegalStateException(s"Server side session ID changed from $id to $value") - case _ if value != "" => + case Some(id) => +if (value != id) { + throw new IllegalStateException( +s"Server side session ID changed from $id to $value") +} + case _ => synchronized { - serverSideSessionId = Some(value.toString) + serverSideSessionId = Some(value) } - case _ => // No-op } } } else { - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated (bab884082c0 -> bb886abcc4a)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from bab884082c0 [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide add bb886abcc4a [SPARK-46356][BUILD] Upgrade `sbt-assembly` to 2.1.5, `sbt-checkstyle-plugin` to 4.0.1 No new revisions were added by this update. Summary of changes: project/plugins.sbt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch branch-3.5 updated: [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch branch-3.5 in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/branch-3.5 by this push: new 9c83bf501cc [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide 9c83bf501cc is described below commit 9c83bf501ccefa7c6c0ba071f69e2528f3504854 Author: Amy Tsai AuthorDate: Mon Dec 11 18:35:31 2023 +0300 [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide ### What changes were proposed in this pull request? Fixes an error in the SQL migration guide documentation for `spark.sql.legacy.doLooseUpcast`. I corrected the config name and moved it to the section for migration from Spark 2.4 to 3.0 since it was not made available until Spark 3.0. ### Why are the changes needed? The config was documented as `spark.sql.legacy.looseUpcast` and is inaccurately included in the Spark 2.4 to Spark 2.4.1 section. I changed the docs to match what is implemented in https://github.com/apache/spark/blob/20df062d85e80422a55afae80ddbf2060f26516c/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3873 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Docs only change ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44262 from amytsai-stripe/fix-migration-docs-loose-upcast. Authored-by: Amy Tsai Signed-off-by: Max Gekk (cherry picked from commit bab884082c0f82e3f9053adac6c7e8a3fcfab11c) Signed-off-by: Max Gekk --- docs/sql-migration-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 88635ee3d1f..2eba9500e90 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -251,6 +251,8 @@ license: | - In Spark 3.0, the column metadata will always be propagated in the API `Column.name` and `Column.as`. In Spark version 2.4 and earlier, the metadata of `NamedExpression` is set as the `explicitMetadata` for the new column at the time the API is called, it won't change even if the underlying `NamedExpression` changes metadata. To restore the behavior before Spark 3.0, you can use the API `as(alias: String, metadata: Metadata)` with explicit metadata. + - When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. In Spark 3.0, the up cast is stricter and turning String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` will fail during analysis. To res [...] + ### DDL Statements - In Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean` are disallowed. A runtime exception is thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and below, type conversions during table insertion are allowed as long as they are valid `Cast`. When inserting an o [...] @@ -464,8 +466,6 @@ license: | need to specify a value with units like "30s" now, to avoid being interpreted as milliseconds; otherwise, the extremely short interval that results will likely cause applications to fail. - - When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. In Spark 3.0, the up cast is stricter and turning String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` will fail during analysis. To res [...] - ## Upgrading from Spark SQL 2.3 to 2.4 - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new bab884082c0 [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide bab884082c0 is described below commit bab884082c0f82e3f9053adac6c7e8a3fcfab11c Author: Amy Tsai AuthorDate: Mon Dec 11 18:35:31 2023 +0300 [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide ### What changes were proposed in this pull request? Fixes an error in the SQL migration guide documentation for `spark.sql.legacy.doLooseUpcast`. I corrected the config name and moved it to the section for migration from Spark 2.4 to 3.0 since it was not made available until Spark 3.0. ### Why are the changes needed? The config was documented as `spark.sql.legacy.looseUpcast` and is inaccurately included in the Spark 2.4 to Spark 2.4.1 section. I changed the docs to match what is implemented in https://github.com/apache/spark/blob/20df062d85e80422a55afae80ddbf2060f26516c/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3873 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Docs only change ### Was this patch authored or co-authored using generative AI tooling? No. Closes #44262 from amytsai-stripe/fix-migration-docs-loose-upcast. Authored-by: Amy Tsai Signed-off-by: Max Gekk --- docs/sql-migration-guide.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index 9f9c15521c6..4e8e2422d7e 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -260,6 +260,8 @@ license: | - In Spark 3.0, the column metadata will always be propagated in the API `Column.name` and `Column.as`. In Spark version 2.4 and earlier, the metadata of `NamedExpression` is set as the `explicitMetadata` for the new column at the time the API is called, it won't change even if the underlying `NamedExpression` changes metadata. To restore the behavior before Spark 3.0, you can use the API `as(alias: String, metadata: Metadata)` with explicit metadata. + - When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. In Spark 3.0, the up cast is stricter and turning String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` will fail during analysis. To res [...] + ### DDL Statements - In Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean` are disallowed. A runtime exception is thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and below, type conversions during table insertion are allowed as long as they are valid `Cast`. When inserting an o [...] @@ -473,8 +475,6 @@ license: | need to specify a value with units like "30s" now, to avoid being interpreted as milliseconds; otherwise, the extremely short interval that results will likely cause applications to fail. - - When turning a Dataset to another Dataset, Spark will up cast the fields in the original Dataset to the type of corresponding fields in the target DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. `Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and throw NPE during execution. In Spark 3.0, the up cast is stricter and turning String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` will fail during analysis. To res [...] - ## Upgrading from Spark SQL 2.3 to 2.4 - In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause `array_contains` function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism. This can cause some change in behavior and are illustrated in the table below. - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
(spark) branch master updated: [SPARK-45649][SQL] Unify the prepare framework for OffsetWindowFunctionFrame
This is an automated email from the ASF dual-hosted git repository. beliefer pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 6a197efeb3c [SPARK-45649][SQL] Unify the prepare framework for OffsetWindowFunctionFrame 6a197efeb3c is described below commit 6a197efeb3c1cca156cd615e990e35e82ce22ee3 Author: Jiaan Geng AuthorDate: Mon Dec 11 19:48:14 2023 +0800 [SPARK-45649][SQL] Unify the prepare framework for OffsetWindowFunctionFrame ### What changes were proposed in this pull request? Currently, the implementation of the `prepare` of all the `OffsetWindowFunctionFrame` have the same code logic show below. ``` override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { if (offset > rows.length) { fillDefaultValue(EmptyRow) } else { resetStates(rows) if (ignoreNulls) { ... } else { ... } } } ``` This PR want unify the prepare framework for `OffsetWindowFunctionFrame` **Why the https://github.com/apache/spark/pull/43507 introduces the NPE bug?** For example, there is a window group with the offset 5 and have 4 elements. First, we don't call the `resetStates` due to the offset is greater than 4. After that, we iterates the elements of the window group by visit input. But the input is null. This PR also add two test cases about the absolute value of offset greater than the window group size. ### Why are the changes needed? Unify the prepare framework for `OffsetWindowFunctionFrame` ### Does this PR introduce _any_ user-facing change? 'No'. Inner update. ### How was this patch tested? Exists test cases. ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #43958 from beliefer/SPARK-45649. Authored-by: Jiaan Geng Signed-off-by: Jiaan Geng --- .../sql/execution/window/WindowFunctionFrame.scala | 114 ++--- .../spark/sql/DataFrameWindowFramesSuite.scala | 24 + 2 files changed, 76 insertions(+), 62 deletions(-) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala index 6cea838311a..4aa7444c407 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala @@ -87,7 +87,8 @@ abstract class OffsetWindowFunctionFrameBase( expressions: Array[OffsetWindowFunction], inputSchema: Seq[Attribute], newMutableProjection: (Seq[Expression], Seq[Attribute]) => MutableProjection, -offset: Int) +offset: Int, +ignoreNulls: Boolean) extends WindowFunctionFrame { /** Rows of the partition currently being processed. */ @@ -141,6 +142,8 @@ abstract class OffsetWindowFunctionFrameBase( // is not null. protected var skippedNonNullCount = 0 + protected val absOffset = Math.abs(offset) + // Reset the states by the data of the new partition. protected def resetStates(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { input = rows @@ -176,6 +179,33 @@ abstract class OffsetWindowFunctionFrameBase( } } + override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { +resetStates(rows) +if (absOffset > rows.length) { + fillDefaultValue(EmptyRow) +} else { + if (ignoreNulls) { +prepareForIgnoreNulls() + } else { +prepareForRespectNulls() + } +} + } + + protected def prepareForIgnoreNulls(): Unit = findNextRowWithNonNullInput() + + protected def prepareForRespectNulls(): Unit = { +// drain the first few rows if offset is larger than one +while (inputIndex < offset) { + nextSelectedRow = WindowFunctionFrame.getNextOrNull(inputIterator) + inputIndex += 1 +} +// `inputIndex` starts as 0, but the `offset` can be negative and we may not enter the +// while loop at all. We need to make sure `inputIndex` ends up as `offset` to meet the +// assumption of the write path. +inputIndex = offset + } + override def currentLowerBound(): Int = throw new UnsupportedOperationException() override def currentUpperBound(): Int = throw new UnsupportedOperationException() @@ -197,25 +227,7 @@ class FrameLessOffsetWindowFunctionFrame( offset: Int, ignoreNulls: Boolean = false) extends OffsetWindowFunctionFrameBase( -target, ordinal, expressions, inputSchema, newMutableProjection, offset) { - - override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = { -resetStates(rows) -if (ignoreNulls) { - if (Math.abs(offset) > rows.length) { -