(spark) branch master updated: [SPARK-46360][PYTHON] Enhance error message debugging with new `getMessage` API

2023-12-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8d64cb4144d1 [SPARK-46360][PYTHON] Enhance error message debugging 
with new `getMessage` API
8d64cb4144d1 is described below

commit 8d64cb4144d17107fd3758d2a46430439203c7ad
Author: Haejoon Lee 
AuthorDate: Mon Dec 11 21:03:11 2023 -0800

[SPARK-46360][PYTHON] Enhance error message debugging with new `getMessage` 
API

### What changes were proposed in this pull request?

This PR proposes to introduce `getMessage` to provide a standardized way 
for users to obtain a concise and clear error message.

### Why are the changes needed?

Previously, extracting a simple and informative error message in PySpark 
was not straightforward. The internal `ErrorClassesReader.get_error_message` 
method was often used, but for JVM-originated errors not defined in 
`error_classes.py`, obtaining a succinct error message was challenging.

The new `getMessage` API harmonizes error message retrieval across PySpark, 
leveraging existing JVM implementations to ensure consistency and clarity in 
the messages presented to the users.

### Does this PR introduce _any_ user-facing change?

Yes, this PR introduces a `getMessage` for directly accessing simplified 
error messages in PySpark.

- **Before**: No official API for simplified error messages; excessive 
details in the error output:
```python
from pyspark.sql.utils import AnalysisException

try:
spark.sql("""SELECT a""")
except AnalysisException as e:
str(e)
# "[UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or 
function parameter with name `a` cannot be resolved.  SQLSTATE: 42703; line 1 
pos 7;\n'Project ['a]\n+- OneRowRelation\n"
```

- **After**: The `getMessage` API provides streamlined, user-friendly error 
messages:
```python
from pyspark.sql.utils import AnalysisException

try:
spark.sql("""SELECT a""")
except AnalysisException as e:
e.getMessage()
# '[UNRESOLVED_COLUMN.WITHOUT_SUGGESTION] A column, variable, or 
function parameter with name `a` cannot be resolved.  SQLSTATE: 42703'
```

### How was this patch tested?

Added UTs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44292 from itholic/getMessage.

Authored-by: Haejoon Lee 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/errors/exceptions/base.py | 19 ++-
 python/pyspark/errors/exceptions/captured.py | 18 ++
 python/pyspark/sql/tests/test_utils.py   |  8 
 3 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/python/pyspark/errors/exceptions/base.py 
b/python/pyspark/errors/exceptions/base.py
index b60800da3ff8..e40e1b2e93cb 100644
--- a/python/pyspark/errors/exceptions/base.py
+++ b/python/pyspark/errors/exceptions/base.py
@@ -60,6 +60,7 @@ class PySparkException(Exception):
 
 See Also
 
+:meth:`PySparkException.getMessage`
 :meth:`PySparkException.getMessageParameters`
 :meth:`PySparkException.getSqlState`
 """
@@ -74,6 +75,7 @@ class PySparkException(Exception):
 See Also
 
 :meth:`PySparkException.getErrorClass`
+:meth:`PySparkException.getMessage`
 :meth:`PySparkException.getSqlState`
 """
 return self._message_parameters
@@ -89,13 +91,28 @@ class PySparkException(Exception):
 See Also
 
 :meth:`PySparkException.getErrorClass`
+:meth:`PySparkException.getMessage`
 :meth:`PySparkException.getMessageParameters`
 """
 return None
 
+def getMessage(self) -> str:
+"""
+Returns full error message.
+
+.. versionadded:: 4.0.0
+
+See Also
+
+:meth:`PySparkException.getErrorClass`
+:meth:`PySparkException.getMessageParameters`
+:meth:`PySparkException.getSqlState`
+"""
+return f"[{self.getErrorClass()}] {self._message}"
+
 def __str__(self) -> str:
 if self.getErrorClass() is not None:
-return f"[{self.getErrorClass()}] {self._message}"
+return self.getMessage()
 else:
 return self._message
 
diff --git a/python/pyspark/errors/exceptions/captured.py 
b/python/pyspark/errors/exceptions/captured.py
index ec987e0854ea..4164bb7b428d 100644
--- a/python/pyspark/errors/exceptions/captured.py
+++ b/python/pyspark/errors/exceptions/captured.py
@@ -118,6 +118,24 @@ class CapturedException(PySparkException):
 else:
 return None
 
+def 

(spark) branch master updated: [SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer types

2023-12-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2a49feeb5d7 [SPARK-43427][PROTOBUF] spark protobuf: allow upcasting 
unsigned integer types
2a49feeb5d7 is described below

commit 2a49feeb5d727552758a75fdcfbc49e8f6eed72f
Author: Parth Upadhyay 
AuthorDate: Mon Dec 11 16:37:04 2023 -0800

[SPARK-43427][PROTOBUF] spark protobuf: allow upcasting unsigned integer 
types

### What changes were proposed in this pull request?

JIRA: https://issues.apache.org/jira/browse/SPARK-43427

Protobuf supports unsigned integer types, including uint32 and uint64. When 
deserializing protobuf values with fields of these types, `from_protobuf` 
currently transforms them to the spark types of:

```
uint32 => IntegerType
uint64 => LongType
```

IntegerType and LongType are 
[signed](https://spark.apache.org/docs/latest/sql-ref-datatypes.html) integer 
types, so this can lead to confusing results. Namely, if a uint32 value in a 
stored proto is above 2^31 or a uint64 value is above 2^63, their 
representation in binary will contain a 1 in the highest bit, which when 
interpreted as a signed integer will be negative (I.e. overflow). No 
information is lost, as `IntegerType` and `LongType` contain 32 and 64 bits 
respectively, however [...]

In this PR, we add an option (`upcast.unsigned.ints`) to allow upcasting 
unsigned integer types into a larger integer type that can represent them 
natively, i.e.

```
uint32 => LongType
uint64 => Decimal(20, 0)
```

I added an option so that it doesn't break any existing clients.

**Example of current behavior**

Consider a protobuf message like:

```
syntax = "proto3";

message Test {
  uint64 val = 1;
}
```

If we compile the above and then generate a message with a value for `val` 
above 2^63:

```
import test_pb2

s = test_pb2.Test()
s.val = 9223372036854775809 # 2**63 + 1
serialized = s.SerializeToString()
print(serialized)
```

This generates the binary representation:

b'\x08\x81\x80\x80\x80\x80\x80\x80\x80\x80\x01'

Then, deserializing this using `from_protobuf`, we can see that it is 
represented as a negative number. I did this in a notebook so its easier to 
see, but could reproduce in a scala test as well:


![image](https://github.com/apache/spark/assets/1002986/7144e6a9-3f43-455e-94c3-9065ae88206e)

**Precedent**
I believe that unsigned integer types in parquet are deserialized in a 
similar manner, i.e. put into a larger type so that the unsigned representation 
natively fits. https://issues.apache.org/jira/browse/SPARK-34817 and 
https://github.com/apache/spark/pull/31921. So an option to get similar 
behavior would be useful.

### Why are the changes needed?
Improve unsigned integer deserialization behavior.

### Does this PR introduce any user-facing change?
Yes, adds a new option.

### How was this patch tested?
Unit Testing

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43773 from 
justaparth/parth/43427-add-option-to-expand-unsigned-integers.

Authored-by: Parth Upadhyay 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/protobuf/ProtobufDeserializer.scala  | 12 ++
 .../spark/sql/protobuf/ProtobufSerializer.scala| 11 +-
 .../spark/sql/protobuf/utils/ProtobufOptions.scala | 12 ++
 .../sql/protobuf/utils/SchemaConverters.scala  | 18 -
 .../sql/protobuf/ProtobufFunctionsSuite.scala  | 46 ++
 5 files changed, 96 insertions(+), 3 deletions(-)

diff --git 
a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala
 
b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala
index a46baf51379..45f3419edf9 100644
--- 
a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala
+++ 
b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDeserializer.scala
@@ -193,6 +193,11 @@ private[sql] class ProtobufDeserializer(
   case (INT, ShortType) =>
 (updater, ordinal, value) => updater.setShort(ordinal, 
value.asInstanceOf[Short])
 
+  case (INT, LongType) =>
+(updater, ordinal, value) =>
+  updater.setLong(
+ordinal,
+Integer.toUnsignedLong(value.asInstanceOf[Int]))
   case  (
 MESSAGE | BOOLEAN | INT | FLOAT | DOUBLE | LONG | STRING | ENUM | 
BYTE_STRING,
 ArrayType(dataType: DataType, containsNull)) if protoType.isRepeated =>
@@ -201,6 +206,13 @@ private[sql] class ProtobufDeserializer(
   case (LONG, 

(spark) branch branch-3.4 updated: [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage`

2023-12-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new b813e2e100f [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` 
drivers in `MasterPage`
b813e2e100f is described below

commit b813e2e100faf7bab88c23ba9bba6e3197b169aa
Author: Dongjoon Hyun 
AuthorDate: Mon Dec 11 15:05:21 2023 -0800

[SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in 
`MasterPage`

### What changes were proposed in this pull request?

This PR aims to remove `kill` hyperlink from `RELAUNCHING` drivers in 
`MasterPage`.

### Why are the changes needed?

Since Apache Spark 1.4.0 (SPARK-5495), `RELAUNCHING` drivers have `kill` 
hyperlinks in the `Completed Drivers` table.

![Screenshot 2023-12-11 at 1 02 29 
PM](https://github.com/apache/spark/assets/9700541/38f4bf08-efb9-47e5-8a7a-f7d127429012)

However, this is a bug because the driver was already terminated by 
definition. Newly relaunched driver has an independent ID and there is no 
relationship with the previously terminated ID.


https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala#L27

If we clicked the `kill` link, `Master` always complains like the following.
```
23/12/11 21:25:50 INFO Master: Asked to kill driver 202312112113-0
23/12/11 21:25:50 WARN Master: Driver 202312112113-0 has already 
finished or does not exist
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44301 from dongjoon-hyun/SPARK-46369.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit e434c9f0d5792b7af43c87dd6145fd8a6a04d8e2)
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit ac031d68a01f14cc73f05e83a790a6787aa6453d)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
index a71eb33a2fe..e7e90aa0a37 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
@@ -322,8 +322,7 @@ private[ui] class MasterPage(parent: MasterWebUI) extends 
WebUIPage("") {
   private def driverRow(driver: DriverInfo, showDuration: Boolean): Seq[Node] 
= {
 val killLink = if (parent.killEnabled &&
   (driver.state == DriverState.RUNNING ||
-driver.state == DriverState.SUBMITTED ||
-driver.state == DriverState.RELAUNCHING)) {
+driver.state == DriverState.SUBMITTED)) {
   val confirm =
 s"if (window.confirm('Are you sure you want to kill driver 
${driver.id} ?')) " +
   "{ this.parentNode.submit(); return true; } else { return false; }"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in `MasterPage`

2023-12-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new ac031d68a01 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` 
drivers in `MasterPage`
ac031d68a01 is described below

commit ac031d68a01f14cc73f05e83a790a6787aa6453d
Author: Dongjoon Hyun 
AuthorDate: Mon Dec 11 15:05:21 2023 -0800

[SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` drivers in 
`MasterPage`

### What changes were proposed in this pull request?

This PR aims to remove `kill` hyperlink from `RELAUNCHING` drivers in 
`MasterPage`.

### Why are the changes needed?

Since Apache Spark 1.4.0 (SPARK-5495), `RELAUNCHING` drivers have `kill` 
hyperlinks in the `Completed Drivers` table.

![Screenshot 2023-12-11 at 1 02 29 
PM](https://github.com/apache/spark/assets/9700541/38f4bf08-efb9-47e5-8a7a-f7d127429012)

However, this is a bug because the driver was already terminated by 
definition. Newly relaunched driver has an independent ID and there is no 
relationship with the previously terminated ID.


https://github.com/apache/spark/blob/7db85642600b1e3b39ca11e41d4e3e0bf1c8962b/core/src/main/scala/org/apache/spark/deploy/master/DriverState.scala#L27

If we clicked the `kill` link, `Master` always complains like the following.
```
23/12/11 21:25:50 INFO Master: Asked to kill driver 202312112113-0
23/12/11 21:25:50 WARN Master: Driver 202312112113-0 has already 
finished or does not exist
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44301 from dongjoon-hyun/SPARK-46369.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit e434c9f0d5792b7af43c87dd6145fd8a6a04d8e2)
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala 
b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
index a71eb33a2fe..e7e90aa0a37 100644
--- a/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala
@@ -322,8 +322,7 @@ private[ui] class MasterPage(parent: MasterWebUI) extends 
WebUIPage("") {
   private def driverRow(driver: DriverInfo, showDuration: Boolean): Seq[Node] 
= {
 val killLink = if (parent.killEnabled &&
   (driver.state == DriverState.RUNNING ||
-driver.state == DriverState.SUBMITTED ||
-driver.state == DriverState.RELAUNCHING)) {
+driver.state == DriverState.SUBMITTED)) {
   val confirm =
 s"if (window.confirm('Are you sure you want to kill driver 
${driver.id} ?')) " +
   "{ this.parentNode.submit(); return true; } else { return false; }"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (3e0808c33f1 -> e434c9f0d57)

2023-12-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3e0808c33f1 [SPARK-46351][SQL] Require an error class in 
`AnalysisException`
 add e434c9f0d57 [SPARK-46369][CORE] Remove `kill` link from `RELAUNCHING` 
drivers in `MasterPage`

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/deploy/master/ui/MasterPage.scala | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46351][SQL] Require an error class in `AnalysisException`

2023-12-11 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3e0808c33f1 [SPARK-46351][SQL] Require an error class in 
`AnalysisException`
3e0808c33f1 is described below

commit 3e0808c33f185c13808ce2d547ce9ba0057d31a6
Author: Max Gekk 
AuthorDate: Tue Dec 12 01:29:26 2023 +0300

[SPARK-46351][SQL] Require an error class in `AnalysisException`

### What changes were proposed in this pull request?
In the PR, I propose to create `AnalysisException` only with an error class 
by making the constructor with `message` protected. So, in this way only 
sub-classes can create `AnalysisException` by passing a `message`, but others 
shall provide an error class.

### Why are the changes needed?
To improve user experience with Spark SQL by unifying error exceptions: the 
final goal is all Spark exception should contain an error class.

### Does this PR introduce _any_ user-facing change?
No since user's code shouldn't throw `AnalysisException` but it can if it 
depends on error message formats.

### How was this patch tested?
By existing test test suites like:
```
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite"
```
and the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveDDLSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44277 from MaxGekk/protected-AnalysisException.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../src/main/resources/error/error-classes.json| 253 +
 .../apache/spark/sql/avro/AvroDataToCatalyst.scala |  19 +-
 .../org/apache/spark/sql/test/SQLHelper.scala  |   4 +-
 .../connect/client/GrpcExceptionConverter.scala|  22 +-
 .../spark/sql/kafka010/KafkaSourceProvider.scala   |   9 +-
 .../apache/spark/sql/kafka010/KafkaWriter.scala|   5 +-
 .../org/apache/spark/sql/AnalysisException.scala   |  14 +-
 .../apache/spark/sql/catalyst/SQLConfHelper.scala  |   4 +-
 .../catalyst/analysis/ColumnResolutionHelper.scala |   8 +-
 .../ResolveRowLevelCommandAssignments.scala|   4 +-
 .../catalyst/analysis/RewriteMergeIntoTable.scala  |  12 +-
 .../catalyst/expressions/V2ExpressionUtils.scala   |  12 +-
 .../spark/sql/catalyst/planning/patterns.scala |   5 +-
 .../sql/catalyst/plans/logical/v2Commands.scala|   8 +-
 .../org/apache/spark/sql/util/SchemaUtils.scala|  25 +-
 .../catalyst/catalog/ExternalCatalogSuite.scala|   4 +-
 .../spark/sql/RelationalGroupedDataset.scala   |   4 +-
 .../spark/sql/execution/SparkStrategies.scala  |   3 +-
 .../spark/sql/execution/aggregate/AggUtils.scala   |   5 +-
 .../execution/datasources/FileSourceStrategy.scala |  13 +-
 .../parquet/ParquetSchemaConverter.scala   |   4 +-
 .../spark/sql/execution/datasources/rules.scala|   5 +-
 .../execution/datasources/v2/MergeRowsExec.scala   |   4 +-
 .../datasources/v2/state/utils/SchemaUtil.scala|   6 +-
 .../RowLevelOperationRuntimeGroupFiltering.scala   |   5 +-
 .../execution/streaming/WatermarkPropagator.scala  |   6 +-
 .../execution/streaming/statefulOperators.scala|   6 +-
 .../org/apache/spark/sql/jdbc/JdbcDialects.scala   |   5 +-
 .../sql/connector/DataSourceV2FunctionSuite.scala  |  10 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  79 +--
 .../connector/FileDataSourceV2FallBackSuite.scala  |  28 ++-
 .../spark/sql/execution/QueryExecutionSuite.scala  |   2 +-
 .../execution/datasources/orc/OrcFilterSuite.scala |   3 +-
 .../sql/execution/datasources/orc/OrcTest.scala|   3 +-
 .../datasources/parquet/ParquetFilterSuite.scala   |   3 +-
 .../spark/sql/hive/HiveExternalCatalog.scala   |  32 ++-
 .../org/apache/spark/sql/hive/HiveInspectors.scala |  15 +-
 .../spark/sql/hive/HiveMetastoreCatalog.scala  |  17 +-
 .../spark/sql/hive/HiveSessionStateBuilder.scala   |   7 +-
 .../org/apache/spark/sql/hive/HiveStrategies.scala |   8 +-
 .../sql/hive/execution/V1WritesHiveUtils.scala |   4 +-
 .../spark/sql/hive/HiveMetastoreCatalogSuite.scala |  11 +-
 .../spark/sql/hive/execution/HiveDDLSuite.scala|  72 +++---
 .../spark/sql/hive/execution/HiveQuerySuite.scala  |   8 +-
 .../sql/hive/execution/Hive_2_1_DDLSuite.scala |   8 +-
 45 files changed, 611 insertions(+), 173 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 62d10c0d34c..d52ffc011b7 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -6705,6 +6705,259 @@
   "Failed to get block , which is not a shuffle block"
 ]
   },
+  

(spark) branch master updated: [SPARK-46273][SQL] Support INSERT INTO/OVERWRITE using DSv2 sources

2023-12-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7db85642600 [SPARK-46273][SQL] Support INSERT INTO/OVERWRITE using 
DSv2 sources
7db85642600 is described below

commit 7db85642600b1e3b39ca11e41d4e3e0bf1c8962b
Author: allisonwang-db 
AuthorDate: Mon Dec 11 10:32:48 2023 -0800

[SPARK-46273][SQL] Support INSERT INTO/OVERWRITE using DSv2 sources

### What changes were proposed in this pull request?

This PR adds test cases for INSERT INTO and INSERT OVERWRITE queries with 
DSv2 sources.

### Why are the changes needed?

To improve test coverage

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44213 from allisonwang-db/spark-46273-dsv2-insert.

Authored-by: allisonwang-db 
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/errors/QueryCompilationErrors.scala  |   4 +-
 .../datasources/v2/TableCapabilityCheck.scala  |   2 +-
 .../spark/sql/connector/DataSourceV2Suite.scala| 127 -
 3 files changed, 129 insertions(+), 4 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
index b7e10dc194a..1195e9dd78d 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
@@ -926,8 +926,8 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase with Compilat
 unsupportedTableOperationError(table.name(), "truncate in batch mode")
   }
 
-  def unsupportedOverwriteByFilterInBatchModeError(table: Table): Throwable = {
-unsupportedTableOperationError(table.name(), "overwrite by filter in batch 
mode")
+  def unsupportedOverwriteByFilterInBatchModeError(name: String): Throwable = {
+unsupportedTableOperationError(name, "overwrite by filter in batch mode")
   }
 
   def catalogOperationNotSupported(catalog: CatalogPlugin, operation: String): 
Throwable = {
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala
index a3afaa36ab9..b1a93addc80 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/TableCapabilityCheck.scala
@@ -63,7 +63,7 @@ object TableCapabilityCheck extends (LogicalPlan => Unit) {
   case _ =>
 if (!supportsBatchWrite(r.table) || 
!r.table.supports(OVERWRITE_BY_FILTER)) {
   throw 
QueryCompilationErrors.unsupportedOverwriteByFilterInBatchModeError(
-r.table)
+   r.name)
 }
 }
 
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
index f2e518e8acc..6e365e1d605 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2Suite.scala
@@ -18,6 +18,7 @@
 package org.apache.spark.sql.connector
 
 import java.io.File
+import java.util
 import java.util.OptionalLong
 
 import test.org.apache.spark.sql.connector._
@@ -723,8 +724,116 @@ class DataSourceV2Suite extends QueryTest with 
SharedSparkSession with AdaptiveS
   }
 }
   }
-}
 
+  test("SPARK-46273: insert into") {
+val cls = classOf[SupportsExternalMetadataDataSource]
+withTable("test") {
+  sql(
+s"""
+   |CREATE TABLE test (x INT, y INT) USING ${cls.getName}
+   |""".stripMargin)
+  sql("INSERT INTO test VALUES (1, 2)")
+  checkAnswer(sql("SELECT * FROM test"), Seq(Row(1, 2)))
+  // Insert by name
+  sql("INSERT INTO test(y, x) VALUES (3, 2)")
+  checkAnswer(sql("SELECT * FROM test"), Seq(Row(1, 2), Row(2, 3)))
+  // Can be casted automatically
+  sql("INSERT INTO test(y, x) VALUES (4L, 3L)")
+  checkAnswer(sql("SELECT * FROM test"), Seq(Row(1, 2), Row(2, 3), Row(3, 
4)))
+  // Insert values by name
+  sql("INSERT INTO test BY NAME VALUES (5, 4) t(y, x)")
+  checkAnswer(sql("SELECT * FROM test"), Seq(Row(1, 2), Row(2, 3), Row(3, 
4), Row(4, 5)))
+  // Missing columns
+  checkError(
+exception = intercept[AnalysisException] {
+  sql(s"INSERT INTO test VALUES (4)")
+},
+errorClass 

(spark) branch master updated: [SPARK-46341][PS][TESTS][FOLLOWUP] Sort indices before dataframe comparison

2023-12-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 35bc0b4ecd2 [SPARK-46341][PS][TESTS][FOLLOWUP] Sort indices before 
dataframe comparison
35bc0b4ecd2 is described below

commit 35bc0b4ecd266de2eb5f53f92491a53efe9c3eef
Author: Ruifeng Zheng 
AuthorDate: Mon Dec 11 10:32:14 2023 -0800

[SPARK-46341][PS][TESTS][FOLLOWUP] Sort indices before dataframe comparison

### What changes were proposed in this pull request?

### Why are the changes needed?
Sort indices before dataframe comparison

### Does this PR introduce _any_ user-facing change?
no, test-only

### How was this patch tested?
ci and manually check

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44295 from zhengruifeng/ps_test_interpolate_sort_index.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/pandas/tests/series/test_interpolate.py | 9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/pandas/tests/series/test_interpolate.py 
b/python/pyspark/pandas/tests/series/test_interpolate.py
index 0008dd2ee91..fad684822a6 100644
--- a/python/pyspark/pandas/tests/series/test_interpolate.py
+++ b/python/pyspark/pandas/tests/series/test_interpolate.py
@@ -24,7 +24,10 @@ from pyspark.testing.pandasutils import PandasOnSparkTestCase
 class SeriesInterpolateMixin:
 def _test_interpolate(self, pobj):
 psobj = ps.from_pandas(pobj)
-self.assert_eq(psobj.interpolate(), pobj.interpolate())
+self.assert_eq(
+psobj.interpolate().sort_index(),
+pobj.interpolate().sort_index(),
+)
 for limit, limit_direction, limit_area in [
 (1, None, None),
 (2, "forward", "inside"),
@@ -35,10 +38,10 @@ class SeriesInterpolateMixin:
 self.assert_eq(
 psobj.interpolate(
 limit=limit, limit_direction=limit_direction, 
limit_area=limit_area
-),
+).sort_index(),
 pobj.interpolate(
 limit=limit, limit_direction=limit_direction, 
limit_area=limit_area
-),
+).sort_index(),
 )
 
 def test_interpolate(self):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46347][PS][TESTS] Reorganize `RollingTests `

2023-12-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d5241ff2689 [SPARK-46347][PS][TESTS] Reorganize `RollingTests `
d5241ff2689 is described below

commit d5241ff26892fa615b27ae39b0be1b8907f59f29
Author: Ruifeng Zheng 
AuthorDate: Mon Dec 11 10:29:49 2023 -0800

[SPARK-46347][PS][TESTS] Reorganize `RollingTests `

### What changes were proposed in this pull request?
Reorganize `RollingTests`, break it into multiple small files

### Why are the changes needed?
to be consistent with Pandas's tests

### Does this PR introduce _any_ user-facing change?
no, test only

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44281 from zhengruifeng/ps_test_rolling.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 dev/sparktestsupport/modules.py|  16 +-
 .../test_parity_groupby_rolling.py}|  12 +-
 .../test_parity_groupby_rolling_adv.py}|  12 +-
 .../test_parity_groupby_rolling_count.py}  |  12 +-
 .../connect/{ => window}/test_parity_rolling.py|  10 +-
 .../test_parity_rolling_adv.py}|  12 +-
 .../test_parity_rolling_count.py}  |  12 +-
 .../test_parity_rolling_error.py}  |  12 +-
 python/pyspark/pandas/tests/test_rolling.py| 317 -
 .../pandas/tests/window/test_groupby_rolling.py| 132 +
 .../tests/window/test_groupby_rolling_adv.py   |  60 
 .../tests/window/test_groupby_rolling_count.py | 113 
 python/pyspark/pandas/tests/window/test_rolling.py |  91 ++
 .../test_rolling_adv.py}   |  33 ++-
 .../pandas/tests/window/test_rolling_count.py  |  72 +
 .../test_rolling_error.py} |  31 +-
 16 files changed, 578 insertions(+), 369 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index 68e9ed8101d..c77a34f1d22 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -750,7 +750,13 @@ pyspark_pandas = Module(
 "pyspark.pandas.tests.resample.test_series",
 "pyspark.pandas.tests.resample.test_timezone",
 "pyspark.pandas.tests.test_reshape",
-"pyspark.pandas.tests.test_rolling",
+"pyspark.pandas.tests.window.test_rolling",
+"pyspark.pandas.tests.window.test_rolling_adv",
+"pyspark.pandas.tests.window.test_rolling_count",
+"pyspark.pandas.tests.window.test_rolling_error",
+"pyspark.pandas.tests.window.test_groupby_rolling",
+"pyspark.pandas.tests.window.test_groupby_rolling_adv",
+"pyspark.pandas.tests.window.test_groupby_rolling_count",
 "pyspark.pandas.tests.test_scalars",
 "pyspark.pandas.tests.test_series_conversion",
 "pyspark.pandas.tests.test_series_datetime",
@@ -1120,7 +1126,13 @@ pyspark_pandas_connect_part2 = Module(
 "pyspark.pandas.tests.connect.window.test_parity_ewm_error",
 "pyspark.pandas.tests.connect.window.test_parity_ewm_mean",
 "pyspark.pandas.tests.connect.window.test_parity_groupby_ewm_mean",
-"pyspark.pandas.tests.connect.test_parity_rolling",
+"pyspark.pandas.tests.connect.window.test_parity_rolling",
+"pyspark.pandas.tests.connect.window.test_parity_rolling_adv",
+"pyspark.pandas.tests.connect.window.test_parity_rolling_count",
+"pyspark.pandas.tests.connect.window.test_parity_rolling_error",
+"pyspark.pandas.tests.connect.window.test_parity_groupby_rolling",
+"pyspark.pandas.tests.connect.window.test_parity_groupby_rolling_adv",
+
"pyspark.pandas.tests.connect.window.test_parity_groupby_rolling_count",
 "pyspark.pandas.tests.connect.test_parity_expanding",
 
"pyspark.pandas.tests.connect.test_parity_ops_on_diff_frames_groupby_rolling",
 "pyspark.pandas.tests.connect.computation.test_parity_missing_data",
diff --git a/python/pyspark/pandas/tests/connect/test_parity_rolling.py 
b/python/pyspark/pandas/tests/connect/window/test_parity_groupby_rolling.py
similarity index 76%
copy from python/pyspark/pandas/tests/connect/test_parity_rolling.py
copy to 
python/pyspark/pandas/tests/connect/window/test_parity_groupby_rolling.py
index 8318bed24f0..0a3e0b1358f 100644
--- a/python/pyspark/pandas/tests/connect/test_parity_rolling.py
+++ b/python/pyspark/pandas/tests/connect/window/test_parity_groupby_rolling.py
@@ -16,19 +16,21 @@
 #
 import unittest
 
-from pyspark.pandas.tests.test_rolling import RollingTestsMixin
+from pyspark.pandas.tests.window.test_groupby_rolling import 
GroupByRollingMixin
 from pyspark.testing.connectutils 

(spark) branch master updated: [SPARK-46355][SQL] XML: Close InputStreamReader on read completion

2023-12-11 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fc5868791a5 [SPARK-46355][SQL] XML: Close InputStreamReader on read 
completion
fc5868791a5 is described below

commit fc5868791a5d261fb416b376cc6eb39e8e29b90a
Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com>
AuthorDate: Mon Dec 11 09:41:03 2023 -0800

[SPARK-46355][SQL] XML: Close InputStreamReader on read completion

### What changes were proposed in this pull request?
XML InputStreamReader need to be closed on read completion to timely 
release InputStream resources.

### Why are the changes needed?
Not closing the reader may result in not timely recycling underlying 
InputStream connection resources for cloud objects.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test and manual testing

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44287 from sandip-db/xml-reader-close.

Authored-by: Sandip Agarwala <131817656+sandip...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 .../spark/sql/catalyst/xml/StaxXmlParser.scala  | 21 +++--
 .../spark/sql/catalyst/xml/XmlInferSchema.scala |  4 +++-
 2 files changed, 18 insertions(+), 7 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
index 567074bbf12..373ef410f05 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/StaxXmlParser.scala
@@ -133,7 +133,9 @@ class StaxXmlParser(
   val isRootAttributesOnly = schema.fields.forall { f =>
 f.name == options.valueTag || 
f.name.startsWith(options.attributePrefix)
   }
-  Some(convertObject(parser, schema, rootAttributes, isRootAttributesOnly))
+  val result = Some(convertObject(parser, schema, rootAttributes, 
isRootAttributesOnly))
+  parser.close()
+  result
 } catch {
   case e: SparkUpgradeException => throw e
   case e@(_: RuntimeException | _: XMLStreamException | _: 
MalformedInputException
@@ -576,7 +578,7 @@ class StaxXmlParser(
 class XmlTokenizer(
   inputStream: InputStream,
   options: XmlOptions) {
-  private val reader = new InputStreamReader(inputStream, 
Charset.forName(options.charset))
+  private var reader = new InputStreamReader(inputStream, 
Charset.forName(options.charset))
   private var currentStartTag: String = _
   private var buffer = new StringBuilder()
   private val startTag = s"<${options.rowTag}>"
@@ -591,17 +593,24 @@ class XmlTokenizer(
* @return whether it reads successfully
*/
   def next(): Option[String] = {
-if (readUntilStartElement()) {
-  try {
+try {
+  if (readUntilStartElement()) {
 buffer.append(currentStartTag)
 // Don't check whether the end element was found. Even if not, return 
everything
 // that was read, which will invariably cause a parse error later
 readUntilEndElement(currentStartTag.endsWith(">"))
-return Some(buffer.toString())
-  } finally {
+val str = buffer.toString()
 buffer = new StringBuilder()
+return Some(str)
   }
+} catch {
+  case e: Throwable =>
+reader.close()
+reader = null
+throw e
 }
+reader.close()
+reader = null
 None
   }
 
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
index f81c476cd38..52c98bc43eb 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/xml/XmlInferSchema.scala
@@ -118,7 +118,9 @@ class XmlInferSchema(options: XmlOptions, caseSensitive: 
Boolean)
   }
   val parser = StaxXmlParserUtils.filteredReader(xml)
   val rootAttributes = StaxXmlParserUtils.gatherRootAttributes(parser)
-  Some(inferObject(parser, rootAttributes))
+  val schema = Some(inferObject(parser, rootAttributes))
+  parser.close()
+  schema
 } catch {
   case NonFatal(_) if options.parseMode == PermissiveMode =>
 Some(StructType(Seq(StructField(options.columnNameOfCorruptRecord, 
StringType


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46358][CONNECT] Simplify the condition check in the `ResponseValidator#verifyResponse`

2023-12-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 051b1781827 [SPARK-46358][CONNECT] Simplify the condition check in the 
`ResponseValidator#verifyResponse`
051b1781827 is described below

commit 051b1781827dd3a4e1e95a5354caa747ff41ae1a
Author: yangjie01 
AuthorDate: Mon Dec 11 08:53:10 2023 -0800

[SPARK-46358][CONNECT] Simplify the condition check in the 
`ResponseValidator#verifyResponse`

### What changes were proposed in this pull request?
This PR has made the following refactoring to the `verifyResponse` function 
in `ResponseValidator`:
1. The check condition `response.hasField(field)` is moved before getting 
`value`, and only when `response.hasField(field)` is true, `value` is obtained, 
which seems more in line with the existing comments.
2. Removed the `value != ""` condition check in the case match, because 
only when `value.nonEmpty` is true will it enter the `if` branch, and the 
condition `value.nonEmpty` has already covered the check for `value != ""`.
3. The condition check `value != id` is moved inside `case Some(id)`. After 
the modification, an `IllegalStateException` will still be thrown when the id 
exists and `value != id`, but `serverSideSessionId` will no longer be 
reassigned when the id exists and `value == id`.
4. Removed the redundant `toString` operation when reassigning 
`serverSideSessionId`, because `value` is String type.
5. Removed the No-op `case _` match, because it is unreachable code after 
the above modifications.

### Why are the changes needed?
Simplify the condition check in the `verifyResponse` function

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44291 from LuciferYang/Simplify-ResponseValidator-verifyResponse.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/connect/client/ResponseValidator.scala | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git 
a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala
 
b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala
index 67f29c727ef..22c5505e7d4 100644
--- 
a/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala
+++ 
b/connector/connect/common/src/main/scala/org/apache/spark/sql/connect/client/ResponseValidator.scala
@@ -35,18 +35,20 @@ class ResponseValidator extends Logging {
 val field = 
response.getDescriptorForType.findFieldByName("server_side_session_id")
 // If the field does not exist, we ignore it. New / Old message might not 
contain it and this
 // behavior allows us to be compatible.
-if (field != null) {
+if (field != null && response.hasField(field)) {
   val value = response.getField(field).asInstanceOf[String]
   // Ignore, if the value is unset.
-  if (response.hasField(field) && value != null && value.nonEmpty) {
+  if (value != null && value.nonEmpty) {
 serverSideSessionId match {
-  case Some(id) if value != id && value != "" =>
-throw new IllegalStateException(s"Server side session ID changed 
from $id to $value")
-  case _ if value != "" =>
+  case Some(id) =>
+if (value != id) {
+  throw new IllegalStateException(
+s"Server side session ID changed from $id to $value")
+}
+  case _ =>
 synchronized {
-  serverSideSessionId = Some(value.toString)
+  serverSideSessionId = Some(value)
 }
-  case _ => // No-op
 }
   }
 } else {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (bab884082c0 -> bb886abcc4a)

2023-12-11 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from bab884082c0 [MINOR][DOCS] Fix documentation for 
`spark.sql.legacy.doLooseUpcast` in SQL migration guide
 add bb886abcc4a [SPARK-46356][BUILD] Upgrade `sbt-assembly` to 2.1.5, 
`sbt-checkstyle-plugin` to 4.0.1

No new revisions were added by this update.

Summary of changes:
 project/plugins.sbt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide

2023-12-11 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 9c83bf501cc [MINOR][DOCS] Fix documentation for 
`spark.sql.legacy.doLooseUpcast` in SQL migration guide
9c83bf501cc is described below

commit 9c83bf501ccefa7c6c0ba071f69e2528f3504854
Author: Amy Tsai 
AuthorDate: Mon Dec 11 18:35:31 2023 +0300

[MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL 
migration guide

### What changes were proposed in this pull request?

Fixes an error in the SQL migration guide documentation for 
`spark.sql.legacy.doLooseUpcast`. I corrected the config name and moved it to 
the section for migration from Spark 2.4 to 3.0 since it was not made available 
until Spark 3.0.

### Why are the changes needed?

The config was documented as `spark.sql.legacy.looseUpcast` and is 
inaccurately included in the Spark 2.4 to Spark 2.4.1 section.

I changed the docs to match what is implemented in 
https://github.com/apache/spark/blob/20df062d85e80422a55afae80ddbf2060f26516c/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3873

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Docs only change

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44262 from amytsai-stripe/fix-migration-docs-loose-upcast.

Authored-by: Amy Tsai 
Signed-off-by: Max Gekk 
(cherry picked from commit bab884082c0f82e3f9053adac6c7e8a3fcfab11c)
Signed-off-by: Max Gekk 
---
 docs/sql-migration-guide.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 88635ee3d1f..2eba9500e90 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -251,6 +251,8 @@ license: |
 
   - In Spark 3.0, the column metadata will always be propagated in the API 
`Column.name` and `Column.as`. In Spark version 2.4 and earlier, the metadata 
of `NamedExpression` is set as the `explicitMetadata` for the new column at the 
time the API is called, it won't change even if the underlying 
`NamedExpression` changes metadata. To restore the behavior before Spark 3.0, 
you can use the API `as(alias: String, metadata: Metadata)` with explicit 
metadata.
 
+  - When turning a Dataset to another Dataset, Spark will up cast the fields 
in the original Dataset to the type of corresponding fields in the target 
DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. 
`Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and 
throw NPE during execution. In Spark 3.0, the up cast is stricter and turning 
String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` 
will fail during analysis. To res [...]
+
 ### DDL Statements
 
   - In Spark 3.0, when inserting a value into a table column with a different 
data type, the type coercion is performed as per ANSI SQL standard. Certain 
unreasonable type conversions such as converting `string` to `int` and `double` 
to `boolean` are disallowed. A runtime exception is thrown if the value is 
out-of-range for the data type of the column. In Spark version 2.4 and below, 
type conversions during table insertion are allowed as long as they are valid 
`Cast`. When inserting an o [...]
@@ -464,8 +466,6 @@ license: |
 need to specify a value with units like "30s" now, to avoid being 
interpreted as milliseconds; otherwise,
 the extremely short interval that results will likely cause applications 
to fail.
 
-  - When turning a Dataset to another Dataset, Spark will up cast the fields 
in the original Dataset to the type of corresponding fields in the target 
DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. 
`Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and 
throw NPE during execution. In Spark 3.0, the up cast is stricter and turning 
String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` 
will fail during analysis. To res [...]
-
 ## Upgrading from Spark SQL 2.3 to 2.4
 
   - In Spark version 2.3 and earlier, the second parameter to array_contains 
function is implicitly promoted to the element type of first array type 
parameter. This type promotion can be lossy and may cause `array_contains` 
function to return wrong result. This problem has been addressed in 2.4 by 
employing a safer type promotion mechanism. This can cause some change in 
behavior and are illustrated in the table below.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL migration guide

2023-12-11 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new bab884082c0 [MINOR][DOCS] Fix documentation for 
`spark.sql.legacy.doLooseUpcast` in SQL migration guide
bab884082c0 is described below

commit bab884082c0f82e3f9053adac6c7e8a3fcfab11c
Author: Amy Tsai 
AuthorDate: Mon Dec 11 18:35:31 2023 +0300

[MINOR][DOCS] Fix documentation for `spark.sql.legacy.doLooseUpcast` in SQL 
migration guide

### What changes were proposed in this pull request?

Fixes an error in the SQL migration guide documentation for 
`spark.sql.legacy.doLooseUpcast`. I corrected the config name and moved it to 
the section for migration from Spark 2.4 to 3.0 since it was not made available 
until Spark 3.0.

### Why are the changes needed?

The config was documented as `spark.sql.legacy.looseUpcast` and is 
inaccurately included in the Spark 2.4 to Spark 2.4.1 section.

I changed the docs to match what is implemented in 
https://github.com/apache/spark/blob/20df062d85e80422a55afae80ddbf2060f26516c/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L3873

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Docs only change

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44262 from amytsai-stripe/fix-migration-docs-loose-upcast.

Authored-by: Amy Tsai 
Signed-off-by: Max Gekk 
---
 docs/sql-migration-guide.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 9f9c15521c6..4e8e2422d7e 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -260,6 +260,8 @@ license: |
 
   - In Spark 3.0, the column metadata will always be propagated in the API 
`Column.name` and `Column.as`. In Spark version 2.4 and earlier, the metadata 
of `NamedExpression` is set as the `explicitMetadata` for the new column at the 
time the API is called, it won't change even if the underlying 
`NamedExpression` changes metadata. To restore the behavior before Spark 3.0, 
you can use the API `as(alias: String, metadata: Metadata)` with explicit 
metadata.
 
+  - When turning a Dataset to another Dataset, Spark will up cast the fields 
in the original Dataset to the type of corresponding fields in the target 
DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. 
`Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and 
throw NPE during execution. In Spark 3.0, the up cast is stricter and turning 
String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` 
will fail during analysis. To res [...]
+
 ### DDL Statements
 
   - In Spark 3.0, when inserting a value into a table column with a different 
data type, the type coercion is performed as per ANSI SQL standard. Certain 
unreasonable type conversions such as converting `string` to `int` and `double` 
to `boolean` are disallowed. A runtime exception is thrown if the value is 
out-of-range for the data type of the column. In Spark version 2.4 and below, 
type conversions during table insertion are allowed as long as they are valid 
`Cast`. When inserting an o [...]
@@ -473,8 +475,6 @@ license: |
 need to specify a value with units like "30s" now, to avoid being 
interpreted as milliseconds; otherwise,
 the extremely short interval that results will likely cause applications 
to fail.
 
-  - When turning a Dataset to another Dataset, Spark will up cast the fields 
in the original Dataset to the type of corresponding fields in the target 
DataSet. In version 2.4 and earlier, this up cast is not very strict, e.g. 
`Seq("str").toDS.as[Int]` fails, but `Seq("str").toDS.as[Boolean]` works and 
throw NPE during execution. In Spark 3.0, the up cast is stricter and turning 
String into something else is not allowed, i.e. `Seq("str").toDS.as[Boolean]` 
will fail during analysis. To res [...]
-
 ## Upgrading from Spark SQL 2.3 to 2.4
 
   - In Spark version 2.3 and earlier, the second parameter to array_contains 
function is implicitly promoted to the element type of first array type 
parameter. This type promotion can be lossy and may cause `array_contains` 
function to return wrong result. This problem has been addressed in 2.4 by 
employing a safer type promotion mechanism. This can cause some change in 
behavior and are illustrated in the table below.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45649][SQL] Unify the prepare framework for OffsetWindowFunctionFrame

2023-12-11 Thread beliefer
This is an automated email from the ASF dual-hosted git repository.

beliefer pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 6a197efeb3c [SPARK-45649][SQL] Unify the prepare framework for 
OffsetWindowFunctionFrame
6a197efeb3c is described below

commit 6a197efeb3c1cca156cd615e990e35e82ce22ee3
Author: Jiaan Geng 
AuthorDate: Mon Dec 11 19:48:14 2023 +0800

[SPARK-45649][SQL] Unify the prepare framework for OffsetWindowFunctionFrame

### What changes were proposed in this pull request?
Currently, the implementation of the `prepare` of all the 
`OffsetWindowFunctionFrame` have the same code logic show below.
```
  override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
if (offset > rows.length) {
  fillDefaultValue(EmptyRow)
} else {
  resetStates(rows)
  if (ignoreNulls) {
...
  } else {
...
  }
}
  }
```
This PR want unify the prepare framework for `OffsetWindowFunctionFrame`

**Why the https://github.com/apache/spark/pull/43507 introduces the NPE 
bug?**
For example, there is a window group with the offset 5 and have 4 elements.
First, we don't call the `resetStates` due to the offset is greater than 4.
After that, we iterates the elements of the window group by visit input. 
But the input is null.

This PR also add two test cases about the absolute value of offset greater 
than the window group size.

### Why are the changes needed?
Unify the prepare framework for `OffsetWindowFunctionFrame`

### Does this PR introduce _any_ user-facing change?
'No'.
Inner update.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43958 from beliefer/SPARK-45649.

Authored-by: Jiaan Geng 
Signed-off-by: Jiaan Geng 
---
 .../sql/execution/window/WindowFunctionFrame.scala | 114 ++---
 .../spark/sql/DataFrameWindowFramesSuite.scala |  24 +
 2 files changed, 76 insertions(+), 62 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala
index 6cea838311a..4aa7444c407 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowFunctionFrame.scala
@@ -87,7 +87,8 @@ abstract class OffsetWindowFunctionFrameBase(
 expressions: Array[OffsetWindowFunction],
 inputSchema: Seq[Attribute],
 newMutableProjection: (Seq[Expression], Seq[Attribute]) => 
MutableProjection,
-offset: Int)
+offset: Int,
+ignoreNulls: Boolean)
   extends WindowFunctionFrame {
 
   /** Rows of the partition currently being processed. */
@@ -141,6 +142,8 @@ abstract class OffsetWindowFunctionFrameBase(
   // is not null.
   protected var skippedNonNullCount = 0
 
+  protected val absOffset = Math.abs(offset)
+
   // Reset the states by the data of the new partition.
   protected def resetStates(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
 input = rows
@@ -176,6 +179,33 @@ abstract class OffsetWindowFunctionFrameBase(
 }
   }
 
+  override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
+resetStates(rows)
+if (absOffset > rows.length) {
+  fillDefaultValue(EmptyRow)
+} else {
+  if (ignoreNulls) {
+prepareForIgnoreNulls()
+  } else {
+prepareForRespectNulls()
+  }
+}
+  }
+
+  protected def prepareForIgnoreNulls(): Unit = findNextRowWithNonNullInput()
+
+  protected def prepareForRespectNulls(): Unit = {
+// drain the first few rows if offset is larger than one
+while (inputIndex < offset) {
+  nextSelectedRow = WindowFunctionFrame.getNextOrNull(inputIterator)
+  inputIndex += 1
+}
+// `inputIndex` starts as 0, but the `offset` can be negative and we may 
not enter the
+// while loop at all. We need to make sure `inputIndex` ends up as 
`offset` to meet the
+// assumption of the write path.
+inputIndex = offset
+  }
+
   override def currentLowerBound(): Int = throw new 
UnsupportedOperationException()
 
   override def currentUpperBound(): Int = throw new 
UnsupportedOperationException()
@@ -197,25 +227,7 @@ class FrameLessOffsetWindowFunctionFrame(
 offset: Int,
 ignoreNulls: Boolean = false)
   extends OffsetWindowFunctionFrameBase(
-target, ordinal, expressions, inputSchema, newMutableProjection, offset) {
-
-  override def prepare(rows: ExternalAppendOnlyUnsafeRowArray): Unit = {
-resetStates(rows)
-if (ignoreNulls) {
-  if (Math.abs(offset) > rows.length) {
-