date:20230531

[spark] branch master updated (10d7acc81c5 -> d5f758ea359)

2023-05-31 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 10d7acc81c5 [SPARK-43889][PYTHON] add check for column name for 
`__dir__()` to filter out illegal column name
 add d5f758ea359 [SPARK-43792][SQL][PYTHON][CONNECT] Add optional pattern 
for Catalog.listCatalogs

No new revisions were added by this update.

Summary of changes:
 .../org/apache/spark/sql/catalog/Catalog.scala  |  8 
 .../org/apache/spark/sql/internal/CatalogImpl.scala | 12 
 .../scala/org/apache/spark/sql/CatalogSuite.scala   |  5 +
 .../src/main/protobuf/spark/connect/catalog.proto   |  5 -
 .../sql/connect/planner/SparkConnectPlanner.scala   |  6 +-
 project/MimaExcludes.scala  |  4 +++-
 python/pyspark/sql/catalog.py   | 21 +++--
 python/pyspark/sql/connect/catalog.py   |  4 ++--
 python/pyspark/sql/connect/plan.py  |  8 ++--
 python/pyspark/sql/connect/proto/catalog_pb2.py |  4 ++--
 python/pyspark/sql/connect/proto/catalog_pb2.pyi| 14 ++
 .../org/apache/spark/sql/catalog/Catalog.scala  |  7 +++
 .../org/apache/spark/sql/catalog/interface.scala|  7 ++-
 .../org/apache/spark/sql/internal/CatalogImpl.scala | 10 ++
 .../apache/spark/sql/internal/CatalogSuite.scala|  4 
 15 files changed, 103 insertions(+), 16 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43882][SQL] Assign name to _LEGACY_ERROR_TEMP_2122

2023-05-31 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2687d784fe4 [SPARK-43882][SQL] Assign name to _LEGACY_ERROR_TEMP_2122
2687d784fe4 is described below

commit 2687d784fe4d20af321f11074139c0ce382bbaef
Author: Jia Fan 
AuthorDate: Wed May 31 10:26:15 2023 +0300

[SPARK-43882][SQL] Assign name to _LEGACY_ERROR_TEMP_2122

### What changes were proposed in this pull request?
This PR proposes to assign name to _LEGACY_ERROR_TEMP_2122, 
"FAILED_PARSE_STRUCT_TYPE".

### Why are the changes needed?
Assign proper name to _LEGACY_ERROR_TEMP_2122

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test

Closes #41381 from Hisoka-X/SPARK-43882_LEGACY_ERROR_TEMP_2122.

Authored-by: Jia Fan 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json  | 11 ++-
 .../org/apache/spark/sql/errors/QueryExecutionErrors.scala|  4 ++--
 .../apache/spark/sql/errors/QueryExecutionErrorsSuite.scala   | 10 ++
 3 files changed, 18 insertions(+), 7 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 8c3ba1e190d..7f2b1975855 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -634,6 +634,12 @@
 ],
 "sqlState" : "38000"
   },
+  "FAILED_PARSE_STRUCT_TYPE" : {
+"message" : [
+  "Failed parsing struct: ."
+],
+"sqlState" : "22018"
+  },
   "FAILED_RENAME_PATH" : {
 "message" : [
   "Failed to rename  to  as destination already 
exists."
@@ -4563,11 +4569,6 @@
   "Do not support type ."
 ]
   },
-  "_LEGACY_ERROR_TEMP_2122" : {
-"message" : [
-  "Failed parsing : ."
-]
-  },
   "_LEGACY_ERROR_TEMP_2124" : {
 "message" : [
   "Failed to merge decimal types with incompatible scale  and 
."
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
index 5daa8ed3b7f..7ce3e7a9e7e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala
@@ -1305,8 +1305,8 @@ private[sql] object QueryExecutionErrors extends 
QueryErrorsBase {
 
   def failedParsingStructTypeError(raw: String): SparkRuntimeException = {
 new SparkRuntimeException(
-  errorClass = "_LEGACY_ERROR_TEMP_2122",
-  messageParameters = Map("simpleString" -> StructType.simpleString, "raw" 
-> raw))
+  errorClass = "FAILED_PARSE_STRUCT_TYPE",
+  messageParameters = Map("raw" -> toSQLValue(raw, StringType)))
   }
 
   def cannotMergeDecimalTypesWithIncompatibleScaleError(
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
index 4bcb1d115b7..6d2c2600cbb 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/errors/QueryExecutionErrorsSuite.scala
@@ -633,6 +633,16 @@ class QueryExecutionErrorsSuite
 "config" -> s${SQLConf.ANSI_ENABLED.key}))
   }
 
+  test("FAILED_PARSE_STRUCT_TYPE: parsing invalid struct type") {
+val raw = 
"""{"type":"array","elementType":"integer","containsNull":false}"""
+checkError(
+  exception = intercept[SparkRuntimeException] {
+StructType.fromString(raw)
+  },
+  errorClass = "FAILED_PARSE_STRUCT_TYPE",
+  parameters = Map("raw" -> s"'$raw'"))
+  }
+
   test("CAST_OVERFLOW: from long to ANSI intervals") {
 Seq(
   LongType -> "9223372036854775807L",


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43081][ML][FOLLOW-UP] Improve torch distributor data loader code

2023-05-31 Thread weichenxu123

This is an automated email from the ASF dual-hosted git repository.

weichenxu123 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c2060e7c0a3 [SPARK-43081][ML][FOLLOW-UP] Improve torch distributor 
data loader code
c2060e7c0a3 is described below

commit c2060e7c0a332c20f527adeb34a52042237430e4
Author: Weichen Xu 
AuthorDate: Wed May 31 16:34:19 2023 +0800

[SPARK-43081][ML][FOLLOW-UP] Improve torch distributor data loader code

### What changes were proposed in this pull request?

### Why are the changes needed?

Improve torch distributor data loader code:

* Add a verification that num_processes must match input spark dataframe 
partitions. This makes user debug easier when they set mismatched input 
dataframe, otherwise torch package will raise intricate error information.
* Improve column value conversion in torch dataloader. Avoid comparing type 
operation for every column values.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

UT.

Closes #41382 from WeichenXu123/improve-torch-dataloader.

Authored-by: Weichen Xu 
Signed-off-by: Weichen Xu 
---
 python/pyspark/ml/torch/data.py| 50 --
 python/pyspark/ml/torch/distributor.py |  6 
 2 files changed, 36 insertions(+), 20 deletions(-)

diff --git a/python/pyspark/ml/torch/data.py b/python/pyspark/ml/torch/data.py
index a52b96a9392..d421683c16d 100644
--- a/python/pyspark/ml/torch/data.py
+++ b/python/pyspark/ml/torch/data.py
@@ -17,7 +17,7 @@
 
 import torch
 import numpy as np
-from typing import Any, Iterator
+from typing import Any, Callable, Iterator
 from pyspark.sql.types import StructType
 
 
@@ -26,27 +26,37 @@ class 
_SparkPartitionTorchDataset(torch.utils.data.IterableDataset):
 self.arrow_file_path = arrow_file_path
 self.num_samples = num_samples
 self.field_types = [field.dataType.simpleString() for field in schema]
+self.field_converters = [
+_SparkPartitionTorchDataset._get_field_converter(field_type)
+for field_type in self.field_types
+]
 
 @staticmethod
-def _extract_field_value(value: Any, field_type: str) -> Any:
-# TODO: avoid checking field type for every row.
+def _get_field_converter(field_type: str) -> Callable[[Any], Any]:
 if field_type == "vector":
-if value["type"] == 1:
-# dense vector
-return value["values"]
-if value["type"] == 0:
-# sparse vector
-size = int(value["size"])
-sparse_array = np.zeros(size, dtype=np.float64)
-sparse_array[value["indices"]] = value["values"]
-return sparse_array
-if field_type in ["float", "double", "int", "bigint", "smallint"]:
-return value
 
-raise ValueError(
-"SparkPartitionTorchDataset does not support loading data from 
field of "
-f"type {field_type}."
-)
+def converter(value: Any) -> Any:
+if value["type"] == 1:
+# dense vector
+return value["values"]
+if value["type"] == 0:
+# sparse vector
+size = int(value["size"])
+sparse_array = np.zeros(size, dtype=np.float64)
+sparse_array[value["indices"]] = value["values"]
+return sparse_array
+
+elif field_type in ["float", "double", "int", "bigint", "smallint"]:
+
+def converter(value: Any) -> Any:
+return value
+
+else:
+raise ValueError(
+"SparkPartitionTorchDataset does not support loading data from 
field of "
+f"type {field_type}."
+)
+return converter
 
 def __iter__(self) -> Iterator[Any]:
 from pyspark.sql.pandas.serializers import ArrowStreamSerializer
@@ -71,8 +81,8 @@ class 
_SparkPartitionTorchDataset(torch.utils.data.IterableDataset):
 batch_pdf = batch.to_pandas()
 for row in batch_pdf.itertuples(index=False):
 yield [
-
_SparkPartitionTorchDataset._extract_field_value(value, field_type)
-for value, field_type in zip(row, self.field_types)
+field_converter(value)
+for value, field_converter in zip(row, 
self.field_converters)
 ]
 count += 1
 if count == self.num_samples:
diff --git a/python/pyspark/ml/torch/distributor.py 
b/python/pyspark/ml/torch/distributor.py
index 0249e6b4b2c..711f76db09b 100644
---

[spark] branch master updated (c2060e7c0a3 -> 3457b4be356)

2023-05-31 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c2060e7c0a3 [SPARK-43081][ML][FOLLOW-UP] Improve torch distributor 
data loader code
 add 3457b4be356 
[SPARK-43852][SPARK-43853][SPARK-43854][SPARK-43855][SPARK-43856] Assign names 
to the error class _LEGACY_ERROR_TEMP_2418-2425

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json   | 57 --
 .../sql/tests/pandas/test_pandas_udf_scalar.py |  4 +-
 python/pyspark/sql/tests/test_udf.py   |  4 +-
 .../sql/catalyst/analysis/CheckAnalysis.scala  | 18 +++
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala | 28 ---
 .../apache/spark/sql/DataFrameAsOfJoinSuite.scala  | 29 ++-
 .../apache/spark/sql/LateralColumnAliasSuite.scala | 32 
 .../sql/hive/execution/AggregationQuerySuite.scala | 25 ++
 .../spark/sql/hive/execution/HiveUDAFSuite.scala   | 15 --
 .../spark/sql/hive/execution/UDAQuerySuite.scala   | 25 ++
 10 files changed, 145 insertions(+), 92 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3457b4be356 -> fead25ac4d6)

2023-05-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3457b4be356 
[SPARK-43852][SPARK-43853][SPARK-43854][SPARK-43855][SPARK-43856] Assign names 
to the error class _LEGACY_ERROR_TEMP_2418-2425
 add fead25ac4d6 [SPARK-43775][SQL] DataSource V2: Allow representing 
updates as deletes and inserts

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/connector/write/SupportsDelta.java   | 10 
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  1 +
 .../catalyst/analysis/RewriteRowLevelCommand.scala | 31 -
 .../sql/catalyst/analysis/RewriteUpdateTable.scala | 54 ++
 .../catalog/InMemoryRowLevelOperationTable.scala   |  5 ++
 ...taBasedUpdateAsDeleteAndInsertTableSuite.scala} |  3 +-
 .../spark/sql/connector/UpdateTableSuiteBase.scala | 15 ++
 7 files changed, 108 insertions(+), 11 deletions(-)
 copy 
sql/core/src/test/scala/org/apache/spark/sql/connector/{DeltaBasedUpdateTableSuite.scala
 => DeltaBasedUpdateAsDeleteAndInsertTableSuite.scala} (94%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (fead25ac4d6 -> d3f76c6ca07)

2023-05-31 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from fead25ac4d6 [SPARK-43775][SQL] DataSource V2: Allow representing 
updates as deletes and inserts
 add d3f76c6ca07 [SPARK-43894][PYTHON] Fix bug in df.cache()

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py| 4 +---
 python/pyspark/sql/tests/connect/test_connect_basic.py | 6 ++
 2 files changed, 7 insertions(+), 3 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-43894][PYTHON] Fix bug in df.cache()

2023-05-31 Thread hvanhovell

This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 0e1401dc71b [SPARK-43894][PYTHON] Fix bug in df.cache()
0e1401dc71b is described below

commit 0e1401dc71b5aee540a54fc6a36a1857b13390b4
Author: Martin Grund 
AuthorDate: Wed May 31 11:55:19 2023 -0400

[SPARK-43894][PYTHON] Fix bug in df.cache()

### What changes were proposed in this pull request?
Previously calling `df.cache()` would result in an invalid plan input 
exception because we did not invoke `persist()` with the right arguments. This 
patch simplifies the logic and makes it compatible to the behavior in Spark 
itself.

### Why are the changes needed?
Bug

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #41399 from grundprinzip/df_cache.

Authored-by: Martin Grund 
Signed-off-by: Herman van Hovell 
(cherry picked from commit d3f76c6ca07a7a11fd228dde770186c0fbc3f03f)
Signed-off-by: Herman van Hovell 
---
 python/pyspark/sql/connect/dataframe.py| 4 +---
 python/pyspark/sql/tests/connect/test_connect_basic.py | 6 ++
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index ca2e1b7a0dc..03049109061 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1544,9 +1544,7 @@ class DataFrame:
 def cache(self) -> "DataFrame":
 if self._plan is None:
 raise Exception("Cannot cache on empty plan.")
-relation = self._plan.plan(self._session.client)
-self._session.client._analyze(method="persist", relation=relation)
-return self
+return self.persist()
 
 cache.__doc__ = PySparkDataFrame.cache.__doc__
 
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 008b95d6f14..b051b9233c8 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -3032,6 +3032,12 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 message_parameters={"attr_name": "_jreader"},
 )
 
+def test_df_caache(self):
+df = self.connect.range(10)
+df.cache()
+self.assert_eq(10, df.count())
+self.assertTrue(df.is_cached)
+
 
 class SparkConnectSessionTests(SparkConnectSQLTestCase):
 def _check_no_active_session_error(self, e: PySparkException):


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43888][CORE] Relocate Logging to common/utils

2023-05-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 51359b3 [SPARK-43888][CORE] Relocate Logging to common/utils
51359b3 is described below

commit 51359b3e57de8421b1929045de9ff48890b8
Author: Rui Wang 
AuthorDate: Wed May 31 10:02:14 2023 -0700

[SPARK-43888][CORE] Relocate Logging to common/utils

### What changes were proposed in this pull request?

We can relocate logging to common/utils so share it among Spark Connect 
client and Spark core.

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

Towards the goal that Spark Connect client do not need to depend on Spark 
core.

### How was this patch tested?

Existing UT

Closes #41391 from amaliujia/refactor_logging.

Authored-by: Rui Wang 
Signed-off-by: Dongjoon Hyun 
---
 common/utils/pom.xml   | 29 ++
 .../scala/org/apache/spark/internal/Logging.scala  |  6 ++---
 core/pom.xml   | 29 --
 3 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/common/utils/pom.xml b/common/utils/pom.xml
index 53df20b646d..ee10a606182 100644
--- a/common/utils/pom.xml
+++ b/common/utils/pom.xml
@@ -51,6 +51,35 @@
   org.apache.commons
   commons-text
 
+
+  org.slf4j
+  slf4j-api
+
+
+
+  org.slf4j
+  jul-to-slf4j
+
+
+  org.slf4j
+  jcl-over-slf4j
+
+
+  org.apache.logging.log4j
+  log4j-slf4j2-impl
+
+
+  org.apache.logging.log4j
+  log4j-api
+
+
+  org.apache.logging.log4j
+  log4j-core
+
+
+  org.apache.logging.log4j
+  log4j-1.2-api
+
   
   
 
target/scala-${scala.binary.version}/classes
diff --git a/core/src/main/scala/org/apache/spark/internal/Logging.scala 
b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala
similarity index 98%
rename from core/src/main/scala/org/apache/spark/internal/Logging.scala
rename to common/utils/src/main/scala/org/apache/spark/internal/Logging.scala
index b6e3914622a..83e01330ce3 100644
--- a/core/src/main/scala/org/apache/spark/internal/Logging.scala
+++ b/common/utils/src/main/scala/org/apache/spark/internal/Logging.scala
@@ -27,7 +27,7 @@ import org.apache.logging.log4j.core.filter.AbstractFilter
 import org.slf4j.{Logger, LoggerFactory}
 
 import org.apache.spark.internal.Logging.SparkShellLoggingFilter
-import org.apache.spark.util.Utils
+import org.apache.spark.util.SparkClassUtils
 
 /**
  * Utility trait for classes that want to log data. Creates a SLF4J logger for 
the class and allows
@@ -133,7 +133,7 @@ trait Logging {
   if (Logging.islog4j2DefaultConfigured()) {
 Logging.defaultSparkLog4jConfig = true
 val defaultLogProps = "org/apache/spark/log4j2-defaults.properties"
-Option(Utils.getSparkClassLoader.getResource(defaultLogProps)) match {
+
Option(SparkClassUtils.getSparkClassLoader.getResource(defaultLogProps)) match {
   case Some(url) =>
 val context = 
LogManager.getContext(false).asInstanceOf[LoggerContext]
 context.setConfigLocation(url.toURI)
@@ -197,7 +197,7 @@ private[spark] object Logging {
   try {
 // We use reflection here to handle the case where users remove the
 // slf4j-to-jul bridge order to route their logs to JUL.
-val bridgeClass = Utils.classForName("org.slf4j.bridge.SLF4JBridgeHandler")
+val bridgeClass = 
SparkClassUtils.classForName("org.slf4j.bridge.SLF4JBridgeHandler")
 bridgeClass.getMethod("removeHandlersForRootLogger").invoke(null)
 val installed = 
bridgeClass.getMethod("isInstalled").invoke(null).asInstanceOf[Boolean]
 if (!installed) {
diff --git a/core/pom.xml b/core/pom.xml
index 4413ba6b634..79bf8a21635 100644
--- a/core/pom.xml
+++ b/core/pom.xml
@@ -218,35 +218,6 @@
   com.google.code.findbugs
   jsr305
 
-
-  org.slf4j
-  slf4j-api
-
-
-
-  org.slf4j
-  jul-to-slf4j
-
-
-  org.slf4j
-  jcl-over-slf4j
-
-
-  org.apache.logging.log4j
-  log4j-slf4j2-impl
-
-
-  org.apache.logging.log4j
-  log4j-api
-
-
-  org.apache.logging.log4j
-  log4j-core
-
-
-  org.apache.logging.log4j
-  log4j-1.2-api
-
 
 
   com.ning


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43867][SQL] Improve suggested candidates for unresolved attribute

2023-05-31 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a8893422752 [SPARK-43867][SQL] Improve suggested candidates for 
unresolved attribute
a8893422752 is described below

commit a88934227523334550e451e437ce013772001079
Author: Max Gekk 
AuthorDate: Wed May 31 21:04:44 2023 +0300

[SPARK-43867][SQL] Improve suggested candidates for unresolved attribute

### What changes were proposed in this pull request?
In the PR, I propose to change the approach of stripping the common part of 
candidate qualifiers in `StringUtils.orderSuggestedIdentifiersBySimilarity`:
1. If all candidates have the same qualifier including namespace and table 
name, drop it. It should be dropped if the base string (unresolved attribute) 
doesn't include a namespace and table name. For example:
- `[ns1.table1.col1, ns1.table1.col2] -> [col1, col2]` for unresolved 
attribute `col0`
- `[ns1.table1.col1, ns1.table1.col2] -> [table1.col1, table1.col2]` 
for unresolved attribute `table1.col0`
2. If all candidates belong to the same namespace, just drop it. It should 
be dropped for any non-fully qualified unresolved attribute. For example:
- `[ns1.table1.col1, ns1.table2.col2] -> [table1.col1, table2.col2]` 
for unresolved attribute `col0` or `table0.col0`
- `[ns1.table1.col1, ns1.table1.col2] -> [ns1.table1.col1, 
ns1.table1.col2]` for unresolved attribute `ns0.table0.col0`
4. Otherwise take the suggested candidates AS IS.
5. Sort the candidate list using the levenshtein distance.

### Why are the changes needed?
This should improve user experience with Spark SQL by simplifying the error 
message about an unresolved attribute.

### Does this PR introduce _any_ user-facing change?
Yes, it changes the error message.

### How was this patch tested?
By running the existing test suites:
```
$ build/sbt "test:testOnly *AnalysisErrorSuite"
$ build/sbt "test:testOnly *QueryCompilationErrorsSuite"
$ PYSPARK_PYTHON=python3 build/sbt "sql/testOnly 
org.apache.spark.sql.SQLQueryTestSuite"
$ build/sbt "test:testOnly *DatasetUnpivotSuite"
$ build/sbt "test:testOnly *DatasetSuite"

```

Closes #41368 from MaxGekk/fix-suggested-column-list.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  3 +-
 .../plans/logical/basicLogicalOperators.scala  |  2 +-
 .../spark/sql/catalyst/util/StringUtils.scala  | 46 +-
 .../sql/catalyst/analysis/AnalysisErrorSuite.scala |  4 +-
 .../spark/sql/catalyst/util/StringUtilsSuite.scala |  5 ++-
 .../columnresolution-negative.sql.out  |  2 +-
 .../analyzer-results/group-by-all.sql.out  |  2 +-
 .../analyzer-results/join-lateral.sql.out  |  2 +-
 .../postgreSQL/aggregates_part1.sql.out|  2 +-
 .../analyzer-results/postgreSQL/join.sql.out   |  6 +--
 .../udf/postgreSQL/udf-aggregates_part1.sql.out|  2 +-
 .../udf/postgreSQL/udf-join.sql.out|  6 +--
 .../results/columnresolution-negative.sql.out  |  2 +-
 .../sql-tests/results/group-by-all.sql.out |  2 +-
 .../sql-tests/results/join-lateral.sql.out |  2 +-
 .../results/postgreSQL/aggregates_part1.sql.out|  2 +-
 .../sql-tests/results/postgreSQL/join.sql.out  |  6 +--
 .../udf/postgreSQL/udf-aggregates_part1.sql.out|  2 +-
 .../results/udf/postgreSQL/udf-join.sql.out|  6 +--
 .../org/apache/spark/sql/DatasetUnpivotSuite.scala |  2 +-
 .../spark/sql/connector/DataSourceV2SQLSuite.scala |  4 +-
 .../sql/errors/QueryCompilationErrorsSuite.scala   |  3 +-
 22 files changed, 53 insertions(+), 60 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index c46dff1c4bf..594c0b666e8 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -139,7 +139,8 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
   a: Attribute,
   errorClass: String): Nothing = {
 val missingCol = a.sql
-val candidates = operator.inputSet.toSeq.map(_.qualifiedName)
+val candidates = operator.inputSet.toSeq
+  .map(attr => attr.qualifier :+ attr.name)
 val orderedCandidates =
   StringUtils.orderSuggestedIdentifiersBySimilarity(missingCol, candidates)
 throw QueryCompilationErrors.unresolvedAttributeError(
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
 
b/sql/catalys

[spark] branch master updated (a8893422752 -> 47ddac13144)

2023-05-31 Thread ueshin

This is an automated email from the ASF dual-hosted git repository.

ueshin pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a8893422752 [SPARK-43867][SQL] Improve suggested candidates for 
unresolved attribute
 add 47ddac13144 [SPARK-43804][PYTHON][TESTS] Test on nested structs 
support in Pandas UDF

No new revisions were added by this update.

Summary of changes:
 .../sql/tests/connect/test_parity_pandas_udf.py|   3 -
 .../sql/tests/pandas/test_pandas_udf_scalar.py | 107 +
 2 files changed, 89 insertions(+), 21 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43895][CONNECT][GO] Prepare the go package path

2023-05-31 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 419f46ab621 [SPARK-43895][CONNECT][GO] Prepare the go package path
419f46ab621 is described below

commit 419f46ab621e72e64dc9b897e416e56ea348cf1e
Author: Martin Grund 
AuthorDate: Wed May 31 13:17:09 2023 -0700

[SPARK-43895][CONNECT][GO] Prepare the go package path

### What changes were proposed in this pull request?

This patch adds the golang package path to the proto files to be consumed 
from the golang repo.

### Why are the changes needed?
Preparation for the go client.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No functional change.

Closes #41403 from grundprinzip/SPARK-43895.

Authored-by: Martin Grund 
Signed-off-by: Gengliang Wang 
---
 connector/connect/common/src/main/protobuf/buf.yaml | 2 ++
 connector/connect/common/src/main/protobuf/spark/connect/base.proto | 1 +
 .../connect/common/src/main/protobuf/spark/connect/catalog.proto| 1 +
 .../connect/common/src/main/protobuf/spark/connect/commands.proto   | 1 +
 .../connect/common/src/main/protobuf/spark/connect/common.proto | 1 +
 .../common/src/main/protobuf/spark/connect/example_plugins.proto| 1 +
 .../common/src/main/protobuf/spark/connect/expressions.proto| 1 +
 .../connect/common/src/main/protobuf/spark/connect/relations.proto  | 1 +
 .../connect/common/src/main/protobuf/spark/connect/types.proto  | 1 +
 python/pyspark/sql/connect/proto/base_pb2.py| 6 --
 python/pyspark/sql/connect/proto/catalog_pb2.py | 6 --
 python/pyspark/sql/connect/proto/commands_pb2.py| 6 --
 python/pyspark/sql/connect/proto/common_pb2.py  | 6 --
 python/pyspark/sql/connect/proto/example_plugins_pb2.py | 6 --
 python/pyspark/sql/connect/proto/expressions_pb2.py | 6 --
 python/pyspark/sql/connect/proto/relations_pb2.py   | 6 --
 python/pyspark/sql/connect/proto/types_pb2.py   | 6 --
 17 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/connector/connect/common/src/main/protobuf/buf.yaml 
b/connector/connect/common/src/main/protobuf/buf.yaml
index 496e97af3fa..f17614a8dc4 100644
--- a/connector/connect/common/src/main/protobuf/buf.yaml
+++ b/connector/connect/common/src/main/protobuf/buf.yaml
@@ -18,6 +18,8 @@ version: v1
 breaking:
   use:
 - FILE
+  except:
+- FILE_SAME_GO_PACKAGE
 lint:
   use:
 - DEFAULT
diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/base.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/base.proto
index f54e28e3b61..e869712858a 100644
--- a/connector/connect/common/src/main/protobuf/spark/connect/base.proto
+++ b/connector/connect/common/src/main/protobuf/spark/connect/base.proto
@@ -28,6 +28,7 @@ import "spark/connect/types.proto";
 
 option java_multiple_files = true;
 option java_package = "org.apache.spark.connect.proto";
+option go_package = "internal/generated";
 
 // A [[Plan]] is the structure that carries the runtime information for the 
execution from the
 // client to the server. A [[Plan]] can either be of the type [[Relation]] 
which is a reference
diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/catalog.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/catalog.proto
index 9729c102269..f048dbc7f25 100644
--- a/connector/connect/common/src/main/protobuf/spark/connect/catalog.proto
+++ b/connector/connect/common/src/main/protobuf/spark/connect/catalog.proto
@@ -24,6 +24,7 @@ import "spark/connect/types.proto";
 
 option java_multiple_files = true;
 option java_package = "org.apache.spark.connect.proto";
+option go_package = "internal/generated";
 
 // Catalog messages are marked as unstable.
 message Catalog {
diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/commands.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/commands.proto
index 87d76c5d63f..e716364f69b 100644
--- a/connector/connect/common/src/main/protobuf/spark/connect/commands.proto
+++ b/connector/connect/common/src/main/protobuf/spark/connect/commands.proto
@@ -26,6 +26,7 @@ package spark.connect;
 
 option java_multiple_files = true;
 option java_package = "org.apache.spark.connect.proto";
+option go_package = "internal/generated";
 
 // A [[Command]] is an operation that is executed by the server that does not 
directly consume or
 // produce a relational result.
diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/common.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/common.proto
index 42cac88ea3f..5c538cf108

[spark] branch master updated: [SPARK-43333][SQL] Allow Avro to convert union type to SQL with field name stable with type

2023-05-31 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 8f2afb88d42 [SPARK-4][SQL] Allow Avro to convert union type to SQL 
with field name stable with type
8f2afb88d42 is described below

commit 8f2afb88d42af04f36c84972d9ebcb5dabc91260
Author: Siying Dong 
AuthorDate: Wed May 31 13:32:02 2023 -0700

[SPARK-4][SQL] Allow Avro to convert union type to SQL with field name 
stable with type

### What changes were proposed in this pull request?
Introduce AvroOption "enableStableIdentifiersForUnionType". If it is set to 
true (default remains to be false), Avro's union is converted to SQL schema by 
naming field name "member_" + type name. This is to try to keep field name 
stable with type name.

### Why are the changes needed?
The purpose of this is twofold:

To allow adding or removing types to the union without affecting the record 
names of other member types. If the new or removed type is not ordered last, 
then existing queries referencing "member2" may need to be rewritten to 
reference "member1" or "member3".
Referencing the type name in the query is more readable than referencing 
"member0".
For example, our system produces an avro schema from a Java type structure 
where subtyping maps to union types whose members are ordered 
lexicographically. Adding a subtype can therefore easily result in all 
references to "member2" needing to be updated to "member3".

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Add a unit test that covers all types supported in union, as well as some 
potential name conflict cases.

Closes #41263 from siying/avro_stable_union.

Authored-by: Siying Dong 
Signed-off-by: Gengliang Wang 
---
 .../apache/spark/sql/avro/AvroDataToCatalyst.scala |   2 +-
 .../org/apache/spark/sql/avro/AvroOptions.scala|  10 +
 .../org/apache/spark/sql/avro/AvroUtils.scala  |   2 +-
 .../apache/spark/sql/avro/SchemaConverters.scala   |  62 --
 .../apache/spark/sql/avro/AvroFunctionsSuite.scala |   9 +-
 .../org/apache/spark/sql/avro/AvroSuite.scala  | 222 ++---
 docs/sql-data-sources-avro.md  |   8 +-
 7 files changed, 274 insertions(+), 41 deletions(-)

diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
index f8718edd97f..59f2999bdd3 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroDataToCatalyst.scala
@@ -39,7 +39,7 @@ private[sql] case class AvroDataToCatalyst(
   override def inputTypes: Seq[AbstractDataType] = Seq(BinaryType)
 
   override lazy val dataType: DataType = {
-val dt = SchemaConverters.toSqlType(expectedSchema).dataType
+val dt = SchemaConverters.toSqlType(expectedSchema, options).dataType
 parseMode match {
   // With PermissiveMode, the output Catalyst row might contain columns of 
null values for
   // corrupt records, even if some of the columns are not nullable in the 
user-provided schema.
diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala 
b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
index 95001bb8150..c8057ca5879 100644
--- a/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
+++ b/connector/avro/src/main/scala/org/apache/spark/sql/avro/AvroOptions.scala
@@ -130,6 +130,9 @@ private[sql] class AvroOptions(
   val datetimeRebaseModeInRead: String = parameters
 .get(DATETIME_REBASE_MODE)
 .getOrElse(SQLConf.get.getConf(SQLConf.AVRO_REBASE_MODE_IN_READ))
+
+  val useStableIdForUnionType: Boolean =
+parameters.get(STABLE_ID_FOR_UNION_TYPE).map(_.toBoolean).getOrElse(false)
 }
 
 private[sql] object AvroOptions extends DataSourceOptions {
@@ -154,4 +157,11 @@ private[sql] object AvroOptions extends DataSourceOptions {
   // datasource similarly to the SQL config 
`spark.sql.avro.datetimeRebaseModeInRead`,
   // and can be set to the same values: `EXCEPTION`, `LEGACY` or `CORRECTED`.
   val DATETIME_REBASE_MODE = newOption("datetimeRebaseMode")
+  // If it is set to true, Avro schema is deserialized into Spark SQL schema, 
and the Avro Union
+  // type is transformed into a structure where the field names remain 
consistent with their
+  // respective types. The resulting field names are converted to lowercase, 
e.g. member_int or
+  // member_string. If two user-defined type names or a user-defined type name 
and a built-in
+  // type name are identical regardless of case, an exception will be raised. 
However, in other
+  // cases, the field

[spark-connect-go] branch master updated: [SPARK-43895] Basic Repository Layout

2023-05-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


The following commit(s) were added to refs/heads/master by this push:
 new aa4cef7  [SPARK-43895] Basic Repository Layout
aa4cef7 is described below

commit aa4cef71a43192b70120ce075a8e2e9076207f09
Author: Martin Grund 
AuthorDate: Thu Jun 1 09:10:21 2023 +0900

[SPARK-43895] Basic Repository Layout

## Proposed Changes

In order to prepare the implementation of the Go client. This patch 
prepares the repository layout.

* Adds the git submodule for the Spark repository, currently pointing at 
`master` for the reference to the proto files.
* Adds the Makefile and the necessary outline for the build system
* Adds the `buf` generation logic for the protos.

## Testing

`make gen && make check && make test`

Closes #3 from grundprinzip/SPARK-43895.

Lead-authored-by: Martin Grund 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .gitignore|  23 +
 .gitmodules   |   3 ++
 Makefile  | 105 ++
 README.md |   9 +
 buf.gen.yaml  |  23 +
 buf.work.yaml |  19 +++
 go.mod|  53 +
 go.sum|  61 ++
 spark |   1 +
 9 files changed, 297 insertions(+)

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 000..0bca2cf
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,23 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# All generated files
+internal/generated
+internal/generated.out
+
+# Ignore Coverage Files
+coverage*
diff --git a/.gitmodules b/.gitmodules
new file mode 100644
index 000..59db423
--- /dev/null
+++ b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "spark"]
+   path = spark
+   url = https://github.com/apache/spark.git
diff --git a/Makefile b/Makefile
new file mode 100644
index 000..a9829da
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,105 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+FIRST_GOPATH  := $(firstword $(subst :, ,$(GOPATH)))
+PKGS  := $(shell go list ./... | grep -v /tests | grep -v 
/xcpb | grep -v /gpb)
+GOFILES_NOVENDOR  := $(shell find . -name vendor -prune -o -type f 
-name '*.go' -not -name '*.pb.go' -print)
+GOFILES_BUILD := $(shell find . -type f -name '*.go' -not -name 
'*_test.go')
+PROTOFILES:= $(shell find . -name vendor -prune -o -type f 
-name '*.proto' -print)
+
+ALLGOFILES := $(shell find 
. -type f -name '*.go')
+DATE  := $(shell date -u -d "@$(SOURCE_DATE_EPOCH)" 
'+%FT%T%z' 2>/dev/null || date -u '+%FT%T%z')
+
+BUILDFLAGS_NOPIE :=
+#BUILDFLAGS_NOPIE  := -trimpath -ldflags="-s -w -X 
main.version=$(GOPASS_VERSION) -X main.commit=$(GOPASS_REVISION) -X 
main.date=$(DATE)" -gcflags="-trimpath=$(GOPATH)" 
-asmflags="-trimpath=$(GOPATH)"
+BUILDFLAGS?= $(BUILDFLAGS_NOPIE) -buildmode=pie
+TESTFLAGS ?=
+PWD   := $(shell pwd)
+PREFIX?= $(GOPATH)
+BINDIR?= $(PREFIX)/bin
+GO:= GO111MODULE=on go
+GOOS  ?= $(shell go version | cut -d' ' -f4 | c

[spark-connect-go] branch master updated: [MINOR] Fix the merge script to work

2023-05-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


The following commit(s) were added to refs/heads/master by this push:
 new ce56131  [MINOR] Fix the merge script to work
ce56131 is described below

commit ce56131597dfd555aa24573ca122887757d50a98
Author: Hyukjin Kwon 
AuthorDate: Thu Jun 1 09:12:07 2023 +0900

[MINOR] Fix the merge script to work
---
 merge_connect_go_pr.py | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/merge_connect_go_pr.py b/merge_connect_go_pr.py
index 417c0ff..bb514a8 100755
--- a/merge_connect_go_pr.py
+++ b/merge_connect_go_pr.py
@@ -478,7 +478,11 @@ def main():
 branches = get_json("%s/branches" % GITHUB_API_BASE)
 branch_names = list(filter(lambda x: x.startswith("branch-"), [x["name"] 
for x in branches]))
 # Assumes branch names can be sorted lexicographically
-latest_branch = sorted(branch_names, reverse=True)[0]
+if len(branch_names) == 0:
+# Remove this when we have a branch. It fails now because we don't 
have branch-*.
+latest_branch = "master"
+else:
+latest_branch = sorted(branch_names, reverse=True)[0]
 
 pr_num = input("Which pull request would you like to merge? (e.g. 34): ")
 pr = get_json("%s/pulls/%s" % (GITHUB_API_BASE, pr_num))


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (8f2afb88d42 -> e3eba0b8877)

2023-05-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 8f2afb88d42 [SPARK-4][SQL] Allow Avro to convert union type to SQL 
with field name stable with type
 add e3eba0b8877 [SPARK-43896][TESTS][PS][CONNECT] Enable `test_iterrows` 
and `test_itertuples` on Connect

No new revisions were added by this update.

Summary of changes:
 .../pyspark/pandas/tests/connect/indexes/test_parity_indexing.py  | 8 
 1 file changed, 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43516][ML][FOLLOW-UP] Make `pyspark.mlv2` module supports python < 3.9

2023-05-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 33a76811b23 [SPARK-43516][ML][FOLLOW-UP] Make `pyspark.mlv2` module 
supports python < 3.9
33a76811b23 is described below

commit 33a76811b23f2249cf9343fdc4ef654d12bd23b5
Author: Weichen Xu 
AuthorDate: Thu Jun 1 10:24:39 2023 +0900

[SPARK-43516][ML][FOLLOW-UP] Make `pyspark.mlv2` module supports python < 
3.9

### What changes were proposed in this pull request?

Make `pyspark.mlv2` module supports python < 3.9
We need to change some type hints definition to make them compatible with 
python < 3.9

### Why are the changes needed?

pyspark master still need to support python 3.7 and python 3.8

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually run `pyspark.mlv2` tests against python 3.7 or python 3.8

Closes #41405 from WeichenXu123/fix-tpye-hints.

Authored-by: Weichen Xu 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/mlv2/base.py   | 5 +++--
 python/pyspark/mlv2/feature.py| 7 +++
 python/pyspark/mlv2/summarizer.py | 8 
 python/pyspark/mlv2/util.py   | 7 +++
 4 files changed, 13 insertions(+), 14 deletions(-)

diff --git a/python/pyspark/mlv2/base.py b/python/pyspark/mlv2/base.py
index 4c0d4652928..dc503db71c0 100644
--- a/python/pyspark/mlv2/base.py
+++ b/python/pyspark/mlv2/base.py
@@ -16,7 +16,6 @@
 #
 
 from abc import ABCMeta, abstractmethod
-from collections.abc import Callable
 
 import pandas as pd
 
@@ -28,6 +27,8 @@ from typing import (
 TypeVar,
 Union,
 TYPE_CHECKING,
+Tuple,
+Callable,
 )
 
 from pyspark import since
@@ -123,7 +124,7 @@ class Transformer(Params, metaclass=ABCMeta):
 """
 raise NotImplementedError()
 
-def _output_columns(self) -> list[tuple[str, str]]:
+def _output_columns(self) -> List[Tuple[str, str]]:
 """
 Return a list of output transformed columns, each elements in the list
 is a tuple of (column_name, column_spark_type)
diff --git a/python/pyspark/mlv2/feature.py b/python/pyspark/mlv2/feature.py
index 6bbcdf7eaac..cecff362823 100644
--- a/python/pyspark/mlv2/feature.py
+++ b/python/pyspark/mlv2/feature.py
@@ -15,10 +15,9 @@
 # limitations under the License.
 #
 
-from collections.abc import Callable
 import numpy as np
 import pandas as pd
-from typing import Any, Union
+from typing import Any, Union, List, Tuple, Callable
 
 from pyspark.sql import DataFrame
 from pyspark.mlv2.base import Estimator, Model
@@ -60,7 +59,7 @@ class MaxAbsScalerModel(Model, HasInputCol, HasOutputCol):
 def _input_column_name(self) -> str:
 return self.getInputCol()
 
-def _output_columns(self) -> list[tuple[str, str]]:
+def _output_columns(self) -> List[Tuple[str, str]]:
 return [(self.getOutputCol(), "array")]
 
 def _get_transform_fn(self) -> Callable[["pd.Series"], Any]:
@@ -108,7 +107,7 @@ class StandardScalerModel(Model, HasInputCol, HasOutputCol):
 def _input_column_name(self) -> str:
 return self.getInputCol()
 
-def _output_columns(self) -> list[tuple[str, str]]:
+def _output_columns(self) -> List[Tuple[str, str]]:
 return [(self.getOutputCol(), "array")]
 
 def _get_transform_fn(self) -> Callable[["pd.Series"], Any]:
diff --git a/python/pyspark/mlv2/summarizer.py 
b/python/pyspark/mlv2/summarizer.py
index 6bf03c26e19..18776b71eb2 100644
--- a/python/pyspark/mlv2/summarizer.py
+++ b/python/pyspark/mlv2/summarizer.py
@@ -17,7 +17,7 @@
 
 import numpy as np
 import pandas as pd
-from typing import Any, Union
+from typing import Any, Union, List, Dict
 
 from pyspark.sql import DataFrame
 from pyspark.mlv2.util import aggregate_dataframe
@@ -46,7 +46,7 @@ class SummarizerAggState:
 self.max_values = np.maximum(self.max_values, state.max_values)
 return self
 
-def to_result(self, metrics: list[str]) -> dict[str, Any]:
+def to_result(self, metrics: List[str]) -> Dict[str, Any]:
 result = {}
 
 for metric in metrics:
@@ -75,8 +75,8 @@ class SummarizerAggState:
 
 
 def summarize_dataframe(
-dataframe: Union["DataFrame", "pd.DataFrame"], column: str, metrics: 
list[str]
-) -> dict[str, Any]:
+dataframe: Union["DataFrame", "pd.DataFrame"], column: str, metrics: 
List[str]
+) -> Dict[str, Any]:
 """
 Summarize an array type column over a spark dataframe or a pandas dataframe
 
diff --git a/python/pyspark/mlv2/util.py b/python/pyspark/mlv2/util.py
index de2ffb3d7c1..9aebb3fa9a3 100644
--- a/python/pyspark/mlv2/util.py
+++ b/python/pyspark/mlv2/util.py
@@ -16,8 +16,7 @@
 #
 
 import pandas as pd
-from collections.abc import Callable, Iterable
-from typing import Any, Union
+from typing im

[spark] branch master updated: [SPARK-43898][CORE] Automatically register `immutable.ArraySeq$ofRef` to `KryoSerializer` for Scala 2.13

2023-05-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 398435fcd6e [SPARK-43898][CORE] Automatically register 
`immutable.ArraySeq$ofRef` to `KryoSerializer` for Scala 2.13
398435fcd6e is described below

commit 398435fcd6eb8623219c8d6c3e5966463ebe4429
Author: yangjie01 
AuthorDate: Wed May 31 19:23:30 2023 -0700

[SPARK-43898][CORE] Automatically register `immutable.ArraySeq$ofRef` to 
`KryoSerializer` for Scala 2.13

SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt clean "sql/Test/runMain 
org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location 
/Users/yangjie01/Tools/tpcds-sf-1 --query-filter q17" -Pscala-2.13

### What changes were proposed in this pull request?
This pr aims to automatically register `immutable.ArraySeq$ofRef` to 
`KryoSerializer` for Scala 2.13 to make `TPCDSQueryBenchmark` can run 
successfully using Scala 2.13.

### Why are the changes needed?
Scala 2.13 introduced `scala.collection.immutable.ArraySeq$ofRef`, but it 
has not been registered with `KryoSerializer`, so when run 
`TPCDSQueryBenchmark` using `KryoSerializer` and  enabled 
`spark.kryo.registrationRequired`(This is the default behavior after 
SPARK-42074), there will be the following error:

```
Error: Exception in thread "main" org.apache.spark.SparkException: Job 
aborted due to stage failure: Failed to serialize task 741, not attempting to 
retry it. Exception during serialization: java.io.IOException: 
java.lang.IllegalArgumentException: Class is not registered: 
scala.collection.immutable.ArraySeq$ofRef
Note: To register this class use: 
kryo.register(scala.collection.immutable.ArraySeq$ofRef.class);
at 
org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2815)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2751)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2750)
at scala.collection.immutable.List.foreach(List.scala:333)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2750)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1218)
at 
org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1218)
at scala.Option.foreach(Option.scala:437)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1218)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3014)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2953)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2942)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:983)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2285)
at 
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:385)
at 
org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2$(WriteToDataSourceV2Exec.scala:359)
at 
org.apache.spark.sql.execution.datasources.v2.OverwriteByExpressionExec.writeWithV2(WriteToDataSourceV2Exec.scala:243)
at 
org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run(WriteToDataSourceV2Exec.scala:337)
at 
org.apache.spark.sql.execution.datasources.v2.V2ExistingTableWriteExec.run$(WriteToDataSourceV2Exec.scala:336)
at 
org.apache.spark.sql.execution.datasources.v2.OverwriteByExpressionExec.run(WriteToDataSourceV2Exec.scala:243)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result$lzycompute(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.result(V2CommandExec.scala:43)
at 
org.apache.spark.sql.execution.datasources.v2.V2CommandExec.executeCollect(V2CommandExec.scala:49)
at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:98)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:118)
at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:195)
at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:103)
at 
org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:825)

[spark] branch master updated: [SPARK-43504][K8S] Mounts the hadoop config map on the executor pod

2023-05-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 10ee643c674 [SPARK-43504][K8S] Mounts the hadoop config map on the 
executor pod
10ee643c674 is described below

commit 10ee643c6746bbf110be6d680fe1c3559e522720
Author: fwang12 
AuthorDate: Wed May 31 21:30:05 2023 -0700

[SPARK-43504][K8S] Mounts the hadoop config map on the executor pod

### What changes were proposed in this pull request?

In this pr, for spark on k8s, the hadoop config map will be mounted in 
executor side as well.
Before, the  hadoop config map is only mounted in driver side.
### Why are the changes needed?

Since [SPARK-25815](https://issues.apache.org/jira/browse/SPARK-25815) 
[,](https://github.com/apache/spark/pull/22911,) the hadoop config map will not 
be mounted in executor side.

Per the  https://github.com/apache/spark/pull/22911 description:

> The main two things that don't need to happen in executors anymore are:
> 1. adding the Hadoop config to the executor pods: this is not needed
> since the Spark driver will serialize the Hadoop config and send
> it to executors when running tasks.

But in fact, the executor still need the hadoop configuration.


![image](https://github.com/apache/spark/assets/6757692/ff6374c9-7ebd-4472-a85c-99c75a737e2a)

As shown in above picture, the driver can resolve `hdfs://zeus`, but the 
executor can not.

so we still need to mount the hadoop config map in executor side.

### Does this PR introduce _any_ user-facing change?

Yes, users do not need to take workarounds to make executors load the 
hadoop configuration.
Such as:
- including hadoop conf in executor image
- placing hadoop conf files under `SPARK_CONF_DIR`.
### How was this patch tested?

UT.

Closes #41181 from turboFei/exec_hadoop_conf.

Authored-by: fwang12 
Signed-off-by: Dongjoon Hyun 
---
 .../k8s/features/HadoopConfDriverFeatureStep.scala |  8 +++
 .../features/HadoopConfExecutorFeatureStep.scala   | 63 +
 .../cluster/k8s/KubernetesExecutorBuilder.scala|  1 +
 .../HadoopConfExecutorFeatureStepSuite.scala   | 82 ++
 4 files changed, 154 insertions(+)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStep.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStep.scala
index 069a57d3dc4..45a5b8d7dae 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStep.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfDriverFeatureStep.scala
@@ -104,6 +104,14 @@ private[spark] class HadoopConfDriverFeatureStep(conf: 
KubernetesConf)
 }
   }
 
+  override def getAdditionalPodSystemProperties(): Map[String, String] = {
+if (hasHadoopConf) {
+  Map(HADOOP_CONFIG_MAP_NAME -> 
existingConfMap.getOrElse(newConfigMapName))
+} else {
+  Map.empty
+}
+  }
+
   override def getAdditionalKubernetesResources(): Seq[HasMetadata] = {
 if (confDir.isDefined) {
   val fileMap = confFiles.map { file =>
diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfExecutorFeatureStep.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfExecutorFeatureStep.scala
new file mode 100644
index 000..8a2773c1ac3
--- /dev/null
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/features/HadoopConfExecutorFeatureStep.scala
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.deploy.k8s.features
+
+import io.fabric8.kubernetes.api.model.{ContainerBuilder, PodBuilder, 
VolumeBuilder}
+
+import org.apache.spark.deploy.k8s.{KubernetesConf, SparkPod}
+impo

[spark] branch master updated: [SPARK-43421][SS] Implement Changelog based Checkpointing for RocksDB State Store Provider

2023-05-31 Thread kabhwan

This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c02c8be43c6 [SPARK-43421][SS] Implement Changelog based Checkpointing 
for RocksDB State Store Provider
c02c8be43c6 is described below

commit c02c8be43c64dbd6bffceb53f8b18cd19a0d2f2e
Author: Chaoqin Li 
AuthorDate: Thu Jun 1 14:34:24 2023 +0900

[SPARK-43421][SS] Implement Changelog based Checkpointing for RocksDB State 
Store Provider

### What changes were proposed in this pull request?
In order to reduce the checkpoint duration and end to end latency, we 
propose Changelog Based Checkpointing for RocksDB State Store Provider. Below 
is the mechanism.
1. Changelog checkpoint: Upon each put() delete() call to local rocksdb 
instance, log the operation to a changelog file. During the state change 
commit,  sync the compressed change log of the current batch to DFS as 
checkpointDir/{version}.delta.
2. Version reconstruction: For version j, find latest snapshot i.zip such 
that i <= j, load snapshot i, and replay i+1.delta ~ j.delta. This is used in 
loading the initial state as well as creating the latest version snapshot. 
Note: If a query is shutdown without exception, there won’t be changelog replay 
during query restart because a maintenance task is executed before the state 
store instance is unloaded.
3. Background snapshot: A maintenance thread in executors will launch 
maintenance tasks periodically. Inside the maintenance task, sync the latest 
RocksDB local snapshot to DFS as checkpointDir/{version}.zip. Snapshot enables 
faster failure recovery and allows old versions to be purged.
4. Garbage collection: Inside the maintenance task, delete snapshot and 
delta files from DFS for versions that is out of retention range(default 
retained version number is 100)

### Why are the changes needed?
We have identified state checkpointing latency as one of the major 
performance bottlenecks for stateful streaming queries. Currently, RocksDB 
state store pauses the RocksDB instances to upload a snapshot to the cloud when 
committing a batch, which is heavy weight and has unpredictable performance.
With changelog based checkpointing, we allow the RocksDB instance to run 
uninterruptibly, which improves RocksDB operation performance. This also 
dramatically reduces the commit time and batch duration because we are 
uploading a smaller amount of data during state commit. With this change, 
stateful query with RocksDB state store will have lower and more predictable 
latency.

### How was this patch tested?
Add unit test for changelog checkpointing utility.
Add unit test and integration test that check backward compatibility with 
existing checkpoint.
Enable RocksDB state store unit test and stateful streaming query 
integration test to run with changelog checkpointing enabled.

Closes #41099 from chaoqin-li1123/changelog.

Authored-by: Chaoqin Li 
Signed-off-by: Jungtaek Lim 
---
 docs/structured-streaming-programming-guide.md |  18 +
 .../sql/execution/streaming/state/RocksDB.scala| 187 --
 .../streaming/state/RocksDBFileManager.scala   | 119 --
 .../state/RocksDBStateStoreProvider.scala  |   8 +-
 .../streaming/state/StateStoreChangelog.scala  | 167 +
 .../state/RocksDBStateStoreIntegrationSuite.scala  |  60 ++-
 .../streaming/state/RocksDBStateStoreSuite.scala   |  80 +++-
 .../execution/streaming/state/RocksDBSuite.scala   | 402 +
 .../state/StateStoreCompatibilitySuite.scala   |   2 +-
 .../streaming/state/StateStoreSuite.scala  | 109 +++---
 .../streaming/FlatMapGroupsWithStateSuite.scala|   3 +
 .../sql/streaming/RocksDBStateStoreTest.scala  |  52 +++
 .../sql/streaming/StreamingAggregationSuite.scala  |   3 +
 .../streaming/StreamingDeduplicationSuite.scala|   3 +
 14 files changed, 1012 insertions(+), 201 deletions(-)

diff --git a/docs/structured-streaming-programming-guide.md 
b/docs/structured-streaming-programming-guide.md
index 267596a3899..53d5919d4dc 100644
--- a/docs/structured-streaming-programming-guide.md
+++ b/docs/structured-streaming-programming-guide.md
@@ -2320,6 +2320,11 @@ Here are the configs regarding to RocksDB instance of 
the state store provider:
 Whether we perform a range compaction of RocksDB instance for commit 
operation
 False
   
+  
+
spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled
+Whether to upload changelog instead of snapshot during RocksDB 
StateStore commit
+False
+  
   
 spark.sql.streaming.stateStore.rocksdb.blockSizeKB
 Approximate size in KB of user data packed per block for a RocksDB 
BlockBasedTable, which is a RocksDB's default SST file format.
@@ -2389,6 +2394,19 @@ If you want to cap R

[spark] branch master updated (c02c8be43c6 -> c8c6b0c3308)

2023-05-31 Thread dongjoon

This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c02c8be43c6 [SPARK-43421][SS] Implement Changelog based Checkpointing 
for RocksDB State Store Provider
 add c8c6b0c3308 Revert 
[SPARK-43836][SPARK-43845][SPARK-43858][SPARK-43858] to keep Scala 2.12 for 
Spark 3.x

No new revisions were added by this update.

Summary of changes:
 .github/workflows/benchmark.yml|   4 +-
 .../{build_scala212.yml => build_scala213.yml} |   4 +-
 assembly/pom.xml   |   4 +-
 common/kvstore/pom.xml |   4 +-
 common/network-common/pom.xml  |   4 +-
 common/network-shuffle/pom.xml |   4 +-
 common/network-yarn/pom.xml|   4 +-
 common/sketch/pom.xml  |   4 +-
 common/tags/pom.xml|   4 +-
 common/unsafe/pom.xml  |   4 +-
 common/utils/pom.xml   |   4 +-
 connector/avro/pom.xml |   8 +-
 connector/connect/client/jvm/pom.xml   |   4 +-
 connector/connect/common/pom.xml   |   4 +-
 connector/connect/server/pom.xml   |   8 +-
 connector/docker-integration-tests/pom.xml |   4 +-
 connector/kafka-0-10-assembly/pom.xml  |   4 +-
 connector/kafka-0-10-sql/pom.xml   |   8 +-
 connector/kafka-0-10-token-provider/pom.xml|   4 +-
 connector/kafka-0-10/pom.xml   |   8 +-
 connector/kinesis-asl-assembly/pom.xml |   4 +-
 connector/kinesis-asl/pom.xml  |   4 +-
 connector/protobuf/pom.xml |   8 +-
 connector/spark-ganglia-lgpl/pom.xml   |   4 +-
 core/pom.xml   |   8 +-
 dev/deps/spark-deps-hadoop-3-hive-2.3  |  43 +++
 dev/mima   |   8 +-
 dev/test-dependencies.sh   |   2 +-
 docs/_plugins/copy_api_dirs.rb |  14 +--
 examples/pom.xml   |   4 +-
 graphx/pom.xml |   4 +-
 hadoop-cloud/pom.xml   |   4 +-
 launcher/pom.xml   |   4 +-
 mllib-local/pom.xml|   4 +-
 mllib/pom.xml  |   8 +-
 pom.xml| 137 +++--
 repl/pom.xml   |   4 +-
 resource-managers/kubernetes/core/pom.xml  |   4 +-
 .../kubernetes/integration-tests/pom.xml   |   4 +-
 resource-managers/mesos/pom.xml|   4 +-
 resource-managers/yarn/pom.xml |   4 +-
 sql/catalyst/pom.xml   |   8 +-
 sql/core/pom.xml   |   8 +-
 sql/hive-thriftserver/pom.xml  |   8 +-
 sql/hive/pom.xml   |   8 +-
 streaming/pom.xml  |   8 +-
 tools/pom.xml  |   4 +-
 47 files changed, 213 insertions(+), 207 deletions(-)
 rename .github/workflows/{build_scala212.yml => build_scala213.yml} (93%)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-docker] branch master updated: [SPARK-43370] Switch spark user only when run driver and executor

2023-05-31 Thread yikun

This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new 2dc12d9  [SPARK-43370] Switch spark user only when run driver and 
executor
2dc12d9 is described below

commit 2dc12d96910710aa6ee2d717c4c723ddd75127a1
Author: Yikun Jiang 
AuthorDate: Thu Jun 1 14:36:17 2023 +0800

[SPARK-43370] Switch spark user only when run driver and executor

### What changes were proposed in this pull request?
Switch spark user only when run driver and executor

### Why are the changes needed?
Address doi comments: question 7 [1]

[1] 
https://github.com/docker-library/official-images/pull/13089#issuecomment-1533540388
[2] 
https://github.com/docker-library/official-images/pull/13089#issuecomment-1561793792

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
1. test mannuly
```
cd ~/spark-docker/3.4.0/scala2.12-java11-ubuntu
$ docker build . -t spark-test

$ docker run -ti spark-test bash
sparkafa78af05cf8:/opt/spark/work-dir$

$ docker run  --user root  -ti spark-test bash
root095e0d7651fd:/opt/spark/work-dir#
```
2. ci passed

Closes: https://github.com/apache/spark-docker/pull/44

Closes #43 from Yikun/SPARK-43370.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile |  4 
 3.4.0/scala2.12-java11-python3-ubuntu/Dockerfile   |  4 
 3.4.0/scala2.12-java11-r-ubuntu/Dockerfile |  4 
 3.4.0/scala2.12-java11-ubuntu/Dockerfile   |  2 ++
 3.4.0/scala2.12-java11-ubuntu/entrypoint.sh| 23 +++---
 Dockerfile.template|  2 ++
 entrypoint.sh.template | 23 +++---
 r-python.template  |  4 
 8 files changed, 44 insertions(+), 22 deletions(-)

diff --git a/3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile 
b/3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
index 7734100..0f1962f 100644
--- a/3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
+++ b/3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -16,6 +16,8 @@
 #
 FROM spark:3.4.0-scala2.12-java11-ubuntu
 
+USER root
+
 RUN set -ex; \
 apt-get update; \
 apt install -y python3 python3-pip; \
@@ -24,3 +26,5 @@ RUN set -ex; \
 rm -rf /var/lib/apt/lists/*
 
 ENV R_HOME /usr/lib/R
+
+USER spark
diff --git a/3.4.0/scala2.12-java11-python3-ubuntu/Dockerfile 
b/3.4.0/scala2.12-java11-python3-ubuntu/Dockerfile
index 6c12c30..258d806 100644
--- a/3.4.0/scala2.12-java11-python3-ubuntu/Dockerfile
+++ b/3.4.0/scala2.12-java11-python3-ubuntu/Dockerfile
@@ -16,8 +16,12 @@
 #
 FROM spark:3.4.0-scala2.12-java11-ubuntu
 
+USER root
+
 RUN set -ex; \
 apt-get update; \
 apt install -y python3 python3-pip; \
 rm -rf /var/cache/apt/*; \
 rm -rf /var/lib/apt/lists/*
+
+USER spark
diff --git a/3.4.0/scala2.12-java11-r-ubuntu/Dockerfile 
b/3.4.0/scala2.12-java11-r-ubuntu/Dockerfile
index 24cd41a..4c928c6 100644
--- a/3.4.0/scala2.12-java11-r-ubuntu/Dockerfile
+++ b/3.4.0/scala2.12-java11-r-ubuntu/Dockerfile
@@ -16,6 +16,8 @@
 #
 FROM spark:3.4.0-scala2.12-java11-ubuntu
 
+USER root
+
 RUN set -ex; \
 apt-get update; \
 apt install -y r-base r-base-dev; \
@@ -23,3 +25,5 @@ RUN set -ex; \
 rm -rf /var/lib/apt/lists/*
 
 ENV R_HOME /usr/lib/R
+
+USER spark
diff --git a/3.4.0/scala2.12-java11-ubuntu/Dockerfile 
b/3.4.0/scala2.12-java11-ubuntu/Dockerfile
index 205b399..a680106 100644
--- a/3.4.0/scala2.12-java11-ubuntu/Dockerfile
+++ b/3.4.0/scala2.12-java11-ubuntu/Dockerfile
@@ -77,4 +77,6 @@ ENV SPARK_HOME /opt/spark
 
 WORKDIR /opt/spark/work-dir
 
+USER spark
+
 ENTRYPOINT [ "/opt/entrypoint.sh" ]
diff --git a/3.4.0/scala2.12-java11-ubuntu/entrypoint.sh 
b/3.4.0/scala2.12-java11-ubuntu/entrypoint.sh
index 716f1af..6def3f9 100755
--- a/3.4.0/scala2.12-java11-ubuntu/entrypoint.sh
+++ b/3.4.0/scala2.12-java11-ubuntu/entrypoint.sh
@@ -69,6 +69,13 @@ elif ! [ -z ${SPARK_HOME+x} ]; then
   SPARK_CLASSPATH="$SPARK_HOME/conf:$SPARK_CLASSPATH";
 fi
 
+# Switch to spark if no USER specified (root by default) otherwise use USER 
directly
+switch_spark_if_root() {
+  if [ $(id -u) -eq 0 ]; then
+echo gosu spark
+  fi
+}
+
 case "$1" in
   driver)
 shift 1
@@ -78,6 +85,8 @@ case "$1" in
   --deploy-mode client
   "$@"
 )
+# Execute the container CMD under tini for better hygiene
+exec $(switch_spark_if_root) /usr/bin/tini -s -- "${CMD[@]}"
 ;;
   executor)
 shift 1
@@ -96,20 +105,12 @@ case "$1" in
   --resourceProfileId $SPARK_RESOURCE_PROFILE_ID
   --podName $SPARK_EXECUTOR_POD_NAME
 )
+# Execute the container CMD under tini for better hygiene
+

[spark-connect-go] branch master updated: [SPARK-43909] Adding PR Template

2023-05-31 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-connect-go.git


The following commit(s) were added to refs/heads/master by this push:
 new ec893d2  [SPARK-43909] Adding PR Template
ec893d2 is described below

commit ec893d29544a38da43298f72bc66a06e054bb147
Author: Martin Grund 
AuthorDate: Thu Jun 1 15:37:34 2023 +0900

[SPARK-43909] Adding PR Template

This patch adds the PR template to the repository.

Closes #4 from grundprinzip/SPARK-43909-p1.

Authored-by: Martin Grund 
Signed-off-by: Hyukjin Kwon 
---
 .github/PULL_REQUEST_TEMPLATE | 49 +++
 1 file changed, 49 insertions(+)

diff --git a/.github/PULL_REQUEST_TEMPLATE b/.github/PULL_REQUEST_TEMPLATE
new file mode 100644
index 000..241e3c1
--- /dev/null
+++ b/.github/PULL_REQUEST_TEMPLATE
@@ -0,0 +1,49 @@
+
+
+### What changes were proposed in this pull request?
+
+
+
+### Why are the changes needed?
+
+
+
+### Does this PR introduce _any_ user-facing change?
+
+
+
+### How was this patch tested?
+


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (10d7acc81c5 -> d5f758ea359)

[spark] branch master updated: [SPARK-43882][SQL] Assign name to _LEGACY_ERROR_TEMP_2122

[spark] branch master updated: [SPARK-43081][ML][FOLLOW-UP] Improve torch distributor data loader code

[spark] branch master updated (c2060e7c0a3 -> 3457b4be356)

[spark] branch master updated (3457b4be356 -> fead25ac4d6)

[spark] branch master updated (fead25ac4d6 -> d3f76c6ca07)

[spark] branch branch-3.4 updated: [SPARK-43894][PYTHON] Fix bug in df.cache()

[spark] branch master updated: [SPARK-43888][CORE] Relocate Logging to common/utils

[spark] branch master updated: [SPARK-43867][SQL] Improve suggested candidates for unresolved attribute

[spark] branch master updated (a8893422752 -> 47ddac13144)

[spark] branch master updated: [SPARK-43895][CONNECT][GO] Prepare the go package path

[spark] branch master updated: [SPARK-43333][SQL] Allow Avro to convert union type to SQL with field name stable with type

[spark-connect-go] branch master updated: [SPARK-43895] Basic Repository Layout

[spark-connect-go] branch master updated: [MINOR] Fix the merge script to work

[spark] branch master updated (8f2afb88d42 -> e3eba0b8877)

[spark] branch master updated: [SPARK-43516][ML][FOLLOW-UP] Make `pyspark.mlv2` module supports python < 3.9

[spark] branch master updated: [SPARK-43898][CORE] Automatically register `immutable.ArraySeq$ofRef` to `KryoSerializer` for Scala 2.13

[spark] branch master updated: [SPARK-43504][K8S] Mounts the hadoop config map on the executor pod

[spark] branch master updated: [SPARK-43421][SS] Implement Changelog based Checkpointing for RocksDB State Store Provider

[spark] branch master updated (c02c8be43c6 -> c8c6b0c3308)

[spark-docker] branch master updated: [SPARK-43370] Switch spark user only when run driver and executor

[spark-connect-go] branch master updated: [SPARK-43909] Adding PR Template

22 matches

Site Navigation

Mail list logo

Footer information