(spark) branch master updated: [SPARK-46587][SQL] XML: Fix XSD big integer conversion

2024-01-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c63e0641f2f3 [SPARK-46587][SQL] XML: Fix XSD big integer conversion
c63e0641f2f3 is described below

commit c63e0641f2f39c9812b58165d1f78daa120a990b
Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com>
AuthorDate: Thu Jan 4 16:42:36 2024 +0900

[SPARK-46587][SQL] XML: Fix XSD big integer conversion

### What changes were proposed in this pull request?
Fix XSD type conversion for some big integer types in XSDToSchema helper 
utility.

NOTE: This is a deviation from spark-xml.

### Why are the changes needed?
To correctly map XSD data types to spark data types

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Unit test

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44587 from sandip-db/xml-xsd-datatype.

Authored-by: Sandip Agarwala <131817656+sandip...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 .../execution/datasources/xml/XSDToSchema.scala|  13 +--
 .../datasources/xml/util/XSDToSchemaSuite.scala| 113 -
 2 files changed, 119 insertions(+), 7 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala
index 356ffd57698c..87082299615c 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/xml/XSDToSchema.scala
@@ -96,17 +96,18 @@ object XSDToSchema extends Logging{
   case facet: XmlSchemaTotalDigitsFacet => 
facet.getValue.toString.toInt
 }.getOrElse(38)
 DecimalType(totalDigits, math.min(totalDigits, fracDigits))
-  case Constants.XSD_UNSIGNEDLONG => DecimalType(38, 0)
+  case Constants.XSD_UNSIGNEDLONG |
+   Constants.XSD_INTEGER |
+   Constants.XSD_NEGATIVEINTEGER |
+   Constants.XSD_NONNEGATIVEINTEGER |
+   Constants.XSD_NONPOSITIVEINTEGER |
+   Constants.XSD_POSITIVEINTEGER => DecimalType(38, 0)
   case Constants.XSD_DOUBLE => DoubleType
   case Constants.XSD_FLOAT => FloatType
   case Constants.XSD_BYTE => ByteType
   case Constants.XSD_SHORT |
Constants.XSD_UNSIGNEDBYTE => ShortType
-  case Constants.XSD_INTEGER |
-   Constants.XSD_NEGATIVEINTEGER |
-   Constants.XSD_NONNEGATIVEINTEGER |
-   Constants.XSD_NONPOSITIVEINTEGER |
-   Constants.XSD_POSITIVEINTEGER |
+  case Constants.XSD_INT |
Constants.XSD_UNSIGNEDSHORT => IntegerType
   case Constants.XSD_LONG |
Constants.XSD_UNSIGNEDINT => LongType
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/util/XSDToSchemaSuite.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/util/XSDToSchemaSuite.scala
index 434b4655d408..1b8059340067 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/util/XSDToSchemaSuite.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/xml/util/XSDToSchemaSuite.scala
@@ -23,7 +23,7 @@ import org.apache.hadoop.fs.Path
 import org.apache.spark.sql.execution.datasources.xml.TestUtils._
 import org.apache.spark.sql.execution.datasources.xml.XSDToSchema
 import org.apache.spark.sql.test.SharedSparkSession
-import org.apache.spark.sql.types.{ArrayType, DecimalType, FloatType, 
LongType, StringType}
+import org.apache.spark.sql.types._
 
 class XSDToSchemaSuite extends SharedSparkSession {
 
@@ -183,4 +183,115 @@ class XSDToSchemaSuite extends SharedSparkSession {
   XSDToSchema.read(new Path("/path/not/found"))
 }
   }
+
+  test("Basic DataTypes parsing") {
+val xsdString =
+  """
+|http://www.w3.org/2001/XMLSchema";>
+|  
+|
+|
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+|  
+  

(spark) branch master updated: [SPARK-46530][PYTHON][SQL][FOLLOW-UP] Uses path separator instead of file separator to correctly check PySpark library existence

2024-01-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b303eced7f86 [SPARK-46530][PYTHON][SQL][FOLLOW-UP] Uses path separator 
instead of file separator to correctly check PySpark library existence
b303eced7f86 is described below

commit b303eced7f8639887278db34e0080ffa0c19bd0c
Author: Hyukjin Kwon 
AuthorDate: Thu Jan 4 15:49:45 2024 +0900

[SPARK-46530][PYTHON][SQL][FOLLOW-UP] Uses path separator instead of file 
separator to correctly check PySpark library existence

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/44519 that 
fixes a mistake of separating the paths. It should use `Files.pathSeparator`.

### Why are the changes needed?

It works with testing mode, but it doesn't work with production mode 
otherwise.

### Does this PR introduce _any_ user-facing change?

No, because the main change has not been released.

### How was this patch tested?

Manually as described in "How was this patch tested?" at 
https://github.com/apache/spark/pull/44504.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44590 from HyukjinKwon/SPARK-46530-followup.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala   | 6 --
 .../apache/spark/sql/execution/datasources/DataSourceManager.scala  | 4 +---
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala 
b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
index 26c790a12447..929058fb7185 100644
--- a/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
+++ b/core/src/main/scala/org/apache/spark/api/python/PythonUtils.scala
@@ -36,7 +36,7 @@ private[spark] object PythonUtils extends Logging {
   val PY4J_ZIP_NAME = "py4j-0.10.9.7-src.zip"
 
   /** Get the PYTHONPATH for PySpark, either from SPARK_HOME, if it is set, or 
from our JAR */
-  def sparkPythonPath: String = {
+  def sparkPythonPaths: Seq[String] = {
 val pythonPath = new ArrayBuffer[String]
 for (sparkHome <- sys.env.get("SPARK_HOME")) {
   pythonPath += Seq(sparkHome, "python", "lib", 
"pyspark.zip").mkString(File.separator)
@@ -44,9 +44,11 @@ private[spark] object PythonUtils extends Logging {
 Seq(sparkHome, "python", "lib", PY4J_ZIP_NAME).mkString(File.separator)
 }
 pythonPath ++= SparkContext.jarOfObject(this)
-pythonPath.mkString(File.pathSeparator)
+pythonPath.toSeq
   }
 
+  def sparkPythonPath: String = sparkPythonPaths.mkString(File.pathSeparator)
+
   /** Merge PYTHONPATHS with the appropriate separator. Ignores blank strings. 
*/
   def mergePythonPaths(paths: String*): String = {
 paths.filter(_ != "").mkString(File.pathSeparator)
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
index 4fc636a59e5a..236ab98969e5 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceManager.scala
@@ -20,7 +20,6 @@ package org.apache.spark.sql.execution.datasources
 import java.io.File
 import java.util.Locale
 import java.util.concurrent.ConcurrentHashMap
-import java.util.regex.Pattern
 
 import scala.jdk.CollectionConverters._
 
@@ -91,8 +90,7 @@ object DataSourceManager extends Logging {
   private lazy val shouldLoadPythonDataSources: Boolean = {
 Utils.checkCommandAvailable(PythonUtils.defaultPythonExec) &&
   // Make sure PySpark zipped files also exist.
-  PythonUtils.sparkPythonPath
-.split(Pattern.quote(File.separator)).forall(new File(_).exists())
+  PythonUtils.sparkPythonPaths.forall(new File(_).exists())
   }
 
   private def initialDataSourceBuilders: Map[String, 
UserDefinedPythonDataSource] = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46576][SQL] Improve error messages for unsupported data source save mode

2024-01-03 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 69c46876b5a7 [SPARK-46576][SQL] Improve error messages for unsupported 
data source save mode
69c46876b5a7 is described below

commit 69c46876b5a76c2de6a149ea7663fad18027e387
Author: allisonwang-db 
AuthorDate: Thu Jan 4 09:40:40 2024 +0300

[SPARK-46576][SQL] Improve error messages for unsupported data source save 
mode

### What changes were proposed in this pull request?

This PR renames the error class `_LEGACY_ERROR_TEMP_1308` to 
`UNSUPPORTED_DATA_SOURCE_SAVE_MODE` and improves its error messages.

### Why are the changes needed?

To make the error more user-friendly.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New unit tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44576 from allisonwang-db/spark-46576-unsupported-save-mode.

Authored-by: allisonwang-db 
Signed-off-by: Max Gekk 
---
 .../src/main/resources/error/error-classes.json | 11 ++-
 .../apache/spark/sql/kafka010/KafkaSinkSuite.scala  |  2 +-
 docs/sql-error-conditions.md|  6 ++
 .../spark/sql/errors/QueryCompilationErrors.scala   |  4 ++--
 .../spark/sql/connector/DataSourceV2Suite.scala |  8 
 .../execution/python/PythonDataSourceSuite.scala| 21 +++--
 6 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index bcaf8a74c08d..9cade1197dca 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -3588,6 +3588,12 @@
 ],
 "sqlState" : "0A000"
   },
+  "UNSUPPORTED_DATA_SOURCE_SAVE_MODE" : {
+"message" : [
+  "The data source '' cannot be written in the  mode. 
Please use either the \"Append\" or \"Overwrite\" mode instead."
+],
+"sqlState" : "0A000"
+  },
   "UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE" : {
 "message" : [
   "The  datasource doesn't support the column  of the 
type ."
@@ -5403,11 +5409,6 @@
   "There is a 'path' option set and save() is called with a path 
parameter. Either remove the path option, or call save() without the parameter. 
To ignore this check, set '' to 'true'."
 ]
   },
-  "_LEGACY_ERROR_TEMP_1308" : {
-"message" : [
-  "TableProvider implementation  cannot be written with 
 mode, please use Append or Overwrite modes instead."
-]
-  },
   "_LEGACY_ERROR_TEMP_1309" : {
 "message" : [
   "insertInto() can't be used together with partitionBy(). Partition 
columns have already been defined for the table. It is not necessary to use 
partitionBy()."
diff --git 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
index 6753f8be54bf..5566785c4d56 100644
--- 
a/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
+++ 
b/connector/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
@@ -557,7 +557,7 @@ class KafkaSinkBatchSuiteV2 extends KafkaSinkBatchSuiteBase 
{
 
   test("batch - unsupported save modes") {
 testUnsupportedSaveModes((mode) =>
-  Seq(s"cannot be written with ${mode.name} mode", "does not support 
truncate"))
+  Seq(s"cannot be written in the \"${mode.name}\" mode", "does not support 
truncate"))
   }
 
   test("generic - write big data with small producer buffer") {
diff --git a/docs/sql-error-conditions.md b/docs/sql-error-conditions.md
index c6108e97b4c5..89de607b0f22 100644
--- a/docs/sql-error-conditions.md
+++ b/docs/sql-error-conditions.md
@@ -2332,6 +2332,12 @@ Unsupported data source type for direct query on files: 
``
 
 Unsupported data type ``.
 
+### UNSUPPORTED_DATA_SOURCE_SAVE_MODE
+
+[SQLSTATE: 
0A000](sql-error-conditions-sqlstates.html#class-0A-feature-not-supported)
+
+The data source '``' cannot be written in the `` mode. 
Please use either the "Append" or "Overwrite" mode instead.
+
 ### UNSUPPORTED_DATA_TYPE_FOR_DATASOURCE
 
 [SQLSTATE: 
0A000](sql-error-conditions-sqlstates.html#class-0A-feature-not-supported)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
index b844ee2bdc45..90e7ab610f7a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.s

(spark) branch master updated: [SPARK-46504][PS][TESTS][FOLLOWUP] Break the remaining part of `IndexesTests` into small test files

2024-01-03 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 59d147a4f48f [SPARK-46504][PS][TESTS][FOLLOWUP] Break the remaining 
part of `IndexesTests` into small test files
59d147a4f48f is described below

commit 59d147a4f48ff6112c682e9797dbd982022bfc10
Author: Ruifeng Zheng 
AuthorDate: Thu Jan 4 14:33:42 2024 +0800

[SPARK-46504][PS][TESTS][FOLLOWUP] Break the remaining part of 
`IndexesTests` into small test files

### What changes were proposed in this pull request?
Break the remaining part of `IndexesTests` into small test files

### Why are the changes needed?
testing parallelism

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44588 from zhengruifeng/ps_test_idx_base_lastlast.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 dev/sparktestsupport/modules.py|   8 +-
 .../{test_parity_base.py => test_parity_basic.py}  |  17 +-
 ...{test_parity_base.py => test_parity_getattr.py} |  17 +-
 .../{test_parity_base.py => test_parity_name.py}   |  17 +-
 .../tests/indexes/{test_base.py => test_basic.py}  | 155 +
 .../pyspark/pandas/tests/indexes/test_getattr.py   |  79 +
 python/pyspark/pandas/tests/indexes/test_name.py   | 183 +
 7 files changed, 296 insertions(+), 180 deletions(-)

diff --git a/dev/sparktestsupport/modules.py b/dev/sparktestsupport/modules.py
index a97e6afdc356..699a9d07452d 100644
--- a/dev/sparktestsupport/modules.py
+++ b/dev/sparktestsupport/modules.py
@@ -795,7 +795,9 @@ pyspark_pandas_slow = Module(
 "pyspark.pandas.generic",
 "pyspark.pandas.series",
 # unittests
-"pyspark.pandas.tests.indexes.test_base",
+"pyspark.pandas.tests.indexes.test_basic",
+"pyspark.pandas.tests.indexes.test_getattr",
+"pyspark.pandas.tests.indexes.test_name",
 "pyspark.pandas.tests.indexes.test_conversion",
 "pyspark.pandas.tests.indexes.test_drop",
 "pyspark.pandas.tests.indexes.test_level",
@@ -1095,7 +1097,9 @@ pyspark_pandas_connect_part0 = Module(
 "pyspark.pandas.tests.connect.test_parity_sql",
 "pyspark.pandas.tests.connect.test_parity_typedef",
 "pyspark.pandas.tests.connect.test_parity_utils",
-"pyspark.pandas.tests.connect.indexes.test_parity_base",
+"pyspark.pandas.tests.connect.indexes.test_parity_basic",
+"pyspark.pandas.tests.connect.indexes.test_parity_getattr",
+"pyspark.pandas.tests.connect.indexes.test_parity_name",
 "pyspark.pandas.tests.connect.indexes.test_parity_conversion",
 "pyspark.pandas.tests.connect.indexes.test_parity_drop",
 "pyspark.pandas.tests.connect.indexes.test_parity_level",
diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_base.py 
b/python/pyspark/pandas/tests/connect/indexes/test_parity_basic.py
similarity index 72%
copy from python/pyspark/pandas/tests/connect/indexes/test_parity_base.py
copy to python/pyspark/pandas/tests/connect/indexes/test_parity_basic.py
index 83ce92eb34b2..94651552ea8d 100644
--- a/python/pyspark/pandas/tests/connect/indexes/test_parity_base.py
+++ b/python/pyspark/pandas/tests/connect/indexes/test_parity_basic.py
@@ -16,22 +16,21 @@
 #
 import unittest
 
-from pyspark import pandas as ps
-from pyspark.pandas.tests.indexes.test_base import IndexesTestsMixin
+from pyspark.pandas.tests.indexes.test_basic import IndexBasicMixin
 from pyspark.testing.connectutils import ReusedConnectTestCase
-from pyspark.testing.pandasutils import PandasOnSparkTestUtils, TestUtils
+from pyspark.testing.pandasutils import PandasOnSparkTestUtils
 
 
-class IndexesParityTests(
-IndexesTestsMixin, PandasOnSparkTestUtils, TestUtils, ReusedConnectTestCase
+class IndexBasicParityTests(
+IndexBasicMixin,
+PandasOnSparkTestUtils,
+ReusedConnectTestCase,
 ):
-@property
-def psdf(self):
-return ps.from_pandas(self.pdf)
+pass
 
 
 if __name__ == "__main__":
-from pyspark.pandas.tests.connect.indexes.test_parity_base import *  # 
noqa: F401
+from pyspark.pandas.tests.connect.indexes.test_parity_basic import *  # 
noqa: F401
 
 try:
 import xmlrunner  # type: ignore[import]
diff --git a/python/pyspark/pandas/tests/connect/indexes/test_parity_base.py 
b/python/pyspark/pandas/tests/connect/indexes/test_parity_getattr.py
similarity index 72%
copy from python/pyspark/pandas/tests/connect/indexes/test_parity_base.py
copy to python/pyspark/pandas/tests/connect/indexes/test_parity_getattr.py
index 83ce92eb34b2..47d893bda3be 100644
--- a/python/pyspark/pandas/tests/connect/indexes

(spark) branch master updated (56023635ab8 -> 1cd3a1b0e1c)

2024-01-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 56023635ab8 [SPARK-46412][K8S][DOCS] Update Java and JDK info in K8S 
testing
 add 1cd3a1b0e1c Revert "[SPARK-46582][R][INFRA] Upgrade R Tools version 
from 4.0.2 to 4.3.2 in AppVeyor"

No new revisions were added by this update.

Summary of changes:
 dev/appveyor-install-dependencies.ps1 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (f3e454a8323 -> 56023635ab8)

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from f3e454a8323 [SPARK-45292][SQL][HIVE] Remove Guava from shared classes 
from IsolatedClientLoader
 add 56023635ab8 [SPARK-46412][K8S][DOCS] Update Java and JDK info in K8S 
testing

No new revisions were added by this update.

Summary of changes:
 resource-managers/kubernetes/integration-tests/README.md | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45292][SQL][HIVE] Remove Guava from shared classes from IsolatedClientLoader

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f3e454a8323a [SPARK-45292][SQL][HIVE] Remove Guava from shared classes 
from IsolatedClientLoader
f3e454a8323a is described below

commit f3e454a8323aa1f1948b0fe7981ac43aa674a32a
Author: Cheng Pan 
AuthorDate: Wed Jan 3 21:28:24 2024 -0800

[SPARK-45292][SQL][HIVE] Remove Guava from shared classes from 
IsolatedClientLoader

### What changes were proposed in this pull request?

Try removing Guava from `sharedClasses` as suggested by JoshRosen in 
https://github.com/apache/spark/pull/33989#issuecomment-928616327 and 
https://github.com/apache/spark/pull/42493#issuecomment-1687092403

### Why are the changes needed?

Unblock Guava upgrading.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI passed (embedded HMS) and verified in the internal YARN cluster (remote 
HMS with kerberos-enabled).

```
# already setup hive-site.xml stuff properly to make sure to use remote HMS
bin/spark-shell --conf spark.sql.hive.metastore.jars=maven

...

scala> spark.sql("show databases").show
warning: 1 deprecation (since 2.13.3); for details, enable `:setting 
-deprecation` or `:replay -deprecation`
https://maven-central.storage-download.googleapis.com/maven2/ added as a 
remote repository with the name: repo-1
Ivy Default Cache set to: /home/hadoop/.ivy2/cache
The jars for the packages stored in: /home/hadoop/.ivy2/jars
org.apache.hive#hive-metastore added as a dependency
org.apache.hive#hive-exec added as a dependency
org.apache.hive#hive-common added as a dependency
org.apache.hive#hive-serde added as a dependency
org.apache.hadoop#hadoop-client-api added as a dependency
org.apache.hadoop#hadoop-client-runtime added as a dependency
:: resolving dependencies :: 
org.apache.spark#spark-submit-parent-d0d2962d-ae27-4526-a0c7-040a542e1e54;1.0
confs: [default]
found org.apache.hive#hive-metastore;2.3.9 in central
found org.apache.hive#hive-serde;2.3.9 in central
found org.apache.hive#hive-common;2.3.9 in central
found org.apache.hive#hive-shims;2.3.9 in central
found org.apache.hive.shims#hive-shims-common;2.3.9 in central
found org.apache.logging.log4j#log4j-slf4j-impl;2.6.2 in central
found org.slf4j#slf4j-api;1.7.10 in central
found com.google.guava#guava;14.0.1 in central
found commons-lang#commons-lang;2.6 in central
found org.apache.thrift#libthrift;0.9.3 in central
found org.apache.httpcomponents#httpclient;4.4 in central
found org.apache.httpcomponents#httpcore;4.4 in central
found commons-logging#commons-logging;1.2 in central
found commons-codec#commons-codec;1.4 in central
found org.apache.zookeeper#zookeeper;3.4.6 in central
found org.slf4j#slf4j-log4j12;1.6.1 in central
found log4j#log4j;1.2.16 in central
found jline#jline;2.12 in central
found io.netty#netty;3.7.0.Final in central
found org.apache.hive.shims#hive-shims-0.23;2.3.9 in central
found org.apache.hadoop#hadoop-yarn-server-resourcemanager;2.7.2 in 
central
found org.apache.hadoop#hadoop-annotations;2.7.2 in central
found com.google.inject.extensions#guice-servlet;3.0 in central
found com.google.inject#guice;3.0 in central
found javax.inject#javax.inject;1 in central
found aopalliance#aopalliance;1.0 in central
found org.sonatype.sisu.inject#cglib;2.2.1-v20090111 in central
found asm#asm;3.2 in central
found com.google.protobuf#protobuf-java;2.5.0 in central
found commons-io#commons-io;2.4 in central
found com.sun.jersey#jersey-json;1.14 in central
found org.codehaus.jettison#jettison;1.1 in central
found com.sun.xml.bind#jaxb-impl;2.2.3-1 in central
found javax.xml.bind#jaxb-api;2.2.2 in central
found javax.xml.stream#stax-api;1.0-2 in central
found javax.activation#activation;1.1 in central
found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
found org.codehaus.jackson#jackson-mapper-asl;1.9.13 in central
found org.codehaus.jackson#jackson-jaxrs;1.9.13 in central
found org.codehaus.jackson#jackson-xc;1.9.13 in central
found com.sun.jersey#jersey-core;1.14 in central
found com.sun.jersey.contribs#jersey-guice;1.9 in central
found com.sun.jersey#jersey-server;1.14 in central
found org.ap

(spark) branch master updated (7b6077a02fc3 -> 733be49a8078)

2024-01-03 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 7b6077a02fc3 [SPARK-46584][SQL][TESTS] Remove invalid 
attachCleanupResourceChecker in JoinSuite
 add 733be49a8078 [SPARK-46539][SQL][FOLLOWUP] fix golden files

No new revisions were added by this update.

Summary of changes:
 .../src/test/resources/sql-tests/analyzer-results/selectExcept.sql.out  | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46584][SQL][TESTS] Remove invalid attachCleanupResourceChecker in JoinSuite

2024-01-03 Thread yao
This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7b6077a02fc3 [SPARK-46584][SQL][TESTS] Remove invalid 
attachCleanupResourceChecker in JoinSuite
7b6077a02fc3 is described below

commit 7b6077a02fc3e619465fb21511ea16e71e6d4c7e
Author: zml1206 
AuthorDate: Thu Jan 4 10:32:46 2024 +0800

[SPARK-46584][SQL][TESTS] Remove invalid attachCleanupResourceChecker in 
JoinSuite

### What changes were proposed in this pull request?
Remove `attachCleanupResourceChecker` in `JoinSuite`.

### Why are the changes needed?
`attachCleanupResourceChecker` is invalid:
1. The matching of `SortExec` needs to be in `QueryExecution.executePlan` 
not `QueryExecution.sparkPlan`, The correct way is 
`foreachUp(df.queryExecution.executedPlan){f()}`.
2. `Mockito` counts the number of function calls, only for objects after 
`spy`. Calls to the original object are not counted. eg
```
test() {
val data = new java.util.ArrayList[String]()
val _data = spy(data)
data.add("a");
data.add("b");
data.add("b");
_data.add("b")
verify(_data, times(0)).add("a")
verify(_data, times(1)).add("b")
  }
```
Therefore, when using `df.queryExecution.executedPlan` correctly to match, 
count is always 0.
3. Not all `SortMergeJoin` joinTypes will trigger `cleanupResources()`, 
such as 'full outer join'.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Local test, update `attachCleanupResourceChecker` atLeastOnce to nerver, ut 
is still successful.
```
verify(sortExec, atLeastOnce).cleanupResources()
verify(sortExec.rowSorter, atLeastOnce).cleanupResources()
```
to
```
verify(sortExec, never).cleanupResources()
verify(sortExec.rowSorter, never).cleanupResources()
```
### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44573 from zml1206/SPARK-21492.

Authored-by: zml1206 
Signed-off-by: Kent Yao 
---
 .../test/scala/org/apache/spark/sql/JoinSuite.scala   | 19 ---
 1 file changed, 19 deletions(-)

diff --git a/sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
index 909a05ce26f7..f31f60e8df56 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala
@@ -22,8 +22,6 @@ import java.util.Locale
 import scala.collection.mutable.ListBuffer
 import scala.jdk.CollectionConverters._
 
-import org.mockito.Mockito._
-
 import org.apache.spark.TestUtils.{assertNotSpilled, assertSpilled}
 import org.apache.spark.sql.catalyst.TableIdentifier
 import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation
@@ -44,23 +42,6 @@ import org.apache.spark.tags.SlowSQLTest
 class JoinSuite extends QueryTest with SharedSparkSession with 
AdaptiveSparkPlanHelper {
   import testImplicits._
 
-  private def attachCleanupResourceChecker(plan: SparkPlan): Unit = {
-// SPARK-21492: Check cleanupResources are finally triggered in SortExec 
node for every
-// test case
-plan.foreachUp {
-  case s: SortExec =>
-val sortExec = spy[SortExec](s)
-verify(sortExec, atLeastOnce).cleanupResources()
-verify(sortExec.rowSorter, atLeastOnce).cleanupResources()
-  case _ =>
-}
-  }
-
-  override protected def checkAnswer(df: => DataFrame, rows: Seq[Row]): Unit = 
{
-attachCleanupResourceChecker(df.queryExecution.sparkPlan)
-super.checkAnswer(df, rows)
-  }
-
   setupTestData()
 
   def statisticSizeInByte(df: DataFrame): BigInt = {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46582][R][INFRA] Upgrade R Tools version from 4.0.2 to 4.3.2 in AppVeyor

2024-01-03 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new dbdeecbc3ffc [SPARK-46582][R][INFRA] Upgrade R Tools version from 
4.0.2 to 4.3.2 in AppVeyor
dbdeecbc3ffc is described below

commit dbdeecbc3ffc2a048ba720a688e1e6bfff4e8b4b
Author: Hyukjin Kwon 
AuthorDate: Thu Jan 4 11:09:33 2024 +0900

[SPARK-46582][R][INFRA] Upgrade R Tools version from 4.0.2 to 4.3.2 in 
AppVeyor

### What changes were proposed in this pull request?

This PR proposes to upgrade R Tools version from 4.0.2 to 4.3.2 in AppVeyor

### Why are the changes needed?

R Tools 4.3.X is for R 4.3.X. We did not upgrade because of the test 
failure previously.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Checking the CI in this PR.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #44584 from HyukjinKwon/r-tools-ver.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 dev/appveyor-install-dependencies.ps1 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/dev/appveyor-install-dependencies.ps1 
b/dev/appveyor-install-dependencies.ps1
index b37f1ee45f30..a3a440ef83f2 100644
--- a/dev/appveyor-install-dependencies.ps1
+++ b/dev/appveyor-install-dependencies.ps1
@@ -141,7 +141,7 @@ Pop-Location
 
 # == R
 $rVer = "4.3.2"
-$rToolsVer = "4.0.2"
+$rToolsVer = "4.3.2"
 
 InstallR
 InstallRtools


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (5c10fb3e509a -> 893e69172560)

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 5c10fb3e509a [SPARK-44556][SQL] Reuse `OrcTail` when enable 
vectorizedReader
 add 893e69172560 [SPARK-46580][TESTS] Regenerate benchmark results

No new revisions were added by this update.

Summary of changes:
 .../benchmarks/AvroReadBenchmark-jdk21-results.txt | 110 +--
 .../avro/benchmarks/AvroReadBenchmark-results.txt  | 110 +--
 .../AvroWriteBenchmark-jdk21-results.txt   |  20 +-
 .../avro/benchmarks/AvroWriteBenchmark-results.txt |  20 +-
 .../CoalescedRDDBenchmark-jdk21-results.txt|  64 +-
 core/benchmarks/CoalescedRDDBenchmark-results.txt  |  64 +-
 core/benchmarks/KryoBenchmark-jdk21-results.txt|  40 +-
 core/benchmarks/KryoBenchmark-results.txt  |  40 +-
 .../KryoIteratorBenchmark-jdk21-results.txt|  40 +-
 core/benchmarks/KryoIteratorBenchmark-results.txt  |  40 +-
 .../KryoSerializerBenchmark-jdk21-results.txt  |   8 +-
 .../benchmarks/KryoSerializerBenchmark-results.txt |   8 +-
 .../MapStatusesConvertBenchmark-jdk21-results.txt  |   6 +-
 .../MapStatusesConvertBenchmark-results.txt|   6 +-
 .../MapStatusesSerDeserBenchmark-jdk21-results.txt |  52 +-
 .../MapStatusesSerDeserBenchmark-results.txt   |  50 +-
 .../PersistenceEngineBenchmark-jdk21-results.txt   |  32 +-
 .../PersistenceEngineBenchmark-results.txt |  32 +-
 .../PropertiesCloneBenchmark-jdk21-results.txt |  40 +-
 .../PropertiesCloneBenchmark-results.txt   |  40 +-
 .../XORShiftRandomBenchmark-jdk21-results.txt  |  38 +-
 .../benchmarks/XORShiftRandomBenchmark-results.txt |  38 +-
 .../ZStandardBenchmark-jdk21-results.txt   |  48 +-
 core/benchmarks/ZStandardBenchmark-results.txt |  48 +-
 .../benchmarks/BLASBenchmark-jdk21-results.txt | 208 ++---
 mllib-local/benchmarks/BLASBenchmark-results.txt   | 208 ++---
 .../UDTSerializationBenchmark-jdk21-results.txt|   8 +-
 .../UDTSerializationBenchmark-results.txt  |   8 +-
 .../CalendarIntervalBenchmark-jdk21-results.txt|   6 +-
 .../CalendarIntervalBenchmark-results.txt  |   6 +-
 .../EnumTypeSetBenchmark-jdk21-results.txt | 120 +--
 .../benchmarks/EnumTypeSetBenchmark-results.txt| 120 +--
 .../GenericArrayDataBenchmark-jdk21-results.txt|  14 +-
 .../GenericArrayDataBenchmark-results.txt  |  14 +-
 .../benchmarks/HashBenchmark-jdk21-results.txt |  60 +-
 sql/catalyst/benchmarks/HashBenchmark-results.txt  |  60 +-
 .../HashByteArrayBenchmark-jdk21-results.txt   |  90 +-
 .../benchmarks/HashByteArrayBenchmark-results.txt  |  90 +-
 .../UnsafeProjectionBenchmark-jdk21-results.txt|  12 +-
 .../UnsafeProjectionBenchmark-results.txt  |  12 +-
 .../AggregateBenchmark-jdk21-results.txt   | 130 +--
 sql/core/benchmarks/AggregateBenchmark-results.txt | 130 +--
 .../AnsiIntervalSortBenchmark-jdk21-results.txt|  32 +-
 .../AnsiIntervalSortBenchmark-results.txt  |  32 +-
 .../benchmarks/Base64Benchmark-jdk21-results.txt   |  64 +-
 sql/core/benchmarks/Base64Benchmark-results.txt|  64 +-
 .../BloomFilterBenchmark-jdk21-results.txt | 128 +--
 .../benchmarks/BloomFilterBenchmark-results.txt| 128 +--
 ...iltInDataSourceWriteBenchmark-jdk21-results.txt |  70 +-
 .../BuiltInDataSourceWriteBenchmark-results.txt|  70 +-
 .../ByteArrayBenchmark-jdk21-results.txt   |  22 +-
 sql/core/benchmarks/ByteArrayBenchmark-results.txt |  22 +-
 sql/core/benchmarks/CSVBenchmark-jdk21-results.txt |  94 +--
 sql/core/benchmarks/CSVBenchmark-results.txt   |  94 +--
 .../CharVarcharBenchmark-jdk21-results.txt | 140 ++--
 .../benchmarks/CharVarcharBenchmark-results.txt| 140 ++--
 .../ColumnarBatchBenchmark-jdk21-results.txt   |  54 +-
 .../benchmarks/ColumnarBatchBenchmark-results.txt  |  54 +-
 .../CompressionSchemeBenchmark-jdk21-results.txt   | 168 ++--
 .../CompressionSchemeBenchmark-results.txt | 168 ++--
 ...ConstantColumnVectorBenchmark-jdk21-results.txt | 350 
 .../ConstantColumnVectorBenchmark-results.txt  | 350 
 .../DataSourceReadBenchmark-jdk21-results.txt  | 634 +++---
 .../benchmarks/DataSourceReadBenchmark-results.txt | 634 +++---
 .../benchmarks/DatasetBenchmark-jdk21-results.txt  |  52 +-
 sql/core/benchmarks/DatasetBenchmark-results.txt   |  52 +-
 .../benchmarks/DateTimeBenchmark-jdk21-results.txt | 482 +--
 sql/core/benchmarks/DateTimeBenchmark-results.txt  | 482 +--
 .../DateTimeRebaseBenchmark-jdk21-results.txt  | 230 +++---
 .../benchmarks/DateTimeRebaseBenchmark-results.txt | 230 +++---
 ...ndOnlyUnsafeRowArrayBenchmark-jdk21-results.txt |  40 +-
 ...alAppendOnlyUnsafeRowArrayBenchmark-results.txt |  40 +-
 .../benchmarks/ExtractBenchmark-jdk21-results.txt  | 172 ++--
 sql/core/benchmarks/ExtractBenchmark-results.txt   | 172 ++

(spark) branch master updated (85b44ccef4c4 -> 5c10fb3e509a)

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 85b44ccef4c4 [SPARK-46546][DOCS] Fix the formatting of tables in 
`running-on-yarn` pages
 add 5c10fb3e509a [SPARK-44556][SQL] Reuse `OrcTail` when enable 
vectorizedReader

No new revisions were added by this update.

Summary of changes:
 .../datasources/orc/OrcColumnarBatchReader.java| 11 ++-
 .../sql/execution/datasources/orc/OrcFileFormat.scala  |  2 +-
 .../datasources/v2/orc/OrcPartitionReaderFactory.scala | 18 ++
 3 files changed, 21 insertions(+), 10 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46546][DOCS] Fix the formatting of tables in `running-on-yarn` pages

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new fb90ade2c739 [SPARK-46546][DOCS] Fix the formatting of tables in 
`running-on-yarn` pages
fb90ade2c739 is described below

commit fb90ade2c7390077d2755fc43b73e63f5cf44f21
Author: panbingkun 
AuthorDate: Wed Jan 3 12:07:15 2024 -0800

[SPARK-46546][DOCS] Fix the formatting of tables in `running-on-yarn` pages

### What changes were proposed in this pull request?
The pr aims to fix the formatting of tables in `running-on-yarn` pages.

### Why are the changes needed?
Make the tables on the page display normally.
Before:
https://github.com/apache/spark/assets/15246973/26facec4-d805-4549-a640-120c499bd7fd";>

After:
https://github.com/apache/spark/assets/15246973/cf6c20ef-a4ce-4532-9acd-ab9cec41881a";>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually check.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44540 from panbingkun/SPARK-46546.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 85b44ccef4c4aeec302c12e03833590c7d8d6b9e)
Signed-off-by: Dongjoon Hyun 
---
 docs/running-on-yarn.md | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 9b4e59a119ee..ce7121b806cb 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -866,7 +866,7 @@ to avoid garbage collection issues during shuffle.
 The following extra configuration options are available when the shuffle 
service is running on YARN:
 
 
-Property NameDefaultMeaning
+Property NameDefaultMeaningSince 
Version
 
   spark.yarn.shuffle.stopOnFailure
   false
@@ -875,6 +875,7 @@ The following extra configuration options are available 
when the shuffle service
 initialization. This prevents application failures caused by running 
containers on
 NodeManagers where the Spark Shuffle Service is not running.
   
+  2.1.0
 
 
   spark.yarn.shuffle.service.metrics.namespace
@@ -883,6 +884,7 @@ The following extra configuration options are available 
when the shuffle service
 The namespace to use when emitting shuffle service metrics into Hadoop 
metrics2 system of the
 NodeManager.
   
+  3.2.0
 
 
   spark.yarn.shuffle.service.logs.namespace
@@ -894,6 +896,7 @@ The following extra configuration options are available 
when the shuffle service
 may expect the logger name to look like a class name, it's generally 
recommended to provide a value which
 would be a valid Java package or class name and not include spaces.
   
+  3.3.0
 
 
   spark.shuffle.service.db.backend


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46546][DOCS] Fix the formatting of tables in `running-on-yarn` pages

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 85b44ccef4c4 [SPARK-46546][DOCS] Fix the formatting of tables in 
`running-on-yarn` pages
85b44ccef4c4 is described below

commit 85b44ccef4c4aeec302c12e03833590c7d8d6b9e
Author: panbingkun 
AuthorDate: Wed Jan 3 12:07:15 2024 -0800

[SPARK-46546][DOCS] Fix the formatting of tables in `running-on-yarn` pages

### What changes were proposed in this pull request?
The pr aims to fix the formatting of tables in `running-on-yarn` pages.

### Why are the changes needed?
Make the tables on the page display normally.
Before:
https://github.com/apache/spark/assets/15246973/26facec4-d805-4549-a640-120c499bd7fd";>

After:
https://github.com/apache/spark/assets/15246973/cf6c20ef-a4ce-4532-9acd-ab9cec41881a";>

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually check.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44540 from panbingkun/SPARK-46546.

Authored-by: panbingkun 
Signed-off-by: Dongjoon Hyun 
---
 docs/running-on-yarn.md | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 3dfa63e1cb2e..02547b30d2e5 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -866,7 +866,7 @@ to avoid garbage collection issues during shuffle.
 The following extra configuration options are available when the shuffle 
service is running on YARN:
 
 
-Property NameDefaultMeaning
+Property NameDefaultMeaningSince 
Version
 
   spark.yarn.shuffle.stopOnFailure
   false
@@ -875,6 +875,7 @@ The following extra configuration options are available 
when the shuffle service
 initialization. This prevents application failures caused by running 
containers on
 NodeManagers where the Spark Shuffle Service is not running.
   
+  2.1.0
 
 
   spark.yarn.shuffle.service.metrics.namespace
@@ -883,6 +884,7 @@ The following extra configuration options are available 
when the shuffle service
 The namespace to use when emitting shuffle service metrics into Hadoop 
metrics2 system of the
 NodeManager.
   
+  3.2.0
 
 
   spark.yarn.shuffle.service.logs.namespace
@@ -894,6 +896,7 @@ The following extra configuration options are available 
when the shuffle service
 may expect the logger name to look like a class name, it's generally 
recommended to provide a value which
 would be a valid Java package or class name and not include spaces.
   
+  3.3.0
 
 
   spark.shuffle.service.db.backend


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46579][SQL] Redact JDBC url in errors and logs

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 49f94eb6e88a [SPARK-46579][SQL] Redact JDBC url in errors and logs
49f94eb6e88a is described below

commit 49f94eb6e88a9e5aaff675fb53125ce6091529fa
Author: Max Gekk 
AuthorDate: Wed Jan 3 12:02:02 2024 -0800

[SPARK-46579][SQL] Redact JDBC url in errors and logs

### What changes were proposed in this pull request?
In the PR, I propose to redact the JDBC url in error message parameters and 
logs.

### Why are the changes needed?
To avoid leaking of user's secrets.

### Does this PR introduce _any_ user-facing change?
Yes, it can.

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt "test:testOnly *JDBCTableCatalogSuite"
```

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #44574 from MaxGekk/redact-jdbc-url.

Authored-by: Max Gekk 
Signed-off-by: Dongjoon Hyun 
---
 .../execution/datasources/jdbc/JDBCOptions.scala   |  3 +++
 .../jdbc/connection/BasicConnectionProvider.scala  |  3 ++-
 .../execution/datasources/v2/jdbc/JDBCTable.scala  |  4 ++--
 .../datasources/v2/jdbc/JDBCTableCatalog.scala | 22 +++---
 .../org/apache/spark/sql/jdbc/JdbcDialects.scala   |  2 +-
 .../v2/jdbc/JDBCTableCatalogSuite.scala| 15 +++
 6 files changed, 30 insertions(+), 19 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
index 28fa7b8bf561..43db0c6eef11 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala
@@ -28,6 +28,7 @@ import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
 import org.apache.spark.sql.errors.QueryExecutionErrors
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types.TimestampNTZType
+import org.apache.spark.util.Utils
 
 /**
  * Options for the JDBC data source.
@@ -248,6 +249,8 @@ class JDBCOptions(
   otherOption.parameters.equals(this.parameters)
 case _ => false
   }
+
+  def getRedactUrl(): String = 
Utils.redact(SQLConf.get.stringRedactionPattern, url)
 }
 
 class JdbcOptionsInWrite(
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/BasicConnectionProvider.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/BasicConnectionProvider.scala
index 369cf59e0599..57902336ebf2 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/BasicConnectionProvider.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/connection/BasicConnectionProvider.scala
@@ -45,7 +45,8 @@ private[jdbc] class BasicConnectionProvider extends 
JdbcConnectionProvider with
 jdbcOptions.asConnectionProperties.asScala.foreach { case(k, v) =>
   properties.put(k, v)
 }
-logDebug(s"JDBC connection initiated with URL: ${jdbcOptions.url} and 
properties: $properties")
+logDebug(s"JDBC connection initiated with URL: 
${jdbcOptions.getRedactUrl()} " +
+  s"and properties: $properties")
 driver.connect(jdbcOptions.url, properties)
   }
 
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
index c251010881f3..120a68075a8f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTable.scala
@@ -66,7 +66,7 @@ case class JDBCTable(ident: Identifier, schema: StructType, 
jdbcOptions: JDBCOpt
   JdbcUtils.classifyException(
 errorClass = "FAILED_JDBC.CREATE_INDEX",
 messageParameters = Map(
-  "url" -> jdbcOptions.url,
+  "url" -> jdbcOptions.getRedactUrl(),
   "indexName" -> toSQLId(indexName),
   "tableName" -> toSQLId(name)),
 dialect = JdbcDialects.get(jdbcOptions.url)) {
@@ -87,7 +87,7 @@ case class JDBCTable(ident: Identifier, schema: StructType, 
jdbcOptions: JDBCOpt
   JdbcUtils.classifyException(
 errorClass = "FAILED_JDBC.DROP_INDEX",
 messageParameters = Map(
-  "url" -> jdbcOptions.url,
+  "url" -> jdbcOptions.getRedactUrl(),
   "indexName" -> toSQLId(indexName),
   "tableName" -> toSQLId(name)),
 dialect = JdbcDialects.get(jdbcOptions.url)) {
diff --git 
a/sql/core/s

(spark) branch master updated: [SPARK-46539][SQL] SELECT * EXCEPT(all fields from a struct) results in an assertion failure

2024-01-03 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9c46d9dcd195 [SPARK-46539][SQL] SELECT * EXCEPT(all fields from a 
struct) results in an assertion failure
9c46d9dcd195 is described below

commit 9c46d9dcd19551dbdef546adec73d5799364ab0b
Author: Stefan Kandic 
AuthorDate: Wed Jan 3 21:52:37 2024 +0300

[SPARK-46539][SQL] SELECT * EXCEPT(all fields from a struct) results in an 
assertion failure

### What changes were proposed in this pull request?

Fixing the assertion error which occurs when we do SELECT .. EXCEPT(every 
field from a struct) by adding a check for an empty struct

### Why are the changes needed?

Because this is a valid query that should just return an empty struct 
rather than fail during serialization.

### Does this PR introduce _any_ user-facing change?

Yes, users should no longer see this error and instead get an empty struct 
'{ }'

### How was this patch tested?

By adding new UT to existing selectExcept tests

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44527 from stefankandic/select-except-err.

Authored-by: Stefan Kandic 
Signed-off-by: Max Gekk 
---
 .../spark/sql/catalyst/encoders/ExpressionEncoder.scala| 12 ++--
 .../sql-tests/analyzer-results/selectExcept.sql.out| 12 
 .../src/test/resources/sql-tests/inputs/selectExcept.sql   |  1 +
 .../test/resources/sql-tests/results/selectExcept.sql.out  | 14 ++
 4 files changed, 37 insertions(+), 2 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
index 74d7a5e7a675..654f39393636 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala
@@ -325,11 +325,19 @@ case class ExpressionEncoder[T](
   assert(serializer.forall(_.references.isEmpty), "serializer cannot reference 
any attributes.")
   assert(serializer.flatMap { ser =>
 val boundRefs = ser.collect { case b: BoundReference => b }
-assert(boundRefs.nonEmpty,
-  "each serializer expression should contain at least one 
`BoundReference`")
+assert(boundRefs.nonEmpty || isEmptyStruct(ser),
+  "each serializer expression should contain at least one `BoundReference` 
or it " +
+  "should be an empty struct. This is required to ensure that there is a 
reference point " +
+  "for the serialized object or that the serialized object is 
intentionally left empty."
+)
 boundRefs
   }.distinct.length <= 1, "all serializer expressions must use the same 
BoundReference.")
 
+  private def isEmptyStruct(expr: NamedExpression): Boolean = expr.dataType 
match {
+case struct: StructType => struct.isEmpty
+case _ => false
+  }
+
   /**
* Returns a new copy of this encoder, where the `deserializer` is resolved 
and bound to the
* given schema.
diff --git 
a/sql/core/src/test/resources/sql-tests/analyzer-results/selectExcept.sql.out 
b/sql/core/src/test/resources/sql-tests/analyzer-results/selectExcept.sql.out
index 3b8594d832c6..49ea7ed4edcf 100644
--- 
a/sql/core/src/test/resources/sql-tests/analyzer-results/selectExcept.sql.out
+++ 
b/sql/core/src/test/resources/sql-tests/analyzer-results/selectExcept.sql.out
@@ -121,6 +121,18 @@ Project [id#x, name#x, named_struct(f1, data#x.f1, s2, 
named_struct(f3, data#x.s
+- LocalRelation [id#x, name#x, data#x]
 
 
+-- !query
+SELECT * EXCEPT (data.f1, data.s2) FROM tbl_view
+-- !query analysis
+Project [id#x, name#x, named_struct() AS data#x]
++- SubqueryAlias tbl_view
+   +- View (`tbl_view`, [id#x,name#x,data#x])
+  +- Project [cast(id#x as int) AS id#x, cast(name#x as string) AS name#x, 
cast(data#x as struct>) AS data#x]
+ +- Project [id#x, name#x, data#x]
++- SubqueryAlias tbl_view
+   +- LocalRelation [id#x, name#x, data#x]
+
+
 -- !query
 SELECT * EXCEPT (id, name, data) FROM tbl_view
 -- !query analysis
diff --git a/sql/core/src/test/resources/sql-tests/inputs/selectExcept.sql 
b/sql/core/src/test/resources/sql-tests/inputs/selectExcept.sql
index e07e4f1117c2..08d56aeda0a8 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/selectExcept.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/selectExcept.sql
@@ -20,6 +20,7 @@ SELECT * EXCEPT (data) FROM tbl_view;
 SELECT * EXCEPT (data.f1) FROM tbl_view;
 SELECT * EXCEPT (data.s2) FROM tbl_view;
 SELECT * EXCEPT (data.s2.f2) FROM tbl_view;
+SELECT * EXCEPT (data.f1, data.s2) FROM tbl_view;
 -- EXCEPT all columns

(spark) branch master updated (605fecd22cc1 -> 06f9e7419966)

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 605fecd22cc1 [SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite 
leaks hive's SessionState
 add 06f9e7419966 [SPARK-46550][BUILD][SQL] Upgrade `datasketches-java` to 
5.0.1

No new revisions were added by this update.

Summary of changes:
 dev/deps/spark-deps-hadoop-3-hive-2.3  | 4 ++--
 pom.xml| 2 +-
 .../catalyst/expressions/aggregate/datasketchesAggregates.scala| 3 ++-
 .../spark/sql/catalyst/expressions/datasketchesExpressions.scala   | 7 ---
 4 files changed, 9 insertions(+), 7 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.4 updated: [SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite leaks hive's SessionState

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new 2eb603c09fb5 [SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite 
leaks hive's SessionState
2eb603c09fb5 is described below

commit 2eb603c09fb5e81ae24f4e43a17fa45fb071c358
Author: Kent Yao 
AuthorDate: Wed Jan 3 05:54:57 2024 -0800

[SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite leaks hive's 
SessionState

### What changes were proposed in this pull request?

The upcoming tests with the new hive configurations will have no effect due 
to the leaked SessionState.

```
06:21:12.848 pool-1-thread-1 INFO ThriftServerWithSparkContextInHttpSuite: 
Trying to start HiveThriftServer2: mode=http, attempt=0

06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:OperationManager 
is inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:SessionManager 
is inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: CLIService is 
inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: 
Service:ThriftBinaryCLIService is inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: HiveServer2 is 
inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:OperationManager 
is started.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:SessionManager 
is started.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: CLIService is 
started.
06:21:12.852 pool-1-thread-1 INFO AbstractService: 
Service:ThriftBinaryCLIService is started.
06:21:12.852 pool-1-thread-1 INFO ThriftCLIService: Starting 
ThriftBinaryCLIService on port 1 with 5...500 worker threads
06:21:12.852 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is 
started.
```

As the logs above revealed, ThriftServerWithSparkContextInHttpSuite started 
the ThriftBinaryCLIService instead of the ThriftHttpCLIService. This is because 
in HiveClientImpl, the new configurations are only applied to hive conf during 
initializing but not for existing ones.

This cause ThriftServerWithSparkContextInHttpSuite retrying or even 
aborting.

### Why are the changes needed?

Fix flakiness in tests

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

ran tests locally with the hive-thriftserver module locally,
### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44578 from yaooqinn/SPARK-46577.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 605fecd22cc18fc9b93fb26d4aa6088f5a314f92)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala   | 6 ++
 1 file changed, 6 insertions(+)

diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
index b8739ce56e41..cb85993e5e09 100644
--- 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
+++ 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
@@ -17,6 +17,8 @@
 
 package org.apache.spark.sql.hive
 
+import org.apache.hadoop.hive.ql.metadata.Hive
+import org.apache.hadoop.hive.ql.session.SessionState
 import org.apache.logging.log4j.LogManager
 import org.apache.logging.log4j.core.Logger
 
@@ -69,6 +71,10 @@ class HiveMetastoreLazyInitializationSuite extends 
SparkFunSuite {
 } finally {
   Thread.currentThread().setContextClassLoader(originalClassLoader)
   spark.sparkContext.setLogLevel(originalLevel.toString)
+  SparkSession.clearActiveSession()
+  SparkSession.clearDefaultSession()
+  SessionState.detachSession()
+  Hive.closeCurrent()
   spark.stop()
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch branch-3.5 updated: [SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite leaks hive's SessionState

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 2891d92e9d8a [SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite 
leaks hive's SessionState
2891d92e9d8a is described below

commit 2891d92e9d8a5050f457bb116530d46de3babf97
Author: Kent Yao 
AuthorDate: Wed Jan 3 05:54:57 2024 -0800

[SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite leaks hive's 
SessionState

### What changes were proposed in this pull request?

The upcoming tests with the new hive configurations will have no effect due 
to the leaked SessionState.

```
06:21:12.848 pool-1-thread-1 INFO ThriftServerWithSparkContextInHttpSuite: 
Trying to start HiveThriftServer2: mode=http, attempt=0

06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:OperationManager 
is inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:SessionManager 
is inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: CLIService is 
inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: 
Service:ThriftBinaryCLIService is inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: HiveServer2 is 
inited.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:OperationManager 
is started.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service:SessionManager 
is started.
06:21:12.851 pool-1-thread-1 INFO AbstractService: Service: CLIService is 
started.
06:21:12.852 pool-1-thread-1 INFO AbstractService: 
Service:ThriftBinaryCLIService is started.
06:21:12.852 pool-1-thread-1 INFO ThriftCLIService: Starting 
ThriftBinaryCLIService on port 1 with 5...500 worker threads
06:21:12.852 pool-1-thread-1 INFO AbstractService: Service:HiveServer2 is 
started.
```

As the logs above revealed, ThriftServerWithSparkContextInHttpSuite started 
the ThriftBinaryCLIService instead of the ThriftHttpCLIService. This is because 
in HiveClientImpl, the new configurations are only applied to hive conf during 
initializing but not for existing ones.

This cause ThriftServerWithSparkContextInHttpSuite retrying or even 
aborting.

### Why are the changes needed?

Fix flakiness in tests

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

ran tests locally with the hive-thriftserver module locally,
### Was this patch authored or co-authored using generative AI tooling?
no

Closes #44578 from yaooqinn/SPARK-46577.

Authored-by: Kent Yao 
Signed-off-by: Dongjoon Hyun 
(cherry picked from commit 605fecd22cc18fc9b93fb26d4aa6088f5a314f92)
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala   | 6 ++
 1 file changed, 6 insertions(+)

diff --git 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
index b8739ce56e41..cb85993e5e09 100644
--- 
a/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
+++ 
b/sql/hive-thriftserver/src/test/scala/org/apache/spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala
@@ -17,6 +17,8 @@
 
 package org.apache.spark.sql.hive
 
+import org.apache.hadoop.hive.ql.metadata.Hive
+import org.apache.hadoop.hive.ql.session.SessionState
 import org.apache.logging.log4j.LogManager
 import org.apache.logging.log4j.core.Logger
 
@@ -69,6 +71,10 @@ class HiveMetastoreLazyInitializationSuite extends 
SparkFunSuite {
 } finally {
   Thread.currentThread().setContextClassLoader(originalClassLoader)
   spark.sparkContext.setLogLevel(originalLevel.toString)
+  SparkSession.clearActiveSession()
+  SparkSession.clearDefaultSession()
+  SessionState.detachSession()
+  Hive.closeCurrent()
   spark.stop()
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (3b1d843da2de -> 605fecd22cc1)

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3b1d843da2de [SPARK-46567][CORE] Remove ThreadLocal for 
ReadAheadInputStream
 add 605fecd22cc1 [SPARK-46577][SQL] HiveMetastoreLazyInitializationSuite 
leaks hive's SessionState

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/hive/HiveMetastoreLazyInitializationSuite.scala   | 6 ++
 1 file changed, 6 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46567][CORE] Remove ThreadLocal for ReadAheadInputStream

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3b1d843da2de [SPARK-46567][CORE] Remove ThreadLocal for 
ReadAheadInputStream
3b1d843da2de is described below

commit 3b1d843da2de524c781757b4823cc8b8e7d2f5f7
Author: beliefer 
AuthorDate: Wed Jan 3 05:50:44 2024 -0800

[SPARK-46567][CORE] Remove ThreadLocal for ReadAheadInputStream

### What changes were proposed in this pull request?
This PR propose to remove `ThreadLocal` for `ReadAheadInputStream`.

### Why are the changes needed?
`ReadAheadInputStream` has a field `oneByte` declared as `TheadLocal`.
In fact, `oneByte` only used in read.
We can remove it by the way that the closure of local variables in instance 
methods can provide thread safety guarantees.

On the other hand, the `TheadLocal` occupies a certain amount of space in 
the heap and there are allocation and GC costs.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44563 from beliefer/SPARK-46567.

Authored-by: beliefer 
Signed-off-by: Dongjoon Hyun 
---
 core/src/main/java/org/apache/spark/io/ReadAheadInputStream.java | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/core/src/main/java/org/apache/spark/io/ReadAheadInputStream.java 
b/core/src/main/java/org/apache/spark/io/ReadAheadInputStream.java
index 1b76aae8dd22..33dfa4422906 100644
--- a/core/src/main/java/org/apache/spark/io/ReadAheadInputStream.java
+++ b/core/src/main/java/org/apache/spark/io/ReadAheadInputStream.java
@@ -89,8 +89,6 @@ public class ReadAheadInputStream extends InputStream {
 
   private final Condition asyncReadComplete = stateChangeLock.newCondition();
 
-  private static final ThreadLocal oneByte = 
ThreadLocal.withInitial(() -> new byte[1]);
-
   /**
* Creates a ReadAheadInputStream with the specified buffer 
size and read-ahead
* threshold
@@ -247,7 +245,7 @@ public class ReadAheadInputStream extends InputStream {
   // short path - just get one byte.
   return activeBuffer.get() & 0xFF;
 } else {
-  byte[] oneByteArray = oneByte.get();
+  byte[] oneByteArray = new byte[1];
   return read(oneByteArray, 0, 1) == -1 ? -1 : oneByteArray[0] & 0xFF;
 }
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46524][SQL] Improve error messages for invalid save mode

2024-01-03 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a3d999292f8e [SPARK-46524][SQL] Improve error messages for invalid 
save mode
a3d999292f8e is described below

commit a3d999292f8e99269dfd0289e2f5aca7e5ea4fae
Author: allisonwang-db 
AuthorDate: Wed Jan 3 15:43:53 2024 +0300

[SPARK-46524][SQL] Improve error messages for invalid save mode

### What changes were proposed in this pull request?

This PR improves the error messages when writing a data frame with an 
invalid save mode.

### Why are the changes needed?

To improve the error messages.
Before this PR, Spark throws an java.lang.IllegalArgumentException:
`java.lang.IllegalArgumentException: Unknown save mode: foo. Accepted save 
modes are 'overwrite', 'append', 'ignore', 'error', 'errorifexists', 'default'.`

After this PR, the error will have a proper error class:
`[INVALID_SAVE_MODE] The specified save mode "foo" is invalid. Valid save 
modes include "append", "overwrite", "ignore", "error", "errorifexists", and 
"default"."
`

### Does this PR introduce _any_ user-facing change?

Yes. The error messages will be changed.

### How was this patch tested?

New unit test

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #44508 from allisonwang-db/spark-46524-invalid-save-mode.

Authored-by: allisonwang-db 
Signed-off-by: Max Gekk 
---
 R/pkg/tests/fulltests/test_sparkSQL.R| 2 +-
 common/utils/src/main/resources/error/error-classes.json | 6 ++
 docs/sql-error-conditions.md | 6 ++
 .../org/apache/spark/sql/errors/QueryCompilationErrors.scala | 7 +++
 .../src/main/scala/org/apache/spark/sql/DataFrameWriter.scala| 3 +--
 .../spark/sql/execution/python/PythonDataSourceSuite.scala   | 9 +
 6 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/R/pkg/tests/fulltests/test_sparkSQL.R 
b/R/pkg/tests/fulltests/test_sparkSQL.R
index 0d96f708a544..c1a5292195af 100644
--- a/R/pkg/tests/fulltests/test_sparkSQL.R
+++ b/R/pkg/tests/fulltests/test_sparkSQL.R
@@ -1414,7 +1414,7 @@ test_that("test HiveContext", {
 
 # Invalid mode
 expect_error(saveAsTable(df, "parquetest", "parquet", mode = "abc", path = 
parquetDataPath),
- "illegal argument - Unknown save mode: abc")
+ "Error in mode : analysis error - \\[INVALID_SAVE_MODE\\].*")
 unsetHiveContext()
   }
 })
diff --git a/common/utils/src/main/resources/error/error-classes.json 
b/common/utils/src/main/resources/error/error-classes.json
index 87e43fe0e38c..bcaf8a74c08d 100644
--- a/common/utils/src/main/resources/error/error-classes.json
+++ b/common/utils/src/main/resources/error/error-classes.json
@@ -2239,6 +2239,12 @@
 ],
 "sqlState" : "42613"
   },
+  "INVALID_SAVE_MODE" : {
+"message" : [
+  "The specified save mode  is invalid. Valid save modes include 
\"append\", \"overwrite\", \"ignore\", \"error\", \"errorifexists\", and 
\"default\"."
+],
+"sqlState" : "42000"
+  },
   "INVALID_SCHEMA" : {
 "message" : [
   "The input schema  is not a valid schema string."
diff --git a/docs/sql-error-conditions.md b/docs/sql-error-conditions.md
index 3f4074af9b78..c6108e97b4c5 100644
--- a/docs/sql-error-conditions.md
+++ b/docs/sql-error-conditions.md
@@ -1271,6 +1271,12 @@ For more details see 
[INVALID_PARTITION_OPERATION](sql-error-conditions-invalid-
 
 Parameterized query must either use positional, or named parameters, but not 
both.
 
+### INVALID_SAVE_MODE
+
+[SQLSTATE: 
42000](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation)
+
+The specified save mode `` is invalid. Valid save modes include 
"append", "overwrite", "ignore", "error", "errorifexists", and "default".
+
 ### [INVALID_SCHEMA](sql-error-conditions-invalid-schema-error-class.html)
 
 [SQLSTATE: 
42K07](sql-error-conditions-sqlstates.html#class-42-syntax-error-or-access-rule-violation)
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
index bc847d1c0069..b844ee2bdc45 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryCompilationErrors.scala
@@ -3184,6 +3184,13 @@ private[sql] object QueryCompilationErrors extends 
QueryErrorsBase with Compilat
 "config" -> SQLConf.LEGACY_PATH_OPTION_BEHAVIOR.key))
   }
 
+  def invalidSaveModeError(saveMode: String): Throwable = {
+new AnalysisException(
+  errorClass = "INVALID_SA

(spark-website) branch asf-site updated: docs: update examples page (#494)

2024-01-03 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new 67c90a03a7 docs: update examples page (#494)
67c90a03a7 is described below

commit 67c90a03a706fec13d6356d009ea19270391c4b1
Author: Matthew Powers 
AuthorDate: Wed Jan 3 04:49:55 2024 -0500

docs: update examples page (#494)

* docs: update examples page

* add examples html
---
 examples.md| 745 +
 site/examples.html | 576 +++--
 2 files changed, 617 insertions(+), 704 deletions(-)

diff --git a/examples.md b/examples.md
index d9362784d9..d29cd40bba 100644
--- a/examples.md
+++ b/examples.md
@@ -6,397 +6,364 @@ navigation:
   weight: 4
   show: true
 ---
-Apache Spark™ examples
-
-These examples give a quick overview of the Spark API.
-Spark is built on the concept of distributed datasets, which contain 
arbitrary Java or
-Python objects. You create a dataset from external data, then apply parallel 
operations
-to it. The building block of the Spark API is its [RDD 
API](https://spark.apache.org/docs/latest/rdd-programming-guide.html#resilient-distributed-datasets-rdds).
-In the RDD API,
-there are two types of operations: transformations, which define a 
new dataset based on previous ones,
-and actions, which kick off a job to execute on a cluster.
-On top of Spark’s RDD API, high level APIs are provided, e.g.
-[DataFrame 
API](https://spark.apache.org/docs/latest/sql-programming-guide.html#datasets-and-dataframes)
 and
-[Machine Learning API](https://spark.apache.org/docs/latest/mllib-guide.html).
-These high level APIs provide a concise way to conduct certain data operations.
-In this page, we will show examples using RDD API as well as examples using 
high level APIs.
-
-RDD API examples
-
-Word count
-In this example, we use a few transformations to build a dataset of 
(String, Int) pairs called counts and then save it to a file.
-
-
-  Python
-  Scala
-  Java
-
-
-
-
-
-{% highlight python %}
-text_file = sc.textFile("hdfs://...")
-counts = text_file.flatMap(lambda line: line.split(" ")) \
- .map(lambda word: (word, 1)) \
- .reduceByKey(lambda a, b: a + b)
-counts.saveAsTextFile("hdfs://...")
-{% endhighlight %}
-
-
-
-
-
-{% highlight scala %}
-val textFile = sc.textFile("hdfs://...")
-val counts = textFile.flatMap(line => line.split(" "))
- .map(word => (word, 1))
- .reduceByKey(_ + _)
-counts.saveAsTextFile("hdfs://...")
-{% endhighlight %}
-
-
-
-
-
-{% highlight java %}
-JavaRDD textFile = sc.textFile("hdfs://...");
-JavaPairRDD counts = textFile
-.flatMap(s -> Arrays.asList(s.split(" ")).iterator())
-.mapToPair(word -> new Tuple2<>(word, 1))
-.reduceByKey((a, b) -> a + b);
-counts.saveAsTextFile("hdfs://...");
-{% endhighlight %}
-
-
-
-
-Pi estimation
-Spark can also be used for compute-intensive tasks. This code estimates 
π by "throwing darts" 
at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see 
how many fall in the unit circle. The fraction should be π / 4, so we use this to 
get our estimate.
-
-
-  Python
-  Scala
-  Java
-
-
-
-
-
-{% highlight python %}
-def inside(p):
-x, y = random.random(), random.random()
-return x*x + y*y < 1
-
-count = sc.parallelize(range(0, NUM_SAMPLES)) \
- .filter(inside).count()
-print("Pi is roughly %f" % (4.0 * count / NUM_SAMPLES))
-{% endhighlight %}
-
-
-
-
-
-{% highlight scala %}
-val count = sc.parallelize(1 to NUM_SAMPLES).filter { _ =>
-  val x = math.random
-  val y = math.random
-  x*x + y*y < 1
-}.count()
-println(s"Pi is roughly ${4.0 * count / NUM_SAMPLES}")
-{% endhighlight %}
-
-
-
-
-
-{% highlight java %}
-List l = new ArrayList<>(NUM_SAMPLES);
-for (int i = 0; i < NUM_SAMPLES; i++) {
-  l.add(i);
-}
-
-long count = sc.parallelize(l).filter(i -> {
-  double x = Math.random();
-  double y = Math.random();
-  return x*x + y*y < 1;
-}).count();
-System.out.println("Pi is roughly " + 4.0 * count / NUM_SAMPLES);
-{% endhighlight %}
-
-
-
-
-DataFrame API examples
-
-In Spark, a https://spark.apache.org/docs/latest/sql-programming-guide.html#dataframes";>DataFrame
-is a distributed collection of data organized into named columns.
-Users can use DataFrame API to perform various relational operations on both 
external
-data sources and Spark’s built-in distributed collections without providing 
specific procedures for processing data.
-Also, programs based on DataFrame API will be automatically optimized by 
Spark’s built-in optimizer, Catalyst.
-
-
-Text search
-In this example, we search through the error messages in a log file.
-
-
-  Python
-  Scala
-  Java
-
-
-
-
-
-{% highlight python %}
-textFile = sc.textFile("hdfs://...")
-
-# Creates a DataFrame hav

Re: [PR] docs: update examples page [spark-website]

2024-01-03 Thread via GitHub


zhengruifeng merged PR #494:
URL: https://github.com/apache/spark-website/pull/494


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46573][K8S] Use `appId` instead of `conf.appId` in `LoggingPodStatusWatcherImpl`

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 347c955fe723 [SPARK-46573][K8S] Use `appId` instead of `conf.appId` in 
`LoggingPodStatusWatcherImpl`
347c955fe723 is described below

commit 347c955fe7231eb2912c6678ea7769024f6dc5df
Author: yangjie01 
AuthorDate: Wed Jan 3 01:00:22 2024 -0800

[SPARK-46573][K8S] Use `appId` instead of `conf.appId` in 
`LoggingPodStatusWatcherImpl`

### What changes were proposed in this pull request?
This PR replaces the call to `conf.appId` with direct use of `appId` in 
`LoggingPodStatusWatcherImpl`, as it is already defined in 
`LoggingPodStatusWatcherImpl`:


https://github.com/apache/spark/blob/b74b1592c9ec07b3d29b6d4d900b1d3ba1417cd1/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala#L42

### Why are the changes needed?
Should use the already defined `val appId`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #44569 from LuciferYang/SPARK-46573.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala  | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
index bc8b023b5ecd..3227a72a8371 100644
--- 
a/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
+++ 
b/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/submit/LoggingPodStatusWatcher.scala
@@ -96,7 +96,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(conf: 
KubernetesDriverConf)
   }
 
   override def watchOrStop(sId: String): Boolean = {
-logInfo(s"Waiting for application ${conf.appName} with application ID 
${conf.appId} " +
+logInfo(s"Waiting for application ${conf.appName} with application ID 
$appId " +
   s"and submission ID $sId to finish...")
 val interval = conf.get(REPORT_INTERVAL)
 synchronized {
@@ -110,7 +110,7 @@ private[k8s] class LoggingPodStatusWatcherImpl(conf: 
KubernetesDriverConf)
   logInfo(
 pod.map { p => s"Container final 
statuses:\n\n${containersDescription(p)}" }
   .getOrElse("No containers were found in the driver pod."))
-  logInfo(s"Application ${conf.appName} with application ID ${conf.appId} 
" +
+  logInfo(s"Application ${conf.appName} with application ID $appId " +
 s"and submission ID $sId finished")
 }
 podCompleted


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-46525][DOCKER][TESTS] Fix docker-integration-tests on Apple Sillicon

2024-01-03 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9208c42b3a11 [SPARK-46525][DOCKER][TESTS] Fix docker-integration-tests 
on Apple Sillicon
9208c42b3a11 is described below

commit 9208c42b3a110099d1cc0249b6be364aacff0f2a
Author: Kent Yao 
AuthorDate: Wed Jan 3 00:58:11 2024 -0800

[SPARK-46525][DOCKER][TESTS] Fix docker-integration-tests on Apple Sillicon

### What changes were proposed in this pull request?

`com.spotify.docker.client` is not going to support Apple Silicons as it 
has already been archived and the 
[jnr-unixsocket](https://mvnrepository.com/artifact/com.github.jnr/jnr-unixsocket)
 0.18 it uses is not compatible with Apple Silicons.

If we run our docker IT tests on Apple Silicons, it will fail like

```java
[info] org.apache.spark.sql.jdbc.MariaDBKrbIntegrationSuite *** ABORTED *** 
(2 seconds, 264 milliseconds)
[info]   com.spotify.docker.client.exceptions.DockerException: 
java.util.concurrent.ExecutionException: 
com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
java.lang.UnsatisfiedLinkError: could not load FFI provider 
jnr.ffi.provider.jffi.Provider
[info]   at 
com.spotify.docker.client.DefaultDockerClient.propagate(DefaultDockerClient.java:2828)
[info]   at 
com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:2692)
[info]   at 
com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:574)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:124)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.beforeAll(DockerJDBCIntegrationSuite.scala:118)
[info]   at 
org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.super$beforeAll(DockerKrbJDBCIntegrationSuite.scala:65)
[info]   at 
org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerKrbJDBCIntegrationSuite.scala:65)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrationFunSuite.scala:49)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled$(DockerIntegrationFunSuite.scala:47)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.runIfTestsEnabled(DockerJDBCIntegrationSuite.scala:95)
[info]   at 
org.apache.spark.sql.jdbc.DockerKrbJDBCIntegrationSuite.beforeAll(DockerKrbJDBCIntegrationSuite.scala:44)
[info]   at 
org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at 
org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:69)
[info]   at 
org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321)
[info]   at 
org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517)
[info]   at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:414)
[info]   at 
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
[info]   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
[info]   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
[info]   at java.base/java.lang.Thread.run(Thread.java:840)
[info]   Cause: java.util.concurrent.ExecutionException: 
com.spotify.docker.client.shaded.javax.ws.rs.ProcessingException: 
java.lang.UnsatisfiedLinkError: could not load FFI provider 
jnr.ffi.provider.jffi.Provider
[info]   at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
[info]   at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
[info]   at 
jersey.repackaged.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
[info]   at 
com.spotify.docker.client.DefaultDockerClient.request(DefaultDockerClient.java:2690)
[info]   at 
com.spotify.docker.client.DefaultDockerClient.ping(DefaultDockerClient.java:574)
[info]   at 
org.apache.spark.sql.jdbc.DockerJDBCIntegrationSuite.$anonfun$beforeAll$1(DockerJDBCIntegrationSuite.scala:124)
[info]   at 
org.apache.spark.sql.jdbc.DockerIntegrationFunSuite.runIfTestsEnabled(DockerIntegrat