(spark) branch master updated: [SPARK-45685][CORE][SQL] Use `LazyList` instead of `Stream`

2023-10-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b92265a98f2 [SPARK-45685][CORE][SQL] Use `LazyList` instead of `Stream`
b92265a98f2 is described below

commit b92265a98f241b333467a02f4fffc9889ad3e7da
Author: yangjie01 
AuthorDate: Sun Oct 29 22:43:54 2023 -0700

[SPARK-45685][CORE][SQL] Use `LazyList` instead of `Stream`

### What changes were proposed in this pull request?
This pr change to use  `LazyList` instead of `Stream` due to `Stream` has 
been marked as deprecated after Scala 2.13.0.

- `class Stream`

```scala
deprecated("Use LazyList (which is fully lazy) instead of Stream (which has 
a lazy tail only)", "2.13.0")
SerialVersionUID(3L)
sealed abstract class Stream[+A] extends AbstractSeq[A]
  with LinearSeq[A]
  with LinearSeqOps[A, Stream, Stream[A]]
  with IterableFactoryDefaults[A, Stream]
  with Serializable {
  ...
  deprecated("The `append` operation has been renamed `lazyAppendedAll`", 
"2.13.0")
  inline final def append[B >: A](rest: => IterableOnce[B]): Stream[B] = 
lazyAppendedAll(rest)
```
- `object Stream`

```scala
deprecated("Use LazyList (which is fully lazy) instead of Stream (which has 
a lazy tail only)", "2.13.0")
SerialVersionUID(3L)
object Stream extends SeqFactory[Stream] {
```
- `type Stream` and value Stream

```scala
deprecated("Use LazyList instead of Stream", "2.13.0")
type Stream[+A] = scala.collection.immutable.Stream[A]
deprecated("Use LazyList instead of Stream", "2.13.0")
val Stream = scala.collection.immutable.Stream
```
- method `toStream` in trait `IterableOnceOps`

```scala
deprecated("Use .to(LazyList) instead of .toStream", "2.13.0")
inline final def toStream: immutable.Stream[A] = to(immutable.Stream)
```

### Why are the changes needed?
Clean up deprecated Scala API usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Acitons

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43563 from LuciferYang/stream-2-lazylist.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../input/WholeTextFileInputFormatSuite.scala  |  2 +-
 .../input/WholeTextFileRecordReaderSuite.scala |  2 +-
 .../spark/storage/FlatmapIteratorSuite.scala   |  6 ++--
 .../expressions/collectionOperations.scala |  2 +-
 .../plans/AliasAwareOutputExpression.scala |  4 +--
 .../spark/sql/catalyst/plans/QueryPlan.scala   |  2 +-
 .../apache/spark/sql/catalyst/trees/TreeNode.scala | 42 +++---
 .../sql/catalyst/plans/LogicalPlanSuite.scala  |  4 +--
 .../spark/sql/catalyst/trees/TreeNodeSuite.scala   | 36 +--
 .../spark/sql/RelationalGroupedDataset.scala   |  4 +--
 .../sql/execution/AliasAwareOutputExpression.scala |  2 +-
 .../sql/execution/WholeStageCodegenExec.scala  |  2 +-
 .../execution/joins/BroadcastHashJoinExec.scala|  2 +-
 .../apache/spark/sql/DataFrameAggregateSuite.scala |  2 +-
 .../spark/sql/DataFrameWindowFunctionsSuite.scala  |  2 +-
 .../scala/org/apache/spark/sql/GenTPCDSData.scala  | 13 ---
 .../apache/spark/sql/GeneratorFunctionSuite.scala  |  2 +-
 .../scala/org/apache/spark/sql/JoinSuite.scala |  4 +--
 .../apache/spark/sql/execution/PlannerSuite.scala  |  2 +-
 .../org/apache/spark/sql/execution/SortSuite.scala |  2 +-
 .../sql/execution/WholeStageCodegenSuite.scala |  4 +--
 21 files changed, 70 insertions(+), 71 deletions(-)

diff --git 
a/core/src/test/scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala
 
b/core/src/test/scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala
index 417e711e9c0..26c1be259ad 100644
--- 
a/core/src/test/scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/input/WholeTextFileInputFormatSuite.scala
@@ -83,6 +83,6 @@ object WholeTextFileInputFormatSuite {
   private val fileLengths = Array(10, 100, 1000)
 
   private val files = fileLengths.zip(fileNames).map { case (upperBound, 
filename) =>
-filename -> 
Stream.continually(testWords.toList.toStream).flatten.take(upperBound).toArray
+filename -> 
LazyList.continually(testWords.toList.to(LazyList)).flatten.take(upperBound).toArray
   }.toMap
 }
diff --git 
a/core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala
 
b/core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala
index c833d22b3be..e64ebe2a551 100644
--- 
a/core/src/test/scala/org/apache/spark/input/WholeTextFileRecordReaderSuite.scala
+++ 

(spark) branch master updated: [SPARK-45707][SQL] Simplify `DataFrameStatFunctions.countMinSketch` with `CountMinSketchAgg`

2023-10-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4339e0c0e4d [SPARK-45707][SQL] Simplify 
`DataFrameStatFunctions.countMinSketch` with `CountMinSketchAgg`
4339e0c0e4d is described below

commit 4339e0c0e4d7e502ae6cafa90444cd153017cb1a
Author: Ruifeng Zheng 
AuthorDate: Sun Oct 29 22:42:01 2023 -0700

[SPARK-45707][SQL] Simplify `DataFrameStatFunctions.countMinSketch` with 
`CountMinSketchAgg`

### What changes were proposed in this pull request?
Simplify `DataFrameStatFunctions.countMinSketch` with `CountMinSketchAgg`

### Why are the changes needed?
to make it consistent with sql functions

### Does this PR introduce _any_ user-facing change?

better error messages: `IllegalArgumentException` -> `AnalysisException`

### How was this patch tested?
ci

### Was this patch authored or co-authored using generative AI tooling?
no

Closes #43560 from zhengruifeng/sql_reimpl_stat_countMinSketch.

Authored-by: Ruifeng Zheng 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/DataFrameStatFunctions.scala  | 44 +++---
 .../org/apache/spark/sql/DataFrameStatSuite.scala  |  2 +-
 2 files changed, 14 insertions(+), 32 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala 
b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
index de3b100cd6a..f3690773f6d 100644
--- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
+++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
@@ -22,9 +22,8 @@ import java.{lang => jl, util => ju}
 import scala.jdk.CollectionConverters._
 
 import org.apache.spark.annotation.Stable
-import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.Literal
-import org.apache.spark.sql.catalyst.expressions.aggregate.BloomFilterAggregate
+import 
org.apache.spark.sql.catalyst.expressions.aggregate.{BloomFilterAggregate, 
CountMinSketchAgg}
 import org.apache.spark.sql.execution.stat._
 import org.apache.spark.sql.functions.col
 import org.apache.spark.sql.types._
@@ -483,7 +482,9 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @since 2.0.0
*/
   def countMinSketch(col: Column, depth: Int, width: Int, seed: Int): 
CountMinSketch = {
-countMinSketch(col, CountMinSketch.create(depth, width, seed))
+val eps = 2.0 / width
+val confidence = 1 - 1 / Math.pow(2, depth)
+countMinSketch(col, eps, confidence, seed)
   }
 
   /**
@@ -497,35 +498,16 @@ final class DataFrameStatFunctions private[sql](df: 
DataFrame) {
* @since 2.0.0
*/
   def countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): 
CountMinSketch = {
-countMinSketch(col, CountMinSketch.create(eps, confidence, seed))
-  }
-
-  private def countMinSketch(col: Column, zero: CountMinSketch): 
CountMinSketch = {
-val singleCol = df.select(col)
-val colType = singleCol.schema.head.dataType
-
-val updater: (CountMinSketch, InternalRow) => Unit = colType match {
-  // For string type, we can get bytes of our `UTF8String` directly, and 
call the `addBinary`
-  // instead of `addString` to avoid unnecessary conversion.
-  case StringType => (sketch, row) => 
sketch.addBinary(row.getUTF8String(0).getBytes)
-  case ByteType => (sketch, row) => sketch.addLong(row.getByte(0))
-  case ShortType => (sketch, row) => sketch.addLong(row.getShort(0))
-  case IntegerType => (sketch, row) => sketch.addLong(row.getInt(0))
-  case LongType => (sketch, row) => sketch.addLong(row.getLong(0))
-  case _ =>
-throw new IllegalArgumentException(
-  s"Count-min Sketch only supports string type and integral types, " +
-s"and does not support type $colType."
-)
-}
-
-singleCol.queryExecution.toRdd.aggregate(zero)(
-  (sketch: CountMinSketch, row: InternalRow) => {
-updater(sketch, row)
-sketch
-  },
-  (sketch1, sketch2) => sketch1.mergeInPlace(sketch2)
+val countMinSketchAgg = new CountMinSketchAgg(
+  col.expr,
+  Literal(eps, DoubleType),
+  Literal(confidence, DoubleType),
+  Literal(seed, IntegerType)
 )
+val bytes = df.select(
+  Column(countMinSketchAgg.toAggregateExpression(false))
+).head().getAs[Array[Byte]](0)
+countMinSketchAgg.deserialize(bytes)
   }
 
   /**
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala
index 1dece5c8285..430e3622102 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala
+++ 

(spark) branch master updated: [SPARK-45682][CORE][SQL][ML][MLLIB][GRAPHX][YARN][DSTREAM][UI][EXAMPLES] Fix "method + in class Byte/Short/Char/Long/Double/Int is deprecated"

2023-10-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 59a9ee77657 
[SPARK-45682][CORE][SQL][ML][MLLIB][GRAPHX][YARN][DSTREAM][UI][EXAMPLES] Fix 
"method + in class Byte/Short/Char/Long/Double/Int is deprecated"
59a9ee77657 is described below

commit 59a9ee776570de987c36e3e8e995d067017064b5
Author: yangjie01 
AuthorDate: Sun Oct 29 22:37:49 2023 -0700

[SPARK-45682][CORE][SQL][ML][MLLIB][GRAPHX][YARN][DSTREAM][UI][EXAMPLES] 
Fix "method + in class Byte/Short/Char/Long/Double/Int is deprecated"

### What changes were proposed in this pull request?
There are some compilation warnings similar to the following:

```
[error] 
/Users/yangjie01/SourceCode/git/spark-mine-sbt/mllib/src/test/scala/org/apache/spark/mllib/regression/LassoSuite.scala:65:58:
 method + in class Double is deprecated (since 2.13.0): Adding a number and a 
String is deprecated. Use the string interpolation `s"$num$str"`
[error] Applicable -Wconf / nowarn filters for this fatal warning: 
msg=, cat=deprecation, 
site=org.apache.spark.mllib.regression.LassoSuite, origin=scala.Double.+, 
version=2.13.0
[error] assert(weight1 >= -1.60 && weight1 <= -1.40, weight1 + " not in 
[-1.6, -1.4]")
[error]  ^
[error] 
/Users/yangjie01/SourceCode/git/spark-mine-sbt/sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala:1314:50:
 method + in class Int is deprecated (since 2.13.0): Adding a number and a 
String is deprecated. Use the string interpolation `s"$num$str"`
[error] Applicable -Wconf / nowarn filters for this fatal warning: 
msg=, cat=deprecation, 
site=org.apache.spark.sql.hive.execution.SQLQuerySuiteBase, origin=scala.Int.+, 
version=2.13.0
[error]   checkAnswer(df, (0 until 5).map(i => Row(i + "#", i + "#")))
[error]  ^
```

This pr fix them refer to `Adding a number and a String is deprecated. Use 
the string interpolation `s"$num$str"``

### Why are the changes needed?
Clean up deprecated Scala API usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Acitons

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43573 from LuciferYang/SPARK-45682.

Authored-by: yangjie01 
Signed-off-by: Dongjoon Hyun 
---
 .../main/scala/org/apache/spark/SparkConf.scala|  2 +-
 .../spark/internal/config/ConfigBuilder.scala  |  4 ++--
 .../scala/org/apache/spark/scheduler/TaskSet.scala |  2 +-
 .../org/apache/spark/ui/ConsoleProgressBar.scala   |  4 ++--
 .../main/scala/org/apache/spark/ui/UIUtils.scala   |  2 +-
 .../scala/org/apache/spark/ui/jobs/StagePage.scala |  6 +++---
 .../scala/org/apache/spark/util/Distribution.scala |  4 ++--
 .../scala/org/apache/spark/rdd/PipedRDDSuite.scala |  2 +-
 .../spark/util/collection/AppendOnlyMapSuite.scala |  6 +++---
 .../spark/util/collection/OpenHashMapSuite.scala   |  6 +++---
 .../collection/PrimitiveKeyOpenHashMapSuite.scala  |  6 +++---
 .../apache/spark/examples/graphx/Analytics.scala   |  2 +-
 .../apache/spark/graphx/util/GraphGenerators.scala |  2 +-
 .../apache/spark/mllib/util/MFDataGenerator.scala  |  4 ++--
 .../spark/ml/classification/LinearSVCSuite.scala   |  2 +-
 .../classification/LogisticRegressionSuite.scala   | 10 +-
 .../GeneralizedLinearRegressionSuite.scala | 22 +++---
 .../ml/regression/LinearRegressionSuite.scala  |  8 
 .../apache/spark/mllib/regression/LassoSuite.scala | 12 ++--
 .../spark/deploy/yarn/ExecutorRunnable.scala   |  2 +-
 .../yarn/YarnShuffleServiceMetricsSuite.scala  |  2 +-
 .../spark/sql/catalyst/expressions/literals.scala  | 10 +-
 .../expressions/IntervalExpressionsSuite.scala |  4 ++--
 .../sql/catalyst/util/IntervalUtilsSuite.scala |  2 +-
 .../test/scala/org/apache/spark/sql/UDFSuite.scala |  2 +-
 .../sql/execution/datasources/FileIndexSuite.scala |  2 +-
 .../parquet/ParquetColumnIndexSuite.scala  |  6 +++---
 .../datasources/parquet/ParquetFilterSuite.scala   |  2 +-
 .../datasources/v2/V2PredicateSuite.scala  |  2 +-
 .../spark/sql/hive/execution/SQLQuerySuite.scala   |  4 ++--
 .../apache/spark/streaming/CheckpointSuite.scala   |  2 +-
 .../apache/spark/streaming/InputStreamsSuite.scala |  4 ++--
 .../spark/streaming/StreamingContextSuite.scala|  2 +-
 33 files changed, 76 insertions(+), 76 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/SparkConf.scala 
b/core/src/main/scala/org/apache/spark/SparkConf.scala
index b688604beea..b8fd2700771 100644
--- 

(spark) branch master updated: [SPARK-45664][SQL] Introduce a mapper for orc compression codecs

2023-10-29 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f12bc05e440 [SPARK-45664][SQL] Introduce a mapper for orc compression 
codecs
f12bc05e440 is described below

commit f12bc05e44099f24c470466bf777473744ab893d
Author: Jiaan Geng 
AuthorDate: Sun Oct 29 22:22:07 2023 -0700

[SPARK-45664][SQL] Introduce a mapper for orc compression codecs

### What changes were proposed in this pull request?
Currently, Spark supported all the orc compression codecs, but the orc 
supported compression codecs and spark supported are not completely one-on-one 
due to Spark introduce two compression codecs `NONE` and `UNCOMPRESSED`.

On the other hand, there are a lot of magic strings copy from orc 
compression codecs. This issue lead to developers need to manually maintain its 
consistency. It is easy to make mistakes and reduce development efficiency.

### Why are the changes needed?
Let developers easy to use orc compression codecs.

### Does this PR introduce _any_ user-facing change?
'No'.
Introduce a new class.

### How was this patch tested?
Exists test cases.

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #43528 from beliefer/SPARK-45664.

Authored-by: Jiaan Geng 
Signed-off-by: Dongjoon Hyun 
---
 .../datasources/orc/OrcCompressionCodec.java   | 56 ++
 .../sql/execution/datasources/orc/OrcOptions.scala | 11 ++---
 .../sql/execution/datasources/orc/OrcUtils.scala   | 12 ++---
 .../BuiltInDataSourceWriteBenchmark.scala  |  4 +-
 .../benchmark/DataSourceReadBenchmark.scala|  4 +-
 .../benchmark/FilterPushdownBenchmark.scala|  3 +-
 .../datasources/FileSourceCodecSuite.scala |  5 +-
 .../execution/datasources/orc/OrcQuerySuite.scala  | 26 +-
 .../execution/datasources/orc/OrcSourceSuite.scala | 29 +++
 .../spark/sql/hive/CompressionCodecSuite.scala | 23 ++---
 .../spark/sql/hive/execution/HiveDDLSuite.scala|  7 +--
 .../sql/hive/orc/OrcHadoopFsRelationSuite.scala|  6 +--
 .../spark/sql/hive/orc/OrcReadBenchmark.scala  |  5 +-
 13 files changed, 134 insertions(+), 57 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcCompressionCodec.java
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcCompressionCodec.java
new file mode 100644
index 000..c8e57969068
--- /dev/null
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcCompressionCodec.java
@@ -0,0 +1,56 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.execution.datasources.orc;
+
+import java.util.Arrays;
+import java.util.Locale;
+import java.util.Map;
+import java.util.stream.Collectors;
+
+import org.apache.orc.CompressionKind;
+
+/**
+ * A mapper class from Spark supported orc compression codecs to orc 
compression codecs.
+ */
+public enum OrcCompressionCodec {
+  NONE(CompressionKind.NONE),
+  UNCOMPRESSED(CompressionKind.NONE),
+  ZLIB(CompressionKind.ZLIB),
+  SNAPPY(CompressionKind.SNAPPY),
+  LZO(CompressionKind.LZO),
+  LZ4(CompressionKind.LZ4),
+  ZSTD(CompressionKind.ZSTD);
+
+  private final CompressionKind compressionKind;
+
+  OrcCompressionCodec(CompressionKind compressionKind) {
+this.compressionKind = compressionKind;
+  }
+
+  public CompressionKind getCompressionKind() {
+return this.compressionKind;
+  }
+
+  public static final Map codecNameMap =
+Arrays.stream(OrcCompressionCodec.values()).collect(
+  Collectors.toMap(codec -> codec.name(), codec -> 
codec.name().toLowerCase(Locale.ROOT)));
+
+  public String lowerCaseName() {
+return codecNameMap.get(this.name());
+  }
+}
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcOptions.scala
index 1c819f07038..4bed600fa4e 100644
--- 

(spark) branch master updated: [SPARK-44790][SQL] XML: to_xml implementation and bindings for python, connect and SQL

2023-10-29 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 00d2b4fa2de [SPARK-44790][SQL] XML: to_xml implementation and bindings 
for python, connect and SQL
00d2b4fa2de is described below

commit 00d2b4fa2def948e7517bacfce7c75be6a37eb20
Author: Sandip Agarwala <131817656+sandip...@users.noreply.github.com>
AuthorDate: Mon Oct 30 14:00:31 2023 +0900

[SPARK-44790][SQL] XML: to_xml implementation and bindings for python, 
connect and SQL

### What changes were proposed in this pull request?
to_xml: Converts a `StructType` to a XML output string.
Bindings for python, connect and SQL

### Why are the changes needed?
to_xml: Converts a `StructType` to a XML output string.

### Does this PR introduce _any_ user-facing change?
Yes, new to_xml API.

### How was this patch tested?
New unit tests.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43503 from sandip-db/to_xml.

Authored-by: Sandip Agarwala <131817656+sandip...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 .../scala/org/apache/spark/sql/functions.scala |  31 +++
 .../org/apache/spark/sql/FunctionTestSuite.scala   |   6 +
 .../source/reference/pyspark.sql/functions.rst |   1 +
 python/pyspark/sql/connect/functions.py|  10 +
 python/pyspark/sql/functions.py|  36 +++
 .../sql/tests/connect/test_connect_function.py |  14 +
 sql/catalyst/pom.xml   |   4 +
 .../sql/catalyst/analysis/FunctionRegistry.scala   |   3 +-
 .../sql/catalyst/expressions/xmlExpressions.scala  |  90 ++-
 .../spark/sql/catalyst/xml/StaxXmlGenerator.scala  | 295 -
 .../apache/spark/sql/catalyst/xml/XmlOptions.scala |   5 +
 sql/core/pom.xml   |   4 -
 .../datasources/xml/XmlOutputWriter.scala  |  53 +---
 .../scala/org/apache/spark/sql/functions.scala |  30 +++
 .../sql-functions/sql-expression-schema.md |   1 +
 .../analyzer-results/xml-functions.sql.out | 122 +
 .../resources/sql-tests/inputs/xml-functions.sql   |  18 +-
 .../sql-tests/results/xml-functions.sql.out| 134 ++
 .../sql/execution/datasources/xml/XmlSuite.scala   |  25 ++
 19 files changed, 696 insertions(+), 186 deletions(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index 9c5adca7e28..1c8f5993d29 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -7470,6 +7470,37 @@ object functions {
 fnWithOptions("schema_of_xml", options.asScala.iterator, xml)
   }
 
+  // scalastyle:off line.size.limit
+
+  /**
+   * (Java-specific) Converts a column containing a `StructType` into a XML 
string with the
+   * specified schema. Throws an exception, in the case of an unsupported type.
+   *
+   * @param e
+   *   a column containing a struct.
+   * @param options
+   *   options to control how the struct column is converted into a XML 
string. It accepts the
+   *   same options as the XML data source. See https://spark.apache.org/docs/latest/sql-data-sources-xml.html#data-source-option;>
 Data
+   *   Source Option in the version you use.
+   * @group xml_funcs
+   * @since 4.0.0
+   */
+  // scalastyle:on line.size.limit
+  def to_xml(e: Column, options: java.util.Map[String, String]): Column =
+fnWithOptions("to_xml", options.asScala.iterator, e)
+
+  /**
+   * Converts a column containing a `StructType` into a XML string with the 
specified schema.
+   * Throws an exception, in the case of an unsupported type.
+   *
+   * @param e
+   *   a column containing a struct.
+   * @group xml_funcs
+   * @since 4.0.0
+   */
+  def to_xml(e: Column): Column = to_xml(e, Collections.emptyMap())
+
   /**
* Returns the total number of elements in the array. The function returns 
null for null input.
*
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala
index e350bde9946..748843ec991 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/FunctionTestSuite.scala
@@ -237,6 +237,12 @@ class FunctionTestSuite extends ConnectFunSuite {
 from_xml(a, schema, Map.empty[String, String].asJava),
 from_xml(a, schema, Collections.emptyMap[String, String]),
 from_xml(a, lit(schema.json), 

Re: [PR] Add Matomo analytics to the published documents [spark-website]

2023-10-29 Thread via GitHub


allisonwang-db commented on PR #485:
URL: https://github.com/apache/spark-website/pull/485#issuecomment-1784451642

   This is awesome! Thanks for adding it to ALL API docs! Let's see if it works 
for 3.1.1.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Add Matomo analytics to the published documents [spark-website]

2023-10-29 Thread via GitHub


panbingkun commented on PR #485:
URL: https://github.com/apache/spark-website/pull/485#issuecomment-1784382189

   cc @zhengruifeng @itholic @allisonwang-db @HyukjinKwon @srowen


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



Re: [PR] Add Matomo analytics to the published documents [spark-website]

2023-10-29 Thread via GitHub


panbingkun commented on PR #485:
URL: https://github.com/apache/spark-website/pull/485#issuecomment-1784379925

   At present, this PR has only made changes to the historical document of 
version 3.1.1. If there are no issues with similar modifications, I will use 
tools to make similar modifications to other versions.
   
   PS: This PR will involve a lot of files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[PR] Add Matomo analytics to the published documents [spark-website]

2023-10-29 Thread via GitHub


panbingkun opened a new pull request, #485:
URL: https://github.com/apache/spark-website/pull/485

   The pr aims to add Matomo analytics to the published documents.
   include: 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated (4af4ddea116 -> 0245b842ccc)

2023-10-29 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 4af4ddea116 [SPARK-45552][PS] Introduce flexible parameters to 
`assertDataFrameEqual`
 add 0245b842ccc [SPARK-45554][PYTHON] Introduce flexible parameter to 
`assertSchemaEqual`

No new revisions were added by this update.

Summary of changes:
 python/pyspark/testing/utils.py | 90 ++---
 1 file changed, 85 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual`

2023-10-29 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4af4ddea116 [SPARK-45552][PS] Introduce flexible parameters to 
`assertDataFrameEqual`
4af4ddea116 is described below

commit 4af4ddea116d26086550596693ce09674e75bfa3
Author: Haejoon Lee 
AuthorDate: Mon Oct 30 11:07:01 2023 +0900

[SPARK-45552][PS] Introduce flexible parameters to `assertDataFrameEqual`

### What changes were proposed in this pull request?

This PR proposes to add six new parameters to the `assertDataFrameEqual`: 
`ignoreNullable`, `ignoreColumnOrder`, `ignoreColumnName`, `ignoreColumnType`, 
`maxErrors`, and `showOnlyDiff` to provide users with more flexibility in 
DataFrame testing.

### Why are the changes needed?

To enhance the utility of `assertDataFrameEqual` by accommodating various 
common DataFrame comparison scenarios that users might encounter, without 
necessitating manual adjustments or workarounds.

### Does this PR introduce _any_ user-facing change?

Yes. `assertDataFrameEqual` now have the option to use the six new 
parameters:


Parameter | Type | Comment
-- | -- | --
ignoreNullable | Boolean [optional] | Specifies whether a column’s nullable 
property is included when checking for schema equality. When set to 
True (default), the nullable property of the columns being compared is not 
taken into account and the columns will be considered equal even if they have 
different nullable settings.When set to False, columns are considered 
equal only if they have the same nullable setting.
ignoreColumnOrder | Boolean [optional] | Specifies whether to compare 
columns in the order they appear in the DataFrames or by column name. 
When set to False (default), columns are compared in the order they appear in 
the DataFrames. When set to True, a column in the expected DataFrame 
is compared to the column with the same name in the actual DataFrame. 
ignoreColumnOrder cannot be set to True if ignoreColumnNames is also 
set to True.
ignoreColumnName | Boolean [optional] | Specifies whether to fail the 
initial schema equality check if the column names in the two DataFrames are 
different. When set to False (default), column names are checked and 
the function fails if they are different. When set to True, the 
function will succeed even if column names are different. Column data types are 
compared for columns in the order they appear in the DataFrames. 
ignoreColumnNames cannot be set to  [...]
ignoreColumnType | Boolean [optional] | Specifies whether to ignore the 
data type of the columns when comparing. When set to False (default), 
column data types are checked and the function fails if they are 
different. When set to True, the schema equality check will succeed 
even if column data types are different and the function will attempt to 
compare rows.
maxErrors | Integer [optional] | The maximum number of row comparison 
failures to encounter before returning. When this number of row 
comparisons have failed, the function returns independent of how many rows have 
been compared. Set to None by default which means compare all rows 
independent of number of failures.
showOnlyDiff | Boolean [optional] | If set to True, the error message will 
only include rows that are different. If set to False (default), the 
error message will include all rows (when there is at least one row that is 
different).

### How was this patch tested?

Added usage examples into doctest for each parameter.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43433 from itholic/SPARK-45552.

Authored-by: Haejoon Lee 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_utils.py |  68 +++
 python/pyspark/testing/utils.py| 215 +++--
 2 files changed, 274 insertions(+), 9 deletions(-)

diff --git a/python/pyspark/sql/tests/test_utils.py 
b/python/pyspark/sql/tests/test_utils.py
index a2cad4e83bd..421043a41bb 100644
--- a/python/pyspark/sql/tests/test_utils.py
+++ b/python/pyspark/sql/tests/test_utils.py
@@ -1238,6 +1238,9 @@ class UtilsTestsMixin:
 
 assertDataFrameEqual(df1, df2)
 
+with self.assertRaises(PySparkAssertionError):
+assertDataFrameEqual(df1, df2, ignoreNullable=False)
+
 def test_schema_ignore_nullable_array_equal(self):
 s1 = StructType([StructField("names", ArrayType(DoubleType(), True), 
True)])
 s2 = StructType([StructField("names", ArrayType(DoubleType(), False), 
False)])
@@ -1611,6 +1614,71 @@ class UtilsTestsMixin:
 message_parameters={"error_msg": error_msg},
 )
 
+def test_dataframe_ignore_column_order(self):
+df1 = 

(spark) branch master updated: [SPARK-45544][CORE] Integrate SSL support into TransportContext

2023-10-29 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 884f6f71172 [SPARK-45544][CORE] Integrate SSL support into 
TransportContext
884f6f71172 is described below

commit 884f6f71172156ccc7d95ed022c8fb8baadc3c0a
Author: Hasnain Lakhani 
AuthorDate: Sun Oct 29 20:58:18 2023 -0500

[SPARK-45544][CORE] Integrate SSL support into TransportContext

### What changes were proposed in this pull request?

This integrates SSL support into TransportContext and related modules so 
that the RPC SSL functionality can work when properly configured.

### Why are the changes needed?

This is needed in order to support SSL for RPC connections.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

CI

Ran the following tests:

```
build/sbt -P yarn
> project network-common
> testOnly
> project network-shuffle
> testOnly
> project core
> testOnly *Ssl*
> project yarn
> testOnly 
org.apache.spark.network.yarn.SslYarnShuffleServiceWithRocksDBBackendSuite
```

I verified traffic was encrypted using TLS using two mechanisms:

* Enabled trace level logging for Netty and JDK SSL and saw logs confirming 
TLS handshakes were happening
* I ran wireshark on my machine and snooped on traffic while sending 
queries shuffling a fixed string. Without any encryption, I could find that 
string in the network traffic. With this encryption enabled, that string did 
not show up, and wireshark logs confirmed a TLS handshake was happening.

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43541 from hasnain-db/spark-tls-final.

Authored-by: Hasnain Lakhani 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../org/apache/spark/network/TransportContext.java | 70 --
 .../network/client/TransportClientFactory.java | 26 +++-
 .../spark/network/server/TransportServer.java  |  2 +-
 .../apache/spark/network/util/TransportConf.java   |  8 ---
 .../spark/network/ChunkFetchIntegrationSuite.java  |  6 +-
 .../network/SslChunkFetchIntegrationSuite.java | 22 ---
 .../client/SslTransportClientFactorySuite.java | 29 +
 .../client/TransportClientFactorySuite.java|  8 +--
 .../network/shuffle/ShuffleTransportContext.java   | 10 ++--
 .../shuffle/ExternalShuffleIntegrationSuite.java   | 29 +
 .../shuffle/ExternalShuffleSecuritySuite.java  | 14 -
 .../shuffle/ShuffleTransportContextSuite.java  | 33 +-
 .../SslExternalShuffleIntegrationSuite.java| 44 ++
 .../shuffle/SslExternalShuffleSecuritySuite.java   | 35 +++
 .../shuffle/SslShuffleTransportContextSuite.java   | 28 +
 .../network/yarn/SslYarnShuffleServiceSuite.scala  |  2 +-
 16 files changed, 265 insertions(+), 101 deletions(-)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
 
b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
index 51d074a4ddb..90ca4f4c46a 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java
@@ -23,13 +23,17 @@ import io.netty.handler.codec.MessageToMessageDecoder;
 import java.io.Closeable;
 import java.util.ArrayList;
 import java.util.List;
+import javax.annotation.Nullable;
 
 import com.codahale.metrics.Counter;
 import io.netty.channel.Channel;
 import io.netty.channel.ChannelPipeline;
 import io.netty.channel.EventLoopGroup;
 import io.netty.channel.socket.SocketChannel;
+import io.netty.handler.ssl.SslHandler;
+import io.netty.handler.stream.ChunkedWriteHandler;
 import io.netty.handler.timeout.IdleStateHandler;
+import io.netty.handler.codec.MessageToMessageEncoder;
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
@@ -37,6 +41,8 @@ import org.apache.spark.network.client.TransportClient;
 import org.apache.spark.network.client.TransportClientBootstrap;
 import org.apache.spark.network.client.TransportClientFactory;
 import org.apache.spark.network.client.TransportResponseHandler;
+import org.apache.spark.network.protocol.Message;
+import org.apache.spark.network.protocol.SslMessageEncoder;
 import org.apache.spark.network.protocol.MessageDecoder;
 import org.apache.spark.network.protocol.MessageEncoder;
 import org.apache.spark.network.server.ChunkFetchRequestHandler;
@@ -45,6 +51,7 @@ import 
org.apache.spark.network.server.TransportChannelHandler;
 import org.apache.spark.network.server.TransportRequestHandler;
 import org.apache.spark.network.server.TransportServer;
 import 

(spark) branch master updated: [SPARK-45605][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES] Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`

2023-10-29 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 89ca8b6065e 
[SPARK-45605][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES]
 Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`
89ca8b6065e is described below

commit 89ca8b6065e9f690a492c778262080741d50d94d
Author: yangjie01 
AuthorDate: Sun Oct 29 09:19:30 2023 -0500


[SPARK-45605][CORE][SQL][SS][CONNECT][MLLIB][GRAPHX][DSTREAM][PROTOBUF][EXAMPLES]
 Replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`

### What changes were proposed in this pull request?
This pr replace `s.c.MapOps.mapValues` with `s.c.MapOps.view.mapValues`  
due to `s.c.MapOps.mapValues` marked as deprecated since Scala 2.13.0:


https://github.com/scala/scala/blob/bf45e199e96383b96a6955520d7d2524c78e6e12/src/library/scala/collection/Map.scala#L256-L262

```scala
  deprecated("Use .view.mapValues(f). A future version will include a 
strict version of this method (for now, .view.mapValues(f).toMap).", "2.13.0")
  def mapValues[W](f: V => W): MapView[K, W] = new MapView.MapValues(this, 
f)
```

### Why are the changes needed?
Cleanup deprecated API usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass GitHub Acitons
- Packaged the client, manually tested 
`DFSReadWriteTest/MiniReadWriteTest/PowerIterationClusteringExample`.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43448 from LuciferYang/SPARK-45605.

Lead-authored-by: yangjie01 
Co-authored-by: YangJie 
Signed-off-by: Sean Owen 
---
 .../spark/util/sketch/CountMinSketchSuite.scala|  2 +-
 .../org/apache/spark/sql/avro/AvroUtils.scala  |  1 +
 .../scala/org/apache/spark/sql/SparkSession.scala  |  2 +-
 .../spark/sql/ClientDataFrameStatSuite.scala   |  2 +-
 .../org/apache/spark/sql/connect/dsl/package.scala |  2 +-
 .../sql/connect/planner/SparkConnectPlanner.scala  | 13 ++
 .../sql/kafka010/KafkaMicroBatchSourceSuite.scala  |  3 ++-
 .../apache/spark/sql/kafka010/KafkaTestUtils.scala |  2 +-
 .../streaming/kafka010/ConsumerStrategy.scala  |  6 ++---
 .../kafka010/DirectKafkaInputDStream.scala |  2 +-
 .../kafka010/DirectKafkaStreamSuite.scala  |  2 +-
 .../spark/streaming/kafka010/KafkaTestUtils.scala  |  2 +-
 .../spark/streaming/kinesis/KinesisTestUtils.scala |  2 +-
 .../kinesis/KPLBasedKinesisTestUtils.scala |  2 +-
 .../kinesis/KinesisBackedBlockRDDSuite.scala   |  4 +--
 .../spark/sql/protobuf/utils/ProtobufUtils.scala   |  1 +
 .../org/apache/spark/api/java/JavaPairRDD.scala|  4 +--
 .../apache/spark/api/java/JavaSparkContext.scala   |  2 +-
 .../spark/api/python/PythonWorkerFactory.scala |  2 +-
 .../apache/spark/scheduler/InputFormatInfo.scala   |  2 +-
 .../apache/spark/scheduler/TaskSchedulerImpl.scala |  2 +-
 .../cluster/CoarseGrainedSchedulerBackend.scala|  2 +-
 ...plicationEnvironmentInfoWrapperSerializer.scala |  5 ++--
 .../ExecutorSummaryWrapperSerializer.scala |  3 ++-
 .../status/protobuf/JobDataWrapperSerializer.scala |  2 +-
 .../protobuf/StageDataWrapperSerializer.scala  |  6 ++---
 .../org/apache/spark/SparkThrowableSuite.scala |  2 +-
 .../apache/spark/rdd/PairRDDFunctionsSuite.scala   |  2 +-
 .../test/scala/org/apache/spark/rdd/RDDSuite.scala |  1 +
 .../scheduler/ExecutorResourceInfoSuite.scala  |  1 +
 .../BlockManagerDecommissionIntegrationSuite.scala |  2 +-
 .../storage/ShuffleBlockFetcherIteratorSuite.scala |  2 +-
 .../util/collection/ExternalSorterSuite.scala  |  2 +-
 .../apache/spark/examples/DFSReadWriteTest.scala   |  1 +
 .../apache/spark/examples/MiniReadWriteTest.scala  |  1 +
 .../mllib/PowerIterationClusteringExample.scala|  2 +-
 .../spark/graphx/lib/ShortestPathsSuite.scala  |  2 +-
 .../spark/ml/evaluation/ClusteringMetrics.scala|  1 +
 .../apache/spark/ml/feature/VectorIndexer.scala|  2 +-
 .../org/apache/spark/ml/feature/Word2Vec.scala |  2 +-
 .../apache/spark/ml/tree/impl/RandomForest.scala   |  4 +--
 .../spark/mllib/clustering/BisectingKMeans.scala   |  2 +-
 .../mllib/linalg/distributed/BlockMatrix.scala |  4 +--
 .../apache/spark/mllib/stat/test/ChiSqTest.scala   |  1 +
 .../apache/spark/ml/recommendation/ALSSuite.scala  |  8 +++---
 .../apache/spark/mllib/feature/Word2VecSuite.scala | 12 -
 .../org/apache/spark/sql/types/Metadata.scala  |  2 +-
 .../spark/sql/catalyst/analysis/Analyzer.scala |  3 ++-
 .../catalyst/catalog/ExternalCatalogUtils.scala|  2 +-
 .../sql/catalyst/catalog/SessionCatalog.scala  |  2 +-
 .../spark/sql/catalyst/expressions/package.scala   |  2 +-
 

(spark) branch master updated: [SPARK-45636][BUILD] Upgrade jersey to 2.41

2023-10-29 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 4ae99f9320c [SPARK-45636][BUILD] Upgrade jersey to 2.41
4ae99f9320c is described below

commit 4ae99f9320ca29193f7c0d6d54d61e5d3fd0b323
Author: YangJie 
AuthorDate: Sun Oct 29 09:18:07 2023 -0500

[SPARK-45636][BUILD] Upgrade jersey to 2.41

### What changes were proposed in this pull request?
This pr aims upgrade Jersey from 2.40 to 2.41.

### Why are the changes needed?
The new version bring some improvements, like:
- https://github.com/eclipse-ee4j/jersey/pull/5350
- https://github.com/eclipse-ee4j/jersey/pull/5365
- https://github.com/eclipse-ee4j/jersey/pull/5436
- https://github.com/eclipse-ee4j/jersey/pull/5296

and some bug fix, like:
- https://github.com/eclipse-ee4j/jersey/pull/5359
- https://github.com/eclipse-ee4j/jersey/pull/5405
- https://github.com/eclipse-ee4j/jersey/pull/5423
- https://github.com/eclipse-ee4j/jersey/pull/5435
- https://github.com/eclipse-ee4j/jersey/pull/5445

The full release notes as follows:
- https://github.com/eclipse-ee4j/jersey/releases/tag/2.41

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43490 from LuciferYang/SPARK-45636.

Lead-authored-by: YangJie 
Co-authored-by: yangjie01 
Signed-off-by: Sean Owen 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 12 ++--
 pom.xml   |  2 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index c6fa77c84ca..2bfd94b9d46 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -122,12 +122,12 @@ jaxb-runtime/2.3.2//jaxb-runtime-2.3.2.jar
 jcl-over-slf4j/2.0.9//jcl-over-slf4j-2.0.9.jar
 jdo-api/3.0.1//jdo-api-3.0.1.jar
 jdom2/2.0.6//jdom2-2.0.6.jar
-jersey-client/2.40//jersey-client-2.40.jar
-jersey-common/2.40//jersey-common-2.40.jar
-jersey-container-servlet-core/2.40//jersey-container-servlet-core-2.40.jar
-jersey-container-servlet/2.40//jersey-container-servlet-2.40.jar
-jersey-hk2/2.40//jersey-hk2-2.40.jar
-jersey-server/2.40//jersey-server-2.40.jar
+jersey-client/2.41//jersey-client-2.41.jar
+jersey-common/2.41//jersey-common-2.41.jar
+jersey-container-servlet-core/2.41//jersey-container-servlet-core-2.41.jar
+jersey-container-servlet/2.41//jersey-container-servlet-2.41.jar
+jersey-hk2/2.41//jersey-hk2-2.41.jar
+jersey-server/2.41//jersey-server-2.41.jar
 jettison/1.5.4//jettison-1.5.4.jar
 jetty-util-ajax/9.4.53.v20231009//jetty-util-ajax-9.4.53.v20231009.jar
 jetty-util/9.4.53.v20231009//jetty-util-9.4.53.v20231009.jar
diff --git a/pom.xml b/pom.xml
index 6488918326f..71c3044dd42 100644
--- a/pom.xml
+++ b/pom.xml
@@ -206,7 +206,7 @@
   Please don't upgrade the version to 3.0.0+,
   Because it transitions Jakarta REST API from javax to jakarta package.
 -->
-2.40
+2.41
 2.12.5
 3.5.2
 3.0.0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



(spark) branch master updated: [SPARK-45674][CONNECT][PYTHON] Improve error message for JVM-dependent attributes on Spark Connect

2023-10-29 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1f8422b77ae [SPARK-45674][CONNECT][PYTHON] Improve error message for 
JVM-dependent attributes on Spark Connect
1f8422b77ae is described below

commit 1f8422b77ae9fd920a30c55947712a5fb388c480
Author: Haejoon Lee 
AuthorDate: Sun Oct 29 19:26:50 2023 +0900

[SPARK-45674][CONNECT][PYTHON] Improve error message for JVM-dependent 
attributes on Spark Connect

### What changes were proposed in this pull request?

This PR proposes to improve error message for JVM-dependent attributes on 
Spark Connect.

### Why are the changes needed?

To improve the usability of error message by using proper error class.

### Does this PR introduce _any_ user-facing change?

No API change, but only user-facing error messages are improved as below:

**Before**
```
>>> spark.sparkContext
PySparkNotImplementedError: [NOT_IMPLEMENTED] sparkContext() is not 
implemented.
```
**After**
```
>>> spark.sparkContext
PySparkAttributeError: [JVM_ATTRIBUTE_NOT_SUPPORTED] Attribute 
`sparkContext` is not supported in Spark Connect as it depends on the JVM. If 
you need to use this attribute, do not use Spark Connect when creating your 
session.
```

### How was this patch tested?

The existing CI should pass

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43537 from itholic/SPARK-45674.

Authored-by: Haejoon Lee 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/errors/error_classes.py |  2 +-
 python/pyspark/sql/connect/dataframe.py|  4 +---
 python/pyspark/sql/connect/session.py  |  6 +-
 python/pyspark/sql/tests/connect/test_connect_basic.py | 14 --
 4 files changed, 3 insertions(+), 23 deletions(-)

diff --git a/python/pyspark/errors/error_classes.py 
b/python/pyspark/errors/error_classes.py
index 53a7279fde2..70e88c18f9d 100644
--- a/python/pyspark/errors/error_classes.py
+++ b/python/pyspark/errors/error_classes.py
@@ -354,7 +354,7 @@ ERROR_CLASSES_JSON = """
   },
   "JVM_ATTRIBUTE_NOT_SUPPORTED" : {
 "message" : [
-  "Attribute `` is not supported in Spark Connect as it depends 
on the JVM. If you need to use this attribute, do not use Spark Connect when 
creating your session."
+  "Attribute `` is not supported in Spark Connect as it depends 
on the JVM. If you need to use this attribute, do not use Spark Connect when 
creating your session. Visit 
https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession
 for creating regular Spark Session in detail."
 ]
   },
   "KEY_VALUE_PAIR_REQUIRED" : {
diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index b322ded84a4..75cecd5f610 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1639,13 +1639,11 @@ class DataFrame:
 sampleBy.__doc__ = PySparkDataFrame.sampleBy.__doc__
 
 def __getattr__(self, name: str) -> "Column":
-if name in ["_jseq", "_jdf", "_jmap", "_jcols"]:
+if name in ["_jseq", "_jdf", "_jmap", "_jcols", "rdd", "toJSON"]:
 raise PySparkAttributeError(
 error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", 
message_parameters={"attr_name": name}
 )
 elif name in [
-"rdd",
-"toJSON",
 "checkpoint",
 "localCheckpoint",
 ]:
diff --git a/python/pyspark/sql/connect/session.py 
b/python/pyspark/sql/connect/session.py
index 53bf19b78c8..09bd60606c7 100644
--- a/python/pyspark/sql/connect/session.py
+++ b/python/pyspark/sql/connect/session.py
@@ -694,14 +694,10 @@ class SparkSession:
 streams.__doc__ = PySparkSession.streams.__doc__
 
 def __getattr__(self, name: str) -> Any:
-if name in ["_jsc", "_jconf", "_jvm", "_jsparkSession"]:
+if name in ["_jsc", "_jconf", "_jvm", "_jsparkSession", 
"sparkContext", "newSession"]:
 raise PySparkAttributeError(
 error_class="JVM_ATTRIBUTE_NOT_SUPPORTED", 
message_parameters={"attr_name": name}
 )
-elif name in ["newSession", "sparkContext"]:
-raise PySparkNotImplementedError(
-error_class="NOT_IMPLEMENTED", message_parameters={"feature": 
f"{name}()"}
-)
 return object.__getattribute__(self, name)
 
 @property
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index c96d08b5bbe..34bd314c76f 100755
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++