[spark] branch master updated: [MINOR][SQL] Remove Scalac 2.12-specific code in `InMemoryFileIndex`

2023-10-05 Thread yangjie01
This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 78863ad5870a [MINOR][SQL] Remove Scalac 2.12-specific code in 
`InMemoryFileIndex`
78863ad5870a is described below

commit 78863ad5870a8c40d251ed515181ab2d452afa3c
Author: Dongjoon Hyun 
AuthorDate: Fri Oct 6 13:20:41 2023 +0800

[MINOR][SQL] Remove Scalac 2.12-specific code in `InMemoryFileIndex`

### What changes were proposed in this pull request?

This PR aims to remove `scalac 2.12`-specific code from InMemoryFileIndex 
class.

### Why are the changes needed?

Like the comment mentioned, we don't need this because `master` branch is 
using Scala 2.13 only.

https://github.com/apache/spark/blob/c0d9ca3be14cb0ec8d8f9920d3ecc4aac3cf5adc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L129

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43233 from dongjoon-hyun/minor_scalac_2.12.

Authored-by: Dongjoon Hyun 
Signed-off-by: yangjie01 
---
 .../org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala   | 1 -
 1 file changed, 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
index 44d31131e9c6..4d917994123f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala
@@ -126,7 +126,6 @@ class InMemoryFileIndex(
 case None =>
   pathsToFetch += path
   }
-  () // for some reasons scalac 2.12 needs this; return type doesn't matter
 }
 val filter = FileInputFormat.getInputPathFilter(new JobConf(hadoopConf, 
this.getClass))
 val discovered = InMemoryFileIndex.bulkListLeafFiles(


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-44219][SQL] Adds extra per-rule validations for optimization rewrites

2023-10-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2ce1a8769846 [SPARK-44219][SQL] Adds extra per-rule validations for 
optimization rewrites
2ce1a8769846 is described below

commit 2ce1a8769846477cfc2c596885f8005d8fc972b5
Author: Yannis Sismanis 
AuthorDate: Fri Oct 6 11:20:35 2023 +0800

[SPARK-44219][SQL] Adds extra per-rule validations for optimization rewrites

### What changes were proposed in this pull request?

Adds per-rule validation checks for the following:

1.  aggregate expressions in Aggregate plans are valid.
2. Grouping key types in Aggregate plans cannot by of type Map.
3. No dangling references have been generated.

This validation is by default enabled for all tests or selectively using 
the spark.sql.planChangeValidation=true flag.

### Why are the changes needed?
Extra validation for optimizer rewrites.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests

Closes #41763 from YannisSismanis/SC-130139_followup.

Authored-by: Yannis Sismanis 
Signed-off-by: Wenchen Fan 
---
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  72 +-
 .../spark/sql/catalyst/expressions/ExprUtils.scala |  77 ++
 .../sql/catalyst/plans/logical/LogicalPlan.scala   |  68 +-
 .../sql/catalyst/optimizer/OptimizerSuite.scala| 270 -
 4 files changed, 412 insertions(+), 75 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 81ca59c0976e..e140625f47ab 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -431,77 +431,7 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 messageParameters = Map.empty)
 }
 
-  case Aggregate(groupingExprs, aggregateExprs, _) =>
-def checkValidAggregateExpression(expr: Expression): Unit = expr 
match {
-  case expr: AggregateExpression =>
-val aggFunction = expr.aggregateFunction
-aggFunction.children.foreach { child =>
-  child.foreach {
-case expr: AggregateExpression =>
-  expr.failAnalysis(
-errorClass = "NESTED_AGGREGATE_FUNCTION",
-messageParameters = Map.empty)
-case other => // OK
-  }
-
-  if (!child.deterministic) {
-child.failAnalysis(
-  errorClass = 
"AGGREGATE_FUNCTION_WITH_NONDETERMINISTIC_EXPRESSION",
-  messageParameters = Map("sqlExpr" -> toSQLExpr(expr)))
-  }
-}
-  case _: Attribute if groupingExprs.isEmpty =>
-operator.failAnalysis(
-  errorClass = "MISSING_GROUP_BY",
-  messageParameters = Map.empty)
-  case e: Attribute if !groupingExprs.exists(_.semanticEquals(e)) 
=>
-throw QueryCompilationErrors.columnNotInGroupByClauseError(e)
-  case s: ScalarSubquery
-  if s.children.nonEmpty && 
!groupingExprs.exists(_.semanticEquals(s)) =>
-s.failAnalysis(
-  errorClass = 
"SCALAR_SUBQUERY_IS_IN_GROUP_BY_OR_AGGREGATE_FUNCTION",
-  messageParameters = Map("sqlExpr" -> toSQLExpr(s)))
-  case e if groupingExprs.exists(_.semanticEquals(e)) => // OK
-  // There should be no Window in Aggregate - this case will fail 
later check anyway.
-  // Perform this check for special case of lateral column alias, 
when the window
-  // expression is not eligible to propagate to upper plan because 
it is not valid,
-  // containing non-group-by or non-aggregate-expressions.
-  case WindowExpression(function, spec) =>
-function.children.foreach(checkValidAggregateExpression)
-checkValidAggregateExpression(spec)
-  case e => e.children.foreach(checkValidAggregateExpression)
-}
-
-def checkValidGroupingExprs(expr: Expression): Unit = {
-  if (expr.exists(_.isInstanceOf[AggregateExpression])) {
-expr.failAnalysis(
-  errorClass = "GROUP_BY_AGGREGATE",
-  messageParameters = Map("sqlExpr" -> expr.sql))
-  }
-
-  // Check if the data type of expr is orderable.
-   

[spark] branch master updated: [SPARK-45421][SQL] Catch AnalysisException over InlineCTE

2023-10-05 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 94666e9cd63a [SPARK-45421][SQL] Catch AnalysisException over InlineCTE
94666e9cd63a is described below

commit 94666e9cd63a9471f50ab2fce6c8439c8fd74154
Author: Rui Wang 
AuthorDate: Fri Oct 6 10:48:24 2023 +0800

[SPARK-45421][SQL] Catch AnalysisException over InlineCTE

### What changes were proposed in this pull request?

We are catching exceptions for `CheckAnalysis` and attach inlined plan. 
However, if `InlineCTE` itself throws we can also catch but attach original 
plan.

### Why are the changes needed?

Attach original plan if `InlineCTE` throws.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #43227 from amaliujia/inline_cte_plan.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../apache/spark/sql/catalyst/analysis/CheckAnalysis.scala   | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 64f1cafd03fe..81ca59c0976e 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -164,12 +164,18 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 
 }
 // Inline all CTEs in the plan to help check query plan structures in 
subqueries.
-val inlinedPlan = inlineCTE(plan)
+var inlinedPlan: Option[LogicalPlan] = None
 try {
-  checkAnalysis0(inlinedPlan)
+  inlinedPlan = Some(inlineCTE(plan))
 } catch {
   case e: AnalysisException =>
-throw new ExtendedAnalysisException(e, inlinedPlan)
+throw new ExtendedAnalysisException(e, plan)
+}
+try {
+  checkAnalysis0(inlinedPlan.get)
+} catch {
+  case e: AnalysisException =>
+throw new ExtendedAnalysisException(e, inlinedPlan.get)
 }
 plan.setAnalyzed()
   }


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45394][SPARK-45093][PYTHON][CONNECT] Add retries for artifact API. Improve error handling (follow-up to [])

2023-10-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 58863dfa1b4b [SPARK-45394][SPARK-45093][PYTHON][CONNECT] Add retries 
for artifact API. Improve error handling (follow-up to [])
58863dfa1b4b is described below

commit 58863dfa1b4b84dee5a0d6323265f6f3bb71a763
Author: Alice Sayutina 
AuthorDate: Fri Oct 6 11:21:17 2023 +0900

[SPARK-45394][SPARK-45093][PYTHON][CONNECT] Add retries for artifact API. 
Improve error handling (follow-up to [])

### What changes were proposed in this pull request?

1. Add retries to `add_artifact` api in client

2. Slightly change control flow within `artifact.py` so that client-side 
errors (e.g. FileNotFound) are properly thrown. (Previously we attempted to add 
logs in https://github.com/apache/spark/pull/42949, but that was imperfect 
solution, this should be much better).

3. Accept proper ownership over files in LocalData, and close those 
descriptors.

### Why are the changes needed?

Improves user experience

### Does this PR introduce _any_ user-facing change?

Improve error handling, adds retries.

### How was this patch tested?

Added test coverage for add_artifact when there is no artifact.

### Was this patch authored or co-authored using generative AI tooling?

NO

Closes #43216 from cdkrot/SPARK-45394.

Authored-by: Alice Sayutina 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/client/artifact.py  | 94 --
 python/pyspark/sql/connect/client/core.py  | 18 -
 .../sql/tests/connect/client/test_artifact.py  |  7 ++
 3 files changed, 72 insertions(+), 47 deletions(-)

diff --git a/python/pyspark/sql/connect/client/artifact.py 
b/python/pyspark/sql/connect/client/artifact.py
index fb31a57e0f62..5829ec9a8d4d 100644
--- a/python/pyspark/sql/connect/client/artifact.py
+++ b/python/pyspark/sql/connect/client/artifact.py
@@ -52,7 +52,6 @@ class LocalData(metaclass=abc.ABCMeta):
 Payload stored on this machine.
 """
 
-@cached_property
 @abc.abstractmethod
 def stream(self) -> BinaryIO:
 pass
@@ -70,14 +69,18 @@ class LocalFile(LocalData):
 
 def __init__(self, path: str):
 self.path = path
-self._size: int
-self._stream: int
+
+# Check that the file can be read
+# so that incorrect references can be discovered during Artifact 
creation,
+# and not at the point of consumption.
+
+with self.stream():
+pass
 
 @cached_property
 def size(self) -> int:
 return os.path.getsize(self.path)
 
-@cached_property
 def stream(self) -> BinaryIO:
 return open(self.path, "rb")
 
@@ -89,14 +92,11 @@ class InMemory(LocalData):
 
 def __init__(self, blob: bytes):
 self.blob = blob
-self._size: int
-self._stream: int
 
 @cached_property
 def size(self) -> int:
 return len(self.blob)
 
-@cached_property
 def stream(self) -> BinaryIO:
 return io.BytesIO(self.blob)
 
@@ -244,18 +244,23 @@ class ArtifactManager:
 self, *path: str, pyfile: bool, archive: bool, file: bool
 ) -> Iterator[proto.AddArtifactsRequest]:
 """Separated for the testing purpose."""
-try:
-yield from self._add_artifacts(
-chain(
-*(
-self._parse_artifacts(p, pyfile=pyfile, 
archive=archive, file=file)
-for p in path
-)
-)
-)
-except Exception as e:
-logger.error(f"Failed to submit addArtifacts request: {e}")
-raise
+
+# It's crucial that this function is not generator, but only returns 
generator.
+# This way we are doing artifact parsing within the original caller 
thread
+# And not during grpc consuming iterator, allowing for much better 
error reporting.
+
+artifacts: Iterator[Artifact] = chain(
+*(self._parse_artifacts(p, pyfile=pyfile, archive=archive, 
file=file) for p in path)
+)
+
+def generator() -> Iterator[proto.AddArtifactsRequest]:
+try:
+yield from self._add_artifacts(artifacts)
+except Exception as e:
+logger.error(f"Failed to submit addArtifacts request: {e}")
+raise
+
+return generator()
 
 def _retrieve_responses(
 self, requests: Iterator[proto.AddArtifactsRequest]
@@ -279,6 +284,7 @@ class ArtifactManager:
 requests: Iterator[proto.AddArtifactsRequest] = self._create_requests(
 *path, pyfile=pyfile, archive=archive, file=file
 )
+
 

[spark] branch branch-3.5 updated: [SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`

2023-10-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new 64e2b22f6b40 [SPARK-45396][PYTHON] Add doc entry for 
`pyspark.ml.connect` module, and adds `Evaluator` to `__all__` at `ml.connect`
64e2b22f6b40 is described below

commit 64e2b22f6b4023197871a60eb08b055688e9fdd2
Author: Weichen Xu 
AuthorDate: Thu Oct 5 08:38:54 2023 +0900

[SPARK-45396][PYTHON] Add doc entry for `pyspark.ml.connect` module, and 
adds `Evaluator` to `__all__` at `ml.connect`

This PR documents MLlib's Spark Connect support at API reference.

This PR also piggies back a fix in `__all__` at 
`python/pyspark/ml/connect/__init__.py` so `from pyspark.sql.commect import 
Evaluator` works.

With this this, user cannot see `pyspark.ml.connect` Python APIs on doc 
website.

Yes it adds the new page into your facing documentation ([PySpark API 
reference](https://spark.apache.org/docs/latest/api/python/reference/index.html)).

Manually tested via:

```bash
cd python/docs
make clean html
```

No.

Closes #43210 from HyukjinKwon/SPARK-45396-followup.

Lead-authored-by: Weichen Xu 
Co-authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
(cherry picked from commit 35b627a934b1ab28be7d6ba88fdad63dc129525a)
Signed-off-by: Hyukjin Kwon 
---
 python/docs/source/reference/index.rst |   1 +
 .../docs/source/reference/pyspark.ml.connect.rst   | 122 +
 python/pyspark/ml/connect/__init__.py  |   3 +-
 3 files changed, 125 insertions(+), 1 deletion(-)

diff --git a/python/docs/source/reference/index.rst 
b/python/docs/source/reference/index.rst
index ed3eb4d07dac..6330636839cd 100644
--- a/python/docs/source/reference/index.rst
+++ b/python/docs/source/reference/index.rst
@@ -31,6 +31,7 @@ Pandas API on Spark follows the API specifications of latest 
pandas release.
pyspark.pandas/index
pyspark.ss/index
pyspark.ml
+   pyspark.ml.connect
pyspark.streaming
pyspark.mllib
pyspark
diff --git a/python/docs/source/reference/pyspark.ml.connect.rst 
b/python/docs/source/reference/pyspark.ml.connect.rst
new file mode 100644
index ..1a3e6a593980
--- /dev/null
+++ b/python/docs/source/reference/pyspark.ml.connect.rst
@@ -0,0 +1,122 @@
+..  Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+..http://www.apache.org/licenses/LICENSE-2.0
+
+..  Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+
+
+MLlib (DataFrame-based) for Spark Connect
+=
+
+.. warning::
+The namespace for this package can change in the future Spark version.
+
+
+Pipeline APIs
+-
+
+.. currentmodule:: pyspark.ml.connect
+
+.. autosummary::
+:template: autosummary/class_with_docs.rst
+:toctree: api/
+
+Transformer
+Estimator
+Model
+Evaluator
+Pipeline
+PipelineModel
+
+
+Feature
+---
+
+.. currentmodule:: pyspark.ml.connect.feature
+
+.. autosummary::
+:template: autosummary/class_with_docs.rst
+:toctree: api/
+
+MaxAbsScaler
+MaxAbsScalerModel
+StandardScaler
+StandardScalerModel
+
+
+Classification
+--
+
+.. currentmodule:: pyspark.ml.connect.classification
+
+.. autosummary::
+:template: autosummary/class_with_docs.rst
+:toctree: api/
+
+LogisticRegression
+LogisticRegressionModel
+
+
+Functions
+-
+
+.. currentmodule:: pyspark.ml.connect.functions
+
+.. autosummary::
+:toctree: api/
+
+array_to_vector
+vector_to_array
+
+
+Tuning
+--
+
+.. currentmodule:: pyspark.ml.connect.tuning
+
+.. autosummary::
+:template: autosummary/class_with_docs.rst
+:toctree: api/
+
+CrossValidator
+CrossValidatorModel
+
+
+Evaluation
+--
+
+.. currentmodule:: pyspark.ml.connect.evaluation
+
+.. autosummary::
+:template: autosummary/class_with_docs.rst
+:toctree: api/
+
+RegressionEvaluator
+BinaryClassificationEvaluator
+MulticlassClassificationEvaluator
+
+
+Utilities
+-
+
+.. currentmodule:: pyspark.ml.connect.io_utils
+
+.. autosummary::
+:template: 

[spark] branch master updated: [SPARK-45400][SQL][DOCS] Refer to the unescaping rules from expression descriptions

2023-10-05 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c0d9ca3be14c [SPARK-45400][SQL][DOCS] Refer to the unescaping rules 
from expression descriptions
c0d9ca3be14c is described below

commit c0d9ca3be14cb0ec8d8f9920d3ecc4aac3cf5adc
Author: Max Gekk 
AuthorDate: Thu Oct 5 22:22:29 2023 +0300

[SPARK-45400][SQL][DOCS] Refer to the unescaping rules from expression 
descriptions

### What changes were proposed in this pull request?
In the PR, I propose to refer to the unescaping rules added by 
https://github.com/apache/spark/pull/43152 from expression descriptions like in 
`Like`, see
https://github.com/apache/spark/assets/1580697/6a332b50-f2c8-4549-848a-61519c9f964e;>

### Why are the changes needed?
To improve user experience w/ Spark SQL.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Manually generated docs and checked by eyes.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #43203 from MaxGekk/link-to-escape-doc.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 docs/sql-ref-literals.md   |  2 +
 .../catalyst/expressions/regexpExpressions.scala   | 70 ++
 2 files changed, 47 insertions(+), 25 deletions(-)

diff --git a/docs/sql-ref-literals.md b/docs/sql-ref-literals.md
index e9447af71c54..2a02a22bd6f0 100644
--- a/docs/sql-ref-literals.md
+++ b/docs/sql-ref-literals.md
@@ -62,6 +62,8 @@ The following escape sequences are recognized in regular 
string literals (withou
 - `\_` -> `\_`;
 - `\` -> ``, skip the slash and leave the character as 
is.
 
+The unescaping rules above can be turned off by setting the SQL config 
`spark.sql.parser.escapedStringLiterals` to `true`.
+
  Examples
 
 ```sql
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
index 69d90296d7ff..87ea8b5a102a 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
@@ -77,7 +77,7 @@ abstract class StringRegexExpression extends BinaryExpression
   }
 }
 
-// scalastyle:off line.contains.tab
+// scalastyle:off line.contains.tab line.size.limit
 /**
  * Simple RegEx pattern matching function
  */
@@ -92,11 +92,14 @@ abstract class StringRegexExpression extends 
BinaryExpression
   _ matches any one character in the input (similar to . in posix 
regular expressions)\
   % matches zero or more characters in the input (similar to .* in 
posix regular
   expressions)
-  Since Spark 2.0, string literals are unescaped in our SQL parser. 
For example, in order
-  to match "\abc", the pattern should be "\\abc".
+  Since Spark 2.0, string literals are unescaped in our SQL parser, 
see the unescaping
+  rules at https://spark.apache.org/docs/latest/sql-ref-literals.html#string-literal;>String
 Literal.
+  For example, in order to match "\abc", the pattern should be 
"\\abc".
   When SQL config 'spark.sql.parser.escapedStringLiterals' is enabled, 
it falls back
   to Spark 1.6 behavior regarding string literal parsing. For example, 
if the config is
-  enabled, the pattern to match "\abc" should be "\abc".
+  enabled, the pattern to match "\abc" should be "\abc".
+  It's recommended to use a raw string literal (with the `r` prefix) 
to avoid escaping
+  special characters in the pattern string if exists.
   * escape - an character added since Spark 3.0. The default escape 
character is the '\'.
   If an escape character precedes a special symbol or another escape 
character, the
   following character is matched literally. It is invalid to escape 
any other character.
@@ -121,7 +124,7 @@ abstract class StringRegexExpression extends 
BinaryExpression
   """,
   since = "1.0.0",
   group = "predicate_funcs")
-// scalastyle:on line.contains.tab
+// scalastyle:on line.contains.tab line.size.limit
 case class Like(left: Expression, right: Expression, escapeChar: Char)
   extends StringRegexExpression {
 
@@ -207,11 +210,14 @@ case class Like(left: Expression, right: Expression, 
escapeChar: Char)
   _ matches any one character in the input (similar to . in posix 
regular expressions)
   % matches zero or more characters in the input (similar to .* in 
posix regular
   expressions)
-  Since Spark 2.0, string literals are unescaped in our SQL parser. 
For example, in order
-  to 

[spark] branch master updated: [SPARK-45423][SQL] Lower `ParquetWriteSupport` log level to debug

2023-10-05 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 51973aa4c7b7 [SPARK-45423][SQL] Lower `ParquetWriteSupport` log level 
to debug
51973aa4c7b7 is described below

commit 51973aa4c7b7d6b1b0b98b943f4cf78e27475519
Author: Dongjoon Hyun 
AuthorDate: Thu Oct 5 10:21:53 2023 -0700

[SPARK-45423][SQL] Lower `ParquetWriteSupport` log level to debug

### What changes were proposed in this pull request?

This PR aims to lower `ParquetWriteSupport` log level from INFO to DEBUG

### Why are the changes needed?

 Currently, `ParquetWriteSupport` is too verbose at INFO level because it 
dumps the Parquet file schema per file. Since this is the only log in 
`ParquetWriteSupport`,  the users can see this via a proper `log4j2.properties` 
setting when they want to debug jobs.
 ```
23/10/05 16:29:43 INFO ParquetOutputFormat: ParquetRecordWriter [block 
size: 134217728b, row group padding size: 8388608b, validating: false]
23/10/05 16:29:43 INFO ParquetWriteSupport: Initialized Parquet 
WriteSupport with Catalyst schema:
{
  "type" : "struct",
  "fields" : [ {
"name" : "id",
"type" : "long",
"nullable" : false,
"metadata" : { }
  } ]
}
and corresponding Parquet message type:
message spark_schema {
  required int64 id;
}

23/10/05 16:29:43 INFO MagicCommitTracker: ...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual tests.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43230 from dongjoon-hyun/SPARK-45423.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 .../spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala   | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
index f6248d43c48e..9535bbd585bc 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetWriteSupport.scala
@@ -132,7 +132,7 @@ class ParquetWriteSupport extends WriteSupport[InternalRow] 
with Logging {
   }
 }
 
-logInfo(
+logDebug(
   s"""Initialized Parquet WriteSupport with Catalyst schema:
  |${schema.prettyJson}
  |and corresponding Parquet message type:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45231][INFRA] Remove unrecognized and meaningless command about `Ammonite` from the GA testing workflow

2023-10-05 Thread hvanhovell
This is an automated email from the ASF dual-hosted git repository.

hvanhovell pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 5446f548bbc8 [SPARK-45231][INFRA] Remove unrecognized and meaningless 
command about `Ammonite` from the GA testing workflow
5446f548bbc8 is described below

commit 5446f548bbc8a93414f1c773a8daf714b57b7d1a
Author: panbingkun 
AuthorDate: Thu Oct 5 13:13:43 2023 -0400

[SPARK-45231][INFRA] Remove unrecognized and meaningless command about 
`Ammonite` from the GA testing workflow

### What changes were proposed in this pull request?
The pr aims to remove unrecognized and meaningless command about `amm` from 
the GA testing workflow.

### Why are the changes needed?
- When I observed GA's logs, I found the following logs:
```
Run # Fix for TTY related issues when launching the Ammonite REPL in tests.
sh: 1: amm: not found
```
eg:
   
https://github.com/apache/spark/actions/runs/6243934856/job/16950117287#step:10:21
   
https://github.com/panbingkun/spark/actions/runs/6232228999/job/16924063382#step:10:22

   Obviously, `amm` did not recognize it. Through trial and error, it was 
found that the above command do not need to be executed in our GA.
- Enhance maintainability and reduce misunderstandings.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #42993 from panbingkun/remove_amm_test.

Authored-by: panbingkun 
Signed-off-by: Herman van Hovell 
---
 .github/workflows/build_and_test.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 669cd76bd72f..5dce503a1799 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -261,7 +261,7 @@ jobs:
   shell: 'script -q -e -c "bash {0}"'
   run: |
 # Fix for TTY related issues when launching the Ammonite REPL in tests.
-export TERM=vt100 && script -qfc 'echo exit | amm -s' && rm typescript
+export TERM=vt100
 # Hive "other tests" test needs larger metaspace size based on 
experiment.
 if [[ "$MODULES_TO_TEST" == "hive" ]] && [[ "$EXCLUDED_TAGS" == 
"org.apache.spark.tags.SlowHiveTest" ]]; then export METASPACE_SIZE=2g; fi
 export SERIAL_SBT_TESTS=1


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-45408][CORE] Add RPC SSL settings to TransportConf

2023-10-05 Thread mridulm80
This is an automated email from the ASF dual-hosted git repository.

mridulm80 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f1e9dc2a4b3 [SPARK-45408][CORE] Add RPC SSL settings to TransportConf
f1e9dc2a4b3 is described below

commit f1e9dc2a4b31f597f7b72e6eda137e990c7b3980
Author: Hasnain Lakhani 
AuthorDate: Thu Oct 5 11:40:38 2023 -0500

[SPARK-45408][CORE] Add RPC SSL settings to TransportConf

### What changes were proposed in this pull request?

This change adds new settings to `TransportConf` which are needed for the 
RPC SSL functionality to work. Additionally, add some sample configurations 
which are used by tests in follow up PRs (see 
https://github.com/apache/spark/pull/42685 for the full context)

### Why are the changes needed?

These changes are needed so that other modules can easily access 
configurations, and that the sample configurations are easily accessible for 
tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a test, then ran:

```
./build/sbt
> project network-common
> testOnly org.apache.spark.network.TransportConfSuite
```

There are more follow up tests coming (see 
https://github.com/apache/spark/pull/42685)

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #43220 from hasnain-db/spark-tls-configs-low.

Authored-by: Hasnain Lakhani 
Signed-off-by: Mridul Muralidharan gmail.com>
---
 .../apache/spark/network/util/TransportConf.java   | 152 +
 .../src/test/java/TransportConfSuite.java  |  88 
 .../apache/spark/network/ssl/SslSampleConfigs.java | 235 +
 3 files changed, 475 insertions(+)

diff --git 
a/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
 
b/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
index b8d8f6b85a4..3ebb38e310f 100644
--- 
a/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
+++ 
b/common/network-common/src/main/java/org/apache/spark/network/util/TransportConf.java
@@ -17,6 +17,7 @@
 
 package org.apache.spark.network.util;
 
+import java.io.File;
 import java.util.Locale;
 import java.util.Properties;
 import java.util.concurrent.TimeUnit;
@@ -257,6 +258,157 @@ public class TransportConf {
   conf.get("spark.network.ssl.maxEncryptedBlockSize", "64k")));
   }
 
+  /**
+   * Whether Secure (SSL/TLS) RPC (including Block Transfer Service) is enabled
+   */
+  public boolean sslRpcEnabled() {
+return conf.getBoolean("spark.ssl.rpc.enabled", false);
+  }
+
+  /**
+   * SSL protocol (remember that SSLv3 was compromised) supported by Java
+   */
+  public String sslRpcProtocol() {
+return conf.get("spark.ssl.rpc.protocol", null);
+  }
+
+  /**
+   * A comma separated list of ciphers
+   */
+  public String[] sslRpcRequestedCiphers() {
+String ciphers = conf.get("spark.ssl.rpc.enabledAlgorithms", null);
+return (ciphers != null ? ciphers.split(",") : null);
+  }
+
+  /**
+   * The key-store file; can be relative to the current directory
+   */
+  public File sslRpcKeyStore() {
+String keyStore = conf.get("spark.ssl.rpc.keyStore", null);
+if (keyStore != null) {
+  return new File(keyStore);
+} else {
+  return null;
+}
+  }
+
+  /**
+   * The password to the key-store file
+   */
+  public String sslRpcKeyStorePassword() {
+return conf.get("spark.ssl.rpc.keyStorePassword", null);
+  }
+
+  /**
+   * A PKCS#8 private key file in PEM format; can be relative to the current 
directory
+   */
+  public File sslRpcPrivateKey() {
+String privateKey = conf.get("spark.ssl.rpc.privateKey", null);
+if (privateKey != null) {
+  return new File(privateKey);
+} else {
+  return null;
+}
+  }
+
+  /**
+   * The password to the private key
+   */
+  public String sslRpcKeyPassword() {
+return conf.get("spark.ssl.rpc.keyPassword", null);
+  }
+
+  /**
+   * A X.509 certificate chain file in PEM format; can be relative to the 
current directory
+   */
+  public File sslRpcCertChain() {
+String certChain = conf.get("spark.ssl.rpc.certChain", null);
+if (certChain != null) {
+  return new File(certChain);
+} else {
+  return null;
+}
+  }
+
+  /**
+   * The trust-store file; can be relative to the current directory
+   */
+  public File sslRpcTrustStore() {
+String trustStore = conf.get("spark.ssl.rpc.trustStore", null);
+if (trustStore != null) {
+  return new File(trustStore);
+} else {
+  return null;
+}
+  }
+
+  /**
+   * The password to the trust-store file
+   */
+  public String sslRpcTrustStorePassword() {
+return 

[spark] branch branch-3.5 updated: [SPARK-45250][CORE] Support stage level task resource profile for yarn cluster when dynamic allocation disabled

2023-10-05 Thread tgraves
This is an automated email from the ASF dual-hosted git repository.

tgraves pushed a commit to branch branch-3.5
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.5 by this push:
 new c50e371c2d3 [SPARK-45250][CORE] Support stage level task resource 
profile for yarn cluster when dynamic allocation disabled
c50e371c2d3 is described below

commit c50e371c2d3ccba9340bc8980add0753f2d7a86b
Author: Bobby Wang 
AuthorDate: Mon Oct 2 23:00:56 2023 -0500

[SPARK-45250][CORE] Support stage level task resource profile for yarn 
cluster when dynamic allocation disabled

### What changes were proposed in this pull request?
This PR is a follow-up of https://github.com/apache/spark/pull/37268 which 
supports stage level task resource profile for standalone cluster when dynamic 
allocation disabled. This PR enables stage-level task resource profile for yarn 
cluster.

### Why are the changes needed?

Users who work on spark ML/DL cases running on Yarn would expect 
stage-level task resource profile feature.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

The current tests of https://github.com/apache/spark/pull/37268 can also 
cover this PR since both yarn and standalone cluster share the same 
TaskSchedulerImpl class which implements this feature. Apart from that, 
modifying the existing test to cover yarn cluster. Apart from that, I also 
performed some manual tests which have been updated in the comments.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #43030 from wbo4958/yarn-task-resoure-profile.

Authored-by: Bobby Wang 
Signed-off-by: Mridul Muralidharan gmail.com>
(cherry picked from commit 5b80639e643b6dd09dd64c3f43ec039b2ef2f9fd)
Signed-off-by: Thomas Graves 
---
 .../apache/spark/resource/ResourceProfileManager.scala|  6 +++---
 .../spark/resource/ResourceProfileManagerSuite.scala  | 15 +--
 docs/configuration.md |  2 +-
 docs/running-on-yarn.md   |  6 +-
 4 files changed, 22 insertions(+), 7 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/resource/ResourceProfileManager.scala 
b/core/src/main/scala/org/apache/spark/resource/ResourceProfileManager.scala
index 9f98d4d9c9c..cd7124a5724 100644
--- a/core/src/main/scala/org/apache/spark/resource/ResourceProfileManager.scala
+++ b/core/src/main/scala/org/apache/spark/resource/ResourceProfileManager.scala
@@ -67,9 +67,9 @@ private[spark] class ResourceProfileManager(sparkConf: 
SparkConf,
*/
   private[spark] def isSupported(rp: ResourceProfile): Boolean = {
 if (rp.isInstanceOf[TaskResourceProfile] && !dynamicEnabled) {
-  if ((notRunningUnitTests || testExceptionThrown) && 
!isStandaloneOrLocalCluster) {
-throw new SparkException("TaskResourceProfiles are only supported for 
Standalone " +
-  "cluster for now when dynamic allocation is disabled.")
+  if ((notRunningUnitTests || testExceptionThrown) && 
!(isStandaloneOrLocalCluster || isYarn)) {
+throw new SparkException("TaskResourceProfiles are only supported for 
Standalone and " +
+  "Yarn cluster for now when dynamic allocation is disabled.")
   }
 } else {
   val isNotDefaultProfile = rp.id != 
ResourceProfile.DEFAULT_RESOURCE_PROFILE_ID
diff --git 
a/core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala
 
b/core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala
index e97d5c7883a..77dc7bcb4c5 100644
--- 
a/core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala
+++ 
b/core/src/test/scala/org/apache/spark/resource/ResourceProfileManagerSuite.scala
@@ -126,18 +126,29 @@ class ResourceProfileManagerSuite extends SparkFunSuite {
 val defaultProf = rpmanager.defaultResourceProfile
 assert(rpmanager.isSupported(defaultProf))
 
-// task resource profile.
+// Standalone: supports task resource profile.
 val gpuTaskReq = new TaskResourceRequests().resource("gpu", 1)
 val taskProf = new TaskResourceProfile(gpuTaskReq.requests)
 assert(rpmanager.isSupported(taskProf))
 
+// Local: doesn't support task resource profile.
 conf.setMaster("local")
 rpmanager = new ResourceProfileManager(conf, listenerBus)
 val error = intercept[SparkException] {
   rpmanager.isSupported(taskProf)
 }.getMessage
 assert(error === "TaskResourceProfiles are only supported for Standalone " 
+
-  "cluster for now when dynamic allocation is disabled.")
+  "and Yarn cluster for now when dynamic allocation is disabled.")
+
+// Local cluster: supports task resource profile.
+conf.setMaster("local-cluster[1, 1, 1024]")
+rpmanager = new ResourceProfileManager(conf, listenerBus)
+

[spark] branch master updated: [SPARK-45420][SQL][PYTHON][CONNECT] Add DataType.fromDDL into PySpark

2023-10-05 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7984b1a4b2d [SPARK-45420][SQL][PYTHON][CONNECT] Add DataType.fromDDL 
into PySpark
7984b1a4b2d is described below

commit 7984b1a4b2da6586358397e90d1c28ae73aca6ce
Author: Hyukjin Kwon 
AuthorDate: Thu Oct 5 18:07:21 2023 +0900

[SPARK-45420][SQL][PYTHON][CONNECT] Add DataType.fromDDL into PySpark

### What changes were proposed in this pull request?

This PR implements `DataType.fromDDL` as the parity to Scala API:


https://github.com/apache/spark/blob/350b8d8388c9ad15303d39f22b249b8c73785695/sql/api/src/main/scala/org/apache/spark/sql/types/DataType.scala#L121-L126

One difference is that Python API also supports the legacy format inside 
`struct<...>`, e.g., `a: int, b: int`.

### Why are the changes needed?

In order for the end users to parse the DDL formatted type easily.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a new user-facing API: `DataType.fromDDL`.

### How was this patch tested?

Unittests were added, and manually tested them too.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #43226 from HyukjinKwon/addfromDDL.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/tests/test_types.py | 12 +
 python/pyspark/sql/types.py| 48 ++
 2 files changed, 60 insertions(+)

diff --git a/python/pyspark/sql/tests/test_types.py 
b/python/pyspark/sql/tests/test_types.py
index fb752b93a33..c6e70da1f8d 100644
--- a/python/pyspark/sql/tests/test_types.py
+++ b/python/pyspark/sql/tests/test_types.py
@@ -28,6 +28,7 @@ from pyspark.sql import Row
 from pyspark.sql import functions as F
 from pyspark.errors import AnalysisException, PySparkTypeError, 
PySparkValueError
 from pyspark.sql.types import (
+DataType,
 ByteType,
 ShortType,
 IntegerType,
@@ -1288,6 +1289,17 @@ class TypesTestsMixin:
 schema1 = self.spark.range(1).select(F.make_interval(F.lit(1))).schema
 self.assertEqual(schema1.fields[0].dataType, CalendarIntervalType())
 
+def test_from_ddl(self):
+self.assertEqual(DataType.fromDDL("long"), LongType())
+self.assertEqual(
+DataType.fromDDL("a: int, b: string"),
+StructType([StructField("a", IntegerType()), StructField("b", 
StringType())]),
+)
+self.assertEqual(
+DataType.fromDDL("a int, b string"),
+StructType([StructField("a", IntegerType()), StructField("b", 
StringType())]),
+)
+
 
 class DataTypeTests(unittest.TestCase):
 # regression test for SPARK-6055
diff --git a/python/pyspark/sql/types.py b/python/pyspark/sql/types.py
index bf63bea69ad..01db75b2500 100644
--- a/python/pyspark/sql/types.py
+++ b/python/pyspark/sql/types.py
@@ -139,6 +139,54 @@ class DataType:
 """
 return obj
 
+@classmethod
+def fromDDL(cls, ddl: str) -> "DataType":
+"""
+Creates :class:`DataType` for a given DDL-formatted string.
+
+.. versionadded:: 4.0.0
+
+Parameters
+--
+ddl : str
+DDL-formatted string representation of types, e.g.
+:class:`pyspark.sql.types.DataType.simpleString`, except that top 
level struct
+type can omit the ``struct<>`` for the compatibility reason with
+``spark.createDataFrame`` and Python UDFs.
+
+Returns
+---
+:class:`DataType`
+
+Examples
+
+Create a StructType by the corresponding DDL formatted string.
+
+>>> from pyspark.sql.types import DataType
+>>> DataType.fromDDL("b string, a int")
+StructType([StructField('b', StringType(), True), StructField('a', 
IntegerType(), True)])
+
+Create a single DataType by the corresponding DDL formatted string.
+
+>>> DataType.fromDDL("decimal(10,10)")
+DecimalType(10,10)
+
+Create a StructType by the legacy string format.
+
+>>> DataType.fromDDL("b: string, a: int")
+StructType([StructField('b', StringType(), True), StructField('a', 
IntegerType(), True)])
+"""
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import udf
+
+# Intentionally uses SparkSession so one implementation can be shared 
with/without
+# Spark Connect.
+schema = (
+SparkSession.active().range(0).select(udf(lambda x: x, 
returnType=ddl)("id")).schema
+)
+assert len(schema) == 1
+return schema[0].dataType
+
 
 # This singleton pattern does not work with pickle, you will get
 # another object 

[spark] branch master updated (350b8d8388c -> fa71fc3c25b)

2023-10-05 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 350b8d8388c [PYTHON][DOCS] Fix incorrect doc in 
pyspark.sql.DataFrame.repartition
 add fa71fc3c25b [SPARK-43656][CONNECT][PS][TESTS] Enable numpy compat 
tests for Spark Connect

No new revisions were added by this update.

Summary of changes:
 .../pyspark/pandas/tests/connect/test_parity_numpy_compat.py | 12 
 python/pyspark/pandas/tests/test_numpy_compat.py |  5 -
 2 files changed, 4 insertions(+), 13 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org