[spark] branch master updated: [SPARK-41780][SQL] Should throw INVALID_PARAMETER_VALUE.PATTERN when the parameters `regexp` is invalid

2023-01-09 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 15a0f55246b [SPARK-41780][SQL] Should throw 
INVALID_PARAMETER_VALUE.PATTERN when the parameters `regexp` is invalid
15a0f55246b is described below

commit 15a0f55246bee7b043bd6081f53744fbf74403eb
Author: panbingkun 
AuthorDate: Mon Jan 9 11:37:54 2023 +0300

[SPARK-41780][SQL] Should throw INVALID_PARAMETER_VALUE.PATTERN when the 
parameters `regexp` is invalid

### What changes were proposed in this pull request?
In the PR, I propose to throw error classes - 
`INVALID_PARAMETER_VALUE.PATTERN` when the parameters `regexp` in 
regexp_replace & regexp_extract & rlike is invalid.

### Why are the changes needed?
Clear error prompt should improve user experience with Spark SQL.
The original error prompt is:
https://user-images.githubusercontent.com/15246973/210493673-c1de9927-9a18-4f9d-a94c-48735b6c5e5a.png";>
Valid: [a\\d]{0,2}
Invalid: [a\\d]{0, 2}

![image](https://user-images.githubusercontent.com/15246973/210494925-cb6c8043-de02-4c8e-9b40-225350422340.png)

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Add new UT.
Pass GA.

Closes #39383 from panbingkun/SPARK-41780.

Authored-by: panbingkun 
Signed-off-by: Max Gekk 
---
 .../catalyst/expressions/regexpExpressions.scala   | 20 ++--
 .../expressions/RegexpExpressionsSuite.scala   | 29 -
 .../apache/spark/sql/StringFunctionsSuite.scala| 38 ++
 3 files changed, 76 insertions(+), 11 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
index c86dcfb3b96..29510bc3852 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
@@ -57,7 +57,12 @@ abstract class StringRegexExpression extends BinaryExpression
 null
   } else {
 // Let it raise exception if couldn't compile the regex string
-Pattern.compile(escape(str))
+try {
+  Pattern.compile(escape(str))
+} catch {
+  case e: PatternSyntaxException =>
+throw QueryExecutionErrors.invalidPatternError(prettyName, 
e.getPattern, e)
+}
   }
 
   protected def pattern(str: String) = if (cache == null) compile(str) else 
cache
@@ -634,7 +639,12 @@ case class RegExpReplace(subject: Expression, regexp: 
Expression, rep: Expressio
 if (!p.equals(lastRegex)) {
   // regex value changed
   lastRegex = p.asInstanceOf[UTF8String].clone()
-  pattern = Pattern.compile(lastRegex.toString)
+  try {
+pattern = Pattern.compile(lastRegex.toString)
+  } catch {
+case e: PatternSyntaxException =>
+  throw QueryExecutionErrors.invalidPatternError(prettyName, 
e.getPattern, e)
+  }
 }
 if (!r.equals(lastReplacementInUTF8)) {
   // replacement string changed
@@ -688,7 +698,11 @@ case class RegExpReplace(subject: Expression, regexp: 
Expression, rep: Expressio
   if (!$regexp.equals($termLastRegex)) {
 // regex value changed
 $termLastRegex = $regexp.clone();
-$termPattern = $classNamePattern.compile($termLastRegex.toString());
+try {
+  $termPattern = $classNamePattern.compile($termLastRegex.toString());
+} catch (java.util.regex.PatternSyntaxException e) {
+  throw QueryExecutionErrors.invalidPatternError("$prettyName", 
e.getPattern(), e);
+}
   }
   if (!$rep.equals($termLastReplacementInUTF8)) {
 // replacement string changed
diff --git 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala
 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala
index 8b5e303849c..af051a1a9bc 100644
--- 
a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala
+++ 
b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/RegexpExpressionsSuite.scala
@@ -279,14 +279,27 @@ class RegexpExpressionsSuite extends SparkFunSuite with 
ExpressionEvalHelper {
 checkLiteralRow("abc"  rlike _, "^bc", false)
 checkLiteralRow("abc"  rlike _, "^ab", true)
 checkLiteralRow("abc"  rlike _, "^bc", false)
-
-intercept[java.util.regex.PatternSyntaxException] {
-  evaluateWithoutCodegen("ac" rlike "**")
-}
-intercept[java.util.regex.PatternSyntaxException] {
-  val regex = $"a".string.at(0)
-  evaluateWithoutCodegen("ac" rlike regex, create_row("**"

[spark] branch master updated: [MINOR] Fix wrong file name

2023-01-09 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c2a5cbf2088 [MINOR] Fix wrong file name
c2a5cbf2088 is described below

commit c2a5cbf2088b33d9a518ffe1409db05d373a3b25
Author: dengziming 
AuthorDate: Mon Jan 9 19:17:09 2023 +0900

[MINOR] Fix wrong file name

### What changes were proposed in this pull request?
Fix the filename in doc

### Why are the changes needed?
improve docs and output tips.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not needed.

Closes #39465 from dengziming/filename-typo.

Authored-by: dengziming 
Signed-off-by: Hyukjin Kwon 
---
 dev/connect-check-protos.py | 2 +-
 dev/connect-gen-protos.sh   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/dev/connect-check-protos.py b/dev/connect-check-protos.py
index b902274b1f4..a38a399d877 100755
--- a/dev/connect-check-protos.py
+++ b/dev/connect-check-protos.py
@@ -18,7 +18,7 @@
 #
 
 # Utility for checking whether generated codes in PySpark are out of sync.
-#   usage: ./dev/check-codegen-python.py
+#   usage: ./dev/connect-check-protos.py
 
 import os
 import sys
diff --git a/dev/connect-gen-protos.sh b/dev/connect-gen-protos.sh
index cb5b66379b2..c4fc8d385f2 100755
--- a/dev/connect-gen-protos.sh
+++ b/dev/connect-gen-protos.sh
@@ -20,7 +20,7 @@ set -ex
 
 if [[ $# -gt 1 ]]; then
   echo "Illegal number of parameters."
-  echo "Usage: ./dev/generate_protos.sh [path]"
+  echo "Usage: ./dev/connect-gen-protos.sh [path]"
   exit -1
 fi
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (c2a5cbf2088 -> e08af6d753e)

2023-01-09 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c2a5cbf2088 [MINOR] Fix wrong file name
 add e08af6d753e [SPARK-41945][CONNECT][PYTHON] Python: connect client lost 
column data with pyarrow.Table.to_pylist

No new revisions were added by this update.

Summary of changes:
 python/pyspark/sql/connect/dataframe.py| 24 +++---
 .../sql/tests/connect/test_connect_basic.py| 18 ++--
 .../sql/tests/connect/test_parity_functions.py |  5 -
 3 files changed, 33 insertions(+), 14 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41944][CONNECT] Pass configurations when local remote mode is on

2023-01-09 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 55fe522bbb4 [SPARK-41944][CONNECT] Pass configurations when local 
remote mode is on
55fe522bbb4 is described below

commit 55fe522bbb45ffb4ca59ec664ec0b59d41fa35da
Author: Hyukjin Kwon 
AuthorDate: Mon Jan 9 21:42:32 2023 +0900

[SPARK-41944][CONNECT] Pass configurations when local remote mode is on

### What changes were proposed in this pull request?

This PR mainly proposes to pass the user-specified configurations to local 
remote mode.

Previously, all user-specific configurations were ignored in case of 
PySpark shell such as`./bin/pyspark` or plain Python interpreter - PySpark 
application submission case was fine.

Now, configurations are properly passed to the server side, e.g., 
`./bin/pyspark --remote local --conf aaa=bbb` and `aaa=bbb` is properly passed 
to the server side.

For `spark.master` and `spark.plugins`, user-specific configurations are 
respected. If they are unset, they are automatically set, e.g., 
`org.apache.spark.sql.connect.SparkConnectPlugin`. If they are set, users have 
to provide the proper values to overwrite them, meaning that either:

```bash
./bin/pyspark --remote local --conf 
spark.plugins="other.Plugin,org.apache.spark.sql.connect.SparkConnectPlugin"
```

or

```bash
./bin/pyspark --remote local
```

In addition, this PR fixes the related code as below:
- Adds `spark.local.connect` internal configuration to be used in Spark 
Submit (so we don't have to parse and manipulate user specified arguments in 
Python in order to remove `--remote` or `spark.remote` configuration).
- Adds some more validation on arguments in `SparkSubmitCommandBuilder` so 
invalid combination can fail fast (e.g., setting both remote and master like 
`--master ...` and `--conf spark.remote=...`)
- In dev mode, do not set `spark.jars` anymore since it adds the jars into 
the class path of the JVM through `addJarToCurrentClassLoader`.

### Why are the changes needed?

To correctly pass the configurations specified from users.

### Does this PR introduce _any_ user-facing change?

No, Spark Connect has not been released yet.
This is kind of a followup of https://github.com/apache/spark/pull/39441 to 
complete its support.

### How was this patch tested?

Manually tested all combinations such as:

```bash
./bin/pyspark --conf spark.remote=local
./bin/pyspark --conf spark.remote=local --conf spark.jars=a
./bin/pyspark --conf spark.remote=local --jars 
/.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
./bin/spark-submit --conf spark.remote=local --jars 
/.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
 app.py
./bin/pyspark --conf spark.remote=local --conf 
spark.jars=/.../spark/connector/connect/server/target/scala-2.12/spark-connect-assembly-3.4.0-SNAPSHOT.jar
./bin/pyspark --master "local[*]" --remote "local"
./bin/spark-submit --conf spark.remote=local app.py
./bin/spark-submit --master="local[*]" --conf spark.remote=local app.py
./bin/spark-submit --master="local[*]" --remote=local app.py
./bin/pyspark --master "local[*]" --remote "local"
./bin/pyspark --master "local[*]" --remote "local"
./bin/pyspark --master "local[*]" --conf spark.remote="local"
./bin/spark-submit --master="local[*]" --remote=local app.py
./bin/pyspark --remote local
```

Closes #39463 from HyukjinKwon/SPARK-41933-conf.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/deploy/SparkSubmit.scala  |  12 ++-
 .../apache/spark/deploy/SparkSubmitArguments.scala |   3 +-
 .../org/apache/spark/launcher/SparkLauncher.java   |   4 +
 .../spark/launcher/SparkSubmitCommandBuilder.java  |  20 ++--
 python/pyspark/sql/connect/session.py  | 118 ++---
 5 files changed, 82 insertions(+), 75 deletions(-)

diff --git a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala 
b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
index 745836dfbef..65e4367b33a 100644
--- a/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
+++ b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
@@ -226,12 +226,13 @@ private[spark] class SparkSubmit extends Logging {
 val childArgs = new ArrayBuffer[String]()
 val childClasspath = new ArrayBuffer[String]()
 val sparkConf = args.toSparkConf()
+if (sparkConf.contains("spark.local.connect")) 
sparkConf.remove("spark.remote")
 var childMainClass = ""
 
 // Set the cluster manager
 val clusterManager: Int = ar

[spark] branch master updated: [SPARK-41860][SQL] Make AvroScanBuilder and JsonScanBuilder case classes

2023-01-09 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 1f1c0016fad [SPARK-41860][SQL] Make AvroScanBuilder and 
JsonScanBuilder case classes
1f1c0016fad is described below

commit 1f1c0016fad234ba9d939a8269adc3059842687a
Author: Lorenzo Martini 
AuthorDate: Mon Jan 9 09:53:22 2023 -0800

[SPARK-41860][SQL] Make AvroScanBuilder and JsonScanBuilder case classes

### What changes were proposed in this pull request?
Make `AvroScanBuilder` and `JsonScanBuilder` case classes from normal 
classes.

All the other existing `ScanBuilder`s inheriting from `FileScanBuilder` are 
case classes which is very nice and useful. However the Json and the Avro ones 
aren't, which makes it quite inconvenient when working with those. I don't see 
any particular reason why these two would not be case classes as well given 
they are very similar to the other ones and very simple and stateless. 
Especially when `AvroScan` and `JsonScan` are also already case classes like 
the other scans.

### Why are the changes needed?
When working on using different possible `FileScanBuilder` is really 
problematic that these 2 ScanBuilders are not case classes, especially given 
all the others are. This is a very simple change that helps development greatly 
and allows for more flexibility / feature deelopment when working with 
ScanBuilders.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need for testing as it's a simple change from `class` to `case class`

Closes #39366 from LorenzoMartini/patch-1.

Authored-by: Lorenzo Martini 
Signed-off-by: Dongjoon Hyun 
---
 .../src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala   | 2 +-
 .../spark/sql/execution/datasources/v2/json/JsonScanBuilder.scala   | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git 
a/connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala
 
b/connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala
index 08dcaa7d087..754c58e65b0 100644
--- 
a/connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala
+++ 
b/connector/avro/src/main/scala/org/apache/spark/sql/v2/avro/AvroScanBuilder.scala
@@ -24,7 +24,7 @@ import org.apache.spark.sql.sources.Filter
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.util.CaseInsensitiveStringMap
 
-class AvroScanBuilder (
+case class AvroScanBuilder (
 sparkSession: SparkSession,
 fileIndex: PartitioningAwareFileIndex,
 schema: StructType,
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScanBuilder.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScanBuilder.scala
index 788ccfac6e7..dcae6bd3fd0 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScanBuilder.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/json/JsonScanBuilder.scala
@@ -24,7 +24,7 @@ import org.apache.spark.sql.sources.Filter
 import org.apache.spark.sql.types.StructType
 import org.apache.spark.sql.util.CaseInsensitiveStringMap
 
-class JsonScanBuilder (
+case class JsonScanBuilder (
 sparkSession: SparkSession,
 fileIndex: PartitioningAwareFileIndex,
 schema: StructType,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (1f1c0016fad -> c292bd7a345)

2023-01-09 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 1f1c0016fad [SPARK-41860][SQL] Make AvroScanBuilder and 
JsonScanBuilder case classes
 add c292bd7a345 [SPARK-41848][CORE] Fixing task over-scheduled with 
TaskResourceProfile

No new revisions were added by this update.

Summary of changes:
 .../executor/CoarseGrainedExecutorBackend.scala|  3 +-
 .../scala/org/apache/spark/executor/Executor.scala |  2 +-
 .../cluster/CoarseGrainedClusterMessage.scala  | 10 ++-
 .../cluster/CoarseGrainedSchedulerBackend.scala| 16 ++--
 .../CoarseGrainedExecutorBackendSuite.scala| 25 +-
 .../CoarseGrainedSchedulerBackendSuite.scala   | 96 +-
 6 files changed, 135 insertions(+), 17 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41947][CORE][DOCS] Update the contents of error class guidelines

2023-01-09 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 786594734bd [SPARK-41947][CORE][DOCS] Update the contents of error 
class guidelines
786594734bd is described below

commit 786594734bd79017ebd42eb117b62958afad07bb
Author: itholic 
AuthorDate: Mon Jan 9 23:24:09 2023 +0300

[SPARK-41947][CORE][DOCS] Update the contents of error class guidelines

### What changes were proposed in this pull request?

This PR proposes to update error class guidelines for 
`core/src/main/resources/error/README.md`.

### Why are the changes needed?

Because some of contents are out of date, and no longer valid for current 
behavior.

### Does this PR introduce _any_ user-facing change?

No. It fixed the developer guidelines for error class.

### How was this patch tested?

The existing CI should pass.

Closes #39464 from itholic/SPARK-41947.

Authored-by: itholic 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/README.md | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/core/src/main/resources/error/README.md 
b/core/src/main/resources/error/README.md
index 23e62cd25fb..8ea9e37c27f 100644
--- a/core/src/main/resources/error/README.md
+++ b/core/src/main/resources/error/README.md
@@ -8,9 +8,9 @@ and message parameters rather than an arbitrary error message.
 1. Check if the error is an internal error.
Internal errors are bugs in the code that we do not expect users to 
encounter; this does not include unsupported operations.
If true, use the error class `INTERNAL_ERROR` and skip to step 4.
-2. Check if an appropriate error class already exists in `error-class.json`.
+2. Check if an appropriate error class already exists in `error-classes.json`.
If true, use the error class and skip to step 4.
-3. Add a new class to `error-class.json`; keep in mind the invariants below.
+3. Add a new class to `error-classes.json`; keep in mind the invariants below.
 4. Check if the exception type already extends `SparkThrowable`.
If true, skip to step 6.
 5. Mix `SparkThrowable` into the exception.
@@ -24,10 +24,10 @@ Throw with arbitrary error message:
 
 ### After
 
-`error-class.json`
+`error-classes.json`
 
 "PROBLEM_BECAUSE": {
-  "message": ["Problem %s because %s"],
+  "message": ["Problem  because "],
   "sqlState": "X"
 }
 
@@ -35,16 +35,18 @@ Throw with arbitrary error message:
 
 class SparkTestException(
 errorClass: String,
-messageParameters: Seq[String])
+messageParameters: Map[String, String])
   extends TestException(SparkThrowableHelper.getMessage(errorClass, 
messageParameters))
 with SparkThrowable {
 
-  def getErrorClass: String = errorClass
+  override def getMessageParameters: java.util.Map[String, String] = 
messageParameters.asJava
+
+  override def getErrorClass: String = errorClass
 }
 
 Throw with error class and message parameters:
 
-throw new SparkTestException("PROBLEM_BECAUSE", Seq("A", "B"))
+throw new SparkTestException("PROBLEM_BECAUSE", Map("problem" -> "A", 
"cause" -> "B"))
 
 ## Access fields
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41951][DOCS] Update SQL migration guide and documentations

2023-01-09 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c32ba2cb992 [SPARK-41951][DOCS] Update SQL migration guide and 
documentations
c32ba2cb992 is described below

commit c32ba2cb9925386498afb63cf42542e283391ace
Author: Dongjoon Hyun 
AuthorDate: Mon Jan 9 15:16:41 2023 -0800

[SPARK-41951][DOCS] Update SQL migration guide and documentations

### What changes were proposed in this pull request?

This PR aims to make `SQL migration guide` and documentations up-to-date 
for Apache Spark 3.4.0.

### Why are the changes needed?

- We changed the default values of the configurations at Apache Spark 3.4.0.
- In addition, I added the missed non-internal configurations because they 
are already exposed to the users in the previous Spark releases.

### Does this PR introduce _any_ user-facing change?

No, this is a documentation-only change.

### How was this patch tested?

Manual review.

Closes #39470 from dongjoon-hyun/SPARK-41951.

Authored-by: Dongjoon Hyun 
Signed-off-by: Dongjoon Hyun 
---
 docs/sql-data-sources-orc.md   |  3 +--
 docs/sql-migration-guide.md|  1 +
 docs/sql-performance-tuning.md | 58 ++
 3 files changed, 60 insertions(+), 2 deletions(-)

diff --git a/docs/sql-data-sources-orc.md b/docs/sql-data-sources-orc.md
index d4a50c9fab9..2b00b771b09 100644
--- a/docs/sql-data-sources-orc.md
+++ b/docs/sql-data-sources-orc.md
@@ -37,7 +37,6 @@ For example, historically, `native` implementation handles 
`CHAR/VARCHAR` with S
 
 `native` implementation supports a vectorized ORC reader and has been the 
default ORC implementation since Spark 2.3.
 The vectorized reader is used for the native ORC tables (e.g., the ones 
created using the clause `USING ORC`) when `spark.sql.orc.impl` is set to 
`native` and `spark.sql.orc.enableVectorizedReader` is set to `true`.
-For nested data types (array, map and struct), vectorized reader is disabled 
by default. Set `spark.sql.orc.enableNestedColumnVectorizedReader` to `true` to 
enable vectorized reader for these types.
 
 For the Hive ORC serde tables (e.g., the ones created using the clause `USING 
HIVE OPTIONS (fileFormat 'ORC')`),
 the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is 
also set to `true`, and is turned on by default.
@@ -173,7 +172,7 @@ When reading from Hive metastore ORC tables and inserting 
to Hive metastore ORC
   
   
 spark.sql.orc.enableNestedColumnVectorizedReader
-false
+true
 
   Enables vectorized orc decoding in native implementation 
for nested data types
   (array, map and struct). If 
spark.sql.orc.enableVectorizedReader is set to
diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index df48bf45b7d..0f80bb176f5 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -35,6 +35,7 @@ license: |
 - Valid values for `fmt` are case-insensitive `hex`, `base64`, `utf-8`, 
`utf8`.
   - Since Spark 3.4, Spark throws only `PartitionsAlreadyExistException` when 
it creates partitions but some of them exist already. In Spark 3.3 or earlier, 
Spark can throw either `PartitionsAlreadyExistException` or 
`PartitionAlreadyExistsException`.
   - Since Spark 3.4, Spark will do validation for partition spec in ALTER 
PARTITION to follow the behavior of `spark.sql.storeAssignmentPolicy` which may 
cause an exception if type conversion fails, e.g. `ALTER TABLE .. ADD 
PARTITION(p='a')` if column `p` is int type. To restore the legacy behavior, 
set `spark.sql.legacy.skipTypeValidationOnAlterPartition` to `true`.
+  - Since Spark 3.4, vectorized readers are enabled by default for the nested 
data types (array, map and struct). To restore the legacy behavior, set 
`spark.sql.orc.enableNestedColumnVectorizedReader` and 
`spark.sql.parquet.enableNestedColumnVectorizedReader` to `false`.
 
 ## Upgrading from Spark SQL 3.2 to 3.3
 
diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md
index 6ac39d90527..d5503a2e62d 100644
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@@ -286,6 +286,27 @@ This feature coalesces the post shuffle partitions based 
on the map output stati

  
  
+### Spliting skewed shuffle partitions
+ 
+   Property NameDefaultMeaningSince 
Version
+   
+ 
spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled
+ true
+ 
+   When true and spark.sql.adaptive.enabled is true, Spark 
will optimize the skewed shuffle partitions in RebalancePartitions and split 
them to smaller ones according to the target size (specified by 
spark.sql.adaptive.advisoryPartitionSizeInBytes), to avoid data 
skew.
+ 
+ 3.2.0
+   
+   
+ 
spark.sql.adaptive.rebal

[spark] branch master updated (c32ba2cb992 -> 9974b956ac0)

2023-01-09 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c32ba2cb992 [SPARK-41951][DOCS] Update SQL migration guide and 
documentations
 add 9974b956ac0 [SPARK-41589][PYTHON][ML] PyTorch Distributor Baseline API 
Changes

No new revisions were added by this update.

Summary of changes:
 dev/sparktestsupport/modules.py|   1 +
 python/docs/source/reference/pyspark.ml.rst|  13 +
 python/pyspark/ml/__init__.py  |   2 +
 .../python => python/pyspark/ml/torch}/__init__.py |   0
 python/pyspark/ml/torch/distributor.py | 281 +
 .../pyspark/ml/torch/tests}/__init__.py|   0
 python/pyspark/ml/torch/tests/test_distributor.py  | 190 ++
 7 files changed, 487 insertions(+)
 copy {examples/src/main/python => python/pyspark/ml/torch}/__init__.py (100%)
 create mode 100644 python/pyspark/ml/torch/distributor.py
 copy {examples/src/main/python => python/pyspark/ml/torch/tests}/__init__.py 
(100%)
 create mode 100644 python/pyspark/ml/torch/tests/test_distributor.py


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41232][SQL][PYTHON] Adding array_append function

2023-01-09 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c13fea90595 [SPARK-41232][SQL][PYTHON] Adding array_append function
c13fea90595 is described below

commit c13fea90595cb1489046135c45c92d3bacb85818
Author: Ankit Prakash Gupta 
AuthorDate: Tue Jan 10 10:31:57 2023 +0900

[SPARK-41232][SQL][PYTHON] Adding array_append function

### What changes were proposed in this pull request?

[SPARK-41232] Adding array_append function in spark sql, Pyspark
**Syntax**: array_append(arr, element)

**Arguments:**

arr: Array of anytype of elements in which the element has to be appended.
element: Separate element type which has to be appended in the arr array. 
The type of element has to match with the type of elements array is holding.

`select array_append(array(1, 2, 3), 4);`
| array_append(array(1, 2, 3), 4) |
| --|
| [1, 2, 3, 4] |

The mainstream database supports array_append show below:

**Snowflake**

https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.functions.array_append.html

**PostgreSQL**
https://www.postgresql.org/docs/9.1/functions-array.html

**MySQL**

https://dev.mysql.com/doc/refman/5.7/en/json-modification-functions.html#function_json-array-append

### Why are the changes needed?

New API - array_append -> To append  element in a column or a value to 
another array column at the end

### Does this PR introduce _any_ user-facing change?
 No

### How was this patch tested?
 Unit-tests have been added

Closes #38865 from infoankitp/SPARK-41232.

Authored-by: Ankit Prakash Gupta 
Signed-off-by: Hyukjin Kwon 
---
 .../source/reference/pyspark.sql/functions.rst |   1 +
 python/pyspark/sql/functions.py|  30 +
 .../sql/catalyst/analysis/FunctionRegistry.scala   |   1 +
 .../expressions/collectionOperations.scala | 147 +
 .../expressions/CollectionExpressionsSuite.scala   |  75 +++
 .../scala/org/apache/spark/sql/functions.scala |  12 ++
 .../sql-functions/sql-expression-schema.md |   1 +
 .../src/test/resources/sql-tests/inputs/array.sql  |  11 ++
 .../resources/sql-tests/results/ansi/array.sql.out |  72 ++
 .../test/resources/sql-tests/results/array.sql.out |  72 ++
 .../apache/spark/sql/DataFrameFunctionsSuite.scala | 101 ++
 11 files changed, 523 insertions(+)

diff --git a/python/docs/source/reference/pyspark.sql/functions.rst 
b/python/docs/source/reference/pyspark.sql/functions.rst
index 2dc09cf2762..ddc8eab90f7 100644
--- a/python/docs/source/reference/pyspark.sql/functions.rst
+++ b/python/docs/source/reference/pyspark.sql/functions.rst
@@ -155,6 +155,7 @@ Collection Functions
 concat
 array_position
 element_at
+array_append
 array_sort
 array_remove
 array_distinct
diff --git a/python/pyspark/sql/functions.py b/python/pyspark/sql/functions.py
index c533ca7be6e..60276b2a0b1 100644
--- a/python/pyspark/sql/functions.py
+++ b/python/pyspark/sql/functions.py
@@ -7775,6 +7775,36 @@ def array_compact(col: "ColumnOrName") -> Column:
 return _invoke_function_over_columns("array_compact", col)
 
 
+@try_remote_functions
+def array_append(col1: "ColumnOrName", col2: "ColumnOrName") -> Column:
+"""
+Collection function: returns an array of the elements in col1 along
+with the added element in col2 at the last of the array.
+
+.. versionadded:: 3.4.0
+
+Parameters
+--
+col1 : :class:`~pyspark.sql.Column` or str
+name of column containing array
+col2 : :class:`~pyspark.sql.Column` or str
+name of column containing element
+
+Returns
+---
+:class:`~pyspark.sql.Column`
+an array of values from first array along with the element.
+
+Examples
+
+>>> from pyspark.sql import Row
+>>> df = spark.createDataFrame([Row(c1=["b", "a", "c"], c2="c")])
+>>> df.select(array_append(df.c1, df.c2)).collect()
+[Row(array_append(c1, c2)=['b', 'a', 'c', 'c'])]
+"""
+return _invoke_function_over_columns("array_append", col1, col2)
+
+
 @try_remote_functions
 def explode(col: "ColumnOrName") -> Column:
 """
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
index e28c77a10a4..01d672e20c6 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -

[spark] branch master updated: [SPARK-41943][CORE] Use java api to create files and grant permissions

2023-01-09 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b22946ed8b5 [SPARK-41943][CORE] Use java api to create files and grant 
permissions
b22946ed8b5 is described below

commit b22946ed8b5f41648a32a4e0c4c40226141a06a0
Author: smallzhongfeng 
AuthorDate: Tue Jan 10 10:40:44 2023 +0800

[SPARK-41943][CORE] Use java api to create files and grant permissions

### What changes were proposed in this pull request?

For method `createDirWithPermission770`, using java api to create files and 
grant permissions instead of calling shell commands.

### Why are the changes needed?

Safer and more efficient.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Origin uts.

Closes #39448 from smallzhongfeng/java-api-mkdir.

Authored-by: smallzhongfeng 
Signed-off-by: Wenchen Fan 
---
 .../org/apache/spark/storage/DiskBlockManager.scala  | 16 +---
 1 file changed, 5 insertions(+), 11 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala 
b/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
index e29f3fc1b80..a7ed9226c57 100644
--- a/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
+++ b/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala
@@ -19,7 +19,7 @@ package org.apache.spark.storage
 
 import java.io.{File, IOException}
 import java.nio.file.Files
-import java.nio.file.attribute.PosixFilePermission
+import java.nio.file.attribute.{PosixFilePermission, PosixFilePermissions}
 import java.util.UUID
 
 import scala.collection.mutable.HashMap
@@ -301,9 +301,6 @@ private[spark] class DiskBlockManager(
* Create a directory that is writable by the group.
* Grant the permission 770 "rwxrwx---" to the directory so the shuffle 
server can
* create subdirs/files within the merge folder.
-   * TODO: Find out why can't we create a dir using java api with permission 
770
-   *  Files.createDirectories(mergeDir.toPath, 
PosixFilePermissions.asFileAttribute(
-   *  PosixFilePermissions.fromString("rwxrwx---")))
*/
   def createDirWithPermission770(dirToCreate: File): Unit = {
 var attempts = 0
@@ -315,16 +312,13 @@ private[spark] class DiskBlockManager(
 throw 
SparkCoreErrors.failToCreateDirectoryError(dirToCreate.getAbsolutePath, 
maxAttempts)
   }
   try {
-val builder = new ProcessBuilder().command(
-  "mkdir", "-p", "-m770", dirToCreate.getAbsolutePath)
-val proc = builder.start()
-val exitCode = proc.waitFor()
+dirToCreate.mkdirs()
+Files.setPosixFilePermissions(
+  dirToCreate.toPath, PosixFilePermissions.fromString("rwxrwx---"))
 if (dirToCreate.exists()) {
   created = dirToCreate
 }
-logDebug(
-  s"Created directory at ${dirToCreate.getAbsolutePath} with 
permission " +
-s"770 and exitCode $exitCode")
+logDebug(s"Created directory at ${dirToCreate.getAbsolutePath} with 
permission 770")
   } catch {
 case e: SecurityException =>
   logWarning(s"Failed to create directory 
${dirToCreate.getAbsolutePath} " +


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41950][PS][BUILD] Pin scikit-learn to 1.1.X

2023-01-09 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c4a33ecb161 [SPARK-41950][PS][BUILD] Pin scikit-learn to 1.1.X
c4a33ecb161 is described below

commit c4a33ecb1617bf7abda462662134f147104cbd5a
Author: Hyukjin Kwon 
AuthorDate: Tue Jan 10 12:16:52 2023 +0900

[SPARK-41950][PS][BUILD] Pin scikit-learn to 1.1.X

### What changes were proposed in this pull request?

After scikit-learn 1.2.0 release 
(https://github.com/scikit-learn/scikit-learn/releases), the unittest in pandas 
API on Spark is broken.

```python
from mlflow.tracking import MlflowClient, set_tracking_uri
from sklearn.linear_model import LinearRegression
import mlflow.sklearn
from tempfile import mkdtemp
d = mkdtemp("pandas_on_spark_mlflow")
set_tracking_uri("file:%s"%d)
client = MlflowClient()
exp_id = mlflow.create_experiment("my_experiment")
exp = mlflow.set_experiment("my_experiment")
train = pd.DataFrame({"x1": np.arange(8), "x2": np.arange(8)**2,
  "y": np.log(2 + np.arange(8))})
train_x = train[["x1", "x2"]]
train_y = train[["y"]]
with mlflow.start_run():
lr = LinearRegression()
lr.fit(train_x, train_y)
mlflow.sklearn.log_model(lr, "model")
from pyspark.pandas.mlflow import load_model
run_info = client.search_runs(exp_id)[-1].info
model = load_model("runs:/{run_id}/model".format(run_id=run_info.run_id))
prediction_df = ps.DataFrame({"x1": [2.0], "x2": [4.0]})
prediction_df["prediction"] = model.predict(prediction_df)
print(prediction_df)
```


https://github.com/apache/spark/blob/06ec98b0d6a51e0c3ffec70e78d86d577b0e7a72/python/pyspark/pandas/mlflow.py#L134-L202

```
File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 172, in 
pyspark.pandas.mlflow.load_model
Failed example:
prediction_df
Exception raised:
Traceback (most recent call last):
  File "/usr/lib/python3.9/doctest.py", line 1336, in __run
exec(compile(example.source, filename, "single",
  File "", line 1, in 

prediction_df
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13322, 
in __repr__
pdf = cast("DataFrame", 
self._get_or_create_repr_pandas_cache(max_display_count))
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13313, 
in _get_or_create_repr_pandas_cache
self, "_repr_pandas_cache", {n: self.head(n + 
1)._to_internal_pandas()}
  File "/__w/spark/spark/python/pyspark/pandas/frame.py", line 13308, 
in _to_internal_pandas
return self._internal.to_pandas_frame
  File "/__w/spark/spark/python/pyspark/pandas/utils.py", line 588, in 
wrapped_lazy_property
setattr(self, attr_name, fn(self))
  File "/__w/spark/spark/python/pyspark/pandas/internal.py", line 1056, 
in to_pandas_frame
pdf = sdf.toPandas()
  File "/__w/spark/spark/python/pyspark/sql/pandas/conversion.py", line 
208, in toPandas
pdf = pd.DataFrame.from_records(self.collect(), 
columns=self.columns)
  File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 1197, 
in collect
sock_info = self._jdf.collectToPython()
  File 
"/__w/spark/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 
1322, in __call__
return_value = get_return_value(
  File "/__w/spark/spark/python/pyspark/sql/utils.py", line 209, in deco
raise converted from None
pyspark.sql.utils.PythonException:
  An exception was thrown from the Python worker. Please see the stack 
trace below.
Traceback (most recent call last):
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 829, in main
process()
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 821, in process
serializer.dump_stream(out_iter, outfile)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 345, in dump_stream
return ArrowStreamSerializer.dump_stream(self, 
init_stream_yield_batches(), stream)
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 86, in dump_stream
for batch in iterator:
  File 
"/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", 
line 338, in init_stream_yield_batches
for series in iterator:
  File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", 
line 519, in func
for result_batch, result_type in result_iter:
  File 
"/usr/local/lib/python3.9/dist-packages/mlflow/pyfunc/__init__.py", line 1253, 
in ud

[GitHub] [spark-website] drgnchan opened a new pull request, #431: Add language tab for word count code example

2023-01-09 Thread GitBox


drgnchan opened a new pull request, #431:
URL: https://github.com/apache/spark-website/pull/431

   
   
   Before
   
   https://user-images.githubusercontent.com/40224023/211459050-6c824af6-4ed2-46ab-8f94-25e5fd02d979.png";>
   
   After
   
   https://user-images.githubusercontent.com/40224023/211459062-76816b76-7a31-4211-a024-8aecd3f81fa8.png";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41914][SQL] FileFormatWriter materializes AQE plan before accessing outputOrdering

2023-01-09 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c0911981887 [SPARK-41914][SQL] FileFormatWriter materializes AQE plan 
before accessing outputOrdering
c0911981887 is described below

commit c091198188789afb1282bc76419cf6e1397b0bc9
Author: Enrico Minack 
AuthorDate: Tue Jan 10 13:10:07 2023 +0800

[SPARK-41914][SQL] FileFormatWriter materializes AQE plan before accessing 
outputOrdering

### What changes were proposed in this pull request?
The `FileFormatWriter` materializes an `AdaptiveQueryPlan` before accessing 
the plan's `outputOrdering`. This is required when planned writing is disabled 
(`spark.sql.optimizer.plannedWrite.enabled` is `true` by default). With planned 
writing enabled `FileFormatWriter` gets the final plan already.

### Why are the changes needed?
`FileFormatWriter` enforces an ordering if the written plan does not 
provide that ordering. An `AdaptiveQueryPlan` does not know its final ordering, 
in which case `FileFormatWriter` enforces the ordering (e.g. by column `"a"`) 
even if the plan provides a compatible ordering (e.g. by columns `"a", "b"`). 
In case of spilling, that order (e.g. by columns `"a", "b"`) gets broken (see 
SPARK-40588).

### Does this PR introduce _any_ user-facing change?
This fixes SPARK-40588 for 3.4, which was introduced in 3.0. This restores 
behaviour from Spark 2.4.

### How was this patch tested?
The final plan that is written to files is now stored in 
`FileFormatWriter.executedPlan` (similar to existing 
`FileFormatWriter.outputOrderingMatched`). Unit tests assert the outermost sort 
order written to files.

The actual plan written into the files changed from (taken from 
`"SPARK-41914: v1 write with AQE and in-partition sorted - non-string partition 
column"`):

```
Sort [input[2, int, false] ASC NULLS FIRST], false, 0
+- *(3) Sort [key#13 ASC NULLS FIRST, value#14 ASC NULLS FIRST], false, 0
   +- *(3) Project [b#24, value#14, key#13]
  +- *(3) BroadcastHashJoin [key#13], [a#23], Inner, BuildLeft, false
 :- BroadcastQueryStage 2
 :  +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), 
[plan_id=376]
 : +- AQEShuffleRead local
 :+- ShuffleQueryStage 0
 :   +- Exchange hashpartitioning(key#13, 5), 
ENSURE_REQUIREMENTS, [plan_id=328]
 :  +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, 
true) AS value#14]
 : +- Scan[obj#12]
 +- AQEShuffleRead local
+- ShuffleQueryStage 1
   +- Exchange hashpartitioning(a#23, 5), ENSURE_REQUIREMENTS, 
[plan_id=345]
  +- *(2) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, 
knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).b AS b#24]
 +- Scan[obj#22]
```

where `FileFormatWriter` enforces order with `Sort [input[2, int, false] 
ASC NULLS FIRST], false, 0`, to

```
*(3) Sort [key#13 ASC NULLS FIRST, value#14 ASC NULLS FIRST], false, 0
+- *(3) Project [b#24, value#14, key#13]
   +- *(3) BroadcastHashJoin [key#13], [a#23], Inner, BuildLeft, false
  :- BroadcastQueryStage 2
  :  +- BroadcastExchange 
HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)),false), 
[plan_id=375]
  : +- AQEShuffleRead local
  :+- ShuffleQueryStage 0
  :   +- Exchange hashpartitioning(key#13, 5), 
ENSURE_REQUIREMENTS, [plan_id=327]
  :  +- *(1) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).key AS key#13, 
staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, 
fromString, knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData, true])).value, true, false, 
true) AS value#14]
  : +- Scan[obj#12]
  +- AQEShuffleRead local
 +- ShuffleQueryStage 1
+- Exchange hashpartitioning(a#23, 5), ENSURE_REQUIREMENTS, 
[plan_id=344]
   +- *(2) SerializeFromObject 
[knownnotnull(assertnotnull(input[0, 
org.apache.spark.sql.test.SQLTestData$TestData2, true])).a AS a#23, 
knownnotnull(assertnotnull(input

[spark] branch master updated: [SPARK-41957][CONNECT][PYTHON] Enable the doctest for `DataFrame.hint`

2023-01-09 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 7c1de057dd5 [SPARK-41957][CONNECT][PYTHON] Enable the doctest for 
`DataFrame.hint`
7c1de057dd5 is described below

commit 7c1de057dd59dbef62f153c75005bb67ea809ac3
Author: Ruifeng Zheng 
AuthorDate: Tue Jan 10 13:57:50 2023 +0800

[SPARK-41957][CONNECT][PYTHON] Enable the doctest for `DataFrame.hint`

### What changes were proposed in this pull request?
1, Enable the doctest for `DataFrame.hint`
2, add additional test for JOIN hint

### Why are the changes needed?
for test coverage

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
updated test and added UT

Closes #39472 from zhengruifeng/connect_fix_hint.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 python/pyspark/sql/connect/dataframe.py|  2 --
 python/pyspark/sql/dataframe.py|  2 +-
 python/pyspark/sql/tests/connect/test_connect_basic.py | 15 +++
 3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 7f36272f538..5ff3d59ddd6 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -1627,8 +1627,6 @@ def _test() -> None:
 del pyspark.sql.connect.dataframe.DataFrame.drop.__doc__
 del pyspark.sql.connect.dataframe.DataFrame.join.__doc__
 
-del pyspark.sql.connect.dataframe.DataFrame.hint.__doc__
-
 # TODO(SPARK-41886): The doctest output has different order
 del pyspark.sql.connect.dataframe.DataFrame.intersect.__doc__
 
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 79809cbea53..a437400fb90 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -1116,7 +1116,7 @@ class DataFrame(PandasMapOpsMixin, PandasConversionMixin):
 
 >>> df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], 
schema=["age", "name"])
 >>> df2 = spark.createDataFrame([Row(height=80, name="Tom"), 
Row(height=85, name="Bob")])
->>> df.join(df2, "name").explain()
+>>> df.join(df2, "name").explain()  # doctest: +SKIP
 == Physical Plan ==
 ...
 ... +- SortMergeJoin ...
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index 322594127e7..c946c735a1d 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -1302,6 +1302,21 @@ class SparkConnectBasicTests(SparkConnectSQLTestCase):
 with self.assertRaises(SparkConnectException):
 self.connect.read.table(self.tbl_name).hint("REPARTITION", "id", 
3).toPandas()
 
+def test_join_hint(self):
+
+cdf1 = self.connect.createDataFrame([(2, "Alice"), (5, "Bob")], 
schema=["age", "name"])
+cdf2 = self.connect.createDataFrame(
+[Row(height=80, name="Tom"), Row(height=85, name="Bob")]
+)
+
+self.assertTrue(
+"BroadcastHashJoin" in cdf1.join(cdf2.hint("BROADCAST"), 
"name")._explain_string()
+)
+self.assertTrue("SortMergeJoin" in cdf1.join(cdf2.hint("MERGE"), 
"name")._explain_string())
+self.assertTrue(
+"ShuffledHashJoin" in cdf1.join(cdf2.hint("SHUFFLE_HASH"), 
"name")._explain_string()
+)
+
 def test_empty_dataset(self):
 # SPARK-41005: Test arrow based collection with empty dataset.
 self.assertTrue(


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-41595][SQL] Support generator function explode/explode_outer in the FROM clause

2023-01-09 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new eca21c7ad58 [SPARK-41595][SQL] Support generator function 
explode/explode_outer in the FROM clause
eca21c7ad58 is described below

commit eca21c7ad582c9c374108518842ad816843e1224
Author: allisonwang-db 
AuthorDate: Tue Jan 10 14:09:06 2023 +0800

[SPARK-41595][SQL] Support generator function explode/explode_outer in the 
FROM clause

### What changes were proposed in this pull request?
This PR supports using table-valued generator functions in the FROM clause 
of a query. A generator function can be registered in the table function 
registry and resolved as a table function during analysis.

Note this PR only adds support for two built-in generator functions: 
`explode` and `explode_outer` with literal input values. We will support more 
generator functions and LATERAL references in separate PRs.

### Why are the changes needed?
To make table-valued generator functions more user-friendly and consistent 
with Spark's built-in table function Range.

### Does this PR introduce _any_ user-facing change?
Yes. Before this PR, the built-in generator function explode/explode_outer 
cannot be used in the FROM clause:
```
select * from explode(array(1, 2))

AnalysisException: could not resolve `explode` to a table-valued function;
```
After this PR, we can support this usage:
```
select * from explode(array(1, 2))

+---+
|col|
+---+
|  1|
|  2|
+---+
```

### How was this patch tested?

New SQL query tests.

Closes #39133 from allisonwang-db/spark-41595-explode-in-from.

Authored-by: allisonwang-db 
Signed-off-by: Wenchen Fan 
---
 .../spark/sql/catalyst/analysis/Analyzer.scala |  52 +++--
 .../sql/catalyst/analysis/FunctionRegistry.scala   |  23 +-
 .../spark/sql/catalyst/analysis/unresolved.scala   |  65 --
 .../sql/catalyst/catalog/SessionCatalog.scala  |   4 +-
 .../spark/sql/catalyst/parser/AstBuilder.scala |   8 +-
 .../spark/sql/catalyst/trees/TreePatterns.scala|   1 +
 .../sql/catalyst/analysis/AnalysisSuite.scala  |   6 +-
 .../sql/catalyst/parser/PlanParserSuite.scala  |  12 +-
 .../sql-tests/inputs/table-valued-functions.sql|  33 +++
 .../results/table-valued-functions.sql.out | 257 +
 10 files changed, 410 insertions(+), 51 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
index 00a24357226..3b3a011db97 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@@ -2431,7 +2431,7 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 
 def apply(plan: LogicalPlan): LogicalPlan = 
plan.resolveOperatorsUpWithPruning(
   _.containsAnyPattern(UNRESOLVED_FUNC, UNRESOLVED_FUNCTION, GENERATOR,
-UNRESOLVED_TABLE_VALUED_FUNCTION), ruleId) {
+UNRESOLVED_TABLE_VALUED_FUNCTION, UNRESOLVED_TVF_ALIASES), ruleId) {
   // Resolve functions with concrete relations from v2 catalog.
   case u @ UnresolvedFunctionName(nameParts, cmd, requirePersistentFunc, 
mismatchHint, _) =>
 lookupBuiltinOrTempFunction(nameParts)
@@ -2453,7 +2453,7 @@ class Analyzer(override val catalogManager: 
CatalogManager)
   // Resolve table-valued function references.
   case u: UnresolvedTableValuedFunction if 
u.functionArgs.forall(_.resolved) =>
 withPosition(u) {
-  val resolvedFunc = try {
+  try {
 resolveBuiltinOrTempTableFunction(u.name, 
u.functionArgs).getOrElse {
   val CatalogAndIdentifier(catalog, ident) = 
expandIdentifier(u.name)
   if (CatalogV2Util.isSessionCatalog(catalog)) {
@@ -2470,27 +2470,26 @@ class Analyzer(override val catalogManager: 
CatalogManager)
 errorClass = "_LEGACY_ERROR_TEMP_2308",
 messageParameters = Map("name" -> u.name.quoted))
   }
-  // If alias names assigned, add `Project` with the aliases
-  if (u.outputNames.nonEmpty) {
-val outputAttrs = resolvedFunc.output
-// Checks if the number of the aliases is equal to expected one
-if (u.outputNames.size != outputAttrs.size) {
-  u.failAnalysis(
-errorClass = "_LEGACY_ERROR_TEMP_2307",
-messageParameters = Map(
-  "funcName" -> u.name.quoted,
-  "aliasesNum" -> u.outputNames.size.toString,
-  "outColsNum" -> outputAttrs.size.toString))
-  

[GitHub] [spark-website] dongjoon-hyun closed pull request #431: Add language tab for word count code example

2023-01-09 Thread GitBox


dongjoon-hyun closed pull request #431: Add language tab for word count code 
example
URL: https://github.com/apache/spark-website/pull/431


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-website] branch asf-site updated: Add language tab for word count code example

2023-01-09 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/spark-website.git


The following commit(s) were added to refs/heads/asf-site by this push:
 new c94c6ff3d Add language tab for word count code example
c94c6ff3d is described below

commit c94c6ff3dba01748958aafff39e408c8c001972f
Author: drgnchan <40224023+drgnc...@users.noreply.github.com>
AuthorDate: Mon Jan 9 22:45:03 2023 -0800

Add language tab for word count code example



Before

https://user-images.githubusercontent.com/40224023/211459050-6c824af6-4ed2-46ab-8f94-25e5fd02d979.png";>

After

https://user-images.githubusercontent.com/40224023/211459062-76816b76-7a31-4211-a024-8aecd3f81fa8.png";>

Author: drgnchan <40224023+drgnc...@users.noreply.github.com>

Closes #431 from drgnchan/asf-site.
---
 examples.md| 5 +
 site/examples.html | 6 ++
 2 files changed, 11 insertions(+)

diff --git a/examples.md b/examples.md
index 8061ac430..d9362784d 100644
--- a/examples.md
+++ b/examples.md
@@ -26,6 +26,11 @@ In this page, we will show examples using RDD API as well as 
examples using high
 Word count
 In this example, we use a few transformations to build a dataset of 
(String, Int) pairs called counts and then save it to a file.
 
+
+  Python
+  Scala
+  Java
+
 
 
 
diff --git a/site/examples.html b/site/examples.html
index b41f9a095..71cf0c60a 100644
--- a/site/examples.html
+++ b/site/examples.html
@@ -142,6 +142,12 @@ In this page, we will show examples using RDD API as well 
as examples using high
 Word count
 In this example, we use a few transformations to build a dataset of 
(String, Int) pairs called counts and then save it to a file.
 
+
+  Python
+  Scala
+  Java
+
+
 
 
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org