date:20230525

[spark] branch branch-3.3 updated: [SPARK-43751][SQL][DOC] Document `unbase64` behavior change

2023-05-25 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new 11d2f7316d6 [SPARK-43751][SQL][DOC] Document `unbase64` behavior change
11d2f7316d6 is described below

commit 11d2f7316d6ddd4a00853deb74b4a65f6f4c899c
Author: Cheng Pan 
AuthorDate: Fri May 26 11:33:38 2023 +0800

[SPARK-43751][SQL][DOC] Document `unbase64` behavior change

### What changes were proposed in this pull request?

After SPARK-37820, `select unbase64("abcs==")`(malformed input) always 
throws an exception, this PR does not help in that case, it only improves the 
error message for `to_binary()`.

So, `unbase64()`'s behavior for malformed input changed silently after 
SPARK-37820:
- before: return a best-effort result, because it uses 
[LENIENT](https://github.com/apache/commons-codec/blob/rel/commons-codec-1.15/src/main/java/org/apache/commons/codec/binary/Base64InputStream.java#L46)
 policy: any trailing bits are composed into 8-bit bytes where possible. The 
remainder are discarded.
- after: throw an exception

And there is no way to restore the previous behavior. To tolerate the 
malformed input, the user should migrate `unbase64()` to 
`try_to_binary(, 'base64')` to get NULL instead of interrupting by 
exception.

### Why are the changes needed?

Add the behavior change to migration guide.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Manuelly review.

Closes #41280 from pan3793/SPARK-43751.

Authored-by: Cheng Pan 
Signed-off-by: Kent Yao 
(cherry picked from commit af6c1ec7c795584c28e15e4963eed83917e2f06a)
Signed-off-by: Kent Yao 
---
 docs/sql-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 5c46343d994..02648a8d7e6 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -68,6 +68,8 @@ license: |
   
   - Since Spark 3.3, the precision of the return type of round-like functions 
has been fixed. This may cause Spark throw `AnalysisException` of the 
`CANNOT_UP_CAST_DATATYPE` error class when using views created by prior 
versions. In such cases, you need to recreate the views using ALTER VIEW AS or 
CREATE OR REPLACE VIEW AS with newer Spark versions.
 
+  - Since Spark 3.3, the `unbase64` function throws error for a malformed 
`str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input 
and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function 
returns a best-efforts result for a malformed `str` input.
+
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS 
(b)`-style SQL statements, `grouping__id` returns different values from Apache 
Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by 
expressions plus grouping set columns. To restore the behavior before 3.3.1 and 
3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For 
details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) 
and [SPARK-40562](https:/ [...]
 
 ## Upgrading from Spark SQL 3.1 to 3.2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.4 updated: [SPARK-43751][SQL][DOC] Document `unbase64` behavior change

2023-05-25 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch branch-3.4
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.4 by this push:
 new f8a2498868c [SPARK-43751][SQL][DOC] Document `unbase64` behavior change
f8a2498868c is described below

commit f8a2498868c130c0723d8e871e48132e76c8a263
Author: Cheng Pan 
AuthorDate: Fri May 26 11:33:38 2023 +0800

[SPARK-43751][SQL][DOC] Document `unbase64` behavior change

### What changes were proposed in this pull request?

After SPARK-37820, `select unbase64("abcs==")`(malformed input) always 
throws an exception, this PR does not help in that case, it only improves the 
error message for `to_binary()`.

So, `unbase64()`'s behavior for malformed input changed silently after 
SPARK-37820:
- before: return a best-effort result, because it uses 
[LENIENT](https://github.com/apache/commons-codec/blob/rel/commons-codec-1.15/src/main/java/org/apache/commons/codec/binary/Base64InputStream.java#L46)
 policy: any trailing bits are composed into 8-bit bytes where possible. The 
remainder are discarded.
- after: throw an exception

And there is no way to restore the previous behavior. To tolerate the 
malformed input, the user should migrate `unbase64()` to 
`try_to_binary(, 'base64')` to get NULL instead of interrupting by 
exception.

### Why are the changes needed?

Add the behavior change to migration guide.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Manuelly review.

Closes #41280 from pan3793/SPARK-43751.

Authored-by: Cheng Pan 
Signed-off-by: Kent Yao 
(cherry picked from commit af6c1ec7c795584c28e15e4963eed83917e2f06a)
Signed-off-by: Kent Yao 
---
 docs/sql-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 0181ff95cd6..d9192d36a3b 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -86,6 +86,8 @@ license: |
   
   - Since Spark 3.3, the precision of the return type of round-like functions 
has been fixed. This may cause Spark throw `AnalysisException` of the 
`CANNOT_UP_CAST_DATATYPE` error class when using views created by prior 
versions. In such cases, you need to recreate the views using ALTER VIEW AS or 
CREATE OR REPLACE VIEW AS with newer Spark versions.
 
+  - Since Spark 3.3, the `unbase64` function throws error for a malformed 
`str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input 
and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function 
returns a best-efforts result for a malformed `str` input.
+
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS 
(b)`-style SQL statements, `grouping__id` returns different values from Apache 
Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by 
expressions plus grouping set columns. To restore the behavior before 3.3.1 and 
3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For 
details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) 
and [SPARK-40562](https:/ [...]
 
 ## Upgrading from Spark SQL 3.1 to 3.2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43751][SQL][DOC] Document `unbase64` behavior change

2023-05-25 Thread yao

This is an automated email from the ASF dual-hosted git repository.

yao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new af6c1ec7c79 [SPARK-43751][SQL][DOC] Document `unbase64` behavior change
af6c1ec7c79 is described below

commit af6c1ec7c795584c28e15e4963eed83917e2f06a
Author: Cheng Pan 
AuthorDate: Fri May 26 11:33:38 2023 +0800

[SPARK-43751][SQL][DOC] Document `unbase64` behavior change

### What changes were proposed in this pull request?

After SPARK-37820, `select unbase64("abcs==")`(malformed input) always 
throws an exception, this PR does not help in that case, it only improves the 
error message for `to_binary()`.

So, `unbase64()`'s behavior for malformed input changed silently after 
SPARK-37820:
- before: return a best-effort result, because it uses 
[LENIENT](https://github.com/apache/commons-codec/blob/rel/commons-codec-1.15/src/main/java/org/apache/commons/codec/binary/Base64InputStream.java#L46)
 policy: any trailing bits are composed into 8-bit bytes where possible. The 
remainder are discarded.
- after: throw an exception

And there is no way to restore the previous behavior. To tolerate the 
malformed input, the user should migrate `unbase64()` to 
`try_to_binary(, 'base64')` to get NULL instead of interrupting by 
exception.

### Why are the changes needed?

Add the behavior change to migration guide.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Manuelly review.

Closes #41280 from pan3793/SPARK-43751.

Authored-by: Cheng Pan 
Signed-off-by: Kent Yao 
---
 docs/sql-migration-guide.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md
index 80df50273a1..58627801fc7 100644
--- a/docs/sql-migration-guide.md
+++ b/docs/sql-migration-guide.md
@@ -91,6 +91,8 @@ license: |
 
   - Since Spark 3.3, the precision of the return type of round-like functions 
has been fixed. This may cause Spark throw `AnalysisException` of the 
`CANNOT_UP_CAST_DATATYPE` error class when using views created by prior 
versions. In such cases, you need to recreate the views using ALTER VIEW AS or 
CREATE OR REPLACE VIEW AS with newer Spark versions.
 
+  - Since Spark 3.3, the `unbase64` function throws error for a malformed 
`str` input. Use `try_to_binary(, 'base64')` to tolerate malformed input 
and return NULL instead. In Spark 3.2 and earlier, the `unbase64` function 
returns a best-efforts result for a malformed `str` input.
+
   - Since Spark 3.3.1 and 3.2.3, for `SELECT ... GROUP BY a GROUPING SETS 
(b)`-style SQL statements, `grouping__id` returns different values from Apache 
Spark 3.2.0, 3.2.1, 3.2.2, and 3.3.0. It computes based on user-given group-by 
expressions plus grouping set columns. To restore the behavior before 3.3.1 and 
3.2.3, you can set `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`. For 
details, see [SPARK-40218](https://issues.apache.org/jira/browse/SPARK-40218) 
and [SPARK-40562](https:/ [...]
 
 ## Upgrading from Spark SQL 3.1 to 3.2


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-docker] branch master updated: [SPARK-43806] Add awesome-spark-docker.md

2023-05-25 Thread yikun

This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new 9d4c98c  [SPARK-43806] Add awesome-spark-docker.md
9d4c98c is described below

commit 9d4c98c62c4ce517e69e65d1f6f7bf412d775b75
Author: Yikun Jiang 
AuthorDate: Fri May 26 09:53:20 2023 +0800

[SPARK-43806] Add awesome-spark-docker.md

### What changes were proposed in this pull request?
Add links to more related images and dockerfile reference.

### Why are the changes needed?
Something we talked about in "Spark on Kube Coffe Chats“[1] to add links to 
more related images and dockerfile reference. Init with [2].
[1] https://lists.apache.org/thread/26gpmlhqhk5cp2fhtzrpl5f61p8jc551
[2] 
https://github.com/awesome-spark/awesome-spark/blob/main/README.md#docker-images

### Does this PR introduce _any_ user-facing change?
Doc only

### How was this patch tested?
No

Closes #28 from Yikun/awesome-spark-docker.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 awesome-spark-docker.md | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/awesome-spark-docker.md b/awesome-spark-docker.md
new file mode 100644
index 000..c7bb840
--- /dev/null
+++ b/awesome-spark-docker.md
@@ -0,0 +1,7 @@
+A curated list of awesome Apache Spark Docker resources.
+
+- 
[jupyter/docker-stacks/pyspark-notebook](https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook)
 - PySpark with Jupyter Notebook.
+- 
[big-data-europe/docker-spark](https://github.com/big-data-europe/docker-spark) 
- The standalone cluster and spark applications related Dockerfiles.
+- 
[openeuler/spark](https://github.com/openeuler-mirror/openeuler-docker-images/tree/master/spark)
 - Dockerfile reference for dnf/yum based OS.
+- 
[GoogleCloudPlatform/spark-on-k8s-operator](https://github.com/GoogleCloudPlatform/spark-on-k8s-operator)
 - Kubernetes operator for managing the lifecycle of Apache Spark applications 
on Kubernetes.
+


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43769][CONNECT] Implement 'levenshtein(str1, str2[, threshold])' functions

2023-05-25 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2496523900a [SPARK-43769][CONNECT] Implement 'levenshtein(str1, str2[, 
threshold])' functions
2496523900a is described below

commit 2496523900a662d0f63430cb758b91e002bd520e
Author: panbingkun 
AuthorDate: Fri May 26 09:40:32 2023 +0800

[SPARK-43769][CONNECT] Implement 'levenshtein(str1, str2[, threshold])' 
functions

### What changes were proposed in this pull request?
The pr aims to implement 'levenshtein(str1, str2[, threshold])' functions 
for `connect` module.

### Why are the changes needed?
After [Add a max distance argument to the levenshtein() 
function](https://issues.apache.org/jira/browse/SPARK-43493) We have already 
implemented it on the scala side, so we need to align it.

### Does this PR introduce _any_ user-facing change?
Yes, new API for Connect.

### How was this patch tested?
- Pass GA.
- Manual testing
1../build/sbt "connect-client-jvm/testOnly *ClientE2ETestSuite*"
2.sh dev/connect-jvm-client-mima-check

Closes #41293 from panbingkun/SPARK-43769.

Authored-by: panbingkun 
Signed-off-by: Ruifeng Zheng 
---
 .../scala/org/apache/spark/sql/functions.scala |  11 +++
 .../apache/spark/sql/PlanGenerationTestSuite.scala |   4 +++
 .../CheckConnectJvmClientCompatibility.scala   |   1 -
 .../function_levenshtein_with_threshold.explain|   2 ++
 .../function_levenshtein_with_threshold.json   |  33 +
 .../function_levenshtein_with_threshold.proto.bin  | Bin 0 -> 195 bytes
 6 files changed, 50 insertions(+), 1 deletion(-)

diff --git 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
index f92216f49bb..526f6904d68 100644
--- 
a/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
+++ 
b/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/functions.scala
@@ -2896,6 +2896,17 @@ object functions {
*/
   def levenshtein(l: Column, r: Column): Column = Column.fn("levenshtein", l, 
r)
 
+  /**
+   * Computes the Levenshtein distance of the two given string columns if it's 
less than or equal
+   * to a given threshold.
+   * @return
+   *   result distance, or -1
+   * @group string_funcs
+   * @since 3.5.0
+   */
+  def levenshtein(l: Column, r: Column, threshold: Int): Column =
+Column.fn("levenshtein", l, r, lit(threshold))
+
   /**
* Locate the position of the first occurrence of substr.
*
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
index 7ece54d0439..94b9adda655 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/PlanGenerationTestSuite.scala
@@ -1361,6 +1361,10 @@ class PlanGenerationTestSuite
 fn.levenshtein(fn.col("g"), lit("bob"))
   }
 
+  functionTest("levenshtein with threshold") {
+fn.levenshtein(fn.col("g"), lit("bob"), 2)
+  }
+
   functionTest("locate") {
 fn.locate("jar", fn.col("g"))
   }
diff --git 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala
 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala
index ed3660b791a..429e27827e8 100644
--- 
a/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala
+++ 
b/connector/connect/client/jvm/src/test/scala/org/apache/spark/sql/connect/client/CheckConnectJvmClientCompatibility.scala
@@ -199,7 +199,6 @@ object CheckConnectJvmClientCompatibility {
   
ProblemFilters.exclude[Problem]("org.apache.spark.sql.functions.callUDF"),
   
ProblemFilters.exclude[Problem]("org.apache.spark.sql.functions.unwrap_udt"),
   ProblemFilters.exclude[Problem]("org.apache.spark.sql.functions.udaf"),
-  
ProblemFilters.exclude[Problem]("org.apache.spark.sql.functions.levenshtein"),
 
   // KeyValueGroupedDataset
   ProblemFilters.exclude[Problem](
diff --git 
a/connector/connect/common/src/test/resources/query-tests/explain-results/function_levenshtein_with_threshold.explain
 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_levenshtein_with_threshold.explain
new file mode 100644
index 000..5bd1d89ae06
--- /dev/null
+++ 
b/connector/connect/common/src/test/resources/query-tests/explain-results/function_levenshtein_with_threshold.explain
@@ -0,0 +1,2

[spark] branch master updated: [SPARK-43671][SPARK-43672][SPARK-43673][SPARK-43674][PS] Fix `CategoricalOps` for Spark Connect

2023-05-25 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c686313bb7f [SPARK-43671][SPARK-43672][SPARK-43673][SPARK-43674][PS] 
Fix `CategoricalOps` for Spark Connect
c686313bb7f is described below

commit c686313bb7f2288cdda5b85b33aa4f3ebfea7760
Author: itholic 
AuthorDate: Fri May 26 09:36:30 2023 +0800

[SPARK-43671][SPARK-43672][SPARK-43673][SPARK-43674][PS] Fix 
`CategoricalOps` for Spark Connect

### What changes were proposed in this pull request?

This PR proposes to fix `CategoricalOps` test for pandas API on Spark with 
Spark Connect.

This includes SPARK-43671, SPARK-43672, SPARK-43673, SPARK-43674 at once, 
because they are all related similar modifications in single file.

### Why are the changes needed?

To support all features for pandas API on Spark with Spark Connect.

### Does this PR introduce _any_ user-facing change?

Yes, `CategoricalOps.lt`,  `CategoricalOps.le`, `CategoricalOps.ge`, 
`CategoricalOps.gt` are now working as expected on Spark Connect.

### How was this patch tested?

Uncomment the UTs, and tested manually.

Closes #41310 from itholic/SPARK-43671-4.

Authored-by: itholic 
Signed-off-by: Ruifeng Zheng 
---
 .../pandas/data_type_ops/categorical_ops.py| 57 +++---
 .../data_type_ops/test_parity_categorical_ops.py   | 16 --
 2 files changed, 49 insertions(+), 24 deletions(-)

diff --git a/python/pyspark/pandas/data_type_ops/categorical_ops.py 
b/python/pyspark/pandas/data_type_ops/categorical_ops.py
index ad7e46192bf..9f14a4b1ee7 100644
--- a/python/pyspark/pandas/data_type_ops/categorical_ops.py
+++ b/python/pyspark/pandas/data_type_ops/categorical_ops.py
@@ -27,7 +27,8 @@ from pyspark.pandas.base import column_op, IndexOpsMixin
 from pyspark.pandas.data_type_ops.base import _sanitize_list_like, DataTypeOps
 from pyspark.pandas.typedef import pandas_on_spark_type
 from pyspark.sql import functions as F
-from pyspark.sql.column import Column
+from pyspark.sql.column import Column as PySparkColumn
+from pyspark.sql.utils import is_remote
 
 
 class CategoricalOps(DataTypeOps):
@@ -65,33 +66,73 @@ class CategoricalOps(DataTypeOps):
 
 def eq(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-return _compare(left, right, Column.__eq__, 
is_equality_comparison=True)
+if is_remote():
+from pyspark.sql.connect.column import Column as ConnectColumn
+
+Column = ConnectColumn
+else:
+Column = PySparkColumn  # type: ignore[assignment]
+return _compare(
+left, right, Column.__eq__, is_equality_comparison=True  # type: 
ignore[arg-type]
+)
 
 def ne(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-return _compare(left, right, Column.__ne__, 
is_equality_comparison=True)
+if is_remote():
+from pyspark.sql.connect.column import Column as ConnectColumn
+
+Column = ConnectColumn
+else:
+Column = PySparkColumn  # type: ignore[assignment]
+return _compare(
+left, right, Column.__ne__, is_equality_comparison=True  # type: 
ignore[arg-type]
+)
 
 def lt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-return _compare(left, right, Column.__lt__)
+if is_remote():
+from pyspark.sql.connect.column import Column as ConnectColumn
+
+Column = ConnectColumn
+else:
+Column = PySparkColumn  # type: ignore[assignment]
+return _compare(left, right, Column.__lt__)  # type: ignore[arg-type]
 
 def le(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-return _compare(left, right, Column.__le__)
+if is_remote():
+from pyspark.sql.connect.column import Column as ConnectColumn
+
+Column = ConnectColumn
+else:
+Column = PySparkColumn  # type: ignore[assignment]
+return _compare(left, right, Column.__le__)  # type: ignore[arg-type]
 
 def gt(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-return _compare(left, right, Column.__gt__)
+if is_remote():
+from pyspark.sql.connect.column import Column as ConnectColumn
+
+Column = ConnectColumn
+else:
+Column = PySparkColumn  # type: ignore[assignment]
+return _compare(left, right, Column.__gt__)  # type: ignore[arg-type]
 
 def ge(self, left: IndexOpsLike, right: Any) -> SeriesOrIndex:
 _sanitize_list_like(right)
-return

[spark] branch master updated: [SPARK-43647][CONNECT][TESTS] Clean up hive classes dir when test `connect-client-jvm` without -Phive

2023-05-25 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 9d1e67cc200 [SPARK-43647][CONNECT][TESTS] Clean up hive classes dir 
when test `connect-client-jvm` without -Phive
9d1e67cc200 is described below

commit 9d1e67cc200b9315dccdc2f081549dbfe5d1ecd9
Author: yangjie01 
AuthorDate: Fri May 26 09:31:11 2023 +0800

[SPARK-43647][CONNECT][TESTS] Clean up hive classes dir when test 
`connect-client-jvm` without -Phive

### What changes were proposed in this pull request?
This pr aims to added a cleaning action for the 
`$sparkHome/sql/hive/target/$scalaDir/classes` and 
`$sparkHome/sql/hive/target/$scalaDir/test-classes` directories before 
`SimpleSparkConnectService` starts when running test cases that inherit 
`RemoteSparkSession` without `-Phive` to avoid to unexpected loading of 
`sql/hive/target/scala-2.12/classes/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister`
 by `ServiceLoader`.

### Why are the changes needed?
When we run the test cases that inherit `RemoteSparkSession`, the classpath 
used to launch `SimpleSparkConnectService` will at least include the following 
directory, both maven and sbt:

```
$sparkHome/conf/
$sparkHome/common/kvstore/target/scala-2.12/classes/
$sparkHome/common/network-common/target/scala-2.12/classes/
$sparkHome/common/network-shuffle/target/scala-2.12/classes/
$sparkHome/common/network-yarn/target/scala-2.12/classes
$sparkHome/common/sketch/target/scala-2.12/classes/
$sparkHome/common/tags/target/scala-2.12/classes/
$sparkHome/common/unsafe/target/scala-2.12/classes/
$sparkHome/core/target/scala-2.12/classes/
$sparkHome/examples/target/scala-2.12/classes/
$sparkHome/graphx/target/scala-2.12/classes/
$sparkHome/launcher/target/scala-2.12/classes/
$sparkHome/mllib/target/scala-2.12/classes/
$sparkHome/repl/target/scala-2.12/classes/
$sparkHome/resource-managers/mesos/target/scala-2.12/classes
$sparkHome/resource-managers/yarn/target/scala-2.12/classes
$sparkHome/sql/catalyst/target/scala-2.12/classes/
$sparkHome/sql/core/target/scala-2.12/classes/
$sparkHome/sql/hive/target/scala-2.12/classes/
$sparkHome/sql/hive-thriftserver/target/scala-2.12/classes/
$sparkHome/streaming/target/scala-2.12/classes/
$sparkHome/common/kvstore/target/scala-2.12/test-classes
$sparkHome/common/network-common/target/scala-2.12/test-classes/
$sparkHome/common/network-shuffle/target/scala-2.12/test-classes/
$sparkHome/common/network-yarn/target/scala-2.12/test-classes
$sparkHome/common/sketch/target/scala-2.12/test-classes
$sparkHome/common/tags/target/scala-2.12/test-classes/
$sparkHome/common/unsafe/target/scala-2.12/test-classes
$sparkHome/core/target/scala-2.12/test-classes/
$sparkHome/examples/target/scala-2.12/test-classes
$sparkHome/graphx/target/scala-2.12/test-classes
$sparkHome/launcher/target/scala-2.12/test-classes/
$sparkHome/mllib/target/scala-2.12/test-classes
$sparkHome/repl/target/scala-2.12/test-classes
$sparkHome/resource-managers/mesos/target/scala-2.12/test-classes
$sparkHome/resource-managers/yarn/target/scala-2.12/test-classes
$sparkHome/sql/catalyst/target/scala-2.12/test-classes/
$sparkHome/sql/core/target/scala-2.12/test-classes
$sparkHome/sql/hive/target/scala-2.12/test-classes
$sparkHome/sql/hive-thriftserver/target/scala-2.12/test-classes
$sparkHome/streaming/target/scala-2.12/test-classes
$sparkHome/connector/connect/client/jvm/target/scala-2.12/test-classes/
$sparkHome/connector/connect/common/target/scala-2.12/test-classes/

...

```

So if the test case need calls `DataSource#lookupDataSource` and the `hive` 
module is compiled, 
`sql/hive/target/scala-2.12/classes/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister`
 will be loaded by `ServiceLoader`.

After SPARK-43186 | https://github.com/apache/spark/pull/40848 merged, 
`org.apache.spark.sql.hive.execution.HiveFileFormat` changed to use 
`org.apache.hadoop.hive.ql.plan.FileSinkDesc` instead of 
`org.apache.spark.sql.hive.HiveShim.ShimFileSinkDesc`, it has a strong 
dependence on `hive-exec`. But when there is no hive related jars under 
`assembly/target/$scalaDir/jars/`, it will cause initialization fail of 
`org.apache.spark.sql.hive.execution.HiveFileFormat` and test fail.

For example, when we run the following commands to test 
`connect-client-jvm` without `-Phive`:

```
build/mvn clean install -DskipTests
build/mvn test -pl connector/connect/client/jvm
```

Then hive related jars will not be copied to 
`assembly/target/$scalaDir/jars/`, there will be test error as:

**Client side**

```
- read

[spark] branch master updated: [SPARK-42859][PS][TESTS][FOLLOW-UPS] Delete unused file `test_parity_template.py`

2023-05-25 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c2384cebc3d [SPARK-42859][PS][TESTS][FOLLOW-UPS] Delete unused file 
`test_parity_template.py`
c2384cebc3d is described below

commit c2384cebc3dc49753dc2031a30e2b11e944bc637
Author: Ruifeng Zheng 
AuthorDate: Fri May 26 09:30:30 2023 +0800

[SPARK-42859][PS][TESTS][FOLLOW-UPS] Delete unused file 
`test_parity_template.py`

### What changes were proposed in this pull request?
Delete unused file `test_parity_template.py`

### Why are the changes needed?
it is not used in CI (not listed in `modules.py`)

### Does this PR introduce _any_ user-facing change?
No, test-only

### How was this patch tested?
CI

Closes #41322 from zhengruifeng/del_unused_test.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .../pandas/tests/connect/test_parity_template.py   | 37 --
 1 file changed, 37 deletions(-)

diff --git a/python/pyspark/pandas/tests/connect/test_parity_template.py 
b/python/pyspark/pandas/tests/connect/test_parity_template.py
deleted file mode 100644
index 6f8c98e26e2..000
--- a/python/pyspark/pandas/tests/connect/test_parity_template.py
+++ /dev/null
@@ -1,37 +0,0 @@
-#
-# Licensed to the Apache Software Foundation (ASF) under one or more
-# contributor license agreements.  See the NOTICE file distributed with
-# this work for additional information regarding copyright ownership.
-# The ASF licenses this file to You under the Apache License, Version 2.0
-# (the "License"); you may not use this file except in compliance with
-# the License.  You may obtain a copy of the License at
-#
-#http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-#
-import unittest
-
-from pyspark.pandas.tests.test_dataframe import DataFrameTestsMixin
-from pyspark.testing.connectutils import ReusedConnectTestCase
-from pyspark.testing.pandasutils import PandasOnSparkTestUtils
-
-
-class DataFrameParityTests(DataFrameTestsMixin, PandasOnSparkTestUtils, 
ReusedConnectTestCase):
-pass
-
-
-if __name__ == "__main__":
-from pyspark.pandas.tests.connect.test_parity_dataframe import *  # noqa: 
F401
-
-try:
-import xmlrunner  # type: ignore[import]
-
-testRunner = xmlrunner.XMLTestRunner(output="target/test-reports", 
verbosity=2)
-except ImportError:
-testRunner = None
-unittest.main(testRunner=testRunner, verbosity=2)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (3a6d2153b93 -> cf415cb6625)

2023-05-25 Thread ruifengz

This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 3a6d2153b93 [SPARK-43749][SPARK-43750][SQL] Assign names to the error 
class _LEGACY_ERROR_TEMP_240[4-5]
 add cf415cb6625 [MINOR][DOCS][PS] Move a few `Frame` functions to correct 
categories

No new revisions were added by this update.

Summary of changes:
 python/docs/source/reference/pyspark.pandas/frame.rst | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43749][SPARK-43750][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_240[4-5]

2023-05-25 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3a6d2153b93 [SPARK-43749][SPARK-43750][SQL] Assign names to the error 
class _LEGACY_ERROR_TEMP_240[4-5]
3a6d2153b93 is described below

commit 3a6d2153b93c759b68e5827905d1867ba93ec9cf
Author: Jiaan Geng 
AuthorDate: Thu May 25 20:14:00 2023 +0300

[SPARK-43749][SPARK-43750][SQL] Assign names to the error class 
_LEGACY_ERROR_TEMP_240[4-5]

### What changes were proposed in this pull request?
The pr aims to assign a name to the error class _LEGACY_ERROR_TEMP_240[4-5].

### Why are the changes needed?
Improve the error framework.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
N/A

Closes #41279 from beliefer/INVALID_PARTITION_OPERATION.

Authored-by: Jiaan Geng 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   | 29 +--
 .../sql/catalyst/analysis/CheckAnalysis.scala  |  8 ++---
 .../command/ShowPartitionsSuiteBase.scala  | 12 ---
 .../execution/command/v1/ShowPartitionsSuite.scala | 18 ++
 .../command/v2/AlterTableAddPartitionSuite.scala   | 20 ---
 .../command/v2/AlterTableDropPartitionSuite.scala  | 19 +++---
 .../execution/command/v2/ShowPartitionsSuite.scala | 41 +++---
 .../execution/command/v2/TruncateTableSuite.scala  | 20 ---
 8 files changed, 122 insertions(+), 45 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 1ccbdfdc6eb..7683e7b8650 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -1156,6 +1156,23 @@
 },
 "sqlState" : "22023"
   },
+  "INVALID_PARTITION_OPERATION" : {
+"message" : [
+  "The partition command is invalid."
+],
+"subClass" : {
+  "PARTITION_MANAGEMENT_IS_UNSUPPORTED" : {
+"message" : [
+  "Table  does not support partition management."
+]
+  },
+  "PARTITION_SCHEMA_IS_EMPTY" : {
+"message" : [
+  "Table  is not partitioned."
+]
+  }
+}
+  },
   "INVALID_PROPERTY_KEY" : {
 "message" : [
   " is an invalid property key, please use quotes, e.g. SET 
=."
@@ -5374,16 +5391,6 @@
   "failed to evaluate expression : "
 ]
   },
-  "_LEGACY_ERROR_TEMP_2404" : {
-"message" : [
-  "Table  is not partitioned."
-]
-  },
-  "_LEGACY_ERROR_TEMP_2405" : {
-"message" : [
-  "Table  does not support partition management."
-]
-  },
   "_LEGACY_ERROR_TEMP_2406" : {
 "message" : [
   "invalid cast from  to ."
@@ -5772,4 +5779,4 @@
   "Failed to get block , which is not a shuffle block"
 ]
   }
-}
\ No newline at end of file
+}
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
index 407a9d363f4..fac3f491200 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala
@@ -211,13 +211,13 @@ trait CheckAnalysis extends PredicateHelper with 
LookupCatalog with QueryErrorsB
 case t: SupportsPartitionManagement =>
   if (t.partitionSchema.isEmpty) {
 r.failAnalysis(
-  errorClass = "_LEGACY_ERROR_TEMP_2404",
-  messageParameters = Map("name" -> r.name))
+  errorClass = 
"INVALID_PARTITION_OPERATION.PARTITION_SCHEMA_IS_EMPTY",
+  messageParameters = Map("name" -> toSQLId(r.name)))
   }
 case _ =>
   r.failAnalysis(
-errorClass = "_LEGACY_ERROR_TEMP_2405",
-messageParameters = Map("name" -> r.name))
+errorClass = 
"INVALID_PARTITION_OPERATION.PARTITION_MANAGEMENT_IS_UNSUPPORTED",
+messageParameters = Map("name" -> toSQLId(r.name)))
   }
   case _ =>
 }
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsSuiteBase.scala
 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsSuiteBase.scala
index 27d2eb98543..462b967a759 100644
--- 
a/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsSuiteBase.scala
+++ 
b/sql/core/src/test/scala/org/apache/spark/sql/execution/command/ShowPartitionsSuiteBase.scala
@@ -63,18 +63,6 @@ trait ShowPartitionsSuiteBase extends QueryTest with 
DDLCommandTestUtils {
   .saveAsTable(table)
   }
 
-  test("show partitions of non-partitioned

[spark] branch master updated: [SPARK-43786][SQL][TESTS] Add a test for nullability about 'levenshtein' function

2023-05-25 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 295f540a92f [SPARK-43786][SQL][TESTS] Add a test for nullability about 
'levenshtein' function
295f540a92f is described below

commit 295f540a92f9a4bde1da1244901b844223777a78
Author: panbingkun 
AuthorDate: Thu May 25 15:34:25 2023 +0300

[SPARK-43786][SQL][TESTS] Add a test for nullability about 'levenshtein' 
function

### What changes were proposed in this pull request?
The pr aims to add a test for nullability about 'levenshtein' function.

### Why are the changes needed?
Make testing more robust about 'levenshtein' function.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Pass GA.
- Manual testing

Closes #41303 from panbingkun/SPARK-43786.

Authored-by: panbingkun 
Signed-off-by: Max Gekk 
---
 .../src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala  | 6 ++
 1 file changed, 6 insertions(+)

diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
index e887c570944..f612c5903dc 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
@@ -129,12 +129,18 @@ class StringFunctionsSuite extends QueryTest with 
SharedSparkSession {
 val df = Seq(("kitten", "sitting"), ("frog", "fog")).toDF("l", "r")
 checkAnswer(df.select(levenshtein($"l", $"r")), Seq(Row(3), Row(1)))
 checkAnswer(df.selectExpr("levenshtein(l, r)"), Seq(Row(3), Row(1)))
+checkAnswer(df.select(levenshtein($"l", lit(null))), Seq(Row(null), 
Row(null)))
+checkAnswer(df.selectExpr("levenshtein(l, null)"), Seq(Row(null), 
Row(null)))
 
 checkAnswer(df.select(levenshtein($"l", $"r", 3)), Seq(Row(3), Row(1)))
 checkAnswer(df.selectExpr("levenshtein(l, r, 3)"), Seq(Row(3), Row(1)))
+checkAnswer(df.select(levenshtein(lit(null), $"r", 3)), Seq(Row(null), 
Row(null)))
+checkAnswer(df.selectExpr("levenshtein(null, r, 3)"), Seq(Row(null), 
Row(null)))
 
 checkAnswer(df.select(levenshtein($"l", $"r", 0)), Seq(Row(-1), Row(-1)))
 checkAnswer(df.selectExpr("levenshtein(l, r, 0)"), Seq(Row(-1), Row(-1)))
+checkAnswer(df.select(levenshtein($"l", lit(null), 0)), Seq(Row(null), 
Row(null)))
+checkAnswer(df.selectExpr("levenshtein(l, null, 0)"), Seq(Row(null), 
Row(null)))
   }
 
   test("string regex_replace / regex_extract") {


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43774][BUILD] Upgrade FasterXML jackson to 2.15.1

2023-05-25 Thread yangjie01

This is an automated email from the ASF dual-hosted git repository.

yangjie01 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 59f5fde8f5e [SPARK-43774][BUILD] Upgrade FasterXML jackson to 2.15.1
59f5fde8f5e is described below

commit 59f5fde8f5e4de11ab9778115db4f0a48a78295e
Author: Doolan_R 
AuthorDate: Thu May 25 20:28:13 2023 +0800

[SPARK-43774][BUILD] Upgrade FasterXML jackson to 2.15.1

### What changes were proposed in this pull request?
Upgrade FasterXML jackson from 2.15.0 to 2.15.1

### Why are the changes needed?
New version that fix some bugs, the full release-notes as follows:

- https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.15.1

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GA

Closes #41281 from ronandoolan2/master.

Lead-authored-by: Doolan_R 
Co-authored-by: Ronan Doolan 
Signed-off-by: yangjie01 
---
 dev/deps/spark-deps-hadoop-3-hive-2.3 | 14 +++---
 pom.xml   |  4 ++--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/dev/deps/spark-deps-hadoop-3-hive-2.3 
b/dev/deps/spark-deps-hadoop-3-hive-2.3
index fa870c7240f..9f6a8f2573b 100644
--- a/dev/deps/spark-deps-hadoop-3-hive-2.3
+++ b/dev/deps/spark-deps-hadoop-3-hive-2.3
@@ -98,13 +98,13 @@ httpcore/4.4.16//httpcore-4.4.16.jar
 ini4j/0.5.4//ini4j-0.5.4.jar
 istack-commons-runtime/3.0.8//istack-commons-runtime-3.0.8.jar
 ivy/2.5.1//ivy-2.5.1.jar
-jackson-annotations/2.15.0//jackson-annotations-2.15.0.jar
-jackson-core/2.15.0//jackson-core-2.15.0.jar
-jackson-databind/2.15.0//jackson-databind-2.15.0.jar
-jackson-dataformat-cbor/2.15.0//jackson-dataformat-cbor-2.15.0.jar
-jackson-dataformat-yaml/2.15.0//jackson-dataformat-yaml-2.15.0.jar
-jackson-datatype-jsr310/2.15.0//jackson-datatype-jsr310-2.15.0.jar
-jackson-module-scala_2.12/2.15.0//jackson-module-scala_2.12-2.15.0.jar
+jackson-annotations/2.15.1//jackson-annotations-2.15.1.jar
+jackson-core/2.15.1//jackson-core-2.15.1.jar
+jackson-databind/2.15.1//jackson-databind-2.15.1.jar
+jackson-dataformat-cbor/2.15.1//jackson-dataformat-cbor-2.15.1.jar
+jackson-dataformat-yaml/2.15.1//jackson-dataformat-yaml-2.15.1.jar
+jackson-datatype-jsr310/2.15.1//jackson-datatype-jsr310-2.15.1.jar
+jackson-module-scala_2.12/2.15.1//jackson-module-scala_2.12-2.15.1.jar
 jakarta.annotation-api/1.3.5//jakarta.annotation-api-1.3.5.jar
 jakarta.inject/2.6.1//jakarta.inject-2.6.1.jar
 jakarta.servlet-api/4.0.3//jakarta.servlet-api-4.0.3.jar
diff --git a/pom.xml b/pom.xml
index 6fe9b7b8701..bc6a49c44c2 100644
--- a/pom.xml
+++ b/pom.xml
@@ -179,8 +179,8 @@
 true
 true
 1.9.13
-2.15.0
-
2.15.0
+2.15.1
+
2.15.1
 1.1.10.0
 3.0.3
 1.15


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (71d59a85081 -> 6cc69ba6579)

2023-05-25 Thread srowen

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 71d59a85081 [SPARK-43768][PYTHON][CONNECT] Python dependency 
management support in Python Spark Connect
 add 6cc69ba6579 [SPARK-43785][SQL][DOC] Improve the document of 
GenTPCDSData, so that developers could easy to generate TPCDS table data

No new revisions were added by this update.

Summary of changes:
 sql/core/src/test/scala/org/apache/spark/sql/GenTPCDSData.scala | 4 
 1 file changed, 4 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-docker] branch master updated: [SPARK-43367] Recover sh in dockerfile

2023-05-25 Thread yikun

This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new ce3e122  [SPARK-43367] Recover sh in dockerfile
ce3e122 is described below

commit ce3e12266ef82264b814f6f7823165f7c7ae215a
Author: Yikun Jiang 
AuthorDate: Thu May 25 19:07:55 2023 +0800

[SPARK-43367] Recover sh in dockerfile

### What changes were proposed in this pull request?
Recover `sh`, we remove `sh` due to 
https://github.com/apache-spark-on-k8s/spark/pull/444/files#r134075892 , now 
`SPARK_DRIVER_JAVA_OPTS` related code already move to `entrypoint.sh` with 
`#!/bin/bash`, so we don't need this hack way.

See also:
[1] 
https://github.com/docker-library/official-images/pull/13089#issuecomment-1533540388
[2] 
https://github.com/docker-library/official-images/pull/13089#issuecomment-1561793792

### Why are the changes needed?
Recover sh

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed

Closes #41 from Yikun/SPARK-43367.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 3.4.0/scala2.12-java11-ubuntu/Dockerfile | 2 --
 Dockerfile.template  | 2 --
 2 files changed, 4 deletions(-)

diff --git a/3.4.0/scala2.12-java11-ubuntu/Dockerfile 
b/3.4.0/scala2.12-java11-ubuntu/Dockerfile
index 11f997f..205b399 100644
--- a/3.4.0/scala2.12-java11-ubuntu/Dockerfile
+++ b/3.4.0/scala2.12-java11-ubuntu/Dockerfile
@@ -32,8 +32,6 @@ RUN set -ex; \
 chmod g+w /opt/spark/work-dir; \
 touch /opt/spark/RELEASE; \
 chown -R spark:spark /opt/spark; \
-rm /bin/sh; \
-ln -sv /bin/bash /bin/sh; \
 echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su; \
 chgrp root /etc/passwd && chmod ug+rw /etc/passwd; \
 rm -rf /var/cache/apt/*; \
diff --git a/Dockerfile.template b/Dockerfile.template
index 6e85cd3..8b13e4a 100644
--- a/Dockerfile.template
+++ b/Dockerfile.template
@@ -32,8 +32,6 @@ RUN set -ex; \
 chmod g+w /opt/spark/work-dir; \
 touch /opt/spark/RELEASE; \
 chown -R spark:spark /opt/spark; \
-rm /bin/sh; \
-ln -sv /bin/bash /bin/sh; \
 echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su; \
 chgrp root /etc/passwd && chmod ug+rw /etc/passwd; \
 rm -rf /var/cache/apt/*; \


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-docker] branch master updated: [SPARK-43793] Fix SPARK_EXECUTOR_JAVA_OPTS assignment bug

2023-05-25 Thread yikun

This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new 006e8fa  [SPARK-43793] Fix SPARK_EXECUTOR_JAVA_OPTS assignment bug
006e8fa is described below

commit 006e8fade69f148a05fc73f591f52c7678e48f04
Author: Yikun Jiang 
AuthorDate: Thu May 25 19:05:26 2023 +0800

[SPARK-43793] Fix SPARK_EXECUTOR_JAVA_OPTS assignment bug

### What changes were proposed in this pull request?
In previous code, this is susceptible to a few bugs particularly around 
newlines in values.
```
env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
```

### Why are the changes needed?
To address DOI comments: 
https://github.com/docker-library/official-images/pull/13089#issuecomment-1533540388
 , question 6.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. Test mannully
```
export SPARK_JAVA_OPT_0="foo=bar"
export SPARK_JAVA_OPT_1="foo1=bar1"

for v in "${!SPARK_JAVA_OPT_}"; do
SPARK_EXECUTOR_JAVA_OPTS+=( "${!v}" )
done

for v in ${SPARK_EXECUTOR_JAVA_OPTS[]}; do
echo $v
done

# foo=bar
# foo1=bar1
```
2. CI passed

Closes #42 from Yikun/SPARK-43793.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 3.4.0/scala2.12-java11-ubuntu/entrypoint.sh | 5 +++--
 entrypoint.sh.template  | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/3.4.0/scala2.12-java11-ubuntu/entrypoint.sh 
b/3.4.0/scala2.12-java11-ubuntu/entrypoint.sh
index 4bb1557..716f1af 100755
--- a/3.4.0/scala2.12-java11-ubuntu/entrypoint.sh
+++ b/3.4.0/scala2.12-java11-ubuntu/entrypoint.sh
@@ -38,8 +38,9 @@ if [ -z "$JAVA_HOME" ]; then
 fi
 
 SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
-env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
-readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+for v in "${!SPARK_JAVA_OPT_@}"; do
+SPARK_EXECUTOR_JAVA_OPTS+=( "${!v}" )
+done
 
 if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
   SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"
diff --git a/entrypoint.sh.template b/entrypoint.sh.template
index 4bb1557..716f1af 100644
--- a/entrypoint.sh.template
+++ b/entrypoint.sh.template
@@ -38,8 +38,9 @@ if [ -z "$JAVA_HOME" ]; then
 fi
 
 SPARK_CLASSPATH="$SPARK_CLASSPATH:${SPARK_HOME}/jars/*"
-env | grep SPARK_JAVA_OPT_ | sort -t_ -k4 -n | sed 's/[^=]*=\(.*\)/\1/g' > 
/tmp/java_opts.txt
-readarray -t SPARK_EXECUTOR_JAVA_OPTS < /tmp/java_opts.txt
+for v in "${!SPARK_JAVA_OPT_@}"; do
+SPARK_EXECUTOR_JAVA_OPTS+=( "${!v}" )
+done
 
 if [ -n "$SPARK_EXTRA_CLASSPATH" ]; then
   SPARK_CLASSPATH="$SPARK_CLASSPATH:$SPARK_EXTRA_CLASSPATH"


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-43768][PYTHON][CONNECT] Python dependency management support in Python Spark Connect

2023-05-25 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 71d59a85081 [SPARK-43768][PYTHON][CONNECT] Python dependency 
management support in Python Spark Connect
71d59a85081 is described below

commit 71d59a85081f20cd179f5282e19aebcefa59174b
Author: Hyukjin Kwon 
AuthorDate: Thu May 25 20:03:37 2023 +0900

[SPARK-43768][PYTHON][CONNECT] Python dependency management support in 
Python Spark Connect

### What changes were proposed in this pull request?

This PR proposes to add the support of archive (`.zip`, `.jar`, `.tar.gz`, 
`.tgz`, or `.tar` files) in `SparkSession.addArtifacts` so we can support 
Python dependency management in Python Spark Connect.

### Why are the changes needed?

In order for end users to add the dependencies and archive files in Python 
Spark Connect client.

This PR enables the Python dependency management 
(https://www.databricks.com/blog/2020/12/22/how-to-manage-python-dependencies-in-pyspark.html)
 usecase in Spark Connect.

See below how to do this with Spark Connect Python client:

 Precondition

Assume that we have a Spark Connect server already running, e.g., by:

```bash
./sbin/start-connect-server.sh --jars `ls 
connector/connect/server/target/**/spark-connect*SNAPSHOT.jar` --master 
"local-cluster[2,2,1024]"
```

and assume that you already have a dev env:

```bash
# Notice that you should install `conda-pack`.
conda create -y -n pyspark_conda_env -c conda-forge conda-pack python=3.9
conda activate pyspark_conda_env
pip install --upgrade -r dev/requirements.txt
```

 Dependency management

```python
./bin/pyspark --remote "sc://localhost:15002"
```

```python
import conda_pack
import os
# Pack the current environment ('pyspark_conda_env') to 
'pyspark_conda_env.tar.gz'.
# Or you can run 'conda pack' in your shell.
conda_pack.pack()

spark.addArtifact(f"{os.environ.get('CONDA_DEFAULT_ENV')}.tar.gz#environment", 
archive=True)
spark.conf.set("spark.sql.execution.pyspark.python", 
"environment/bin/python")
# From now on, Python workers on executors use `pyspark_conda_env` Conda 
environment.
```

Run your Python UDFs

```python
import pandas as pd
from pyspark.sql.functions import pandas_udf

pandas_udf("long")
def plug_one(s: pd.Series) -> pd.Series:
return s + 1

spark.range(10).select(plug_one("id")).show()
```

### Does this PR introduce _any_ user-facing change?

Yes, it adds the support of archive (`.zip`, `.jar`, `.tar.gz`, `.tgz`, or 
`.tar` files) in `SparkSession.addArtifacts`.

### How was this patch tested?

Manually tested as described above, and added a unittest.

Also, manually tested with `local-cluster` mode with the code below:

Also verified via:

```python
import sys
from pyspark.sql.functions import udf

spark.range(1).select(udf(lambda x: 
sys.executable)("id")).show(truncate=False)
```
```
++
|(id)|
++
|/.../spark/work/app-20230524132024-/1/environment/bin/python|
++
```

Closes #41292 from HyukjinKwon/python-addArchive.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../artifact/SparkConnectArtifactManager.scala | 22 ++---
 .../service/SparkConnectAddArtifactsHandler.scala  | 19 +++-
 .../connect/artifact/ArtifactManagerSuite.scala| 12 ++---
 python/pyspark/sql/connect/client/artifact.py  | 52 +-
 python/pyspark/sql/connect/client/core.py  |  4 +-
 python/pyspark/sql/connect/session.py  | 11 -
 .../sql/tests/connect/client/test_artifact.py  | 44 +++---
 7 files changed, 130 insertions(+), 34 deletions(-)

diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
index 7a36c46c672..604108f68d2 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/artifact/SparkConnectArtifactManager.scala
@@ -17,9 +17,11 @@
 
 package org.apache.spark.sql.connect.artifact
 
+import java.io.File
 import java.net.{URL, URLClassLoader}
 import

[spark] branch master updated (6fc5f4eeaf8 -> eca0bef9329)

2023-05-25 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 6fc5f4eeaf8 [SPARK-43596][SQL] Handle IsNull predicate in 
rewriteDomainJoins
 add eca0bef9329 [SPARK-43789][R] Uses 
'spark.sql.execution.arrow.maxRecordsPerBatch' in R createDataFrame with Arrow 
by default

No new revisions were added by this update.

Summary of changes:
 R/pkg/R/SQLContext.R|  4 +++-
 R/pkg/tests/fulltests/test_sparkSQL_arrow.R | 15 +++
 2 files changed, 18 insertions(+), 1 deletion(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (0db1f002c09 -> 6fc5f4eeaf8)

2023-05-25 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0db1f002c09 [SPARK-43549][SQL] Convert _LEGACY_ERROR_TEMP_0036 to 
INVALID_SQL_SYNTAX.ANALYZE_TABLE_UNEXPECTED_NOSCAN
 add 6fc5f4eeaf8 [SPARK-43596][SQL] Handle IsNull predicate in 
rewriteDomainJoins

No new revisions were added by this update.

Summary of changes:
 .../catalyst/optimizer/DecorrelateInnerQuery.scala |  4 +++
 .../scalar-subquery/scalar-subquery-select.sql.out | 32 ++
 .../scalar-subquery/scalar-subquery-select.sql |  7 +
 .../scalar-subquery/scalar-subquery-select.sql.out | 26 ++
 4 files changed, 69 insertions(+)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark-docker] branch master updated: [SPARK-43365][FOLLWUP] Refactor publish workflow based on base image

2023-05-25 Thread yikun

This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new f2d2b2d  [SPARK-43365][FOLLWUP] Refactor publish workflow based on 
base image
f2d2b2d is described below

commit f2d2b2d1ffbb951aed29221a040861327c09441e
Author: Yikun Jiang 
AuthorDate: Thu May 25 16:13:44 2023 +0800

[SPARK-43365][FOLLWUP] Refactor publish workflow based on base image

### What changes were proposed in this pull request?
- This patch changes the `build-args` to `patch in test` in build and 
publish workflow, because the docker official image do not support 
**parameterized FROM** values. 
https://github.com/docker-library/official-images/pull/13089#issuecomment-1555352902
- And also Refactor publish workflow:

![image](https://user-images.githubusercontent.com/1736354/236613626-96f8fbf6-7df7-4d10-b4fb-be4d57c56dce.png)
### Why are the changes needed?
Same change with build workflow refactor, to avoid the publish issue like:
```
#5 [linux/amd64 internal] load metadata for 
docker.io/library/spark:3.4.0-scala2.12-java11-ubuntu
#5 ERROR: pull access denied, repository does not exist or may require 
authorization: server message: insufficient_scope: authorization failed
--
 > [linux/amd64 internal] load metadata for 
docker.io/library/spark:3.4.0-scala2.12-java11-ubuntu:
--
Dockerfile:18

  16 | #
  17 | ARG BASE_IMAGE=spark:3.4.0-scala2.12-java11-ubuntu
  18 | >>> FROM $BASE_IMAGE
  19 |
  20 | RUN set -ex && \

ERROR: failed to solve: spark:3.4.0-scala2.12-java11-ubuntu: pull access 
denied, repository does not exist or may require authorization: server message: 
insufficient_scope: authorization failed
Error: buildx failed with: ERROR: failed to solve: 
spark:3.4.0-scala2.12-java11-ubuntu: pull access denied, repository does not 
exist or may require authorization: server message: insufficient_scope: 
authorization failed
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Publish test in my local fork:
- 
https://github.com/Yikun/spark-docker/actions/runs/5076986823/jobs/9120029759: 
Skip the local base build use the [published 
base](https://github.com/Yikun/spark-docker/actions/runs/5076986823/jobs/9120029759#step:11:135)
 image:


![image](https://user-images.githubusercontent.com/1736354/236612540-2b454c14-e194-4d73-b859-0df001570d27.png)

```
#3 [linux/amd64 internal] load metadata for 
ghcr.io/yikun/spark-docker/spark:3.4.0-scala2.12-java11-ubuntu
#3 DONE 0.9s

#4 [linux/arm64 internal] load metadata for 
ghcr.io/yikun/spark-docker/spark:3.4.0-scala2.12-java11-ubuntu
#4 DONE 0.9s
```

- CI passed: do local base build first and build base on the local build

Closes #39 from Yikun/publish-build.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/main.yml | 21 --
 .github/workflows/publish.yml  | 25 +-
 3.4.0/scala2.12-java11-python3-r-ubuntu/Dockerfile |  3 +--
 3.4.0/scala2.12-java11-python3-ubuntu/Dockerfile   |  3 +--
 3.4.0/scala2.12-java11-r-ubuntu/Dockerfile |  3 +--
 r-python.template  |  3 +--
 6 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
index c1d0c56..870c8c7 100644
--- a/.github/workflows/main.yml
+++ b/.github/workflows/main.yml
@@ -107,6 +107,9 @@ jobs:
 TEST_REPO=${{ inputs.repository }}
 UNIQUE_IMAGE_TAG=${{ inputs.image-tag }}
   fi
+
+  # We can't use the real image for build because we haven't publish 
the image yet.
+  # The base image for build, it's something like 
localhost:5000/$REPO_OWNER/spark-docker/spark:3.3.0-scala2.12-java11-ubuntu
   BASE_IMAGE_URL=$TEST_REPO/$IMAGE_NAME:$BASE_IMGAE_TAG
   IMAGE_URL=$TEST_REPO/$IMAGE_NAME:$UNIQUE_IMAGE_TAG
 
@@ -157,7 +160,8 @@ jobs:
   driver-opts: network=host
 
   - name: Build - Build the base image
-if: ${{ inputs.build }}
+# Don't need to build the base image when publish
+if: ${{ inputs.build && !inputs.publish }}
 uses: docker/build-push-action@v3
 with:
   context: ${{ env.BASE_IMAGE_PATH }}
@@ -165,11 +169,24 @@ jobs:
   platforms: linux/amd64,linux/arm64
   push: true
 
+  - name: Build - Use the test image repo when build
+# Don't need to build the base image when publish
+if: ${{ inputs.build && !inputs.publish }}
+working-directory: ${{ env.IMAGE_PATH }}
+run: |
+

[spark] branch master updated (46949e692e8 -> 0db1f002c09)

2023-05-25 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 46949e692e8 [SPARK-43545][SQL][PYTHON] Support nested timestamp type
 add 0db1f002c09 [SPARK-43549][SQL] Convert _LEGACY_ERROR_TEMP_0036 to 
INVALID_SQL_SYNTAX.ANALYZE_TABLE_UNEXPECTED_NOSCAN

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json |  5 +
 .../org/apache/spark/sql/errors/QueryParsingErrors.scala |  4 ++--
 .../apache/spark/sql/catalyst/parser/DDLParserSuite.scala| 12 ++--
 3 files changed, 13 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [SPARK-43751][SQL][DOC] Document `unbase64` behavior change

[spark] branch branch-3.4 updated: [SPARK-43751][SQL][DOC] Document `unbase64` behavior change

[spark] branch master updated: [SPARK-43751][SQL][DOC] Document `unbase64` behavior change

[spark-docker] branch master updated: [SPARK-43806] Add awesome-spark-docker.md

[spark] branch master updated: [SPARK-43769][CONNECT] Implement 'levenshtein(str1, str2[, threshold])' functions

[spark] branch master updated: [SPARK-43671][SPARK-43672][SPARK-43673][SPARK-43674][PS] Fix `CategoricalOps` for Spark Connect

[spark] branch master updated: [SPARK-43647][CONNECT][TESTS] Clean up hive classes dir when test `connect-client-jvm` without -Phive

[spark] branch master updated: [SPARK-42859][PS][TESTS][FOLLOW-UPS] Delete unused file `test_parity_template.py`

[spark] branch master updated (3a6d2153b93 -> cf415cb6625)

[spark] branch master updated: [SPARK-43749][SPARK-43750][SQL] Assign names to the error class _LEGACY_ERROR_TEMP_240[4-5]

[spark] branch master updated: [SPARK-43786][SQL][TESTS] Add a test for nullability about 'levenshtein' function

[spark] branch master updated: [SPARK-43774][BUILD] Upgrade FasterXML jackson to 2.15.1

[spark] branch master updated (71d59a85081 -> 6cc69ba6579)

[spark-docker] branch master updated: [SPARK-43367] Recover sh in dockerfile

[spark-docker] branch master updated: [SPARK-43793] Fix SPARK_EXECUTOR_JAVA_OPTS assignment bug

[spark] branch master updated: [SPARK-43768][PYTHON][CONNECT] Python dependency management support in Python Spark Connect

[spark] branch master updated (6fc5f4eeaf8 -> eca0bef9329)

[spark] branch master updated (0db1f002c09 -> 6fc5f4eeaf8)

[spark-docker] branch master updated: [SPARK-43365][FOLLWUP] Refactor publish workflow based on base image

[spark] branch master updated (46949e692e8 -> 0db1f002c09)

20 matches

Site Navigation

Mail list logo

Footer information