date:20221223

[spark] branch master updated (88fc48f5e7e -> 5a762687f41)

2022-12-23 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 88fc48f5e7e [SPARK-41431][CORE][SQL][UI] Protobuf serializer for 
`SQLExecutionUIData`
 add 5a762687f41 [SPARK-41430][UI] Protobuf serializer for 
ProcessSummaryWrapper

No new revisions were added by this update.

Summary of changes:
 .../apache/spark/status/protobuf/store_types.proto | 14 
 .../org.apache.spark.status.protobuf.ProtobufSerDe |  1 +
 .../protobuf/ProcessSummaryWrapperSerializer.scala | 77 ++
 .../protobuf/KVStoreProtobufSerializerSuite.scala  | 27 
 4 files changed, 119 insertions(+)
 create mode 100644 
core/src/main/scala/org/apache/spark/status/protobuf/ProcessSummaryWrapperSerializer.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-41431][CORE][SQL][UI] Protobuf serializer for `SQLExecutionUIData`

2022-12-23 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 88fc48f5e7e [SPARK-41431][CORE][SQL][UI] Protobuf serializer for 
`SQLExecutionUIData`
88fc48f5e7e is described below

commit 88fc48f5e7e907c25d082a7b35231744ccef2c7e
Author: yangjie01 
AuthorDate: Fri Dec 23 15:53:40 2022 -0800

[SPARK-41431][CORE][SQL][UI] Protobuf serializer for `SQLExecutionUIData`

### What changes were proposed in this pull request?
Add Protobuf serializer for `SQLExecutionUIData`

### Why are the changes needed?
Support fast and compact serialization/deserialization for 
`SQLExecutionUIData` over RocksDB.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new UT

Closes #39139 from LuciferYang/SPARK-41431.

Authored-by: yangjie01 
Signed-off-by: Gengliang Wang 
---
 .../apache/spark/status/protobuf/store_types.proto | 21 +
 sql/core/pom.xml   |  5 ++
 .../org.apache.spark.status.protobuf.ProtobufSerDe | 18 +
 .../sql/SQLExecutionUIDataSerializer.scala | 90 ++
 .../protobuf/sql/SQLPlanMetricSerializer.scala | 36 +
 .../sql/KVStoreProtobufSerializerSuite.scala   | 88 +
 6 files changed, 258 insertions(+)

diff --git 
a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto 
b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
index 7cf5c2921cb..cb0dea540bd 100644
--- a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
+++ b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
@@ -355,3 +355,24 @@ message ExecutorSummary {
 message ExecutorSummaryWrapper {
   ExecutorSummary info = 1;
 }
+
+message SQLPlanMetric {
+  string name = 1;
+  int64 accumulator_id = 2;
+  string metric_type = 3;
+}
+
+message SQLExecutionUIData {
+  int64 execution_id = 1;
+  string description = 2;
+  string details = 3;
+  string physical_plan_description = 4;
+  map modified_configs = 5;
+  repeated SQLPlanMetric metrics = 6;
+  int64 submission_time = 7;
+  optional int64 completion_time = 8;
+  optional string error_message = 9;
+  map jobs = 10;
+  repeated int64 stages = 11;
+  map metric_values = 12;
+}
diff --git a/sql/core/pom.xml b/sql/core/pom.xml
index cfcf7455ad0..71c57f8a7f7 100644
--- a/sql/core/pom.xml
+++ b/sql/core/pom.xml
@@ -147,6 +147,11 @@
   org.apache.xbean
   xbean-asm9-shaded
 
+
+  com.google.protobuf
+  protobuf-java
+  ${protobuf.version}
+
 
   org.scalacheck
   scalacheck_${scala.binary.version}
diff --git 
a/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
 
b/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
new file mode 100644
index 000..de5f2c2d05c
--- /dev/null
+++ 
b/sql/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
@@ -0,0 +1,18 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+org.apache.spark.status.protobuf.sql.SQLExecutionUIDataSerializer
diff --git 
a/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala
 
b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala
new file mode 100644
index 000..8dc28517ff0
--- /dev/null
+++ 
b/sql/core/src/main/scala/org/apache/spark/status/protobuf/sql/SQLExecutionUIDataSerializer.scala
@@ -0,0 +1,90 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *

[spark] branch master updated: [MINOR][SQL] Remove redundant comment in CTESubstitution

2022-12-23 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new f8ad5e307b4 [MINOR][SQL] Remove redundant comment in CTESubstitution
f8ad5e307b4 is described below

commit f8ad5e307b4686f4c22dc47ba8dc2866fe55476b
Author: Reynold Xin 
AuthorDate: Sat Dec 24 08:37:49 2022 +0900

[MINOR][SQL] Remove redundant comment in CTESubstitution

### What changes were proposed in this pull request?
I think the proposed removal is just a duplicate of the next point. I 
actually spent couple minutes trying to understand if there was a difference.

### Why are the changes needed?
See above.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This is a code comment change.

Closes #39197 from rxin/cte_substitution_comment.

Authored-by: Reynold Xin 
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala  | 8 
 1 file changed, 8 deletions(-)

diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
index 6a4562450b9..77c687843c3 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CTESubstitution.scala
@@ -157,14 +157,6 @@ object CTESubstitution extends Rule[LogicalPlan] {
*   SELECT * FROM t
* )
*   SELECT * FROM t2
-   * - If a CTE definition contains a subquery that contains an inner WITH 
node then substitution
-   *   of inner should take precedence because it can shadow an outer CTE 
definition.
-   *   For example the following query should return 2:
-   *   WITH t AS (SELECT 1 AS c)
-   *   SELECT max(c) FROM (
-   * WITH t AS (SELECT 2 AS c)
-   * SELECT * FROM t
-   *   )
* - If a CTE definition contains a subquery expression that contains an 
inner WITH node then
*   substitution of inner should take precedence because it can shadow an 
outer CTE
*   definition.


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch branch-3.3 updated: [MINOR][TEST][SQL] Add a CTE subquery scope test case

2022-12-23 Thread rxin

This is an automated email from the ASF dual-hosted git repository.

rxin pushed a commit to branch branch-3.3
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/branch-3.3 by this push:
 new aa39b06462a [MINOR][TEST][SQL] Add a CTE subquery scope test case
aa39b06462a is described below

commit aa39b06462a98f37be59e239d12edd9f09a25b88
Author: Reynold Xin 
AuthorDate: Fri Dec 23 14:55:14 2022 -0800

[MINOR][TEST][SQL] Add a CTE subquery scope test case

### What changes were proposed in this pull request?
I noticed we were missing a test case for this in SQL tests, so I added one.

### Why are the changes needed?
To ensure we scope CTEs properly in subqueries.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This is a test case change.

Closes #39189 from rxin/cte_test.

Authored-by: Reynold Xin 
Signed-off-by: Reynold Xin 
(cherry picked from commit 24edf8ecb5e47af294f89552dfd9957a2d9f193b)
Signed-off-by: Reynold Xin 
---
 .../test/resources/sql-tests/inputs/cte-nested.sql | 10 
 .../resources/sql-tests/results/cte-legacy.sql.out | 28 ++
 .../resources/sql-tests/results/cte-nested.sql.out | 28 ++
 .../sql-tests/results/cte-nonlegacy.sql.out| 28 ++
 4 files changed, 94 insertions(+)

diff --git a/sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql 
b/sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql
index 5f12388b9cb..e5ef2443417 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql
@@ -17,6 +17,16 @@ SELECT (
   SELECT * FROM t
 );
 
+-- Make sure CTE in subquery is scoped to that subquery rather than global
+-- the 2nd half of the union should fail because the cte is scoped to the 
first half
+SELECT * FROM
+  (
+   WITH cte AS (SELECT * FROM range(10))
+   SELECT * FROM cte WHERE id = 8
+  ) a
+UNION
+SELECT * FROM cte;
+
 -- CTE in CTE definition shadows outer
 WITH
   t AS (SELECT 1),
diff --git a/sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out 
b/sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out
index 264b64ffe96..ebdd64c3ac8 100644
--- a/sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out
@@ -36,6 +36,34 @@ struct
 1
 
 
+-- !query
+SELECT * FROM
+  (
+   WITH cte AS (SELECT * FROM range(10))
+   SELECT * FROM cte WHERE id = 8
+  ) a
+UNION
+SELECT * FROM cte
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "TABLE_OR_VIEW_NOT_FOUND",
+  "sqlState" : "42000",
+  "messageParameters" : {
+"relationName" : "`cte`"
+  },
+  "queryContext" : [ {
+"objectType" : "",
+"objectName" : "",
+"startIndex" : 120,
+"stopIndex" : 122,
+"fragment" : "cte"
+  } ]
+}
+
+
 -- !query
 WITH
   t AS (SELECT 1),
diff --git a/sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out 
b/sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out
index 2c622de3f36..b6e1793f7d7 100644
--- a/sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out
@@ -36,6 +36,34 @@ struct
 1
 
 
+-- !query
+SELECT * FROM
+  (
+   WITH cte AS (SELECT * FROM range(10))
+   SELECT * FROM cte WHERE id = 8
+  ) a
+UNION
+SELECT * FROM cte
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "TABLE_OR_VIEW_NOT_FOUND",
+  "sqlState" : "42000",
+  "messageParameters" : {
+"relationName" : "`cte`"
+  },
+  "queryContext" : [ {
+"objectType" : "",
+"objectName" : "",
+"startIndex" : 120,
+"stopIndex" : 122,
+"fragment" : "cte"
+  } ]
+}
+
+
 -- !query
 WITH
   t AS (SELECT 1),
diff --git 
a/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out 
b/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out
index 283f5a54a42..546ab7ecb95 100644
--- a/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out
@@ -36,6 +36,34 @@ struct
 1
 
 
+-- !query
+SELECT * FROM
+  (
+   WITH cte AS (SELECT * FROM range(10))
+   SELECT * FROM cte WHERE id = 8
+  ) a
+UNION
+SELECT * FROM cte
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "TABLE_OR_VIEW_NOT_FOUND",
+  "sqlState" : "42000",
+  "messageParameters" : {
+"relationName" : "`cte`"
+  },
+  "queryContext" : [ {
+"objectType" : "",
+"objectName" : "",
+"startIndex" : 120,
+"stopIndex" : 122,
+"fragment" : "cte"
+  } ]
+}
+
+
 -- !query
 WITH
   t AS (SELECT 1),

[spark] branch master updated: [MINOR][TEST][SQL] Add a CTE subquery scope test case

2022-12-23 Thread rxin

This is an automated email from the ASF dual-hosted git repository.

rxin pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 24edf8ecb5e [MINOR][TEST][SQL] Add a CTE subquery scope test case
24edf8ecb5e is described below

commit 24edf8ecb5e47af294f89552dfd9957a2d9f193b
Author: Reynold Xin 
AuthorDate: Fri Dec 23 14:55:14 2022 -0800

[MINOR][TEST][SQL] Add a CTE subquery scope test case

### What changes were proposed in this pull request?
I noticed we were missing a test case for this in SQL tests, so I added one.

### Why are the changes needed?
To ensure we scope CTEs properly in subqueries.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
This is a test case change.

Closes #39189 from rxin/cte_test.

Authored-by: Reynold Xin 
Signed-off-by: Reynold Xin 
---
 .../test/resources/sql-tests/inputs/cte-nested.sql | 10 
 .../resources/sql-tests/results/cte-legacy.sql.out | 28 ++
 .../resources/sql-tests/results/cte-nested.sql.out | 28 ++
 .../sql-tests/results/cte-nonlegacy.sql.out| 28 ++
 4 files changed, 94 insertions(+)

diff --git a/sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql 
b/sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql
index 5f12388b9cb..e5ef2443417 100644
--- a/sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/cte-nested.sql
@@ -17,6 +17,16 @@ SELECT (
   SELECT * FROM t
 );
 
+-- Make sure CTE in subquery is scoped to that subquery rather than global
+-- the 2nd half of the union should fail because the cte is scoped to the 
first half
+SELECT * FROM
+  (
+   WITH cte AS (SELECT * FROM range(10))
+   SELECT * FROM cte WHERE id = 8
+  ) a
+UNION
+SELECT * FROM cte;
+
 -- CTE in CTE definition shadows outer
 WITH
   t AS (SELECT 1),
diff --git a/sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out 
b/sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out
index 013c5f27b50..65000471c75 100644
--- a/sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/cte-legacy.sql.out
@@ -33,6 +33,34 @@ struct
 1
 
 
+-- !query
+SELECT * FROM
+  (
+   WITH cte AS (SELECT * FROM range(10))
+   SELECT * FROM cte WHERE id = 8
+  ) a
+UNION
+SELECT * FROM cte
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "TABLE_OR_VIEW_NOT_FOUND",
+  "sqlState" : "42000",
+  "messageParameters" : {
+"relationName" : "`cte`"
+  },
+  "queryContext" : [ {
+"objectType" : "",
+"objectName" : "",
+"startIndex" : 120,
+"stopIndex" : 122,
+"fragment" : "cte"
+  } ]
+}
+
+
 -- !query
 WITH
   t AS (SELECT 1),
diff --git a/sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out 
b/sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out
index ed6d69b233e..2c67f2db56a 100644
--- a/sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/cte-nested.sql.out
@@ -33,6 +33,34 @@ struct
 1
 
 
+-- !query
+SELECT * FROM
+  (
+   WITH cte AS (SELECT * FROM range(10))
+   SELECT * FROM cte WHERE id = 8
+  ) a
+UNION
+SELECT * FROM cte
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "TABLE_OR_VIEW_NOT_FOUND",
+  "sqlState" : "42000",
+  "messageParameters" : {
+"relationName" : "`cte`"
+  },
+  "queryContext" : [ {
+"objectType" : "",
+"objectName" : "",
+"startIndex" : 120,
+"stopIndex" : 122,
+"fragment" : "cte"
+  } ]
+}
+
+
 -- !query
 WITH
   t AS (SELECT 1),
diff --git 
a/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out 
b/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out
index 6a48e1bec43..154ebd20223 100644
--- a/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/cte-nonlegacy.sql.out
@@ -33,6 +33,34 @@ struct
 1
 
 
+-- !query
+SELECT * FROM
+  (
+   WITH cte AS (SELECT * FROM range(10))
+   SELECT * FROM cte WHERE id = 8
+  ) a
+UNION
+SELECT * FROM cte
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.sql.AnalysisException
+{
+  "errorClass" : "TABLE_OR_VIEW_NOT_FOUND",
+  "sqlState" : "42000",
+  "messageParameters" : {
+"relationName" : "`cte`"
+  },
+  "queryContext" : [ {
+"objectType" : "",
+"objectName" : "",
+"startIndex" : 120,
+"stopIndex" : 122,
+"fragment" : "cte"
+  } ]
+}
+
+
 -- !query
 WITH
   t AS (SELECT 1),


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands,

[spark] branch master updated: [SPARK-41697][CONNECT][TESTS] Enable test_df_show, test_drop, test_dropna, test_toDF_with_schema_string and test_with_columns_renamed

2022-12-23 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 82a22606c59 [SPARK-41697][CONNECT][TESTS] Enable test_df_show, 
test_drop, test_dropna, test_toDF_with_schema_string and 
test_with_columns_renamed
82a22606c59 is described below

commit 82a22606c5951e8d0a9c270595d63d6836f2d51b
Author: Hyukjin Kwon 
AuthorDate: Fri Dec 23 21:52:58 2022 +0900

[SPARK-41697][CONNECT][TESTS] Enable test_df_show, test_drop, test_dropna, 
test_toDF_with_schema_string and test_with_columns_renamed

### What changes were proposed in this pull request?

This PR enables the reused PySpark tests in Spark Connect that pass now.

### Why are the changes needed?

To make sure on the test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually ran it in my local.

Closes #39193 from HyukjinKwon/SPARK-41697.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../sql/tests/connect/test_parity_dataframe.py | 24 --
 1 file changed, 24 deletions(-)

diff --git a/python/pyspark/sql/tests/connect/test_parity_dataframe.py 
b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
index ccb5dd45b54..7dfdc8de751 100644
--- a/python/pyspark/sql/tests/connect/test_parity_dataframe.py
+++ b/python/pyspark/sql/tests/connect/test_parity_dataframe.py
@@ -71,22 +71,10 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedSQLTestCase):
 def test_create_nan_decimal_dataframe(self):
 super().test_create_nan_decimal_dataframe()
 
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_df_show(self):
-super().test_df_show()
-
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_drop(self):
-super().test_drop()
-
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_drop_duplicates(self):
 super().test_drop_duplicates()
 
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_dropna(self):
-super().test_dropna()
-
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_duplicated_column_names(self):
 super().test_duplicated_column_names()
@@ -99,10 +87,6 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedSQLTestCase):
 def test_fillna(self):
 super().test_fillna()
 
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_freqItems(self):
-super().test_freqItems()
-
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_generic_hints(self):
 super().test_generic_hints()
@@ -163,10 +147,6 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedSQLTestCase):
 def test_to(self):
 super().test_to()
 
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_toDF_with_schema_string(self):
-super().test_toDF_with_schema_string()
-
 @unittest.skip("Fails in Spark Connect, should enable.")
 def test_to_local_iterator(self):
 super().test_to_local_iterator()
@@ -219,10 +199,6 @@ class DataFrameParityTests(DataFrameTestsMixin, 
ReusedSQLTestCase):
 def test_unpivot(self):
 super().test_unpivot()
 
-@unittest.skip("Fails in Spark Connect, should enable.")
-def test_with_columns_renamed(self):
-super().test_with_columns_renamed()
-
 
 if __name__ == "__main__":
 import unittest


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (827ca9b8247 -> 391a268843e)

2022-12-23 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 827ca9b8247 [SPARK-41498] Propagate metadata through Union
 add 391a268843e [SPARK-41698][CONNECT][TESTS] Enable 16 tests in 
pyspark.sql.tests.connect.test_parity_functions

No new revisions were added by this update.

Summary of changes:
 .../sql/tests/connect/test_parity_functions.py | 64 --
 1 file changed, 64 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated (c526741ed09 -> 827ca9b8247)

2022-12-23 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from c526741ed09 [MINOR][DOCS] Fix grammatical error in streaming 
programming guide
 add 827ca9b8247 [SPARK-41498] Propagate metadata through Union

No new revisions were added by this update.

Summary of changes:
 .../spark/sql/catalyst/analysis/Analyzer.scala |  20 +-
 .../plans/logical/basicLogicalOperators.scala  |  58 --
 .../sql/connector/catalog/InMemoryBaseTable.scala  |   2 +-
 .../spark/sql/connector/MetadataColumnSuite.scala  | 231 +
 4 files changed, 291 insertions(+), 20 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [MINOR][DOCS] Fix grammatical error in streaming programming guide

2022-12-23 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new c526741ed09 [MINOR][DOCS] Fix grammatical error in streaming 
programming guide
c526741ed09 is described below

commit c526741ed095ad88fdfb3f3c5233dcf483d3fe71
Author: Austin Wang <62589566+wangaus...@users.noreply.github.com>
AuthorDate: Fri Dec 23 20:24:48 2022 +0900

[MINOR][DOCS] Fix grammatical error in streaming programming guide

### What changes were proposed in this pull request?

This change fixes a grammatical error in the documentation.

### Why are the changes needed?

There is a grammatical error in the documentation.

### Does this PR introduce _any_ user-facing change?

Yes.
Previously the sentence in question reads "This leads to two kinds of data 
in the system that
need to recovered in the event of failures," but it should instead read 
"This leads to two kinds of data in the system that
need to be recovered in the event of failures."

### How was this patch tested?

Tests were not added. This was a simple text change that can be previewed 
immediately.

Closes #39145 from wangaustin/patch-1.

Authored-by: Austin Wang <62589566+wangaus...@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon 
---
 docs/streaming-programming-guide.md | 38 ++---
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/docs/streaming-programming-guide.md 
b/docs/streaming-programming-guide.md
index 4a104238a6d..0b8e55d84e6 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -725,7 +725,7 @@ of its creation, the new data will be picked up.
 
 In contrast, Object Stores such as Amazon S3 and Azure Storage usually have 
slow rename operations, as the
 data is actually copied.
-Furthermore, renamed object may have the time of the `rename()` operation as 
its modification time, so
+Furthermore, a renamed object may have the time of the `rename()` operation as 
its modification time, so
 may not be considered part of the window which the original create time 
implied they were.
 
 Careful testing is needed against the target object store to verify that the 
timestamp behavior
@@ -1140,7 +1140,7 @@ said two parameters - windowLength and 
slideInterval.
 
  Join Operations
 {:.no_toc}
-Finally, its worth highlighting how easily you can perform different kinds of 
joins in Spark Streaming.
+Finally, it's worth highlighting how easily you can perform different kinds of 
joins in Spark Streaming.
 
 
 # Stream-stream joins
@@ -1236,7 +1236,7 @@ For the Python API, see 
[DStream](api/python/reference/api/pyspark.streaming.DSt
 ***
 
 ## Output Operations on DStreams
-Output operations allow DStream's data to be pushed out to external systems 
like a database or a file systems.
+Output operations allow DStream's data to be pushed out to external systems 
like a database or a file system.
 Since the output operations actually allow the transformed data to be consumed 
by external systems,
 they trigger the actual execution of all the DStream transformations (similar 
to actions for RDDs).
 Currently, the following output operations are defined:
@@ -1293,7 +1293,7 @@ Currently, the following output operations are defined:
 However, it is important to understand how to use this primitive correctly and 
efficiently.
 Some of the common mistakes to avoid are as follows.
 
-Often writing data to external system requires creating a connection object
+Often writing data to external systems requires creating a connection object
 (e.g. TCP connection to a remote server) and using it to send data to a remote 
system.
 For this purpose, a developer may inadvertently try creating a connection 
object at
 the Spark driver, and then try to use it in a Spark worker to save records in 
the RDDs.
@@ -1481,7 +1481,7 @@ Note that the connections in the pool should be lazily 
created on demand and tim
 ***
 
 ## DataFrame and SQL Operations
-You can easily use [DataFrames and SQL](sql-programming-guide.html) operations 
on streaming data. You have to create a SparkSession using the SparkContext 
that the StreamingContext is using. Furthermore, this has to done such that it 
can be restarted on driver failures. This is done by creating a lazily 
instantiated singleton instance of SparkSession. This is shown in the following 
example. It modifies the earlier [word count example](#a-quick-example) to 
generate word counts using DataF [...]
+You can easily use [DataFrames and SQL](sql-programming-guide.html) operations 
on streaming data. You have to create a SparkSession using the SparkContext 
that the StreamingContext is using. Furthermore, this has to be done such that 
it can be restarted on driver

[spark] branch master updated: [SPARK-41407][SQL][FOLLOW-UP] Use string jobTrackerID for FileFormatWriter.executeTask

2022-12-23 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 3d5af030707 [SPARK-41407][SQL][FOLLOW-UP] Use string jobTrackerID for 
FileFormatWriter.executeTask
3d5af030707 is described below

commit 3d5af030707e1dd8bd14a1bee244350329303943
Author: Hyukjin Kwon 
AuthorDate: Fri Dec 23 20:22:33 2022 +0900

[SPARK-41407][SQL][FOLLOW-UP] Use string jobTrackerID for 
FileFormatWriter.executeTask

### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/38939 that 
fixes a logical conflict during merging PRs, see 
https://github.com/apache/spark/pull/38980 and 
https://github.com/apache/spark/pull/38939.

### Why are the changes needed?

To recover the broken build.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested:

```
 ./build/sbt -Phive clean package
```

Closes #39194 from HyukjinKwon/SPARK-41407.

Authored-by: Hyukjin Kwon 
Signed-off-by: Hyukjin Kwon 
---
 .../org/apache/spark/sql/execution/datasources/WriteFiles.scala | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala
index 39b7b252f6e..5bc8f9db32b 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala
@@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.datasources
 import java.util.Date
 
 import org.apache.spark.{SparkException, TaskContext}
-import org.apache.spark.internal.io.FileCommitProtocol
+import org.apache.spark.internal.io.{FileCommitProtocol, 
SparkHadoopWriterUtils}
 import org.apache.spark.rdd.RDD
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.expressions.Attribute
@@ -72,7 +72,7 @@ case class WriteFilesExec(child: SparkPlan) extends 
UnaryExecNode {
 val concurrentOutputWriterSpec = 
writeFilesSpec.concurrentOutputWriterSpecFunc(child)
 val description = writeFilesSpec.description
 val committer = writeFilesSpec.committer
-val jobIdInstant = new Date().getTime
+val jobTrackerID = SparkHadoopWriterUtils.createJobTrackerID(new Date())
 rddWithNonEmptyPartitions.mapPartitionsInternal { iterator =>
   val sparkStageId = TaskContext.get().stageId()
   val sparkPartitionId = TaskContext.get().partitionId()
@@ -80,7 +80,7 @@ case class WriteFilesExec(child: SparkPlan) extends 
UnaryExecNode {
 
   val ret = FileFormatWriter.executeTask(
 description,
-jobIdInstant,
+jobTrackerID,
 sparkStageId,
 sparkPartitionId,
 sparkAttemptNumber,


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-41383][SPARK-41692][SPARK-41693] Implement `rollup`, `cube` and `pivot`

2022-12-23 Thread gurwls223

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new e1af3a992e0 [SPARK-41383][SPARK-41692][SPARK-41693] Implement 
`rollup`, `cube` and `pivot`
e1af3a992e0 is described below

commit e1af3a992e06aeb5185501db908dc272b449c62b
Author: Ruifeng Zheng 
AuthorDate: Fri Dec 23 19:51:44 2022 +0900

[SPARK-41383][SPARK-41692][SPARK-41693] Implement `rollup`, `cube` and 
`pivot`

### What changes were proposed in this pull request?
Implement `rollup`, `cube` and `pivot`:

1. `DataFrame.rollup`
2. `DataFrame.cube`
3. `DataFrame.groupBy.pivot`

### Why are the changes needed?
for API coverage

### Does this PR introduce _any_ user-facing change?
yes

### How was this patch tested?
added UT

Closes #39191 from zhengruifeng/connect_groupby_refactor.

Authored-by: Ruifeng Zheng 
Signed-off-by: Hyukjin Kwon 
---
 .../main/protobuf/spark/connect/relations.proto|  34 +-
 .../org/apache/spark/sql/connect/dsl/package.scala |  50 +++-
 .../planner/LiteralValueProtoConverter.scala   |   2 +-
 .../sql/connect/planner/SparkConnectPlanner.scala  |  82 +
 .../connect/planner/SparkConnectPlannerSuite.scala |   3 +-
 .../connect/planner/SparkConnectProtoSuite.scala   |  77 
 python/pyspark/sql/connect/dataframe.py|  48 +++-
 python/pyspark/sql/connect/group.py|  82 +++--
 python/pyspark/sql/connect/plan.py |  63 +++---
 python/pyspark/sql/connect/proto/relations_pb2.py  | 112 +
 python/pyspark/sql/connect/proto/relations_pb2.pyi |  90 --
 .../sql/tests/connect/test_connect_basic.py| 136 +
 12 files changed, 667 insertions(+), 112 deletions(-)

diff --git 
a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto 
b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
index c4f040c03d6..912ee1fdc63 100644
--- a/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
+++ b/connector/connect/common/src/main/protobuf/spark/connect/relations.proto
@@ -235,11 +235,39 @@ message Tail {
 
 // Relation of type [[Aggregate]].
 message Aggregate {
-  // (Required) Input relation for a Aggregate.
+  // (Required) Input relation for a RelationalGroupedDataset.
   Relation input = 1;
 
-  repeated Expression grouping_expressions = 2;
-  repeated Expression result_expressions = 3;
+  // (Required) How the RelationalGroupedDataset was built.
+  GroupType group_type = 2;
+
+  // (Required) Expressions for grouping keys
+  repeated Expression grouping_expressions = 3;
+
+  // (Required) List of values that will be translated to columns in the 
output DataFrame.
+  repeated Expression aggregate_expressions = 4;
+
+  // (Optional) Pivots a column of the current `DataFrame` and performs the 
specified aggregation.
+  Pivot pivot = 5;
+
+  enum GroupType {
+GROUP_TYPE_UNSPECIFIED = 0;
+GROUP_TYPE_GROUPBY = 1;
+GROUP_TYPE_ROLLUP = 2;
+GROUP_TYPE_CUBE = 3;
+GROUP_TYPE_PIVOT = 4;
+  }
+
+  message Pivot {
+// (Required) The column to pivot
+Expression col = 1;
+
+// (Optional) List of values that will be translated to columns in the 
output DataFrame.
+//
+// Note that if it is empty, the server side will immediately trigger a 
job to collect
+// the distinct values of the column.
+repeated Expression.Literal values = 2;
+  }
 }
 
 // Relation of type [[Sort]].
diff --git 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
index b15e46293ab..e6d230d9eef 100644
--- 
a/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
+++ 
b/connector/connect/server/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
@@ -601,16 +601,64 @@ package object dsl {
   def groupBy(groupingExprs: Expression*)(aggregateExprs: Expression*): 
Relation = {
 val agg = Aggregate.newBuilder()
 agg.setInput(logicalPlan)
+agg.setGroupType(proto.Aggregate.GroupType.GROUP_TYPE_GROUPBY)
 
 for (groupingExpr <- groupingExprs) {
   agg.addGroupingExpressions(groupingExpr)
 }
 for (aggregateExpr <- aggregateExprs) {
-  agg.addResultExpressions(aggregateExpr)
+  agg.addAggregateExpressions(aggregateExpr)
 }
 Relation.newBuilder().setAggregate(agg.build()).build()
   }
 
+  def rollup(groupingExprs: Expression*)(aggregateExprs: Expression*): 
Relation = {
+val agg = Aggregate.newBuilder()
+agg.setInput(logicalPlan)
+

[spark] branch master updated (a1c727f3867 -> 2ffa8178df1)

2022-12-23 Thread wenchen

This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from a1c727f3867 [SPARK-41666][PYTHON] Support parameterized SQL by `sql()`
 add 2ffa8178df1 [SPARK-41407][SQL] Pull out v1 write to WriteFiles

No new revisions were added by this update.

Summary of changes:
 core/src/main/resources/error/error-classes.json   |   2 +-
 .../spark/sql/errors/QueryExecutionErrors.scala|   2 +-
 .../org/apache/spark/sql/internal/WriteSpec.java}  |  19 ++-
 .../org/apache/spark/sql/execution/SparkPlan.scala |  26 +++-
 .../spark/sql/execution/SparkStrategies.scala  |   3 +
 .../execution/command/createDataSourceTables.scala |   6 +
 .../sql/execution/datasources/DataSource.scala |  29 ++--
 .../execution/datasources/FileFormatWriter.scala   | 154 -
 .../spark/sql/execution/datasources/V1Writes.scala |  21 ++-
 .../sql/execution/datasources/WriteFiles.scala | 104 ++
 .../datasources/V1WriteCommandSuite.scala  |   7 +-
 11 files changed, 310 insertions(+), 63 deletions(-)
 copy 
sql/{catalyst/src/main/java/org/apache/spark/sql/connector/read/streaming/PartitionOffset.java
 => core/src/main/java/org/apache/spark/sql/internal/WriteSpec.java} (68%)
 create mode 100644 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriteFiles.scala


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

[spark] branch master updated: [SPARK-41666][PYTHON] Support parameterized SQL by `sql()`

2022-12-23 Thread maxgekk

This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a1c727f3867 [SPARK-41666][PYTHON] Support parameterized SQL by `sql()`
a1c727f3867 is described below

commit a1c727f386724156f680953fa34ec51bb35348a4
Author: Max Gekk 
AuthorDate: Fri Dec 23 12:30:30 2022 +0300

[SPARK-41666][PYTHON] Support parameterized SQL by `sql()`

### What changes were proposed in this pull request?
In the PR, I propose to extend the `sql()` method in PySpark to support 
parameterized SQL queries, see https://github.com/apache/spark/pull/38864, and 
add new parameter - `args` of the type `Dict[str, str]`. This parameter maps 
named parameters that can occur in the input SQL query to SQL literals like 1, 
INTERVAL '1-1' YEAR TO MONTH, DATE'2022-12-22' (see [the doc 
](https://spark.apache.org/docs/latest/sql-ref-literals.html)of supported 
literals).

For example:
```python
>>> spark.sql("SELECT * FROM range(10) WHERE id > :minId", args = 
{"minId" : "7"})
   id
0   8
1   9
```

Closes #39159

### Why are the changes needed?
To achieve feature parity with Scala/Java API, and provide PySpark users 
the same feature.

### Does this PR introduce _any_ user-facing change?
No, it shouldn't.

### How was this patch tested?
Checked the examples locally, and running the tests:
```
$ python/run-tests --modules=pyspark-sql --parallelism=1
```

Closes #39183 from MaxGekk/parameterized-sql-pyspark-dict.

Authored-by: Max Gekk 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   |  6 +++---
 .../source/migration_guide/pyspark_3.3_to_3.4.rst  |  2 ++
 python/pyspark/pandas/sql_formatter.py | 20 +--
 python/pyspark/sql/session.py  | 23 ++
 4 files changed, 42 insertions(+), 9 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index ff235e80dbb..95db9005d02 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -813,7 +813,7 @@
   },
   "INVALID_SQL_ARG" : {
 "message" : [
-  "The argument  of `sql()` is invalid. Consider to replace it by a 
SQL literal statement."
+  "The argument  of `sql()` is invalid. Consider to replace it by a 
SQL literal."
 ]
   },
   "INVALID_SQL_SYNTAX" : {
@@ -1164,7 +1164,7 @@
   },
   "UNBOUND_SQL_PARAMETER" : {
 "message" : [
-  "Found the unbound parameter: . Please, fix `args` and provide a 
mapping of the parameter to a SQL literal statement."
+  "Found the unbound parameter: . Please, fix `args` and provide a 
mapping of the parameter to a SQL literal."
 ]
   },
   "UNCLOSED_BRACKETED_COMMENT" : {
@@ -5225,4 +5225,4 @@
   "grouping() can only be used with GroupingSets/Cube/Rollup"
 ]
   }
-}
\ No newline at end of file
+}
diff --git a/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst 
b/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
index b3baa8345aa..ca942c54979 100644
--- a/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
+++ b/python/docs/source/migration_guide/pyspark_3.3_to_3.4.rst
@@ -39,3 +39,5 @@ Upgrading from PySpark 3.3 to 3.4
 * In Spark 3.4, the ``Series.concat`` sort parameter will be respected to 
follow pandas 1.4 behaviors.
 
 * In Spark 3.4, the ``DataFrame.__setitem__`` will make a copy and replace 
pre-existing arrays, which will NOT be over-written to follow pandas 1.4 
behaviors.
+
+* In Spark 3.4, the ``SparkSession.sql`` and the Pandas on Spark API ``sql`` 
have got new parameter ``args`` which provides binding of named parameters to 
their SQL literals.
diff --git a/python/pyspark/pandas/sql_formatter.py 
b/python/pyspark/pandas/sql_formatter.py
index 45c615161d9..9103366c192 100644
--- a/python/pyspark/pandas/sql_formatter.py
+++ b/python/pyspark/pandas/sql_formatter.py
@@ -17,7 +17,7 @@
 
 import os
 import string
-from typing import Any, Optional, Union, List, Sequence, Mapping, Tuple
+from typing import Any, Dict, Optional, Union, List, Sequence, Mapping, Tuple
 import uuid
 import warnings
 
@@ -43,6 +43,7 @@ _CAPTURE_SCOPES = 3
 def sql(
 query: str,
 index_col: Optional[Union[str, List[str]]] = None,
+args: Dict[str, str] = {},
 **kwargs: Any,
 ) -> DataFrame:
 """
@@ -57,6 +58,8 @@ def sql(
 * pandas Series
 * string
 
+Also the method can bind named parameters to SQL literals from `args`.
+
 Parameters
 --
 query : str
@@ -99,6 +102,12 @@ def sql(
 e  f   3  6
 
 Also note that the index name(s) should be matched to the existing 
name.
+args : dict
+

[spark] branch master updated: [SPARK-41422][UI] Protobuf serializer for ExecutorSummaryWrapper

2022-12-23 Thread gengliang

This is an automated email from the ASF dual-hosted git repository.

gengliang pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new d0ec53cf953 [SPARK-41422][UI] Protobuf serializer for 
ExecutorSummaryWrapper
d0ec53cf953 is described below

commit d0ec53cf9531cba6a154b98d4593a0be50e712d8
Author: Sandeep Singh 
AuthorDate: Fri Dec 23 00:30:33 2022 -0800

[SPARK-41422][UI] Protobuf serializer for ExecutorSummaryWrapper

### What changes were proposed in this pull request?
Add Protobuf serializer for ExecutorSummaryWrapper

### Why are the changes needed?
Support fast and compact serialization/deserialization for 
ExecutorSummaryWrapper over RocksDB.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT

Closes #39141 from techaddict/SPARK-41422-ExecutorSummaryWrapper.

Authored-by: Sandeep Singh 
Signed-off-by: Gengliang Wang 
---
 .../apache/spark/status/protobuf/store_types.proto |  50 ++
 .../org.apache.spark.status.protobuf.ProtobufSerDe |   1 +
 .../ExecutorSummaryWrapperSerializer.scala | 183 +
 .../protobuf/KVStoreProtobufSerializerSuite.scala  | 131 ++-
 4 files changed, 364 insertions(+), 1 deletion(-)

diff --git 
a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto 
b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
index 5949ad63c84..7cf5c2921cb 100644
--- a/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
+++ b/core/src/main/protobuf/org/apache/spark/status/protobuf/store_types.proto
@@ -305,3 +305,53 @@ message SpeculationStageSummaryWrapper {
   int32 stage_attempt_id = 2;
   SpeculationStageSummary info = 3;
 }
+
+message MemoryMetrics {
+  int64 used_on_heap_storage_memory = 1;
+  int64 used_off_heap_storage_memory = 2;
+  int64 total_on_heap_storage_memory = 3;
+  int64 total_off_heap_storage_memory = 4;
+}
+
+message ResourceInformation {
+  string name = 1;
+  repeated string addresses = 2;
+}
+
+message ExecutorSummary {
+  string id = 1;
+  string host_port = 2;
+  bool is_active = 3;
+  int32 rdd_blocks = 4;
+  int64 memory_used = 5;
+  int64 disk_used = 6;
+  int32 total_cores = 7;
+  int32 max_tasks = 8;
+  int32 active_tasks = 9;
+  int32 failed_tasks = 10;
+  int32 completed_tasks = 11;
+  int32 total_tasks = 12;
+  int64 total_duration = 13;
+  int64 total_gc_time = 14;
+  int64 total_input_bytes = 15;
+  int64 total_shuffle_read = 16;
+  int64 total_shuffle_write = 17;
+  bool is_blacklisted = 18;
+  int64 max_memory = 19;
+  int64 add_time = 20;
+  optional int64 remove_time = 21;
+  optional string remove_reason = 22;
+  map executor_logs = 23;
+  optional MemoryMetrics memory_metrics = 24;
+  repeated int64 blacklisted_in_stages = 25;
+  optional ExecutorMetrics peak_memory_metrics = 26;
+  map attributes = 27;
+  map resources = 28;
+  int32 resource_profile_id = 29;
+  bool is_excluded = 30;
+  repeated int64 excluded_in_stages = 31;
+}
+
+message ExecutorSummaryWrapper {
+  ExecutorSummary info = 1;
+}
diff --git 
a/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
 
b/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
index 97c186206bf..b714ea73b36 100644
--- 
a/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
+++ 
b/core/src/main/resources/META-INF/services/org.apache.spark.status.protobuf.ProtobufSerDe
@@ -25,3 +25,4 @@ org.apache.spark.status.protobuf.TaskDataWrapperSerializer
 org.apache.spark.status.protobuf.JobDataWrapperSerializer
 org.apache.spark.status.protobuf.ResourceProfileWrapperSerializer
 org.apache.spark.status.protobuf.SpeculationStageSummaryWrapperSerializer
+org.apache.spark.status.protobuf.ExecutorSummaryWrapperSerializer
diff --git 
a/core/src/main/scala/org/apache/spark/status/protobuf/ExecutorSummaryWrapperSerializer.scala
 
b/core/src/main/scala/org/apache/spark/status/protobuf/ExecutorSummaryWrapperSerializer.scala
new file mode 100644
index 000..03a810157d7
--- /dev/null
+++ 
b/core/src/main/scala/org/apache/spark/status/protobuf/ExecutorSummaryWrapperSerializer.scala
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is

[spark] branch master updated (88fc48f5e7e -> 5a762687f41)

[spark] branch master updated: [SPARK-41431][CORE][SQL][UI] Protobuf serializer for `SQLExecutionUIData`

[spark] branch master updated: [MINOR][SQL] Remove redundant comment in CTESubstitution

[spark] branch branch-3.3 updated: [MINOR][TEST][SQL] Add a CTE subquery scope test case

[spark] branch master updated: [MINOR][TEST][SQL] Add a CTE subquery scope test case

[spark] branch master updated: [SPARK-41697][CONNECT][TESTS] Enable test_df_show, test_drop, test_dropna, test_toDF_with_schema_string and test_with_columns_renamed

[spark] branch master updated (827ca9b8247 -> 391a268843e)

[spark] branch master updated (c526741ed09 -> 827ca9b8247)

[spark] branch master updated: [MINOR][DOCS] Fix grammatical error in streaming programming guide

[spark] branch master updated: [SPARK-41407][SQL][FOLLOW-UP] Use string jobTrackerID for FileFormatWriter.executeTask

[spark] branch master updated: [SPARK-41383][SPARK-41692][SPARK-41693] Implement `rollup`, `cube` and `pivot`

[spark] branch master updated (a1c727f3867 -> 2ffa8178df1)

[spark] branch master updated: [SPARK-41666][PYTHON] Support parameterized SQL by `sql()`

[spark] branch master updated: [SPARK-41422][UI] Protobuf serializer for ExecutorSummaryWrapper

14 matches

Site Navigation

Mail list logo

Footer information