[spark] branch master updated: [SPARK-40657] Add support for Java classes in Protobuf functions

2022-10-20 Thread kabhwan
This is an automated email from the ASF dual-hosted git repository.

kabhwan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new fd9e5760bae [SPARK-40657] Add support for Java classes in Protobuf 
functions
fd9e5760bae is described below

commit fd9e5760bae847f47c9c108f0e58814748e0d9b1
Author: Raghu Angadi 
AuthorDate: Fri Oct 21 15:46:50 2022 +0900

[SPARK-40657] Add support for Java classes in Protobuf functions

### What changes were proposed in this pull request?

Adds support for compiled Java classes to Protobuf functions. This is 
tested with Protobuf v3 classes. V2 vs V3 issues will be handled in a separate 
PR. The main changes in this PR:

 - Changes to top level API:
- Adds new version that takes just the class name.
- Changes the order of arguments for existing API with descriptor files 
(`messageName` and `descFilePath` are swapped).
 - Protobuf utils methods to create descriptor from Java class name.
 - Many unit tests are update to check both versions : (1) with descriptor 
file and (2) with Java class name.
 - Maven build updates to generate Java classes to use in tests.
 - Miscellaneous changes:
- Adds `proto` to package name in `proto` files used in tests.
- A few TODO comments about improvements

### Why are the changes needed?
Java compiled classes is a common method for users to provide Protobuf 
definitions.

### Does this PR introduce _any_ user-facing change?
No.
This updates interface, but for a new feature in active development.

### How was this patch tested?
 - Unit tests

Closes #38286 from rangadi/protobuf-java.

Authored-by: Raghu Angadi 
Signed-off-by: Jungtaek Lim 
---
 connector/protobuf/pom.xml |  23 +-
 .../sql/protobuf/CatalystDataToProtobuf.scala  |  10 +-
 .../sql/protobuf/ProtobufDataToCatalyst.scala  |  34 ++-
 .../org/apache/spark/sql/protobuf/functions.scala  |  58 +++-
 .../spark/sql/protobuf/utils/ProtobufUtils.scala   |  65 -
 .../sql/protobuf/utils/SchemaConverters.scala  |   4 +
 .../test/resources/protobuf/catalyst_types.proto   |   4 +-
 .../test/resources/protobuf/functions_suite.proto  |   4 +-
 .../src/test/resources/protobuf/serde_suite.proto  |   6 +-
 .../ProtobufCatalystDataConversionSuite.scala  |  97 +--
 .../sql/protobuf/ProtobufFunctionsSuite.scala  | 318 +
 .../spark/sql/protobuf/ProtobufSerdeSuite.scala|   9 +-
 project/SparkBuild.scala   |   6 +-
 python/pyspark/sql/protobuf/functions.py   |  22 +-
 14 files changed, 437 insertions(+), 223 deletions(-)

diff --git a/connector/protobuf/pom.xml b/connector/protobuf/pom.xml
index 0515f128b8d..b934c7f831a 100644
--- a/connector/protobuf/pom.xml
+++ b/connector/protobuf/pom.xml
@@ -83,7 +83,6 @@
   ${protobuf.version}
   compile
 
-
   
   
 
target/scala-${scala.binary.version}/classes
@@ -110,6 +109,28 @@
   
 
   
+  
+com.github.os72
+protoc-jar-maven-plugin
+3.11.4
+
+
+  
+generate-test-sources
+
+  run
+
+
+  
com.google.protobuf:protoc:${protobuf.version}
+  ${protobuf.version}
+  
+src/test/resources/protobuf
+  
+  test
+
+  
+
+  
 
   
 
diff --git 
a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala
 
b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala
index 145100268c2..b9f7907ea8c 100644
--- 
a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala
+++ 
b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala
@@ -25,17 +25,17 @@ import org.apache.spark.sql.types.{BinaryType, DataType}
 
 private[protobuf] case class CatalystDataToProtobuf(
 child: Expression,
-descFilePath: String,
-messageName: String)
+messageName: String,
+descFilePath: Option[String] = None)
 extends UnaryExpression {
 
   override def dataType: DataType = BinaryType
 
-  @transient private lazy val protoType =
-ProtobufUtils.buildDescriptor(descFilePath, messageName)
+  @transient private lazy val protoDescriptor =
+ProtobufUtils.buildDescriptor(messageName, descFilePathOpt = descFilePath)
 
   @transient private lazy val serializer =
-new ProtobufSerializer(child.dataType, protoType, child.nullable)
+new ProtobufSerializer(child.dataType, protoDescriptor, child.nullable)
 
   override def nullSafeEval(input: Any): Any = {
 val dynamicMessage = 
serializer.serialize(input).asInstanceOf[DynamicMes

[spark] branch master updated: [SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL

2022-10-20 Thread wenchen
This is an automated email from the ASF dual-hosted git repository.

wenchen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new b14da8b1b65 [SPARK-40812][CONNECT] Add Deduplicate to Connect proto 
and DSL
b14da8b1b65 is described below

commit b14da8b1b65d9f00f49fab87f738715089bc43e8
Author: Rui Wang 
AuthorDate: Fri Oct 21 13:14:43 2022 +0800

[SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL

### What changes were proposed in this pull request?

This PR supports `Deduplicate` to Connect proto and DSL.

Note that `Deduplicate` can not be replaced by SQL's `SELECT DISTINCT 
col_list`. The difference is that `Deduplicate` allows to remove duplicated 
rows based on a set of columns but returns all the columns. SQL's `SELECT 
DISTINCT col_list`, instead, can only return the `col_list`.

### Why are the changes needed?

1. To improve proto API coverage.
2. `Deduplicate` blocks https://github.com/apache/spark/pull/38166 because 
we want support `Union(isAll=false)` but that will return `Union().Distinct()` 
to match existing DataFrame API. `Deduplicate` is needed to write test cases 
for `Union(isAll=false)`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #38276 from amaliujia/supportDropDuplicates.

Authored-by: Rui Wang 
Signed-off-by: Wenchen Fan 
---
 .../main/protobuf/spark/connect/relations.proto|  9 ++
 .../org/apache/spark/sql/connect/dsl/package.scala | 20 +
 .../sql/connect/planner/SparkConnectPlanner.scala  | 35 +++-
 .../planner/SparkConnectDeduplicateSuite.scala | 68 +++
 .../connect/planner/SparkConnectPlannerSuite.scala | 29 ++-
 python/pyspark/sql/connect/proto/relations_pb2.py  | 98 +++---
 python/pyspark/sql/connect/proto/relations_pb2.pyi | 50 +++
 7 files changed, 257 insertions(+), 52 deletions(-)

diff --git a/connector/connect/src/main/protobuf/spark/connect/relations.proto 
b/connector/connect/src/main/protobuf/spark/connect/relations.proto
index eadedf495d3..6adf0831ea2 100644
--- a/connector/connect/src/main/protobuf/spark/connect/relations.proto
+++ b/connector/connect/src/main/protobuf/spark/connect/relations.proto
@@ -43,6 +43,7 @@ message Relation {
 LocalRelation local_relation = 11;
 Sample sample = 12;
 Offset offset = 13;
+Deduplicate deduplicate = 14;
 
 Unknown unknown = 999;
   }
@@ -181,6 +182,14 @@ message Sort {
   }
 }
 
+// Relation of type [[Deduplicate]] which have duplicate rows removed, could 
consider either only
+// the subset of columns or all the columns.
+message Deduplicate {
+  Relation input = 1;
+  repeated string column_names = 2;
+  bool all_columns_as_keys = 3;
+}
+
 message LocalRelation {
   repeated Expression.QualifiedAttribute attributes = 1;
   // TODO: support local data.
diff --git 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
index 8a267dff7d7..68bbc0487f9 100644
--- 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
+++ 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala
@@ -215,6 +215,26 @@ package object dsl {
   .build()
   }
 
+  def deduplicate(colNames: Seq[String]): proto.Relation =
+proto.Relation
+  .newBuilder()
+  .setDeduplicate(
+proto.Deduplicate
+  .newBuilder()
+  .setInput(logicalPlan)
+  .addAllColumnNames(colNames.asJava))
+  .build()
+
+  def distinct(): proto.Relation =
+proto.Relation
+  .newBuilder()
+  .setDeduplicate(
+proto.Deduplicate
+  .newBuilder()
+  .setInput(logicalPlan)
+  .setAllColumnsAsKeys(true))
+  .build()
+
   def join(
   otherPlan: proto.Relation,
   joinType: JoinType = JoinType.JOIN_TYPE_INNER,
diff --git 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
index 450283a9b81..92c8bf01cba 100644
--- 
a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
+++ 
b/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
@@ -27,8 +27,9 @@ import org.apache.spark.sql.catalyst.expressions
 import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, 
AttributeReference, Expression}
 import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
 import org.apache.spark.sql.catalyst.plans.{logical, FullOuter, Inner, 
JoinType, LeftAnti, Left

[spark] branch master updated: [SPARK-40768][SQL] Migrate type check failures of bloom_filter_agg() onto error classes

2022-10-20 Thread maxgekk
This is an automated email from the ASF dual-hosted git repository.

maxgekk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 11cce7e7338 [SPARK-40768][SQL] Migrate type check failures of 
bloom_filter_agg() onto error classes
11cce7e7338 is described below

commit 11cce7e73380231ec7c94096655e3d98ce7e635d
Author: lvshaokang 
AuthorDate: Fri Oct 21 10:06:44 2022 +0500

[SPARK-40768][SQL] Migrate type check failures of bloom_filter_agg() onto 
error classes

### What changes were proposed in this pull request?

In the PR, I propose to use error classes in the case of type check failure 
in Bloom Filter Agg expressions.

### Why are the changes needed?

Migration onto error classes unifies Spark SQL error messages.

### Does this PR introduce _any_ user-facing change?

Yes. The PR changes user-facing error messages.

### How was this patch tested?

```
build/sbt "sql/testOnly *SQLQueryTestSuite"
build/sbt "test:testOnly org.apache.spark.SparkThrowableSuite"
build/sbt "test:testOnly *BloomFilterAggregateQuerySuite"
```

Closes #38315 from lvshaokang/SPARK-40768.

Authored-by: lvshaokang 
Signed-off-by: Max Gekk 
---
 core/src/main/resources/error/error-classes.json   |   2 +-
 .../expressions/BloomFilterMightContain.scala  |   3 +-
 .../aggregate/BloomFilterAggregate.scala   |  61 +++--
 .../spark/sql/BloomFilterAggregateQuerySuite.scala | 144 ++---
 4 files changed, 179 insertions(+), 31 deletions(-)

diff --git a/core/src/main/resources/error/error-classes.json 
b/core/src/main/resources/error/error-classes.json
index 0cfb6861c77..1e9519dd89a 100644
--- a/core/src/main/resources/error/error-classes.json
+++ b/core/src/main/resources/error/error-classes.json
@@ -108,7 +108,7 @@
   },
   "BLOOM_FILTER_WRONG_TYPE" : {
 "message" : [
-  "Input to function  should have been  
followed by a value with , but it's [, 
]."
+  "Input to function  should have been  
followed by value with , but it's []."
 ]
   },
   "CANNOT_CONVERT_TO_JSON" : {
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala
index 5cb19d36b80..b2273b6a6d1 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala
@@ -76,8 +76,7 @@ case class BloomFilterMightContain(
 "functionName" -> toSQLId(prettyName),
 "expectedLeft" -> toSQLType(BinaryType),
 "expectedRight" -> toSQLType(LongType),
-"actualLeft" -> toSQLType(left.dataType),
-"actualRight" -> toSQLType(right.dataType)
+"actual" -> Seq(left.dataType, 
right.dataType).map(toSQLType).mkString(", ")
   )
 )
 }
diff --git 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala
 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala
index c734bca3ef8..5b78c5b5228 100644
--- 
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala
+++ 
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala
@@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.sql.catalyst.analysis.TypeCheckResult
 import org.apache.spark.sql.catalyst.analysis.TypeCheckResult._
 import org.apache.spark.sql.catalyst.expressions._
+import org.apache.spark.sql.catalyst.expressions.Cast.{toSQLExpr, toSQLId, 
toSQLType, toSQLValue}
 import org.apache.spark.sql.catalyst.trees.TernaryLike
 import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.types._
@@ -63,28 +64,66 @@ case class BloomFilterAggregate(
   override def checkInputDataTypes(): TypeCheckResult = {
 (first.dataType, second.dataType, third.dataType) match {
   case (_, NullType, _) | (_, _, NullType) =>
-TypeCheckResult.TypeCheckFailure("Null typed values cannot be used as 
size arguments")
+DataTypeMismatch(
+  errorSubClass = "UNEXPECTED_NULL",
+  messageParameters = Map(
+"exprName" -> "estimatedNumItems or numBits"
+  )
+)
   case (LongType, LongType, LongType) =>
 if (!estimatedNumItemsExpression.foldable) {
-  TypeCheckFailure("The estimated number of items provided must be a 
constant literal")
+  DataTypeMismatch(
+errorSubClass = "NON_FOLDABLE_INPUT",
+messageParameters =

[spark] branch master updated (40086cb9b21 -> 670bc1dfc0e)

2022-10-20 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 40086cb9b21 [SPARK-40860][INFRA] Change `set-output` to `GITHUB_EVENT`
 add 670bc1dfc0e [SPARK-40615][SQL][TESTS][FOLLOW-UP] Make the test pass 
with ANSI enabled

No new revisions were added by this update.

Summary of changes:
 sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-40860][INFRA] Change `set-output` to `GITHUB_EVENT`

2022-10-20 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 40086cb9b21 [SPARK-40860][INFRA] Change `set-output` to `GITHUB_EVENT`
40086cb9b21 is described below

commit 40086cb9b21fe207242c4928d8e2cc3e756d61da
Author: Yikun Jiang 
AuthorDate: Fri Oct 21 11:06:33 2022 +0800

[SPARK-40860][INFRA] Change `set-output` to `GITHUB_EVENT`

### What changes were proposed in this pull request?
Change `set-output` to `GITHUB_OUTPUT`.

### Why are the changes needed?
The `set-output` command is deprecated and will be disabled soon. Please 
upgrade to using Environment Files. For more information see: 
https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/

### Does this PR introduce _any_ user-facing change?
No, dev only

### How was this patch tested?
- CI passed
- Also do a local test on benchmark: 
https://github.com/Yikun/spark/actions/runs/3294384181/jobs/5431945626

Closes #38323 from Yikun/set-output.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/benchmark.yml  |  2 +-
 .github/workflows/build_and_test.yml | 13 ++---
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
index 5508227b8b2..f73267a95fa 100644
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@@ -54,7 +54,7 @@ jobs:
 steps:
 - name: Generate matrix
   id: set-matrix
-  run: echo "::set-output name=matrix::["`seq -s, 1 
$SPARK_BENCHMARK_NUM_SPLITS`"]"
+  run: echo "matrix=["`seq -s, 1 $SPARK_BENCHMARK_NUM_SPLITS`"]" >> 
$GITHUB_OUTPUT
 
   # Any TPC-DS related updates on this job need to be applied to tpcds-1g job 
of build_and_test.yml as well
   tpcds-1g-gen:
diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index e0adad54aed..f9b445e9bbd 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -103,16 +103,15 @@ jobs:
   \"k8s-integration-tests\" : \"true\",
 }"
   echo $precondition # For debugging
-  # GitHub Actions set-output doesn't take newlines
-  # 
https://github.community/t/set-output-truncates-multiline-strings/16852/3
-  precondition="${precondition//$'\n'/'%0A'}"
-  echo "::set-output name=required::$precondition"
+  # Remove `\n` to avoid "Invalid format" error
+  precondition="${precondition//$'\n'/}}"
+  echo "required=$precondition" >> $GITHUB_OUTPUT
 else
   # This is usually set by scheduled jobs.
   precondition='${{ inputs.jobs }}'
   echo $precondition # For debugging
-  precondition="${precondition//$'\n'/'%0A'}"
-  echo "::set-output name=required::$precondition"
+  precondition="${precondition//$'\n'/}"
+  echo "required=$precondition" >> $GITHUB_OUTPUT
 fi
 - name: Generate infra image URL
   id: infra-image-outputs
@@ -121,7 +120,7 @@ jobs:
 REPO_OWNER=$(echo "${{ github.repository_owner }}" | tr '[:upper:]' 
'[:lower:]')
 IMG_NAME="apache-spark-ci-image:${{ inputs.branch }}-${{ github.run_id 
}}"
 IMG_URL="ghcr.io/$REPO_OWNER/$IMG_NAME"
-echo ::set-output name=image_url::$IMG_URL
+echo "image_url=$IMG_URL" >> $GITHUB_OUTPUT
 
   # Build: build Spark and run the tests for specified modules.
   build:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-40859][INFRA] Upgrade action/checkout to v3 to cleanup warning

2022-10-20 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 17efe044fa7 [SPARK-40859][INFRA] Upgrade action/checkout to v3 to 
cleanup warning
17efe044fa7 is described below

commit 17efe044fa7d366fa0beafe71c5e76d46f942b7e
Author: Yikun Jiang 
AuthorDate: Fri Oct 21 10:36:00 2022 +0800

[SPARK-40859][INFRA] Upgrade action/checkout to v3 to cleanup warning

### What changes were proposed in this pull request?
Upgrade action/checkout to v3 (point ot v3.1 now).

### Why are the changes needed?
- https://github.com/actions/checkout/releases/tag/v3.1.0 cleanup "[The 
'set-output' command is deprecated and will be disabled 
soon.](https://github.com/actions/checkout/issues/959#issuecomment-1282107197)"
- https://github.com/actions/checkout/releases/tag/v3.0.0 since v3, use the 
node 16 to cleanup "[Node.js 12 actions are 
deprecated](https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/)"

According to 
https://github.com/actions/checkout/issues/959#issuecomment-1282107197, v2.5 
also address 'set-output' warning, but only v3 support node 16, so we upgrade 
to v3.1 rather than v2.5

### Does this PR introduce _any_ user-facing change?
No, dev only

### How was this patch tested?
CI passed

Closes #38322 from Yikun/checkout-v3.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 .github/workflows/benchmark.yml|  6 +++---
 .github/workflows/build_and_test.yml   | 24 
 .github/workflows/build_infra_images_cache.yml |  2 +-
 .github/workflows/publish_snapshot.yml |  2 +-
 4 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml
index 52adec20e5c..5508227b8b2 100644
--- a/.github/workflows/benchmark.yml
+++ b/.github/workflows/benchmark.yml
@@ -65,7 +65,7 @@ jobs:
   SPARK_LOCAL_IP: localhost
 steps:
   - name: Checkout Spark repository
-uses: actions/checkout@v2
+uses: actions/checkout@v3
 # In order to get diff files
 with:
   fetch-depth: 0
@@ -95,7 +95,7 @@ jobs:
   key: tpcds-${{ hashFiles('.github/workflows/benchmark.yml', 
'sql/core/src/test/scala/org/apache/spark/sql/TPCDSSchema.scala') }}
   - name: Checkout tpcds-kit repository
 if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true'
-uses: actions/checkout@v2
+uses: actions/checkout@v3
 with:
   repository: databricks/tpcds-kit
   ref: 2a5078a782192ddb6efbcead8de9973d6ab4f069
@@ -133,7 +133,7 @@ jobs:
   SPARK_TPCDS_DATA: ${{ github.workspace }}/tpcds-sf-1
 steps:
 - name: Checkout Spark repository
-  uses: actions/checkout@v2
+  uses: actions/checkout@v3
   # In order to get diff files
   with:
 fetch-depth: 0
diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index 12a1ad0e71e..e0adad54aed 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -63,7 +63,7 @@ jobs:
 }}
 steps:
 - name: Checkout Spark repository
-  uses: actions/checkout@v2
+  uses: actions/checkout@v3
   with:
 fetch-depth: 0
 repository: apache/spark
@@ -195,7 +195,7 @@ jobs:
   SPARK_LOCAL_IP: localhost
 steps:
 - name: Checkout Spark repository
-  uses: actions/checkout@v2
+  uses: actions/checkout@v3
   # In order to fetch changed files
   with:
 fetch-depth: 0
@@ -286,7 +286,7 @@ jobs:
   username: ${{ github.actor }}
   password: ${{ secrets.GITHUB_TOKEN }}
   - name: Checkout Spark repository
-uses: actions/checkout@v2
+uses: actions/checkout@v3
 # In order to fetch changed files
 with:
   fetch-depth: 0
@@ -349,7 +349,7 @@ jobs:
   METASPACE_SIZE: 1g
 steps:
 - name: Checkout Spark repository
-  uses: actions/checkout@v2
+  uses: actions/checkout@v3
   # In order to fetch changed files
   with:
 fetch-depth: 0
@@ -438,7 +438,7 @@ jobs:
   SKIP_MIMA: true
 steps:
 - name: Checkout Spark repository
-  uses: actions/checkout@v2
+  uses: actions/checkout@v3
   # In order to fetch changed files
   with:
 fetch-depth: 0
@@ -508,7 +508,7 @@ jobs:
   image: ${{ needs.precondition.outputs.image_url }}
 steps:
 - name: Checkout Spark repository
-  uses: actions/checkout@v2
+  uses: actions/checkout@v3
   with:
 fetch-depth: 0
 repository: apache/spark
@@ -635,7 +635,7 @@ jobs:
 runs-on: ubuntu-20.04
 steps:
 - name: Checkout Spark repository
-  

[spark] branch master updated: [SPARK-40813][CONNECT][PYTHON][FOLLOW-UP] Improve limit and offset in Python client

2022-10-20 Thread gurwls223
This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 45bb9578a0f [SPARK-40813][CONNECT][PYTHON][FOLLOW-UP] Improve limit 
and offset in Python client
45bb9578a0f is described below

commit 45bb9578a0f6b40b472588a407d842f293e9e323
Author: Rui Wang 
AuthorDate: Fri Oct 21 10:33:18 2022 +0900

[SPARK-40813][CONNECT][PYTHON][FOLLOW-UP] Improve limit and offset in 
Python client

### What changes were proposed in this pull request?

Following up after https://github.com/apache/spark/pull/38275, improve 
limit and offset in Python client.

### Why are the changes needed?

Improve API coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #38314 from amaliujia/python_test_limit_offset.

Authored-by: Rui Wang 
Signed-off-by: Hyukjin Kwon 
---
 python/pyspark/sql/connect/dataframe.py|  3 ++
 python/pyspark/sql/connect/plan.py | 32 --
 .../sql/tests/connect/test_connect_basic.py|  7 +
 .../sql/tests/connect/test_connect_plan_only.py| 10 +++
 4 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/python/pyspark/sql/connect/dataframe.py 
b/python/pyspark/sql/connect/dataframe.py
index 1f7e789818f..5ca747fdd6a 100644
--- a/python/pyspark/sql/connect/dataframe.py
+++ b/python/pyspark/sql/connect/dataframe.py
@@ -199,6 +199,9 @@ class DataFrame(object):
 def limit(self, n: int) -> "DataFrame":
 return DataFrame.withPlan(plan.Limit(child=self._plan, limit=n), 
session=self._session)
 
+def offset(self, n: int) -> "DataFrame":
+return DataFrame.withPlan(plan.Offset(child=self._plan, offset=n), 
session=self._session)
+
 def sort(self, *cols: "ColumnOrString") -> "DataFrame":
 """Sort by a specific column"""
 return DataFrame.withPlan(plan.Sort(self._plan, *cols), 
session=self._session)
diff --git a/python/pyspark/sql/connect/plan.py 
b/python/pyspark/sql/connect/plan.py
index c564b71cdba..5b8b7c71866 100644
--- a/python/pyspark/sql/connect/plan.py
+++ b/python/pyspark/sql/connect/plan.py
@@ -272,10 +272,9 @@ class Filter(LogicalPlan):
 
 
 class Limit(LogicalPlan):
-def __init__(self, child: Optional["LogicalPlan"], limit: int, offset: int 
= 0) -> None:
+def __init__(self, child: Optional["LogicalPlan"], limit: int) -> None:
 super().__init__(child)
 self.limit = limit
-self.offset = offset
 
 def plan(self, session: Optional["RemoteSparkSession"]) -> proto.Relation:
 assert self._child is not None
@@ -286,7 +285,7 @@ class Limit(LogicalPlan):
 
 def print(self, indent: int = 0) -> str:
 c_buf = self._child.print(indent + LogicalPlan.INDENT) if self._child 
else ""
-return f"{' ' * indent}\n{c_buf}"
+return f"{' ' * indent}\n{c_buf}"
 
 def _repr_html_(self) -> str:
 return f"""
@@ -294,6 +293,33 @@ class Limit(LogicalPlan):
 
 Limit
 Limit: {self.limit} 
+{self._child_repr_()}
+
+
+"""
+
+
+class Offset(LogicalPlan):
+def __init__(self, child: Optional["LogicalPlan"], offset: int = 0) -> 
None:
+super().__init__(child)
+self.offset = offset
+
+def plan(self, session: Optional["RemoteSparkSession"]) -> proto.Relation:
+assert self._child is not None
+plan = proto.Relation()
+plan.offset.input.CopyFrom(self._child.plan(session))
+plan.offset.offset = self.offset
+return plan
+
+def print(self, indent: int = 0) -> str:
+c_buf = self._child.print(indent + LogicalPlan.INDENT) if self._child 
else ""
+return f"{' ' * indent}\n{c_buf}"
+
+def _repr_html_(self) -> str:
+return f"""
+
+
+Limit
 Offset: {self.offset} 
 {self._child_repr_()}
 
diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py 
b/python/pyspark/sql/tests/connect/test_connect_basic.py
index de300946932..f6988a1d120 100644
--- a/python/pyspark/sql/tests/connect/test_connect_basic.py
+++ b/python/pyspark/sql/tests/connect/test_connect_basic.py
@@ -106,6 +106,13 @@ class SparkConnectTests(SparkConnectSQLTestCase):
 res = pandas.DataFrame(data={"id": [0, 30, 60, 90]})
 self.assert_(pd.equals(res), f"{pd.to_string()} != {res.to_string()}")
 
+def test_limit_offset(self):
+df = self.connect.read.table(self.tbl_name)
+pd = df.limit(10).offset(1).toPandas()
+self.assertEqual(9, len(pd.index))
+pd2 = df.offset(98).limit(10).toPandas()
+self.assertEqual(2, len(pd2.index))
+
 def test_simple_dataso

[spark] branch master updated: [SPARK-39203][SQL][FOLLOWUP] Do not qualify view location

2022-10-20 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new a51dd1820da [SPARK-39203][SQL][FOLLOWUP] Do not qualify view location
a51dd1820da is described below

commit a51dd1820dad8f24f99a979d5998b999ae4a3c25
Author: Wenchen Fan 
AuthorDate: Thu Oct 20 15:16:04 2022 -0700

[SPARK-39203][SQL][FOLLOWUP] Do not qualify view location

### What changes were proposed in this pull request?

This fixes a corner-case regression caused by 
https://github.com/apache/spark/pull/36625. Users may have existing views that 
have invalid locations due to historical reasons. The location is actually 
useless for a view, but after https://github.com/apache/spark/pull/36625 , they 
start to fail to read the view as qualifying the location fails. We should just 
skip qualifying view locations.

### Why are the changes needed?

avoid regression

### Does this PR introduce _any_ user-facing change?

Spark can read view with invalid location again.

### How was this patch tested?

manually test. View with an invalid location is kind of "broken" and can't 
be dropped (HMS fails to drop it), so we can't write a UT for it.

Closes #38321 from cloud-fan/follow.

Authored-by: Wenchen Fan 
Signed-off-by: Dongjoon Hyun 
---
 .../apache/spark/sql/hive/client/HiveClientImpl.scala  | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
index f6b06b08cbc..213d930653d 100644
--- 
a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
+++ 
b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala
@@ -537,12 +537,18 @@ private[hive] class HiveClientImpl(
   storage = CatalogStorageFormat(
 locationUri = shim.getDataLocation(h).map { loc =>
   val tableUri = stringToURI(loc)
-  // Before SPARK-19257, created data source table does not use 
absolute uri.
-  // This makes Spark can't read these tables across HDFS clusters.
-  // Rewrite table location to absolute uri based on database uri to 
fix this issue.
-  val absoluteUri = Option(tableUri).filterNot(_.isAbsolute)
-.map(_ => 
stringToURI(client.getDatabase(h.getDbName).getLocationUri))
-  HiveExternalCatalog.toAbsoluteURI(tableUri, absoluteUri)
+  if (h.getTableType == HiveTableType.VIRTUAL_VIEW) {
+// Data location of SQL view is useless. Do not qualify it even if 
it's present, as
+// it can be an invalid path.
+tableUri
+  } else {
+// Before SPARK-19257, created data source table does not use 
absolute uri.
+// This makes Spark can't read these tables across HDFS clusters.
+// Rewrite table location to absolute uri based on database uri to 
fix this issue.
+val absoluteUri = Option(tableUri).filterNot(_.isAbsolute)
+  .map(_ => 
stringToURI(client.getDatabase(h.getDbName).getLocationUri))
+HiveExternalCatalog.toAbsoluteURI(tableUri, absoluteUri)
+  }
 },
 // To avoid ClassNotFound exception, we try our best to not get the 
format class, but get
 // the class name directly. However, for non-native tables, there is 
no interface to get


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (0643d02e4f0 -> 3b60637d91b)

2022-10-20 Thread srowen
This is an automated email from the ASF dual-hosted git repository.

srowen pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 0643d02e4f0 [SPARK-40853][INFRA] Pin `mypy-protobuf==3.3.0`
 add 3b60637d91b [SPARK-40843][CORE][TESTS] Clean up deprecated api usage 
in SparkThrowableSuite

No new revisions were added by this update.

Summary of changes:
 .../scala/org/apache/spark/SparkThrowableSuite.scala   | 18 ++
 1 file changed, 10 insertions(+), 8 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-40853][INFRA] Pin `mypy-protobuf==3.3.0`

2022-10-20 Thread ruifengz
This is an automated email from the ASF dual-hosted git repository.

ruifengz pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 0643d02e4f0 [SPARK-40853][INFRA] Pin `mypy-protobuf==3.3.0`
0643d02e4f0 is described below

commit 0643d02e4f03cdadb53efc05af0b6533d22db297
Author: Ruifeng Zheng 
AuthorDate: Thu Oct 20 17:41:16 2022 +0800

[SPARK-40853][INFRA] Pin `mypy-protobuf==3.3.0`

### What changes were proposed in this pull request?
[`mypy-protobuf` 3.4.0](https://pypi.org/project/mypy-protobuf/#history) is 
just released, and will break master shortly

### Why are the changes needed?
to make CI happy

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manually check

Closes #38316 from zhengruifeng/infra_ping_mypy_protobuf.

Authored-by: Ruifeng Zheng 
Signed-off-by: Ruifeng Zheng 
---
 .github/workflows/build_and_test.yml | 2 +-
 dev/requirements.txt | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/build_and_test.yml 
b/.github/workflows/build_and_test.yml
index a3998b330c1..12a1ad0e71e 100644
--- a/.github/workflows/build_and_test.yml
+++ b/.github/workflows/build_and_test.yml
@@ -587,7 +587,7 @@ jobs:
 #   See also https://issues.apache.org/jira/browse/SPARK-38279.
 python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme 
ipython nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0'
 python3.9 -m pip install ipython_genutils # See SPARK-38517
-python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
pyarrow pandas 'plotly>=4.8' grpcio protobuf mypy-protobuf
+python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' 
pyarrow pandas 'plotly>=4.8' grpcio protobuf 'mypy-protobuf==3.3.0'
 python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421
 apt-get update -y
 apt-get install -y ruby ruby-dev
diff --git a/dev/requirements.txt b/dev/requirements.txt
index 651fc280627..fa4b6752f14 100644
--- a/dev/requirements.txt
+++ b/dev/requirements.txt
@@ -51,4 +51,4 @@ grpcio==1.48.1
 protobuf==4.21.6
 
 # Spark Connect python proto generation plugin (optional)
-mypy-protobuf
+mypy-protobuf==3.3.0


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated (2698d6bf10b -> 89a31296819)

2022-10-20 Thread dongjoon
This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


from 2698d6bf10b [SPARK-40838][INFRA][TESTS] Upgrade infra base image to 
focal-20220922 and fix ps.mlflow doctest
 add 89a31296819 [SPARK-40778][CORE] Make HeartbeatReceiver as an 
IsolatedRpcEndpoint

No new revisions were added by this update.

Summary of changes:
 core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark] branch master updated: [SPARK-40838][INFRA][TESTS] Upgrade infra base image to focal-20220922 and fix ps.mlflow doctest

2022-10-20 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
 new 2698d6bf10b [SPARK-40838][INFRA][TESTS] Upgrade infra base image to 
focal-20220922 and fix ps.mlflow doctest
2698d6bf10b is described below

commit 2698d6bf10b92e71e8af88fedb4e7c9e0f304416
Author: Yikun Jiang 
AuthorDate: Thu Oct 20 15:54:18 2022 +0800

[SPARK-40838][INFRA][TESTS] Upgrade infra base image to focal-20220922 and 
fix ps.mlflow doctest

### What changes were proposed in this pull request?
Upgrade infra base image to focal-20220922 and fix ps.mlflow doctest

### Why are the changes needed?
- Upgrade infra base image to `focal-20220922` (Ubuntu 20.04 currently 
latest)
- Infra Image Python version updated.
  - numpy 1.23.3 --> 1.23.4
  - mlflow 1.28.0 --> 1.29.0
  - matplotlib 3.5.3 --> 3.6.1
  - pip 22.2.2 --> 22.3
  - scipy 1.9.1 --> 1.9.3

  Full list: https://www.diffchecker.com/e6eZZaYn
- Fix ps.mlfow doctest (due to mlflow upgrade):
```
**
File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 158, in 
pyspark.pandas.mlflow.load_model
Failed example:
with mlflow.start_run():
lr = LinearRegression()
lr.fit(train_x, train_y)
mlflow.sklearn.log_model(lr, "model")
Expected:
LinearRegression(...)
Got:
LinearRegression()

```

### Does this PR introduce _any_ user-facing change?
No, dev only

### How was this patch tested?
All CI passed

Closes #38304 from Yikun/SPARK-40838.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 dev/infra/Dockerfile| 4 ++--
 python/pyspark/pandas/mlflow.py | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile
index ccf0c932b0e..2a70bd3f98f 100644
--- a/dev/infra/Dockerfile
+++ b/dev/infra/Dockerfile
@@ -17,9 +17,9 @@
 
 # Image for building and testing Spark branches. Based on Ubuntu 20.04.
 # See also in https://hub.docker.com/_/ubuntu
-FROM ubuntu:focal-20220801
+FROM ubuntu:focal-20220922
 
-ENV FULL_REFRESH_DATE 20220706
+ENV FULL_REFRESH_DATE 20221019
 
 ENV DEBIAN_FRONTEND noninteractive
 ENV DEBCONF_NONINTERACTIVE_SEEN true
diff --git a/python/pyspark/pandas/mlflow.py b/python/pyspark/pandas/mlflow.py
index 094215743e2..469349b37ee 100644
--- a/python/pyspark/pandas/mlflow.py
+++ b/python/pyspark/pandas/mlflow.py
@@ -159,7 +159,7 @@ def load_model(
 ... lr = LinearRegression()
 ... lr.fit(train_x, train_y)
 ... mlflow.sklearn.log_model(lr, "model")
-LinearRegression(...)
+LinearRegression...
 
 Now that our model is logged using MLflow, we load it back and apply it on 
a pandas-on-Spark
 dataframe:


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



[spark-docker] branch master updated: [SPARK-40845] Add template support for SPARK_GPG_KEY and fix GPG verify

2022-10-20 Thread yikun
This is an automated email from the ASF dual-hosted git repository.

yikun pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark-docker.git


The following commit(s) were added to refs/heads/master by this push:
 new 896a36e  [SPARK-40845] Add template support for SPARK_GPG_KEY and fix 
GPG verify
896a36e is described below

commit 896a36e36c094bf1480f4819005e2982ea8af417
Author: Yikun Jiang 
AuthorDate: Thu Oct 20 15:38:03 2022 +0800

[SPARK-40845] Add template support for SPARK_GPG_KEY and fix GPG verify

### What changes were proposed in this pull request?
This patch:
- Add template support for `SPARK_GPG_KEY`.
- Fix a bug on GPG verified. (Change `||` to `;`)
- Use opengpg.org instead of gpg.com becasue it would be uploaded in [spark 
release process](https://spark.apache.org/release-process.html).

### Why are the changes needed?
Each version have specific GPG key to verified, so we need to set GPG 
version separately.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed
Run `./add-dockerfiles.sh 3.3.0` and see GPG set correctly

Closes #16 from Yikun/GPG.

Authored-by: Yikun Jiang 
Signed-off-by: Yikun Jiang 
---
 3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile | 6 +++---
 3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile   | 6 +++---
 3.3.0/scala2.12-java11-r-ubuntu/Dockerfile | 6 +++---
 3.3.0/scala2.12-java11-ubuntu/Dockerfile   | 6 +++---
 Dockerfile.template| 6 +++---
 tools/template.py  | 6 ++
 6 files changed, 21 insertions(+), 15 deletions(-)

diff --git a/3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile 
b/3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
index be9cbb0..ac16bdd 100644
--- a/3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
+++ b/3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile
@@ -45,7 +45,7 @@ RUN set -ex && \
 # https://downloads.apache.org/spark/KEYS
 ENV 
SPARK_TGZ_URL=https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
 \
 
SPARK_TGZ_ASC_URL=https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc
 \
-GPG_KEY=E298A3A825C0D65DFD57CBB651716619E084DAB9
+GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3
 
 RUN set -ex; \
 export SPARK_TMP="$(mktemp -d)"; \
@@ -53,8 +53,8 @@ RUN set -ex; \
 wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
 wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
 export GNUPGHOME="$(mktemp -d)"; \
-gpg --keyserver hkps://keyserver.pgp.com --recv-key "$GPG_KEY" || \
-gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY" || \
+gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
 gpg --batch --verify spark.tgz.asc spark.tgz; \
 gpgconf --kill all; \
 rm -rf "$GNUPGHOME" spark.tgz.asc; \
diff --git a/3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile 
b/3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile
index 096c7eb..c6e433d 100644
--- a/3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile
+++ b/3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile
@@ -44,7 +44,7 @@ RUN set -ex && \
 # https://downloads.apache.org/spark/KEYS
 ENV 
SPARK_TGZ_URL=https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
 \
 
SPARK_TGZ_ASC_URL=https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc
 \
-GPG_KEY=E298A3A825C0D65DFD57CBB651716619E084DAB9
+GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3
 
 RUN set -ex; \
 export SPARK_TMP="$(mktemp -d)"; \
@@ -52,8 +52,8 @@ RUN set -ex; \
 wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \
 wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \
 export GNUPGHOME="$(mktemp -d)"; \
-gpg --keyserver hkps://keyserver.pgp.com --recv-key "$GPG_KEY" || \
-gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY" || \
+gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \
+gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \
 gpg --batch --verify spark.tgz.asc spark.tgz; \
 gpgconf --kill all; \
 rm -rf "$GNUPGHOME" spark.tgz.asc; \
diff --git a/3.3.0/scala2.12-java11-r-ubuntu/Dockerfile 
b/3.3.0/scala2.12-java11-r-ubuntu/Dockerfile
index 2e085a2..975e444 100644
--- a/3.3.0/scala2.12-java11-r-ubuntu/Dockerfile
+++ b/3.3.0/scala2.12-java11-r-ubuntu/Dockerfile
@@ -42,7 +42,7 @@ RUN set -ex && \
 # https://downloads.apache.org/spark/KEYS
 ENV 
SPARK_TGZ_URL=https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
 \
 
SPARK_TGZ_ASC_URL=https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc
 \
-GPG_KEY=E298A3A825C0D65DFD57CBB651716619E084DAB9
+GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3
 
 RUN set -ex; \
 export SPARK_TMP="$(mktemp