[spark] branch master updated: [SPARK-40657] Add support for Java classes in Protobuf functions
This is an automated email from the ASF dual-hosted git repository. kabhwan pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new fd9e5760bae [SPARK-40657] Add support for Java classes in Protobuf functions fd9e5760bae is described below commit fd9e5760bae847f47c9c108f0e58814748e0d9b1 Author: Raghu Angadi AuthorDate: Fri Oct 21 15:46:50 2022 +0900 [SPARK-40657] Add support for Java classes in Protobuf functions ### What changes were proposed in this pull request? Adds support for compiled Java classes to Protobuf functions. This is tested with Protobuf v3 classes. V2 vs V3 issues will be handled in a separate PR. The main changes in this PR: - Changes to top level API: - Adds new version that takes just the class name. - Changes the order of arguments for existing API with descriptor files (`messageName` and `descFilePath` are swapped). - Protobuf utils methods to create descriptor from Java class name. - Many unit tests are update to check both versions : (1) with descriptor file and (2) with Java class name. - Maven build updates to generate Java classes to use in tests. - Miscellaneous changes: - Adds `proto` to package name in `proto` files used in tests. - A few TODO comments about improvements ### Why are the changes needed? Java compiled classes is a common method for users to provide Protobuf definitions. ### Does this PR introduce _any_ user-facing change? No. This updates interface, but for a new feature in active development. ### How was this patch tested? - Unit tests Closes #38286 from rangadi/protobuf-java. Authored-by: Raghu Angadi Signed-off-by: Jungtaek Lim --- connector/protobuf/pom.xml | 23 +- .../sql/protobuf/CatalystDataToProtobuf.scala | 10 +- .../sql/protobuf/ProtobufDataToCatalyst.scala | 34 ++- .../org/apache/spark/sql/protobuf/functions.scala | 58 +++- .../spark/sql/protobuf/utils/ProtobufUtils.scala | 65 - .../sql/protobuf/utils/SchemaConverters.scala | 4 + .../test/resources/protobuf/catalyst_types.proto | 4 +- .../test/resources/protobuf/functions_suite.proto | 4 +- .../src/test/resources/protobuf/serde_suite.proto | 6 +- .../ProtobufCatalystDataConversionSuite.scala | 97 +-- .../sql/protobuf/ProtobufFunctionsSuite.scala | 318 + .../spark/sql/protobuf/ProtobufSerdeSuite.scala| 9 +- project/SparkBuild.scala | 6 +- python/pyspark/sql/protobuf/functions.py | 22 +- 14 files changed, 437 insertions(+), 223 deletions(-) diff --git a/connector/protobuf/pom.xml b/connector/protobuf/pom.xml index 0515f128b8d..b934c7f831a 100644 --- a/connector/protobuf/pom.xml +++ b/connector/protobuf/pom.xml @@ -83,7 +83,6 @@ ${protobuf.version} compile - target/scala-${scala.binary.version}/classes @@ -110,6 +109,28 @@ + +com.github.os72 +protoc-jar-maven-plugin +3.11.4 + + + +generate-test-sources + + run + + + com.google.protobuf:protoc:${protobuf.version} + ${protobuf.version} + +src/test/resources/protobuf + + test + + + + diff --git a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala index 145100268c2..b9f7907ea8c 100644 --- a/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala +++ b/connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/CatalystDataToProtobuf.scala @@ -25,17 +25,17 @@ import org.apache.spark.sql.types.{BinaryType, DataType} private[protobuf] case class CatalystDataToProtobuf( child: Expression, -descFilePath: String, -messageName: String) +messageName: String, +descFilePath: Option[String] = None) extends UnaryExpression { override def dataType: DataType = BinaryType - @transient private lazy val protoType = -ProtobufUtils.buildDescriptor(descFilePath, messageName) + @transient private lazy val protoDescriptor = +ProtobufUtils.buildDescriptor(messageName, descFilePathOpt = descFilePath) @transient private lazy val serializer = -new ProtobufSerializer(child.dataType, protoType, child.nullable) +new ProtobufSerializer(child.dataType, protoDescriptor, child.nullable) override def nullSafeEval(input: Any): Any = { val dynamicMessage = serializer.serialize(input).asInstanceOf[DynamicMes
[spark] branch master updated: [SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL
This is an automated email from the ASF dual-hosted git repository. wenchen pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new b14da8b1b65 [SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL b14da8b1b65 is described below commit b14da8b1b65d9f00f49fab87f738715089bc43e8 Author: Rui Wang AuthorDate: Fri Oct 21 13:14:43 2022 +0800 [SPARK-40812][CONNECT] Add Deduplicate to Connect proto and DSL ### What changes were proposed in this pull request? This PR supports `Deduplicate` to Connect proto and DSL. Note that `Deduplicate` can not be replaced by SQL's `SELECT DISTINCT col_list`. The difference is that `Deduplicate` allows to remove duplicated rows based on a set of columns but returns all the columns. SQL's `SELECT DISTINCT col_list`, instead, can only return the `col_list`. ### Why are the changes needed? 1. To improve proto API coverage. 2. `Deduplicate` blocks https://github.com/apache/spark/pull/38166 because we want support `Union(isAll=false)` but that will return `Union().Distinct()` to match existing DataFrame API. `Deduplicate` is needed to write test cases for `Union(isAll=false)`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38276 from amaliujia/supportDropDuplicates. Authored-by: Rui Wang Signed-off-by: Wenchen Fan --- .../main/protobuf/spark/connect/relations.proto| 9 ++ .../org/apache/spark/sql/connect/dsl/package.scala | 20 + .../sql/connect/planner/SparkConnectPlanner.scala | 35 +++- .../planner/SparkConnectDeduplicateSuite.scala | 68 +++ .../connect/planner/SparkConnectPlannerSuite.scala | 29 ++- python/pyspark/sql/connect/proto/relations_pb2.py | 98 +++--- python/pyspark/sql/connect/proto/relations_pb2.pyi | 50 +++ 7 files changed, 257 insertions(+), 52 deletions(-) diff --git a/connector/connect/src/main/protobuf/spark/connect/relations.proto b/connector/connect/src/main/protobuf/spark/connect/relations.proto index eadedf495d3..6adf0831ea2 100644 --- a/connector/connect/src/main/protobuf/spark/connect/relations.proto +++ b/connector/connect/src/main/protobuf/spark/connect/relations.proto @@ -43,6 +43,7 @@ message Relation { LocalRelation local_relation = 11; Sample sample = 12; Offset offset = 13; +Deduplicate deduplicate = 14; Unknown unknown = 999; } @@ -181,6 +182,14 @@ message Sort { } } +// Relation of type [[Deduplicate]] which have duplicate rows removed, could consider either only +// the subset of columns or all the columns. +message Deduplicate { + Relation input = 1; + repeated string column_names = 2; + bool all_columns_as_keys = 3; +} + message LocalRelation { repeated Expression.QualifiedAttribute attributes = 1; // TODO: support local data. diff --git a/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala index 8a267dff7d7..68bbc0487f9 100644 --- a/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala +++ b/connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala @@ -215,6 +215,26 @@ package object dsl { .build() } + def deduplicate(colNames: Seq[String]): proto.Relation = +proto.Relation + .newBuilder() + .setDeduplicate( +proto.Deduplicate + .newBuilder() + .setInput(logicalPlan) + .addAllColumnNames(colNames.asJava)) + .build() + + def distinct(): proto.Relation = +proto.Relation + .newBuilder() + .setDeduplicate( +proto.Deduplicate + .newBuilder() + .setInput(logicalPlan) + .setAllColumnsAsKeys(true)) + .build() + def join( otherPlan: proto.Relation, joinType: JoinType = JoinType.JOIN_TYPE_INNER, diff --git a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala b/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala index 450283a9b81..92c8bf01cba 100644 --- a/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala +++ b/connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala @@ -27,8 +27,9 @@ import org.apache.spark.sql.catalyst.expressions import org.apache.spark.sql.catalyst.expressions.{Alias, Attribute, AttributeReference, Expression} import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.apache.spark.sql.catalyst.plans.{logical, FullOuter, Inner, JoinType, LeftAnti, Left
[spark] branch master updated: [SPARK-40768][SQL] Migrate type check failures of bloom_filter_agg() onto error classes
This is an automated email from the ASF dual-hosted git repository. maxgekk pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 11cce7e7338 [SPARK-40768][SQL] Migrate type check failures of bloom_filter_agg() onto error classes 11cce7e7338 is described below commit 11cce7e73380231ec7c94096655e3d98ce7e635d Author: lvshaokang AuthorDate: Fri Oct 21 10:06:44 2022 +0500 [SPARK-40768][SQL] Migrate type check failures of bloom_filter_agg() onto error classes ### What changes were proposed in this pull request? In the PR, I propose to use error classes in the case of type check failure in Bloom Filter Agg expressions. ### Why are the changes needed? Migration onto error classes unifies Spark SQL error messages. ### Does this PR introduce _any_ user-facing change? Yes. The PR changes user-facing error messages. ### How was this patch tested? ``` build/sbt "sql/testOnly *SQLQueryTestSuite" build/sbt "test:testOnly org.apache.spark.SparkThrowableSuite" build/sbt "test:testOnly *BloomFilterAggregateQuerySuite" ``` Closes #38315 from lvshaokang/SPARK-40768. Authored-by: lvshaokang Signed-off-by: Max Gekk --- core/src/main/resources/error/error-classes.json | 2 +- .../expressions/BloomFilterMightContain.scala | 3 +- .../aggregate/BloomFilterAggregate.scala | 61 +++-- .../spark/sql/BloomFilterAggregateQuerySuite.scala | 144 ++--- 4 files changed, 179 insertions(+), 31 deletions(-) diff --git a/core/src/main/resources/error/error-classes.json b/core/src/main/resources/error/error-classes.json index 0cfb6861c77..1e9519dd89a 100644 --- a/core/src/main/resources/error/error-classes.json +++ b/core/src/main/resources/error/error-classes.json @@ -108,7 +108,7 @@ }, "BLOOM_FILTER_WRONG_TYPE" : { "message" : [ - "Input to function should have been followed by a value with , but it's [, ]." + "Input to function should have been followed by value with , but it's []." ] }, "CANNOT_CONVERT_TO_JSON" : { diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala index 5cb19d36b80..b2273b6a6d1 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/BloomFilterMightContain.scala @@ -76,8 +76,7 @@ case class BloomFilterMightContain( "functionName" -> toSQLId(prettyName), "expectedLeft" -> toSQLType(BinaryType), "expectedRight" -> toSQLType(LongType), -"actualLeft" -> toSQLType(left.dataType), -"actualRight" -> toSQLType(right.dataType) +"actual" -> Seq(left.dataType, right.dataType).map(toSQLType).mkString(", ") ) ) } diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala index c734bca3ef8..5b78c5b5228 100644 --- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala +++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/BloomFilterAggregate.scala @@ -24,6 +24,7 @@ import org.apache.spark.sql.catalyst.InternalRow import org.apache.spark.sql.catalyst.analysis.TypeCheckResult import org.apache.spark.sql.catalyst.analysis.TypeCheckResult._ import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.Cast.{toSQLExpr, toSQLId, toSQLType, toSQLValue} import org.apache.spark.sql.catalyst.trees.TernaryLike import org.apache.spark.sql.internal.SQLConf import org.apache.spark.sql.types._ @@ -63,28 +64,66 @@ case class BloomFilterAggregate( override def checkInputDataTypes(): TypeCheckResult = { (first.dataType, second.dataType, third.dataType) match { case (_, NullType, _) | (_, _, NullType) => -TypeCheckResult.TypeCheckFailure("Null typed values cannot be used as size arguments") +DataTypeMismatch( + errorSubClass = "UNEXPECTED_NULL", + messageParameters = Map( +"exprName" -> "estimatedNumItems or numBits" + ) +) case (LongType, LongType, LongType) => if (!estimatedNumItemsExpression.foldable) { - TypeCheckFailure("The estimated number of items provided must be a constant literal") + DataTypeMismatch( +errorSubClass = "NON_FOLDABLE_INPUT", +messageParameters =
[spark] branch master updated (40086cb9b21 -> 670bc1dfc0e)
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 40086cb9b21 [SPARK-40860][INFRA] Change `set-output` to `GITHUB_EVENT` add 670bc1dfc0e [SPARK-40615][SQL][TESTS][FOLLOW-UP] Make the test pass with ANSI enabled No new revisions were added by this update. Summary of changes: sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala | 7 +-- 1 file changed, 1 insertion(+), 6 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40860][INFRA] Change `set-output` to `GITHUB_EVENT`
This is an automated email from the ASF dual-hosted git repository. yikun pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 40086cb9b21 [SPARK-40860][INFRA] Change `set-output` to `GITHUB_EVENT` 40086cb9b21 is described below commit 40086cb9b21fe207242c4928d8e2cc3e756d61da Author: Yikun Jiang AuthorDate: Fri Oct 21 11:06:33 2022 +0800 [SPARK-40860][INFRA] Change `set-output` to `GITHUB_EVENT` ### What changes were proposed in this pull request? Change `set-output` to `GITHUB_OUTPUT`. ### Why are the changes needed? The `set-output` command is deprecated and will be disabled soon. Please upgrade to using Environment Files. For more information see: https://github.blog/changelog/2022-10-11-github-actions-deprecating-save-state-and-set-output-commands/ ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? - CI passed - Also do a local test on benchmark: https://github.com/Yikun/spark/actions/runs/3294384181/jobs/5431945626 Closes #38323 from Yikun/set-output. Authored-by: Yikun Jiang Signed-off-by: Yikun Jiang --- .github/workflows/benchmark.yml | 2 +- .github/workflows/build_and_test.yml | 13 ++--- 2 files changed, 7 insertions(+), 8 deletions(-) diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index 5508227b8b2..f73267a95fa 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -54,7 +54,7 @@ jobs: steps: - name: Generate matrix id: set-matrix - run: echo "::set-output name=matrix::["`seq -s, 1 $SPARK_BENCHMARK_NUM_SPLITS`"]" + run: echo "matrix=["`seq -s, 1 $SPARK_BENCHMARK_NUM_SPLITS`"]" >> $GITHUB_OUTPUT # Any TPC-DS related updates on this job need to be applied to tpcds-1g job of build_and_test.yml as well tpcds-1g-gen: diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index e0adad54aed..f9b445e9bbd 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -103,16 +103,15 @@ jobs: \"k8s-integration-tests\" : \"true\", }" echo $precondition # For debugging - # GitHub Actions set-output doesn't take newlines - # https://github.community/t/set-output-truncates-multiline-strings/16852/3 - precondition="${precondition//$'\n'/'%0A'}" - echo "::set-output name=required::$precondition" + # Remove `\n` to avoid "Invalid format" error + precondition="${precondition//$'\n'/}}" + echo "required=$precondition" >> $GITHUB_OUTPUT else # This is usually set by scheduled jobs. precondition='${{ inputs.jobs }}' echo $precondition # For debugging - precondition="${precondition//$'\n'/'%0A'}" - echo "::set-output name=required::$precondition" + precondition="${precondition//$'\n'/}" + echo "required=$precondition" >> $GITHUB_OUTPUT fi - name: Generate infra image URL id: infra-image-outputs @@ -121,7 +120,7 @@ jobs: REPO_OWNER=$(echo "${{ github.repository_owner }}" | tr '[:upper:]' '[:lower:]') IMG_NAME="apache-spark-ci-image:${{ inputs.branch }}-${{ github.run_id }}" IMG_URL="ghcr.io/$REPO_OWNER/$IMG_NAME" -echo ::set-output name=image_url::$IMG_URL +echo "image_url=$IMG_URL" >> $GITHUB_OUTPUT # Build: build Spark and run the tests for specified modules. build: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40859][INFRA] Upgrade action/checkout to v3 to cleanup warning
This is an automated email from the ASF dual-hosted git repository. yikun pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 17efe044fa7 [SPARK-40859][INFRA] Upgrade action/checkout to v3 to cleanup warning 17efe044fa7 is described below commit 17efe044fa7d366fa0beafe71c5e76d46f942b7e Author: Yikun Jiang AuthorDate: Fri Oct 21 10:36:00 2022 +0800 [SPARK-40859][INFRA] Upgrade action/checkout to v3 to cleanup warning ### What changes were proposed in this pull request? Upgrade action/checkout to v3 (point ot v3.1 now). ### Why are the changes needed? - https://github.com/actions/checkout/releases/tag/v3.1.0 cleanup "[The 'set-output' command is deprecated and will be disabled soon.](https://github.com/actions/checkout/issues/959#issuecomment-1282107197)" - https://github.com/actions/checkout/releases/tag/v3.0.0 since v3, use the node 16 to cleanup "[Node.js 12 actions are deprecated](https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/)" According to https://github.com/actions/checkout/issues/959#issuecomment-1282107197, v2.5 also address 'set-output' warning, but only v3 support node 16, so we upgrade to v3.1 rather than v2.5 ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? CI passed Closes #38322 from Yikun/checkout-v3. Authored-by: Yikun Jiang Signed-off-by: Yikun Jiang --- .github/workflows/benchmark.yml| 6 +++--- .github/workflows/build_and_test.yml | 24 .github/workflows/build_infra_images_cache.yml | 2 +- .github/workflows/publish_snapshot.yml | 2 +- 4 files changed, 17 insertions(+), 17 deletions(-) diff --git a/.github/workflows/benchmark.yml b/.github/workflows/benchmark.yml index 52adec20e5c..5508227b8b2 100644 --- a/.github/workflows/benchmark.yml +++ b/.github/workflows/benchmark.yml @@ -65,7 +65,7 @@ jobs: SPARK_LOCAL_IP: localhost steps: - name: Checkout Spark repository -uses: actions/checkout@v2 +uses: actions/checkout@v3 # In order to get diff files with: fetch-depth: 0 @@ -95,7 +95,7 @@ jobs: key: tpcds-${{ hashFiles('.github/workflows/benchmark.yml', 'sql/core/src/test/scala/org/apache/spark/sql/TPCDSSchema.scala') }} - name: Checkout tpcds-kit repository if: steps.cache-tpcds-sf-1.outputs.cache-hit != 'true' -uses: actions/checkout@v2 +uses: actions/checkout@v3 with: repository: databricks/tpcds-kit ref: 2a5078a782192ddb6efbcead8de9973d6ab4f069 @@ -133,7 +133,7 @@ jobs: SPARK_TPCDS_DATA: ${{ github.workspace }}/tpcds-sf-1 steps: - name: Checkout Spark repository - uses: actions/checkout@v2 + uses: actions/checkout@v3 # In order to get diff files with: fetch-depth: 0 diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index 12a1ad0e71e..e0adad54aed 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -63,7 +63,7 @@ jobs: }} steps: - name: Checkout Spark repository - uses: actions/checkout@v2 + uses: actions/checkout@v3 with: fetch-depth: 0 repository: apache/spark @@ -195,7 +195,7 @@ jobs: SPARK_LOCAL_IP: localhost steps: - name: Checkout Spark repository - uses: actions/checkout@v2 + uses: actions/checkout@v3 # In order to fetch changed files with: fetch-depth: 0 @@ -286,7 +286,7 @@ jobs: username: ${{ github.actor }} password: ${{ secrets.GITHUB_TOKEN }} - name: Checkout Spark repository -uses: actions/checkout@v2 +uses: actions/checkout@v3 # In order to fetch changed files with: fetch-depth: 0 @@ -349,7 +349,7 @@ jobs: METASPACE_SIZE: 1g steps: - name: Checkout Spark repository - uses: actions/checkout@v2 + uses: actions/checkout@v3 # In order to fetch changed files with: fetch-depth: 0 @@ -438,7 +438,7 @@ jobs: SKIP_MIMA: true steps: - name: Checkout Spark repository - uses: actions/checkout@v2 + uses: actions/checkout@v3 # In order to fetch changed files with: fetch-depth: 0 @@ -508,7 +508,7 @@ jobs: image: ${{ needs.precondition.outputs.image_url }} steps: - name: Checkout Spark repository - uses: actions/checkout@v2 + uses: actions/checkout@v3 with: fetch-depth: 0 repository: apache/spark @@ -635,7 +635,7 @@ jobs: runs-on: ubuntu-20.04 steps: - name: Checkout Spark repository -
[spark] branch master updated: [SPARK-40813][CONNECT][PYTHON][FOLLOW-UP] Improve limit and offset in Python client
This is an automated email from the ASF dual-hosted git repository. gurwls223 pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 45bb9578a0f [SPARK-40813][CONNECT][PYTHON][FOLLOW-UP] Improve limit and offset in Python client 45bb9578a0f is described below commit 45bb9578a0f6b40b472588a407d842f293e9e323 Author: Rui Wang AuthorDate: Fri Oct 21 10:33:18 2022 +0900 [SPARK-40813][CONNECT][PYTHON][FOLLOW-UP] Improve limit and offset in Python client ### What changes were proposed in this pull request? Following up after https://github.com/apache/spark/pull/38275, improve limit and offset in Python client. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #38314 from amaliujia/python_test_limit_offset. Authored-by: Rui Wang Signed-off-by: Hyukjin Kwon --- python/pyspark/sql/connect/dataframe.py| 3 ++ python/pyspark/sql/connect/plan.py | 32 -- .../sql/tests/connect/test_connect_basic.py| 7 + .../sql/tests/connect/test_connect_plan_only.py| 10 +++ 4 files changed, 49 insertions(+), 3 deletions(-) diff --git a/python/pyspark/sql/connect/dataframe.py b/python/pyspark/sql/connect/dataframe.py index 1f7e789818f..5ca747fdd6a 100644 --- a/python/pyspark/sql/connect/dataframe.py +++ b/python/pyspark/sql/connect/dataframe.py @@ -199,6 +199,9 @@ class DataFrame(object): def limit(self, n: int) -> "DataFrame": return DataFrame.withPlan(plan.Limit(child=self._plan, limit=n), session=self._session) +def offset(self, n: int) -> "DataFrame": +return DataFrame.withPlan(plan.Offset(child=self._plan, offset=n), session=self._session) + def sort(self, *cols: "ColumnOrString") -> "DataFrame": """Sort by a specific column""" return DataFrame.withPlan(plan.Sort(self._plan, *cols), session=self._session) diff --git a/python/pyspark/sql/connect/plan.py b/python/pyspark/sql/connect/plan.py index c564b71cdba..5b8b7c71866 100644 --- a/python/pyspark/sql/connect/plan.py +++ b/python/pyspark/sql/connect/plan.py @@ -272,10 +272,9 @@ class Filter(LogicalPlan): class Limit(LogicalPlan): -def __init__(self, child: Optional["LogicalPlan"], limit: int, offset: int = 0) -> None: +def __init__(self, child: Optional["LogicalPlan"], limit: int) -> None: super().__init__(child) self.limit = limit -self.offset = offset def plan(self, session: Optional["RemoteSparkSession"]) -> proto.Relation: assert self._child is not None @@ -286,7 +285,7 @@ class Limit(LogicalPlan): def print(self, indent: int = 0) -> str: c_buf = self._child.print(indent + LogicalPlan.INDENT) if self._child else "" -return f"{' ' * indent}\n{c_buf}" +return f"{' ' * indent}\n{c_buf}" def _repr_html_(self) -> str: return f""" @@ -294,6 +293,33 @@ class Limit(LogicalPlan): Limit Limit: {self.limit} +{self._child_repr_()} + + +""" + + +class Offset(LogicalPlan): +def __init__(self, child: Optional["LogicalPlan"], offset: int = 0) -> None: +super().__init__(child) +self.offset = offset + +def plan(self, session: Optional["RemoteSparkSession"]) -> proto.Relation: +assert self._child is not None +plan = proto.Relation() +plan.offset.input.CopyFrom(self._child.plan(session)) +plan.offset.offset = self.offset +return plan + +def print(self, indent: int = 0) -> str: +c_buf = self._child.print(indent + LogicalPlan.INDENT) if self._child else "" +return f"{' ' * indent}\n{c_buf}" + +def _repr_html_(self) -> str: +return f""" + + +Limit Offset: {self.offset} {self._child_repr_()} diff --git a/python/pyspark/sql/tests/connect/test_connect_basic.py b/python/pyspark/sql/tests/connect/test_connect_basic.py index de300946932..f6988a1d120 100644 --- a/python/pyspark/sql/tests/connect/test_connect_basic.py +++ b/python/pyspark/sql/tests/connect/test_connect_basic.py @@ -106,6 +106,13 @@ class SparkConnectTests(SparkConnectSQLTestCase): res = pandas.DataFrame(data={"id": [0, 30, 60, 90]}) self.assert_(pd.equals(res), f"{pd.to_string()} != {res.to_string()}") +def test_limit_offset(self): +df = self.connect.read.table(self.tbl_name) +pd = df.limit(10).offset(1).toPandas() +self.assertEqual(9, len(pd.index)) +pd2 = df.offset(98).limit(10).toPandas() +self.assertEqual(2, len(pd2.index)) + def test_simple_dataso
[spark] branch master updated: [SPARK-39203][SQL][FOLLOWUP] Do not qualify view location
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new a51dd1820da [SPARK-39203][SQL][FOLLOWUP] Do not qualify view location a51dd1820da is described below commit a51dd1820dad8f24f99a979d5998b999ae4a3c25 Author: Wenchen Fan AuthorDate: Thu Oct 20 15:16:04 2022 -0700 [SPARK-39203][SQL][FOLLOWUP] Do not qualify view location ### What changes were proposed in this pull request? This fixes a corner-case regression caused by https://github.com/apache/spark/pull/36625. Users may have existing views that have invalid locations due to historical reasons. The location is actually useless for a view, but after https://github.com/apache/spark/pull/36625 , they start to fail to read the view as qualifying the location fails. We should just skip qualifying view locations. ### Why are the changes needed? avoid regression ### Does this PR introduce _any_ user-facing change? Spark can read view with invalid location again. ### How was this patch tested? manually test. View with an invalid location is kind of "broken" and can't be dropped (HMS fails to drop it), so we can't write a UT for it. Closes #38321 from cloud-fan/follow. Authored-by: Wenchen Fan Signed-off-by: Dongjoon Hyun --- .../apache/spark/sql/hive/client/HiveClientImpl.scala | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) diff --git a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala index f6b06b08cbc..213d930653d 100644 --- a/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala +++ b/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala @@ -537,12 +537,18 @@ private[hive] class HiveClientImpl( storage = CatalogStorageFormat( locationUri = shim.getDataLocation(h).map { loc => val tableUri = stringToURI(loc) - // Before SPARK-19257, created data source table does not use absolute uri. - // This makes Spark can't read these tables across HDFS clusters. - // Rewrite table location to absolute uri based on database uri to fix this issue. - val absoluteUri = Option(tableUri).filterNot(_.isAbsolute) -.map(_ => stringToURI(client.getDatabase(h.getDbName).getLocationUri)) - HiveExternalCatalog.toAbsoluteURI(tableUri, absoluteUri) + if (h.getTableType == HiveTableType.VIRTUAL_VIEW) { +// Data location of SQL view is useless. Do not qualify it even if it's present, as +// it can be an invalid path. +tableUri + } else { +// Before SPARK-19257, created data source table does not use absolute uri. +// This makes Spark can't read these tables across HDFS clusters. +// Rewrite table location to absolute uri based on database uri to fix this issue. +val absoluteUri = Option(tableUri).filterNot(_.isAbsolute) + .map(_ => stringToURI(client.getDatabase(h.getDbName).getLocationUri)) +HiveExternalCatalog.toAbsoluteURI(tableUri, absoluteUri) + } }, // To avoid ClassNotFound exception, we try our best to not get the format class, but get // the class name directly. However, for non-native tables, there is no interface to get - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (0643d02e4f0 -> 3b60637d91b)
This is an automated email from the ASF dual-hosted git repository. srowen pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 0643d02e4f0 [SPARK-40853][INFRA] Pin `mypy-protobuf==3.3.0` add 3b60637d91b [SPARK-40843][CORE][TESTS] Clean up deprecated api usage in SparkThrowableSuite No new revisions were added by this update. Summary of changes: .../scala/org/apache/spark/SparkThrowableSuite.scala | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40853][INFRA] Pin `mypy-protobuf==3.3.0`
This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 0643d02e4f0 [SPARK-40853][INFRA] Pin `mypy-protobuf==3.3.0` 0643d02e4f0 is described below commit 0643d02e4f03cdadb53efc05af0b6533d22db297 Author: Ruifeng Zheng AuthorDate: Thu Oct 20 17:41:16 2022 +0800 [SPARK-40853][INFRA] Pin `mypy-protobuf==3.3.0` ### What changes were proposed in this pull request? [`mypy-protobuf` 3.4.0](https://pypi.org/project/mypy-protobuf/#history) is just released, and will break master shortly ### Why are the changes needed? to make CI happy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manually check Closes #38316 from zhengruifeng/infra_ping_mypy_protobuf. Authored-by: Ruifeng Zheng Signed-off-by: Ruifeng Zheng --- .github/workflows/build_and_test.yml | 2 +- dev/requirements.txt | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/.github/workflows/build_and_test.yml b/.github/workflows/build_and_test.yml index a3998b330c1..12a1ad0e71e 100644 --- a/.github/workflows/build_and_test.yml +++ b/.github/workflows/build_and_test.yml @@ -587,7 +587,7 @@ jobs: # See also https://issues.apache.org/jira/browse/SPARK-38279. python3.9 -m pip install 'sphinx<3.1.0' mkdocs pydata_sphinx_theme ipython nbsphinx numpydoc 'jinja2<3.0.0' 'markupsafe==2.0.1' 'pyzmq<24.0.0' python3.9 -m pip install ipython_genutils # See SPARK-38517 -python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly>=4.8' grpcio protobuf mypy-protobuf +python3.9 -m pip install sphinx_plotly_directive 'numpy>=1.20.0' pyarrow pandas 'plotly>=4.8' grpcio protobuf 'mypy-protobuf==3.3.0' python3.9 -m pip install 'docutils<0.18.0' # See SPARK-39421 apt-get update -y apt-get install -y ruby ruby-dev diff --git a/dev/requirements.txt b/dev/requirements.txt index 651fc280627..fa4b6752f14 100644 --- a/dev/requirements.txt +++ b/dev/requirements.txt @@ -51,4 +51,4 @@ grpcio==1.48.1 protobuf==4.21.6 # Spark Connect python proto generation plugin (optional) -mypy-protobuf +mypy-protobuf==3.3.0 - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated (2698d6bf10b -> 89a31296819)
This is an automated email from the ASF dual-hosted git repository. dongjoon pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/spark.git from 2698d6bf10b [SPARK-40838][INFRA][TESTS] Upgrade infra base image to focal-20220922 and fix ps.mlflow doctest add 89a31296819 [SPARK-40778][CORE] Make HeartbeatReceiver as an IsolatedRpcEndpoint No new revisions were added by this update. Summary of changes: core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark] branch master updated: [SPARK-40838][INFRA][TESTS] Upgrade infra base image to focal-20220922 and fix ps.mlflow doctest
This is an automated email from the ASF dual-hosted git repository. yikun pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git The following commit(s) were added to refs/heads/master by this push: new 2698d6bf10b [SPARK-40838][INFRA][TESTS] Upgrade infra base image to focal-20220922 and fix ps.mlflow doctest 2698d6bf10b is described below commit 2698d6bf10b92e71e8af88fedb4e7c9e0f304416 Author: Yikun Jiang AuthorDate: Thu Oct 20 15:54:18 2022 +0800 [SPARK-40838][INFRA][TESTS] Upgrade infra base image to focal-20220922 and fix ps.mlflow doctest ### What changes were proposed in this pull request? Upgrade infra base image to focal-20220922 and fix ps.mlflow doctest ### Why are the changes needed? - Upgrade infra base image to `focal-20220922` (Ubuntu 20.04 currently latest) - Infra Image Python version updated. - numpy 1.23.3 --> 1.23.4 - mlflow 1.28.0 --> 1.29.0 - matplotlib 3.5.3 --> 3.6.1 - pip 22.2.2 --> 22.3 - scipy 1.9.1 --> 1.9.3 Full list: https://www.diffchecker.com/e6eZZaYn - Fix ps.mlfow doctest (due to mlflow upgrade): ``` ** File "/__w/spark/spark/python/pyspark/pandas/mlflow.py", line 158, in pyspark.pandas.mlflow.load_model Failed example: with mlflow.start_run(): lr = LinearRegression() lr.fit(train_x, train_y) mlflow.sklearn.log_model(lr, "model") Expected: LinearRegression(...) Got: LinearRegression() ``` ### Does this PR introduce _any_ user-facing change? No, dev only ### How was this patch tested? All CI passed Closes #38304 from Yikun/SPARK-40838. Authored-by: Yikun Jiang Signed-off-by: Yikun Jiang --- dev/infra/Dockerfile| 4 ++-- python/pyspark/pandas/mlflow.py | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/dev/infra/Dockerfile b/dev/infra/Dockerfile index ccf0c932b0e..2a70bd3f98f 100644 --- a/dev/infra/Dockerfile +++ b/dev/infra/Dockerfile @@ -17,9 +17,9 @@ # Image for building and testing Spark branches. Based on Ubuntu 20.04. # See also in https://hub.docker.com/_/ubuntu -FROM ubuntu:focal-20220801 +FROM ubuntu:focal-20220922 -ENV FULL_REFRESH_DATE 20220706 +ENV FULL_REFRESH_DATE 20221019 ENV DEBIAN_FRONTEND noninteractive ENV DEBCONF_NONINTERACTIVE_SEEN true diff --git a/python/pyspark/pandas/mlflow.py b/python/pyspark/pandas/mlflow.py index 094215743e2..469349b37ee 100644 --- a/python/pyspark/pandas/mlflow.py +++ b/python/pyspark/pandas/mlflow.py @@ -159,7 +159,7 @@ def load_model( ... lr = LinearRegression() ... lr.fit(train_x, train_y) ... mlflow.sklearn.log_model(lr, "model") -LinearRegression(...) +LinearRegression... Now that our model is logged using MLflow, we load it back and apply it on a pandas-on-Spark dataframe: - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
[spark-docker] branch master updated: [SPARK-40845] Add template support for SPARK_GPG_KEY and fix GPG verify
This is an automated email from the ASF dual-hosted git repository. yikun pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark-docker.git The following commit(s) were added to refs/heads/master by this push: new 896a36e [SPARK-40845] Add template support for SPARK_GPG_KEY and fix GPG verify 896a36e is described below commit 896a36e36c094bf1480f4819005e2982ea8af417 Author: Yikun Jiang AuthorDate: Thu Oct 20 15:38:03 2022 +0800 [SPARK-40845] Add template support for SPARK_GPG_KEY and fix GPG verify ### What changes were proposed in this pull request? This patch: - Add template support for `SPARK_GPG_KEY`. - Fix a bug on GPG verified. (Change `||` to `;`) - Use opengpg.org instead of gpg.com becasue it would be uploaded in [spark release process](https://spark.apache.org/release-process.html). ### Why are the changes needed? Each version have specific GPG key to verified, so we need to set GPG version separately. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed Run `./add-dockerfiles.sh 3.3.0` and see GPG set correctly Closes #16 from Yikun/GPG. Authored-by: Yikun Jiang Signed-off-by: Yikun Jiang --- 3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile | 6 +++--- 3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile | 6 +++--- 3.3.0/scala2.12-java11-r-ubuntu/Dockerfile | 6 +++--- 3.3.0/scala2.12-java11-ubuntu/Dockerfile | 6 +++--- Dockerfile.template| 6 +++--- tools/template.py | 6 ++ 6 files changed, 21 insertions(+), 15 deletions(-) diff --git a/3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile b/3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile index be9cbb0..ac16bdd 100644 --- a/3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile +++ b/3.3.0/scala2.12-java11-python3-r-ubuntu/Dockerfile @@ -45,7 +45,7 @@ RUN set -ex && \ # https://downloads.apache.org/spark/KEYS ENV SPARK_TGZ_URL=https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz \ SPARK_TGZ_ASC_URL=https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc \ -GPG_KEY=E298A3A825C0D65DFD57CBB651716619E084DAB9 +GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3 RUN set -ex; \ export SPARK_TMP="$(mktemp -d)"; \ @@ -53,8 +53,8 @@ RUN set -ex; \ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \ export GNUPGHOME="$(mktemp -d)"; \ -gpg --keyserver hkps://keyserver.pgp.com --recv-key "$GPG_KEY" || \ -gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY" || \ +gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \ +gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \ gpg --batch --verify spark.tgz.asc spark.tgz; \ gpgconf --kill all; \ rm -rf "$GNUPGHOME" spark.tgz.asc; \ diff --git a/3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile b/3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile index 096c7eb..c6e433d 100644 --- a/3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile +++ b/3.3.0/scala2.12-java11-python3-ubuntu/Dockerfile @@ -44,7 +44,7 @@ RUN set -ex && \ # https://downloads.apache.org/spark/KEYS ENV SPARK_TGZ_URL=https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz \ SPARK_TGZ_ASC_URL=https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc \ -GPG_KEY=E298A3A825C0D65DFD57CBB651716619E084DAB9 +GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3 RUN set -ex; \ export SPARK_TMP="$(mktemp -d)"; \ @@ -52,8 +52,8 @@ RUN set -ex; \ wget -nv -O spark.tgz "$SPARK_TGZ_URL"; \ wget -nv -O spark.tgz.asc "$SPARK_TGZ_ASC_URL"; \ export GNUPGHOME="$(mktemp -d)"; \ -gpg --keyserver hkps://keyserver.pgp.com --recv-key "$GPG_KEY" || \ -gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY" || \ +gpg --keyserver hkps://keys.openpgp.org --recv-key "$GPG_KEY" || \ +gpg --keyserver hkps://keyserver.ubuntu.com --recv-keys "$GPG_KEY"; \ gpg --batch --verify spark.tgz.asc spark.tgz; \ gpgconf --kill all; \ rm -rf "$GNUPGHOME" spark.tgz.asc; \ diff --git a/3.3.0/scala2.12-java11-r-ubuntu/Dockerfile b/3.3.0/scala2.12-java11-r-ubuntu/Dockerfile index 2e085a2..975e444 100644 --- a/3.3.0/scala2.12-java11-r-ubuntu/Dockerfile +++ b/3.3.0/scala2.12-java11-r-ubuntu/Dockerfile @@ -42,7 +42,7 @@ RUN set -ex && \ # https://downloads.apache.org/spark/KEYS ENV SPARK_TGZ_URL=https://dlcdn.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz \ SPARK_TGZ_ASC_URL=https://downloads.apache.org/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz.asc \ -GPG_KEY=E298A3A825C0D65DFD57CBB651716619E084DAB9 +GPG_KEY=80FB8EBE8EBA68504989703491B5DC815DBF10D3 RUN set -ex; \ export SPARK_TMP="$(mktemp