[GitHub] [spark] cloud-fan commented on a change in pull request #33200: [SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands to use UnresolvedTable to resolve the identifier
cloud-fan commented on a change in pull request #33200: URL: https://github.com/apache/spark/pull/33200#discussion_r672827401 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -3574,15 +3568,64 @@ class Analyzer(override val catalogManager: CatalogManager) /** * Rule to mostly resolve, normalize and rewrite column names based on case sensitivity - * for alter table commands. + * for alter table column commands. */ - object ResolveAlterTableCommands extends Rule[LogicalPlan] { + object ResolveAlterTableColumnCommands extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { - case a: AlterTableCommand if a.table.resolved && hasUnresolvedFieldName(a) => + case a: AlterTableColumnCommand if a.table.resolved && hasUnresolvedFieldName(a) => val table = a.table.asInstanceOf[ResolvedTable] a.transformExpressions { - case u: UnresolvedFieldName => resolveFieldNames(table, u.name, u) + case u: UnresolvedFieldName => resolveFieldNames(table, u.name, u.origin) +} + + case a @ AlterTableAddColumns(r: ResolvedTable, cols) if hasUnresolvedColumns(cols) => Review comment: This looks fine. We can remove the `if hasUnresolvedColumns(cols)` which is not very useful here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33430: [SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ
SparkQA commented on pull request #33430: URL: https://github.com/apache/spark/pull/33430#issuecomment-883090499 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45810/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33431: [SPARK-36221][SQL] Make sure CustomShuffleReaderExec has at least one partition
SparkQA commented on pull request #33431: URL: https://github.com/apache/spark/pull/33431#issuecomment-883090213 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45809/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33424: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec
SparkQA removed a comment on pull request #33424: URL: https://github.com/apache/spark/pull/33424#issuecomment-882968356 **[Test build #141288 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141288/testReport)** for PR 33424 at commit [`afa5539`](https://github.com/apache/spark/commit/afa55393f8d0c6884ed1d47ac3ced1112a87e7b6). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
AmplabJenkins removed a comment on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-883087045 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45807/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
AmplabJenkins removed a comment on pull request #33239: URL: https://github.com/apache/spark/pull/33239#issuecomment-883087051 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45808/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33428: [SPARK-36220][PYTHON] Fix pyspark.sql.types.Row type annotation
AmplabJenkins removed a comment on pull request #33428: URL: https://github.com/apache/spark/pull/33428#issuecomment-883011701 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33424: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec
AmplabJenkins removed a comment on pull request #33424: URL: https://github.com/apache/spark/pull/33424#issuecomment-883087048 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141288/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33200: [SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands to use UnresolvedTable to resolve the identifier
cloud-fan commented on a change in pull request #33200: URL: https://github.com/apache/spark/pull/33200#discussion_r672821019 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -3574,15 +3568,64 @@ class Analyzer(override val catalogManager: CatalogManager) /** * Rule to mostly resolve, normalize and rewrite column names based on case sensitivity - * for alter table commands. + * for alter table column commands. */ - object ResolveAlterTableCommands extends Rule[LogicalPlan] { + object ResolveAlterTableColumnCommands extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { - case a: AlterTableCommand if a.table.resolved && hasUnresolvedFieldName(a) => + case a: AlterTableColumnCommand if a.table.resolved && hasUnresolvedFieldName(a) => val table = a.table.asInstanceOf[ResolvedTable] a.transformExpressions { - case u: UnresolvedFieldName => resolveFieldNames(table, u.name, u) + case u: UnresolvedFieldName => resolveFieldNames(table, u.name, u.origin) +} + + case a @ AlterTableAddColumns(r: ResolvedTable, cols) if hasUnresolvedColumns(cols) => Review comment: maybe we should do ``` case class QualifiedColType { path: Option[FieldName], // None for top-level columns colName: String, ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33410: [WIP][SPARK-36204][INFRA][BUILD] Deduplicate Scala 2.13 daily build
SparkQA commented on pull request #33410: URL: https://github.com/apache/spark/pull/33410#issuecomment-883088778 **[Test build #141303 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141303/testReport)** for PR 33410 at commit [`48bfb39`](https://github.com/apache/spark/commit/48bfb39f0d3a8d15614998297ba56addf3b756b5). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33416: [SPARK-36207][PYTHON] Expose databaseExists in pyspark.sql.catalog
SparkQA commented on pull request #33416: URL: https://github.com/apache/spark/pull/33416#issuecomment-883088693 **[Test build #141302 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141302/testReport)** for PR 33416 at commit [`ac88451`](https://github.com/apache/spark/commit/ac88451fa14154ad111c2fe2399c8576b133a03f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33429: [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
SparkQA commented on pull request #33429: URL: https://github.com/apache/spark/pull/33429#issuecomment-883088620 **[Test build #141300 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141300/testReport)** for PR 33429 at commit [`2840b47`](https://github.com/apache/spark/commit/2840b475c6de6e3bd5bd3cfce7e981a289ab1e39). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33428: [SPARK-36220][PYTHON] Fix pyspark.sql.types.Row type annotation
SparkQA commented on pull request #33428: URL: https://github.com/apache/spark/pull/33428#issuecomment-883088661 **[Test build #141301 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141301/testReport)** for PR 33428 at commit [`f4330a5`](https://github.com/apache/spark/commit/f4330a5abbd87c19191764b59bb5d55bf6472432). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33432: [SPARK-32709][SQL] Support writing Hive bucketed table (Parquet/ORC format with Hive hash)
SparkQA commented on pull request #33432: URL: https://github.com/apache/spark/pull/33432#issuecomment-883088569 **[Test build #141299 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141299/testReport)** for PR 33432 at commit [`dfabd0f`](https://github.com/apache/spark/commit/dfabd0fce7e9079cd66e75be0eb02a1c814c8b0b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
AmplabJenkins commented on pull request #33239: URL: https://github.com/apache/spark/pull/33239#issuecomment-883087051 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45808/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
AmplabJenkins commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-883087045 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45807/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33424: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec
AmplabJenkins commented on pull request #33424: URL: https://github.com/apache/spark/pull/33424#issuecomment-883087048 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141288/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33429: [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
SparkQA commented on pull request #33429: URL: https://github.com/apache/spark/pull/33429#issuecomment-883084988 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45811/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33424: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec
SparkQA commented on pull request #33424: URL: https://github.com/apache/spark/pull/33424#issuecomment-883082141 **[Test build #141288 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141288/testReport)** for PR 33424 at commit [`afa5539`](https://github.com/apache/spark/commit/afa55393f8d0c6884ed1d47ac3ced1112a87e7b6). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] otterc commented on a change in pull request #33425: [SPARK-32919][FOLLOW-UP] Filter out driver in the merger locations and fix the return type of RemoveShufflePushMergerLocations
otterc commented on a change in pull request #33425: URL: https://github.com/apache/spark/pull/33425#discussion_r672821698 ## File path: core/src/test/scala/org/apache/spark/storage/BlockManagerSuite.scala ## @@ -2093,6 +2093,9 @@ class BlockManagerSuite extends SparkFunSuite with Matchers with BeforeAndAfterE Seq("hostC", "hostB", "hostD").sorted) assert(master.getShufflePushMergerLocations(4, Set.empty).map(_.host).sorted === Seq("hostB", "hostA", "hostC", "hostD").sorted) +master.removeShufflePushMergerLocation("hostA") +assert(master.getShufflePushMergerLocations(4, Set.empty).map(_.host).sorted === + Seq("hostB", "hostC", "hostD").sorted) Review comment: Can we extend this UT to verify the driver host is excluded? It will ensure that any future changes will not change this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33200: [SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands to use UnresolvedTable to resolve the identifier
cloud-fan commented on a change in pull request #33200: URL: https://github.com/apache/spark/pull/33200#discussion_r672821019 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -3574,15 +3568,64 @@ class Analyzer(override val catalogManager: CatalogManager) /** * Rule to mostly resolve, normalize and rewrite column names based on case sensitivity - * for alter table commands. + * for alter table column commands. */ - object ResolveAlterTableCommands extends Rule[LogicalPlan] { + object ResolveAlterTableColumnCommands extends Rule[LogicalPlan] { def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp { - case a: AlterTableCommand if a.table.resolved && hasUnresolvedFieldName(a) => + case a: AlterTableColumnCommand if a.table.resolved && hasUnresolvedFieldName(a) => val table = a.table.asInstanceOf[ResolvedTable] a.transformExpressions { - case u: UnresolvedFieldName => resolveFieldNames(table, u.name, u) + case u: UnresolvedFieldName => resolveFieldNames(table, u.name, u.origin) +} + + case a @ AlterTableAddColumns(r: ResolvedTable, cols) if hasUnresolvedColumns(cols) => Review comment: maybe we should do ``` case class QualifiedColType { path: Option[FieldName], // None for top-level columns colName: String, ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33429: [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
HyukjinKwon commented on pull request #33429: URL: https://github.com/apache/spark/pull/33429#issuecomment-883075964 let me rebase. seems like it couldn't detect my GitHub actions job. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #33432: [SPARK-32709][SQL] Support writing Hive bucketed table (Parquet/ORC format with Hive hash)
c21 commented on pull request #33432: URL: https://github.com/apache/spark/pull/33432#issuecomment-883074400 cc @cloud-fan could you help take a look when you have time? Thanks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 opened a new pull request #33432: [SPARK-32709][SQL] Support writing Hive bucketed table (Parquet/ORC format with Hive hash)
c21 opened a new pull request #33432: URL: https://github.com/apache/spark/pull/33432 ### What changes were proposed in this pull request? This is a re-work of https://github.com/apache/spark/pull/30003, here we add support for writing Hive bucketed table with Parquet/ORC file format (data source v1 write path and Hive hash as the hash function). Support for Hive's other file format will be added in follow up PR. The changes are mostly on: * `HiveMetastoreCatalog.scala`: When converting hive table relation to data source relation, pass bucket info (BucketSpec) and other hive related info as options into `HadoopFsRelation` and `LogicalRelation`, which can be later accessed by `FileFormatWriter` to customize bucket id and file name. * `FileFormatWriter.scala`: Use `HiveHash` for `bucketIdExpression` if it's writing to Hive bucketed table. In addition, Spark output file name should follow Hive/Presto/Trino bucketed file naming convention. Introduce another parameter `bucketFileNamePrefix` and it introduces subsequent change in `FileFormatDataWriter`. * `HadoopMapReduceCommitProtocol`: Implement the new file name APIs introduced in https://github.com/apache/spark/pull/33012, and change its sub-class `PathOutputCommitProtocol`, to make Hive bucketed table writing work with all commit protocol (including S3A commit protocol). ### Why are the changes needed? To make Spark write other-SQL-engines-compatible bucketed table. Currently Spark bucketed table cannot be leveraged by other SQL engines like Hive and Presto, because it uses a different hash function (Spark murmur3hash). With this PR, the Spark-written-Hive-bucketed-table can be efficiently read by Presto and Hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from Facebook, Lyft, etc) migrate bucketing workload from Hive to Spark. ### Does this PR introduce _any_ user-facing change? Yes, any Hive bucketed table (with Parquet/ORC format) written by Spark, is properly bucketed and can be efficiently processed by Hive and Presto/Trino. ### How was this patch tested? * Added unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly. * WIP test in production. Will update later. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
SparkQA commented on pull request #33239: URL: https://github.com/apache/spark/pull/33239#issuecomment-883072761 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45808/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
SparkQA commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-883071887 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45807/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #33409: [SPARK-36201][SQL][FOLLOWUP] Schema check should check inner field too
dongjoon-hyun commented on pull request #33409: URL: https://github.com/apache/spark/pull/33409#issuecomment-883069750 It seems that the master branch's Java 17 job is suffering with the same reason. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun edited a comment on pull request #33409: [SPARK-36201][SQL][FOLLOWUP] Schema check should check inner field too
dongjoon-hyun edited a comment on pull request #33409: URL: https://github.com/apache/spark/pull/33409#issuecomment-883069206 The error code is the following. It looks like OOM happening again. - https://github.com/AngersZh/spark/runs/3110053861?check_suite_focus=true ``` ./build/mvn: line 178: 1699 Killed "${MVN_BIN}" "$@" 2021-07-20T04:37:47.6486105Z ##[error]Process completed with exit code 137. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #33409: [SPARK-36201][SQL][FOLLOWUP] Schema check should check inner field too
dongjoon-hyun commented on pull request #33409: URL: https://github.com/apache/spark/pull/33409#issuecomment-883069206 The error code is the following. It looks like OOM happening again. ``` ./build/mvn: line 178: 1699 Killed "${MVN_BIN}" "$@" 2021-07-20T04:37:47.6486105Z ##[error]Process completed with exit code 137. ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #33409: [SPARK-36201][SQL][FOLLOWUP] Schema check should check inner field too
dongjoon-hyun commented on pull request #33409: URL: https://github.com/apache/spark/pull/33409#issuecomment-883067722 Wow, the GitHub Action failures look really weird. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jhu-chang commented on a change in pull request #33263: [SPARK-35027][CORE] Close the inputStream in FileAppender when writin…
jhu-chang commented on a change in pull request #33263: URL: https://github.com/apache/spark/pull/33263#discussion_r672813847 ## File path: core/src/main/scala/org/apache/spark/util/logging/FileAppender.scala ## @@ -76,7 +80,13 @@ private[spark] class FileAppender(inputStream: InputStream, file: File, bufferSi } } } { -closeFile() +try { + if (closeStreams) { +inputStream.close() + } Review comment: @Ngone51 @srowen It's for both normal exit and exception. Sorry, i don't quite understand the last comment: do you mean handling the error from "inputStream.close()"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
viirya commented on a change in pull request #33239: URL: https://github.com/apache/spark/pull/33239#discussion_r672812727 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala ## @@ -51,7 +51,7 @@ object CustomMetrics { currentMetricsValues: Seq[CustomTaskMetric], customMetrics: Map[String, SQLMetric]): Unit = { currentMetricsValues.foreach { metric => - customMetrics(metric.name()).set(metric.value()) + customMetrics.get(metric.name()).map(_.set(metric.value())) Review comment: Ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
viirya commented on a change in pull request #33239: URL: https://github.com/apache/spark/pull/33239#discussion_r672812642 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriterMetricSuite.scala ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.datasources + +import java.util.Collections + +import org.scalatest.BeforeAndAfter +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.connector.catalog.{Identifier, InMemoryTableCatalog} +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.test.SharedSparkSession +import org.apache.spark.sql.types.StructType + +class FileFormatDataWriterMetricSuite Review comment: Yea, I think so. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33428: [SPARK-36220][PYTHON] Fix pyspark.sql.types.Row type annotation
HyukjinKwon commented on pull request #33428: URL: https://github.com/apache/spark/pull/33428#issuecomment-883060656 Jenkins, ok to test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
dongjoon-hyun commented on a change in pull request #33239: URL: https://github.com/apache/spark/pull/33239#discussion_r672811174 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriterMetricSuite.scala ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.datasources + +import java.util.Collections + +import org.scalatest.BeforeAndAfter +import org.scalatest.time.SpanSugar._ + +import org.apache.spark.sql.QueryTest +import org.apache.spark.sql.connector.catalog.{Identifier, InMemoryTableCatalog} +import org.apache.spark.sql.functions.lit +import org.apache.spark.sql.test.SharedSparkSession +import org.apache.spark.sql.types.StructType + +class FileFormatDataWriterMetricSuite Review comment: For this one, I guess we need @gengliangwang 's review since he requested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe
AmplabJenkins commented on pull request #33422: URL: https://github.com/apache/spark/pull/33422#issuecomment-883059530 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141285/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe
SparkQA removed a comment on pull request #33422: URL: https://github.com/apache/spark/pull/33422#issuecomment-882962473 **[Test build #141285 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141285/testReport)** for PR 33422 at commit [`2a21bb3`](https://github.com/apache/spark/commit/2a21bb3017643410a81305d000af5e591b8ba3bb). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
dongjoon-hyun commented on a change in pull request #33239: URL: https://github.com/apache/spark/pull/33239#discussion_r672810180 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala ## @@ -51,7 +51,7 @@ object CustomMetrics { currentMetricsValues: Seq[CustomTaskMetric], customMetrics: Map[String, SQLMetric]): Unit = { currentMetricsValues.foreach { metric => - customMetrics(metric.name()).set(metric.value()) + customMetrics.get(metric.name()).map(_.set(metric.value())) Review comment: Also, it would be great if you put the explanation at line 48. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe
SparkQA commented on pull request #33422: URL: https://github.com/apache/spark/pull/33422#issuecomment-883058475 **[Test build #141285 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141285/testReport)** for PR 33422 at commit [`2a21bb3`](https://github.com/apache/spark/commit/2a21bb3017643410a81305d000af5e591b8ba3bb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
dongjoon-hyun commented on a change in pull request #33239: URL: https://github.com/apache/spark/pull/33239#discussion_r672809311 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/metric/CustomMetrics.scala ## @@ -51,7 +51,7 @@ object CustomMetrics { currentMetricsValues: Seq[CustomTaskMetric], customMetrics: Map[String, SQLMetric]): Unit = { currentMetricsValues.foreach { metric => - customMetrics(metric.name()).set(metric.value()) + customMetrics.get(metric.name()).map(_.set(metric.value())) Review comment: This looks more robust. Could you add a test case for this no-op too? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #32401: [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file
mridulm commented on pull request #32401: URL: https://github.com/apache/spark/pull/32401#issuecomment-883057074 Thanks for the clarifications ! This sounds good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #33078: [SPARK-35546][Shuffle] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better
mridulm commented on pull request #33078: URL: https://github.com/apache/spark/pull/33078#issuecomment-883056728 +CC @gengliangwang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #33078: [SPARK-35546][Shuffle] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better
mridulm commented on pull request #33078: URL: https://github.com/apache/spark/pull/33078#issuecomment-883056445 Merged to master and branch-3.2 Thanks for working on this @zhouyejoe ! Thanks for all the reviews @Ngone51, @otterc, @venkata91 :-) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] asfgit closed pull request #33078: [SPARK-35546][Shuffle] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better way
asfgit closed pull request #33078: URL: https://github.com/apache/spark/pull/33078 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
SparkQA commented on pull request #33239: URL: https://github.com/apache/spark/pull/33239#issuecomment-883053190 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45808/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tobiasedwards commented on pull request #33428: [SPARK-36220][PYTHON] Fix pyspark.sql.types.Row type annotation
tobiasedwards commented on pull request #33428: URL: https://github.com/apache/spark/pull/33428#issuecomment-883053001 There we go, that should be better. When I botched the rebase the bot added some incorrect labels though, are you able to remove them, @HyukjinKwon? Thanks again for your help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #33429: [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
dongjoon-hyun commented on a change in pull request #33429: URL: https://github.com/apache/spark/pull/33429#discussion_r672805734 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/CoalesceShufflePartitions.scala ## @@ -88,23 +88,23 @@ case class CoalesceShufflePartitions(session: SparkSession) extends CustomShuffl val specsMap = shuffleStageInfos.zip(newPartitionSpecs).map { case (stageInfo, partSpecs) => (stageInfo.shuffleStage.id, partSpecs) }.toMap -updateShuffleReaders(plan, specsMap) +updateShuffleRead(plan, specsMap) } else { plan } } } - private def updateShuffleReaders( + private def updateShuffleRead( Review comment: Like the other places, `updateShuffleRead` -> `updateShuffleReads`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33409: [SPARK-36201][SQL] Schema check should check inner field too
AmplabJenkins removed a comment on pull request #33409: URL: https://github.com/apache/spark/pull/33409#issuecomment-883051150 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45805/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
AmplabJenkins removed a comment on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883051152 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45804/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
SparkQA commented on pull request #33239: URL: https://github.com/apache/spark/pull/33239#issuecomment-883052445 **[Test build #141298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141298/testReport)** for PR 33239 at commit [`bccc98b`](https://github.com/apache/spark/commit/bccc98b7f3afad110ac450c183e341938dd20bc9). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33429: [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
SparkQA commented on pull request #33429: URL: https://github.com/apache/spark/pull/33429#issuecomment-883052316 **[Test build #141297 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141297/testReport)** for PR 33429 at commit [`f2b0bab`](https://github.com/apache/spark/commit/f2b0babd5835d10ee894943b49def6a9cb01fcad). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33430: [SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ
SparkQA commented on pull request #33430: URL: https://github.com/apache/spark/pull/33430#issuecomment-883052246 **[Test build #141296 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141296/testReport)** for PR 33430 at commit [`955ecf6`](https://github.com/apache/spark/commit/955ecf6421b25244ad647a61a53998948064b451). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33431: [SPARK-36221][SQL] Make sure CustomShuffleReaderExec has at least one partition
SparkQA commented on pull request #33431: URL: https://github.com/apache/spark/pull/33431#issuecomment-883052250 **[Test build #141295 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141295/testReport)** for PR 33431 at commit [`26bb39a`](https://github.com/apache/spark/commit/26bb39aea4d606fbe52d09ce51cc6b62fa775e6f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
AmplabJenkins commented on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883051152 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45804/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33409: [SPARK-36201][SQL] Schema check should check inner field too
AmplabJenkins commented on pull request #33409: URL: https://github.com/apache/spark/pull/33409#issuecomment-883051150 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45805/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
SparkQA commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-883050392 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45807/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tobiasedwards commented on pull request #33428: [SPARK-36220][PYTHON] Fix pyspark.sql.types.Row type annotation
tobiasedwards commented on pull request #33428: URL: https://github.com/apache/spark/pull/33428#issuecomment-883049975 Whoops I think I've messed up my rebase, give me a minute -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33429: [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
HyukjinKwon commented on pull request #33429: URL: https://github.com/apache/spark/pull/33429#issuecomment-883048266 cc @ulysses-you too FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe
HyukjinKwon commented on a change in pull request #33422: URL: https://github.com/apache/spark/pull/33422#discussion_r672800804 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.util.UUID + +import org.apache.spark.sql.execution.QueryExecution +import org.apache.spark.sql.util.QueryExecutionListener + + +/** + * Helper class to simplify usage of `Dataset.observe(String, Column, Column*)`: + * + * {{{ + * // Observe row count (rows) and highest id (maxid) in the Dataset while writing it + * val observation = Observation("my metrics") + * val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) + * observed_ds.write.parquet("ds.parquet") + * val metrics = observation.get + * }}} + * + * This collects the metrics while the first action is executed on the observed dataset. Subsequent + * actions do not modify the metrics returned by [[get]]. Retrieval of the metric via [[get]] + * blocks until the first action has finished and metrics become available. + * + * This class does not support streaming datasets. + * + * @param name name of the metric + * @since 3.3.0 + */ +class Observation(name: String) { + + private val listener: ObservationListener = ObservationListener(this) + + @volatile private var sparkSession: Option[SparkSession] = None + + @volatile private var row: Option[Row] = None + + /** + * Attach this observation to the given [[Dataset]] to observe aggregation expressions. + * + * @param ds dataset + * @param expr first aggregation expression + * @param exprs more aggregation expressions + * @tparam T dataset type + * @return observed dataset + * @throws IllegalArgumentException If this is a streaming Dataset (ds.isStreaming == true) + */ + private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): Dataset[T] = { +if (ds.isStreaming) { + throw new IllegalArgumentException("Observation does not support streaming Datasets") +} +register(ds.sparkSession) +ds.observe(name, expr, exprs: _*) + } + + /** + * Get the observed metrics. This waits for the observed dataset to finish its first action. + * Only the result of the first action is available. Subsequent actions do not modify the result. + * + * @return the observed metrics as a [[Row]] + * @throws InterruptedException interrupted while waiting + */ + @throws[InterruptedException] + def get: Row = { +synchronized { + // we need to loop as wait might return without us calling notify + // https://en.wikipedia.org/w/index.php?title=Spurious_wakeup=992601610 + while (this.row.isEmpty) { +wait() + } +} + +this.row.get + } + + private def register(sparkSession: SparkSession): Unit = { +// makes this class thread-safe: +// only the first thread entering this block can set sparkSession +// all other threads will see the exception, as it is only allowed to do this once +synchronized { + if (this.sparkSession.isDefined) { +throw new IllegalArgumentException("An Observation can be used with a Dataset only once") + } + this.sparkSession = Some(sparkSession) +} + +sparkSession.listenerManager.register(this.listener) + } + + private def unregister(): Unit = { +this.sparkSession.foreach(_.listenerManager.unregister(this.listener)) + } + + private[spark] def onFinish(qe: QueryExecution): Unit = { +synchronized { + if (this.row.isEmpty) { +this.row = qe.observedMetrics.get(name) +if (this.row.isDefined) { + notifyAll() + unregister() +} + } +} + } + +} + +private[sql] case class ObservationListener(observation: Observation) + extends QueryExecutionListener { + + override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = +observation.onFinish(qe) + + override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = +observation.onFinish(qe) + +} + +/** + * (Scala-specific) Create a named or anonymous
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe
HyukjinKwon commented on a change in pull request #33422: URL: https://github.com/apache/spark/pull/33422#discussion_r672800470 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.util.UUID + +import org.apache.spark.sql.execution.QueryExecution +import org.apache.spark.sql.util.QueryExecutionListener + + +/** + * Helper class to simplify usage of `Dataset.observe(String, Column, Column*)`: + * + * {{{ + * // Observe row count (rows) and highest id (maxid) in the Dataset while writing it + * val observation = Observation("my metrics") + * val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) + * observed_ds.write.parquet("ds.parquet") + * val metrics = observation.get + * }}} + * + * This collects the metrics while the first action is executed on the observed dataset. Subsequent + * actions do not modify the metrics returned by [[get]]. Retrieval of the metric via [[get]] + * blocks until the first action has finished and metrics become available. + * + * This class does not support streaming datasets. + * + * @param name name of the metric + * @since 3.3.0 + */ +class Observation(name: String) { + + private val listener: ObservationListener = ObservationListener(this) + + @volatile private var sparkSession: Option[SparkSession] = None + + @volatile private var row: Option[Row] = None + + /** + * Attach this observation to the given [[Dataset]] to observe aggregation expressions. + * + * @param ds dataset + * @param expr first aggregation expression + * @param exprs more aggregation expressions + * @tparam T dataset type + * @return observed dataset + * @throws IllegalArgumentException If this is a streaming Dataset (ds.isStreaming == true) + */ + private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): Dataset[T] = { +if (ds.isStreaming) { + throw new IllegalArgumentException("Observation does not support streaming Datasets") +} +register(ds.sparkSession) +ds.observe(name, expr, exprs: _*) + } + + /** + * Get the observed metrics. This waits for the observed dataset to finish its first action. + * Only the result of the first action is available. Subsequent actions do not modify the result. + * + * @return the observed metrics as a [[Row]] + * @throws InterruptedException interrupted while waiting + */ + @throws[InterruptedException] + def get: Row = { +synchronized { + // we need to loop as wait might return without us calling notify + // https://en.wikipedia.org/w/index.php?title=Spurious_wakeup=992601610 + while (this.row.isEmpty) { +wait() + } +} + +this.row.get + } + + private def register(sparkSession: SparkSession): Unit = { +// makes this class thread-safe: +// only the first thread entering this block can set sparkSession +// all other threads will see the exception, as it is only allowed to do this once +synchronized { + if (this.sparkSession.isDefined) { +throw new IllegalArgumentException("An Observation can be used with a Dataset only once") + } + this.sparkSession = Some(sparkSession) +} + +sparkSession.listenerManager.register(this.listener) + } + + private def unregister(): Unit = { +this.sparkSession.foreach(_.listenerManager.unregister(this.listener)) + } + + private[spark] def onFinish(qe: QueryExecution): Unit = { +synchronized { + if (this.row.isEmpty) { +this.row = qe.observedMetrics.get(name) +if (this.row.isDefined) { + notifyAll() + unregister() +} + } +} + } + +} + +private[sql] case class ObservationListener(observation: Observation) + extends QueryExecutionListener { + + override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = +observation.onFinish(qe) + + override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = +observation.onFinish(qe) + +} + +/** + * (Scala-specific) Create a named or anonymous
[GitHub] [spark] HyukjinKwon commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe
HyukjinKwon commented on a change in pull request #33422: URL: https://github.com/apache/spark/pull/33422#discussion_r672800470 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.util.UUID + +import org.apache.spark.sql.execution.QueryExecution +import org.apache.spark.sql.util.QueryExecutionListener + + +/** + * Helper class to simplify usage of `Dataset.observe(String, Column, Column*)`: + * + * {{{ + * // Observe row count (rows) and highest id (maxid) in the Dataset while writing it + * val observation = Observation("my metrics") + * val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) + * observed_ds.write.parquet("ds.parquet") + * val metrics = observation.get + * }}} + * + * This collects the metrics while the first action is executed on the observed dataset. Subsequent + * actions do not modify the metrics returned by [[get]]. Retrieval of the metric via [[get]] + * blocks until the first action has finished and metrics become available. + * + * This class does not support streaming datasets. + * + * @param name name of the metric + * @since 3.3.0 + */ +class Observation(name: String) { + + private val listener: ObservationListener = ObservationListener(this) + + @volatile private var sparkSession: Option[SparkSession] = None + + @volatile private var row: Option[Row] = None + + /** + * Attach this observation to the given [[Dataset]] to observe aggregation expressions. + * + * @param ds dataset + * @param expr first aggregation expression + * @param exprs more aggregation expressions + * @tparam T dataset type + * @return observed dataset + * @throws IllegalArgumentException If this is a streaming Dataset (ds.isStreaming == true) + */ + private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): Dataset[T] = { +if (ds.isStreaming) { + throw new IllegalArgumentException("Observation does not support streaming Datasets") +} +register(ds.sparkSession) +ds.observe(name, expr, exprs: _*) + } + + /** + * Get the observed metrics. This waits for the observed dataset to finish its first action. + * Only the result of the first action is available. Subsequent actions do not modify the result. + * + * @return the observed metrics as a [[Row]] + * @throws InterruptedException interrupted while waiting + */ + @throws[InterruptedException] + def get: Row = { +synchronized { + // we need to loop as wait might return without us calling notify + // https://en.wikipedia.org/w/index.php?title=Spurious_wakeup=992601610 + while (this.row.isEmpty) { +wait() + } +} + +this.row.get + } + + private def register(sparkSession: SparkSession): Unit = { +// makes this class thread-safe: +// only the first thread entering this block can set sparkSession +// all other threads will see the exception, as it is only allowed to do this once +synchronized { + if (this.sparkSession.isDefined) { +throw new IllegalArgumentException("An Observation can be used with a Dataset only once") + } + this.sparkSession = Some(sparkSession) +} + +sparkSession.listenerManager.register(this.listener) + } + + private def unregister(): Unit = { +this.sparkSession.foreach(_.listenerManager.unregister(this.listener)) + } + + private[spark] def onFinish(qe: QueryExecution): Unit = { +synchronized { + if (this.row.isEmpty) { +this.row = qe.observedMetrics.get(name) +if (this.row.isDefined) { + notifyAll() + unregister() +} + } +} + } + +} + +private[sql] case class ObservationListener(observation: Observation) + extends QueryExecutionListener { + + override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = +observation.onFinish(qe) + + override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = +observation.onFinish(qe) + +} + +/** + * (Scala-specific) Create a named or anonymous
[GitHub] [spark] HyukjinKwon commented on pull request #33429: [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
HyukjinKwon commented on pull request #33429: URL: https://github.com/apache/spark/pull/33429#issuecomment-883046619 cc @cloud-fan and @maryannxue can you take a look please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
viirya commented on pull request #33239: URL: https://github.com/apache/spark/pull/33239#issuecomment-883045610 @dongjoon-hyun and @gengliangwang Thanks for reviewing. Please take another look on the suggested change/new tests. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #30565: [WIP][SPARK-33625][SQL] Subexpression elimination for whole-stage codegen in Filter
viirya commented on pull request #30565: URL: https://github.com/apache/spark/pull/30565#issuecomment-883045221 I think it is much easier to solve it at query optimization (i.e. by the optimizer), instead of at codegen. It also looks like query optimization problem instead of codegen. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on pull request #33430: [SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ
beliefer commented on pull request #33430: URL: https://github.com/apache/spark/pull/33430#issuecomment-883043550 > > This PR fix the incorrect alias usecase. > > @beliefer I wouldn't say that is incorrect..implementing `prettyName` is more reliable. OK. I updated the description. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #33430: [SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ
gengliangwang commented on pull request #33430: URL: https://github.com/apache/spark/pull/33430#issuecomment-883042925 > This PR fix the incorrect alias usecase. @beliefer I wouldn't say that is incorrect..implementing `prettyName` is more reliable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you opened a new pull request #33431: [SPARK-36221][SQL] Make sure CustomShuffleReaderExec has at least one partition
ulysses-you opened a new pull request #33431: URL: https://github.com/apache/spark/pull/33431 ### What changes were proposed in this pull request? * Add non-empty partition check in `CustomShuffleReaderExec` * Make sure `OptimizeLocalShuffleReader` doesn't return empty partition ### Why are the changes needed? Since SPARK-32083, AQE coalesce always return at least one partition, it should be robust to add non-empty check in `CustomShuffleReaderExec`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? not need -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #33428: [SPARK-36220][PYTHON] Fix pyspark.sql.types.Row type annotation
HyukjinKwon edited a comment on pull request #33428: URL: https://github.com/apache/spark/pull/33428#issuecomment-883039820 ah actually this is the limitation .. I can;t retrigger the test because it belongs to your repo :-) .. can you rebase and push it again? e.g.) `git checkout python-sql-row-type-annotation && git fetch upstream && git rebase upstream master && git push origin python-sql-row-type-annotation` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33428: [SPARK-36220][PYTHON] Fix pyspark.sql.types.Row type annotation
HyukjinKwon commented on pull request #33428: URL: https://github.com/apache/spark/pull/33428#issuecomment-883039820 ah actually this is the limitation .. I can;t retrigger the test because it belongs to your repo :-) .. can you rebase and push it again? e.g.) `git checkout python-sql-row-type-annotation git fetch upstream && git rebase upstream master && git push origin python-sql-row-type-annotation` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
HyukjinKwon closed pull request #33427: URL: https://github.com/apache/spark/pull/33427 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
HyukjinKwon commented on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883038497 Merged to master, branch-3.2, branch-3.1 and branch-3.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
viirya commented on a change in pull request #33239: URL: https://github.com/apache/spark/pull/33239#discussion_r672793126 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala ## @@ -41,7 +42,8 @@ import org.apache.spark.util.SerializableConfiguration abstract class FileFormatDataWriter( description: WriteJobDescription, taskAttemptContext: TaskAttemptContext, -committer: FileCommitProtocol) extends DataWriter[InternalRow] { +committer: FileCommitProtocol, +customMetrics: Map[String, SQLMetric]) extends DataWriter[InternalRow] { Review comment: I added custom metric for writing to InMemory table for test purpose. The tests are in `FileFormatDataWriterMetricSuite`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer opened a new pull request #33430: [SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ
beliefer opened a new pull request #33430: URL: https://github.com/apache/spark/pull/33430 ### What changes were proposed in this pull request? This PR fix the incorrect use alias for `MakeTimestampNTZ` and `MakeTimestampLTZ` based on the discussion show below https://github.com/apache/spark/pull/33299/files#r668423810 ### Why are the changes needed? This PR fix the incorrect alias usecase. ### Does this PR introduce _any_ user-facing change? 'No'. Modifications are transparent to users. ### How was this patch tested? Jenkins test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
HyukjinKwon commented on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883038320 Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
SparkQA commented on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883038157 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45804/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33409: [SPARK-36201][SQL] Schema check should check inner field too
SparkQA commented on pull request #33409: URL: https://github.com/apache/spark/pull/33409#issuecomment-883037465 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45805/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33239: [SPARK-36030][SQL] Support DS v2 metrics at writing path
SparkQA commented on pull request #33239: URL: https://github.com/apache/spark/pull/33239#issuecomment-883036620 **[Test build #141294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141294/testReport)** for PR 33239 at commit [`fe9cc4e`](https://github.com/apache/spark/commit/fe9cc4e79323a8a089470fcb8b28b346fb96ecdd). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon opened a new pull request #33429: [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader
HyukjinKwon opened a new pull request #33429: URL: https://github.com/apache/spark/pull/33429 ### What changes were proposed in this pull request? This PR proposes to rename: - Rename `*Reader`/`*reader` to `*Read`/`*read` for rules and execution plan (user-facing doc/config name remain untouched) - `*ShuffleReaderExec` ->`*ShuffleReadExec` - `isLocalReader` -> `isLocalRead` - ... - Rename `CustomShuffle*` prefix to `AQEShuffle*` - Rename `OptimizeLocalShuffleReader` rule to `OptimizeShuffleWithLocalRead` ### Why are the changes needed? There are multiple problems in the current naming: - `CustomShuffle*` -> `AQEShuffle*` it sounds like it is a pluggable API. However, this is actually only used by AQE. - `OptimizeLocalShuffleReader` -> `OptimizeShuffleWithLocalRead` it is the name of a rule but it can be misread as a reader, which is counterintuative - `*ReaderExec` -> `*ReadExec` Reader execution reads a bit odd. It should better be read execution (like `ScanExec`, `ProjectExec` and `FilterExec`). I can't find the reason to name it with something that performs an action. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Existing unittests should cover the changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33350: [SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package
SparkQA commented on pull request #33350: URL: https://github.com/apache/spark/pull/33350#issuecomment-883031312 **[Test build #141293 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141293/testReport)** for PR 33350 at commit [`7473aea`](https://github.com/apache/spark/commit/7473aea9aa91586366206b7a01ed3b6e11f7236a). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #33078: [SPARK-35546][Shuffle] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in
SparkQA removed a comment on pull request #33078: URL: https://github.com/apache/spark/pull/33078#issuecomment-882962666 **[Test build #141286 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141286/testReport)** for PR 33078 at commit [`5310991`](https://github.com/apache/spark/commit/53109918cbdbdba2fe79f38a991c171efec7e85f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33078: [SPARK-35546][Shuffle] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the sta
AmplabJenkins removed a comment on pull request #33078: URL: https://github.com/apache/spark/pull/33078#issuecomment-883030102 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141286/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33352: [SPARK-34952][SQL] DSv2 Aggregate push down APIs
AmplabJenkins removed a comment on pull request #33352: URL: https://github.com/apache/spark/pull/33352#issuecomment-883030099 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45806/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
AmplabJenkins removed a comment on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883030098 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141290/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cfmcgrady commented on a change in pull request #33212: [SPARK-35912][SQL] Fix nullability of `spark.read.json`
cfmcgrady commented on a change in pull request #33212: URL: https://github.com/apache/spark/pull/33212#discussion_r672787073 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala ## @@ -405,10 +405,18 @@ class JacksonParser( schema.getFieldIndex(parser.getCurrentName) match { case Some(index) => try { -row.update(index, fieldConverters(index).apply(parser)) +val fieldValue = fieldConverters(index).apply(parser) Review comment: Thank you for your suggestions, I'll raise a new PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33352: [SPARK-34952][SQL] DSv2 Aggregate push down APIs
AmplabJenkins commented on pull request #33352: URL: https://github.com/apache/spark/pull/33352#issuecomment-883030099 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/45806/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
AmplabJenkins commented on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883030098 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141290/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #33078: [SPARK-35546][Shuffle] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a
AmplabJenkins commented on pull request #33078: URL: https://github.com/apache/spark/pull/33078#issuecomment-883030102 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/141286/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] gengliangwang commented on pull request #32401: [SPARK-35276][CORE] Calculate checksum for shuffle data and write as checksum file
gengliangwang commented on pull request #32401: URL: https://github.com/apache/spark/pull/32401#issuecomment-883029244 @Ngone51 Yes let's see if we can make it before 3.2. Thanks for the work! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
HyukjinKwon commented on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883028518 test failures in GA should be unrelated. @dongjoon-hyun, mind taking a quick look please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
SparkQA commented on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883027144 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45804/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a change in pull request #33424: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec
yaooqinn commented on a change in pull request #33424: URL: https://github.com/apache/spark/pull/33424#discussion_r672783259 ## File path: sql/core/src/test/resources/sql-tests/results/describe.sql.out ## @@ -324,6 +324,37 @@ Location [not included in comparison]/{warehouse_dir}/t Storage Properties [a=1, b=2] +-- !query +DESC EXTENDED t PARTITION (C='Us', D=1) Review comment: ``` +-- !query +DESC EXTENDED t PARTITION (C='Us', D=1) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +Partition spec is invalid. The spec (C, D) must match the partition spec (c, d) defined in table '`default`.`t`' + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33409: [SPARK-36201][SQL] Schema check should check inner field too
SparkQA commented on pull request #33409: URL: https://github.com/apache/spark/pull/33409#issuecomment-883027098 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45805/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33078: [SPARK-35546][Shuffle] Enable push-based shuffle when multiple app attempts are enabled and manage concurrent access to the state in a better
SparkQA commented on pull request #33078: URL: https://github.com/apache/spark/pull/33078#issuecomment-883026671 **[Test build #141286 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141286/testReport)** for PR 33078 at commit [`5310991`](https://github.com/apache/spark/commit/53109918cbdbdba2fe79f38a991c171efec7e85f). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `public class ShuffleChecksumHelper ` * `class MutableCheckedOutputStream(out: OutputStream) extends OutputStream ` * `case class ShuffleChecksumBlockId(shuffleId: Int, mapId: Long, reduceId: Int) extends BlockId ` * `case class SessionWindow(timeColumn: Expression, gapDuration: Long) extends UnaryExpression` * `protected abstract class ConnectionProviderBase extends Logging ` * `case class SessionWindowStateStoreRestoreExec(` * `case class SessionWindowStateStoreSaveExec(` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33352: [SPARK-34952][SQL] DSv2 Aggregate push down APIs
SparkQA commented on pull request #33352: URL: https://github.com/apache/spark/pull/33352#issuecomment-883026630 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45806/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] tobiasedwards commented on pull request #33428: [SPARK-36220][PYTHON] Fix pyspark.sql.types.Row type annotation
tobiasedwards commented on pull request #33428: URL: https://github.com/apache/spark/pull/33428#issuecomment-883023465 Hey @HyukjinKwon, I've added a Jira ticket here: [SPARK-36220](https://issues.apache.org/jira/browse/SPARK-36220) and enabled GitHub Actions on my forked repo. Is there anything I need to do to kick off the "Build and test" action again? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a change in pull request #33188: [SPARK-35989][SQL] Only remove redundant shuffle if shuffle origin is REPARTITION_BY_COL in AQE
ulysses-you commented on a change in pull request #33188: URL: https://github.com/apache/spark/pull/33188#discussion_r672779057 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala ## @@ -250,7 +250,12 @@ object EnsureRequirements extends Rule[SparkPlan] { def apply(plan: SparkPlan): SparkPlan = plan.transformUp { // TODO: remove this after we create a physical operator for `RepartitionByExpression`. -case operator @ ShuffleExchangeExec(upper: HashPartitioning, child, _) => +// SPARK-35989: AQE will change the partition number so we should retain the REPARTITION_BY_NUM +// shuffle which is specified by user. And also we can not remove REBALANCE_PARTITIONS_BY_COL, +// it is a special shuffle used to rebalance partitions. +// So, here we only remove REPARTITION_BY_COL in AQE. +case operator @ ShuffleExchangeExec(upper: HashPartitioning, child, shuffleOrigin) +if shuffleOrigin == REPARTITION_BY_COL || !conf.adaptiveExecutionEnabled => Review comment: yeah, as we have only skipped applying `CoalesceShufflePartitions` or other custom shuffle reader at final stage. But for the stages which are in the process, we do nothing. That's why I think a little bit hack. One other hack idea is we can remark the shuffle which is before the removed shuffle and change the `ENSURE_REQUIREMENTS` to `REPARTITION_BY_COL`. Then in AQE, we can do optimization safely. IMO, I prefer the idea of `skip removing shuffle with all shuffle origin in AQE`, it's simple and it can be seen as a behavior change due to AQE is enabled by default. If user really hit this issue, they can just disable AQE. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33424: [SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec
cloud-fan commented on a change in pull request #33424: URL: https://github.com/apache/spark/pull/33424#discussion_r672778569 ## File path: sql/core/src/test/resources/sql-tests/results/describe.sql.out ## @@ -324,6 +324,37 @@ Location [not included in comparison]/{warehouse_dir}/t Storage Properties [a=1, b=2] +-- !query +DESC EXTENDED t PARTITION (C='Us', D=1) Review comment: what was the result before this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #33422: [SPARK-34806][SQL] Add Observation helper for Dataset.observe
cloud-fan commented on a change in pull request #33422: URL: https://github.com/apache/spark/pull/33422#discussion_r672777257 ## File path: sql/core/src/main/scala/org/apache/spark/sql/Observation.scala ## @@ -0,0 +1,150 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql + +import java.util.UUID + +import org.apache.spark.sql.execution.QueryExecution +import org.apache.spark.sql.util.QueryExecutionListener + + +/** + * Helper class to simplify usage of `Dataset.observe(String, Column, Column*)`: + * + * {{{ + * // Observe row count (rows) and highest id (maxid) in the Dataset while writing it + * val observation = Observation("my metrics") + * val observed_ds = ds.observe(observation, count(lit(1)).as("rows"), max($"id").as("maxid")) + * observed_ds.write.parquet("ds.parquet") + * val metrics = observation.get + * }}} + * + * This collects the metrics while the first action is executed on the observed dataset. Subsequent + * actions do not modify the metrics returned by [[get]]. Retrieval of the metric via [[get]] + * blocks until the first action has finished and metrics become available. + * + * This class does not support streaming datasets. + * + * @param name name of the metric + * @since 3.3.0 + */ +class Observation(name: String) { + + private val listener: ObservationListener = ObservationListener(this) + + @volatile private var sparkSession: Option[SparkSession] = None + + @volatile private var row: Option[Row] = None + + /** + * Attach this observation to the given [[Dataset]] to observe aggregation expressions. + * + * @param ds dataset + * @param expr first aggregation expression + * @param exprs more aggregation expressions + * @tparam T dataset type + * @return observed dataset + * @throws IllegalArgumentException If this is a streaming Dataset (ds.isStreaming == true) + */ + private[spark] def on[T](ds: Dataset[T], expr: Column, exprs: Column*): Dataset[T] = { +if (ds.isStreaming) { + throw new IllegalArgumentException("Observation does not support streaming Datasets") +} +register(ds.sparkSession) +ds.observe(name, expr, exprs: _*) + } + + /** + * Get the observed metrics. This waits for the observed dataset to finish its first action. + * Only the result of the first action is available. Subsequent actions do not modify the result. + * + * @return the observed metrics as a [[Row]] + * @throws InterruptedException interrupted while waiting + */ + @throws[InterruptedException] + def get: Row = { +synchronized { + // we need to loop as wait might return without us calling notify + // https://en.wikipedia.org/w/index.php?title=Spurious_wakeup=992601610 + while (this.row.isEmpty) { +wait() + } +} + +this.row.get + } + + private def register(sparkSession: SparkSession): Unit = { +// makes this class thread-safe: +// only the first thread entering this block can set sparkSession +// all other threads will see the exception, as it is only allowed to do this once +synchronized { + if (this.sparkSession.isDefined) { +throw new IllegalArgumentException("An Observation can be used with a Dataset only once") + } + this.sparkSession = Some(sparkSession) +} + +sparkSession.listenerManager.register(this.listener) + } + + private def unregister(): Unit = { +this.sparkSession.foreach(_.listenerManager.unregister(this.listener)) + } + + private[spark] def onFinish(qe: QueryExecution): Unit = { +synchronized { + if (this.row.isEmpty) { +this.row = qe.observedMetrics.get(name) +if (this.row.isDefined) { + notifyAll() + unregister() +} + } +} + } + +} + +private[sql] case class ObservationListener(observation: Observation) + extends QueryExecutionListener { + + override def onSuccess(funcName: String, qe: QueryExecution, durationNs: Long): Unit = +observation.onFinish(qe) + + override def onFailure(funcName: String, qe: QueryExecution, exception: Exception): Unit = +observation.onFinish(qe) + +} + +/** + * (Scala-specific) Create a named or anonymous
[GitHub] [spark] SparkQA removed a comment on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
SparkQA removed a comment on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883011237 **[Test build #141290 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141290/testReport)** for PR 33427 at commit [`ad91d63`](https://github.com/apache/spark/commit/ad91d639cb8c3ede32d24db2703c35354c24617d). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #33427: [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
SparkQA commented on pull request #33427: URL: https://github.com/apache/spark/pull/33427#issuecomment-883019062 **[Test build #141290 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/141290/testReport)** for PR 33427 at commit [`ad91d63`](https://github.com/apache/spark/commit/ad91d639cb8c3ede32d24db2703c35354c24617d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org