[GitHub] [spark] SparkQA removed a comment on pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
SparkQA removed a comment on pull request #32645: URL: https://github.com/apache/spark/pull/32645#issuecomment-846707466 **[Test build #138858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138858/testReport)** for PR 32645 at commit [`0197136`](https://github.com/apache/spark/commit/01971362298c3813123e5ea89383e756701ca5e8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
SparkQA commented on pull request #32645: URL: https://github.com/apache/spark/pull/32645#issuecomment-846766141 **[Test build #138858 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138858/testReport)** for PR 32645 at commit [`0197136`](https://github.com/apache/spark/commit/01971362298c3813123e5ea89383e756701ca5e8). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `case class MakeYMInterval(years: Expression, months: Expression)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
SparkQA commented on pull request #32645: URL: https://github.com/apache/spark/pull/32645#issuecomment-846747586 **[Test build #138860 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138860/testReport)** for PR 32645 at commit [`111535c`](https://github.com/apache/spark/commit/111535c0aa1954a72b21f70679e0573c42850344). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
AmplabJenkins removed a comment on pull request #32646: URL: https://github.com/apache/spark/pull/32646#issuecomment-846746531 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43381/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
AmplabJenkins removed a comment on pull request #32645: URL: https://github.com/apache/spark/pull/32645#issuecomment-846746526 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43380/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
AmplabJenkins removed a comment on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846746525 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43378/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
AmplabJenkins commented on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846746525 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43378/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
AmplabJenkins commented on pull request #32645: URL: https://github.com/apache/spark/pull/32645#issuecomment-846746526 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43380/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
AmplabJenkins commented on pull request #32646: URL: https://github.com/apache/spark/pull/32646#issuecomment-846746531 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43381/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32594: [SPARK-35447][SQL] Optimize skew join before coalescing shuffle partitions
viirya commented on a change in pull request #32594: URL: https://github.com/apache/spark/pull/32594#discussion_r637693175 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala ## @@ -21,16 +21,160 @@ import scala.collection.mutable.ArrayBuffer import org.apache.spark.MapOutputStatistics import org.apache.spark.internal.Logging -import org.apache.spark.sql.execution.{CoalescedPartitionSpec, ShufflePartitionSpec} +import org.apache.spark.sql.execution.{CoalescedPartitionSpec, PartialReducerPartitionSpec, ShufflePartitionSpec} object ShufflePartitionsUtil extends Logging { final val SMALL_PARTITION_FACTOR = 0.2 final val MERGED_PARTITION_FACTOR = 1.2 /** - * Coalesce the partitions from multiple shuffles. This method assumes that all the shuffles - * have the same number of partitions, and the partitions of same index will be read together - * by one task. + * Coalesce the partitions from multiple shuffles, either in their original states, or applied + * with skew handling partition specs. If called on partitions containing skew partition specs, + * this method will keep the skew partition specs intact and only coalesce the partitions outside + * the skew sections. + * + * This method will return an empty result if the shuffles have been coalesced already, or if + * they do not have the same number of partitions, or if the coalesced result is the same as the + * input partition layout. + * + * @return A sequence of sequence of [[ShufflePartitionSpec]]s, which each inner sequence as the + * new partition specs for its corresponding shuffle after coalescing. If Nil is returned, + * then no coalescing is applied. + */ + def coalescePartitions( + mapOutputStatistics: Seq[Option[MapOutputStatistics]], + inputPartitionSpecs: Seq[Option[Seq[ShufflePartitionSpec]]], + advisoryTargetSize: Long, + minNumPartitions: Int): Seq[Seq[ShufflePartitionSpec]] = { +assert(mapOutputStatistics.length == inputPartitionSpecs.length) + +if (mapOutputStatistics.isEmpty) { + return Seq.empty +} + +// If `minNumPartitions` is very large, it is possible that we need to use a value less than +// `advisoryTargetSize` as the target size of a coalesced task. +val totalPostShuffleInputSize = mapOutputStatistics.flatMap(_.map(_.bytesByPartitionId.sum)).sum +// The max at here is to make sure that when we have an empty table, we only have a single +// coalesced partition. +// There is no particular reason that we pick 16. We just need a number to prevent +// `maxTargetSize` from being set to 0. +val maxTargetSize = math.max( + math.ceil(totalPostShuffleInputSize / minNumPartitions.toDouble).toLong, 16) +val targetSize = math.min(maxTargetSize, advisoryTargetSize) + +val shuffleIds = mapOutputStatistics.flatMap(_.map(_.shuffleId)).mkString(", ") +logInfo(s"For shuffle($shuffleIds), advisory target size: $advisoryTargetSize, " + + s"actual target size $targetSize.") + +val numShuffles = mapOutputStatistics.length +// `ShuffleQueryStageExec#mapStats` returns None when the input RDD has 0 partitions, +// we should skip it when calculating the `partitionStartIndices`. +val validMetrics = mapOutputStatistics.flatten + +if (inputPartitionSpecs.forall(_.isEmpty)) { + // If all input RDDs have 0 partition, we create an empty partition for every shuffle reader. + if (validMetrics.isEmpty) { +return Seq.fill(numShuffles)(Seq(CoalescedPartitionSpec(0, 0))) + } + + // We may have different pre-shuffle partition numbers, don't reduce shuffle partition number + // in that case. For example when we union fully aggregated data (data is arranged to a single + // partition) and a result of a SortMergeJoin (multiple partitions). + if (validMetrics.map(_.bytesByPartitionId.length).distinct.length > 1) { +return Seq.empty + } + + val numPartitions = validMetrics.head.bytesByPartitionId.length + val newPartitionSpecs = coalescePartitions( +0, numPartitions, validMetrics, targetSize) + if (newPartitionSpecs.length < numPartitions) { +return Seq.fill(numShuffles)(newPartitionSpecs) + } else { +return Seq.empty + } +} + +// Do not coalesce if any of the map output stats are missing or if not all shuffles have +// partition specs, which should not happen in practice. +if (!mapOutputStatistics.forall(_.isDefined) || !inputPartitionSpecs.forall(_.isDefined)) { + logWarning("Could not apply partition coalescing because of missing MapOutputStatistics " + +"or shuffle partition specs.") + return Seq.empty +} + +// Extract the start indices of each partition spec. Give invalid index -1 to unexpected +// partition specs. When we reach here, it means skew
[GitHub] [spark] viirya commented on a change in pull request #32594: [SPARK-35447][SQL] Optimize skew join before coalescing shuffle partitions
viirya commented on a change in pull request #32594: URL: https://github.com/apache/spark/pull/32594#discussion_r637693175 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala ## @@ -21,16 +21,160 @@ import scala.collection.mutable.ArrayBuffer import org.apache.spark.MapOutputStatistics import org.apache.spark.internal.Logging -import org.apache.spark.sql.execution.{CoalescedPartitionSpec, ShufflePartitionSpec} +import org.apache.spark.sql.execution.{CoalescedPartitionSpec, PartialReducerPartitionSpec, ShufflePartitionSpec} object ShufflePartitionsUtil extends Logging { final val SMALL_PARTITION_FACTOR = 0.2 final val MERGED_PARTITION_FACTOR = 1.2 /** - * Coalesce the partitions from multiple shuffles. This method assumes that all the shuffles - * have the same number of partitions, and the partitions of same index will be read together - * by one task. + * Coalesce the partitions from multiple shuffles, either in their original states, or applied + * with skew handling partition specs. If called on partitions containing skew partition specs, + * this method will keep the skew partition specs intact and only coalesce the partitions outside + * the skew sections. + * + * This method will return an empty result if the shuffles have been coalesced already, or if + * they do not have the same number of partitions, or if the coalesced result is the same as the + * input partition layout. + * + * @return A sequence of sequence of [[ShufflePartitionSpec]]s, which each inner sequence as the + * new partition specs for its corresponding shuffle after coalescing. If Nil is returned, + * then no coalescing is applied. + */ + def coalescePartitions( + mapOutputStatistics: Seq[Option[MapOutputStatistics]], + inputPartitionSpecs: Seq[Option[Seq[ShufflePartitionSpec]]], + advisoryTargetSize: Long, + minNumPartitions: Int): Seq[Seq[ShufflePartitionSpec]] = { +assert(mapOutputStatistics.length == inputPartitionSpecs.length) + +if (mapOutputStatistics.isEmpty) { + return Seq.empty +} + +// If `minNumPartitions` is very large, it is possible that we need to use a value less than +// `advisoryTargetSize` as the target size of a coalesced task. +val totalPostShuffleInputSize = mapOutputStatistics.flatMap(_.map(_.bytesByPartitionId.sum)).sum +// The max at here is to make sure that when we have an empty table, we only have a single +// coalesced partition. +// There is no particular reason that we pick 16. We just need a number to prevent +// `maxTargetSize` from being set to 0. +val maxTargetSize = math.max( + math.ceil(totalPostShuffleInputSize / minNumPartitions.toDouble).toLong, 16) +val targetSize = math.min(maxTargetSize, advisoryTargetSize) + +val shuffleIds = mapOutputStatistics.flatMap(_.map(_.shuffleId)).mkString(", ") +logInfo(s"For shuffle($shuffleIds), advisory target size: $advisoryTargetSize, " + + s"actual target size $targetSize.") + +val numShuffles = mapOutputStatistics.length +// `ShuffleQueryStageExec#mapStats` returns None when the input RDD has 0 partitions, +// we should skip it when calculating the `partitionStartIndices`. +val validMetrics = mapOutputStatistics.flatten + +if (inputPartitionSpecs.forall(_.isEmpty)) { + // If all input RDDs have 0 partition, we create an empty partition for every shuffle reader. + if (validMetrics.isEmpty) { +return Seq.fill(numShuffles)(Seq(CoalescedPartitionSpec(0, 0))) + } + + // We may have different pre-shuffle partition numbers, don't reduce shuffle partition number + // in that case. For example when we union fully aggregated data (data is arranged to a single + // partition) and a result of a SortMergeJoin (multiple partitions). + if (validMetrics.map(_.bytesByPartitionId.length).distinct.length > 1) { +return Seq.empty + } + + val numPartitions = validMetrics.head.bytesByPartitionId.length + val newPartitionSpecs = coalescePartitions( +0, numPartitions, validMetrics, targetSize) + if (newPartitionSpecs.length < numPartitions) { +return Seq.fill(numShuffles)(newPartitionSpecs) + } else { +return Seq.empty + } +} + +// Do not coalesce if any of the map output stats are missing or if not all shuffles have +// partition specs, which should not happen in practice. +if (!mapOutputStatistics.forall(_.isDefined) || !inputPartitionSpecs.forall(_.isDefined)) { + logWarning("Could not apply partition coalescing because of missing MapOutputStatistics " + +"or shuffle partition specs.") + return Seq.empty +} + +// Extract the start indices of each partition spec. Give invalid index -1 to unexpected +// partition specs. When we reach here, it means skew
[GitHub] [spark] SparkQA commented on pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
SparkQA commented on pull request #32646: URL: https://github.com/apache/spark/pull/32646#issuecomment-846738097 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43381/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
SparkQA commented on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846736560 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43378/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
HyukjinKwon commented on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846730839 Shall we file a JIRA? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
SparkQA commented on pull request #32645: URL: https://github.com/apache/spark/pull/32645#issuecomment-846730780 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43380/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
SparkQA removed a comment on pull request #32646: URL: https://github.com/apache/spark/pull/32646#issuecomment-846721626 **[Test build #138859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138859/testReport)** for PR 32646 at commit [`2c46597`](https://github.com/apache/spark/commit/2c4659736935029e84c65c43d16a52fe9d5473b7). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
AmplabJenkins removed a comment on pull request #32646: URL: https://github.com/apache/spark/pull/32646#issuecomment-846725283 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138859/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
AmplabJenkins commented on pull request #32646: URL: https://github.com/apache/spark/pull/32646#issuecomment-846725283 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138859/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
SparkQA commented on pull request #32646: URL: https://github.com/apache/spark/pull/32646#issuecomment-846725263 **[Test build #138859 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138859/testReport)** for PR 32646 at commit [`2c46597`](https://github.com/apache/spark/commit/2c4659736935029e84c65c43d16a52fe9d5473b7). * This patch **fails to build**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
SparkQA commented on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846721706 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43378/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
SparkQA commented on pull request #32646: URL: https://github.com/apache/spark/pull/32646#issuecomment-846721626 **[Test build #138859 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138859/testReport)** for PR 32646 at commit [`2c46597`](https://github.com/apache/spark/commit/2c4659736935029e84c65c43d16a52fe9d5473b7). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
AmplabJenkins removed a comment on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846721119 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138856/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
AmplabJenkins removed a comment on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846721120 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43377/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32643: [WIP] migrate transformAllExpressions
AmplabJenkins removed a comment on pull request #32643: URL: https://github.com/apache/spark/pull/32643#issuecomment-846721117 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
AmplabJenkins commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846721120 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43377/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32643: [WIP] migrate transformAllExpressions
AmplabJenkins commented on pull request #32643: URL: https://github.com/apache/spark/pull/32643#issuecomment-846721117 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
AmplabJenkins commented on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846721119 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138856/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
SparkQA commented on pull request #32645: URL: https://github.com/apache/spark/pull/32645#issuecomment-846720772 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43380/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #32488: [SPARK-35316][SQL] UnwrapCastInBinaryComparison support In predicate
maropu commented on a change in pull request #32488: URL: https://github.com/apache/spark/pull/32488#discussion_r637672796 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala ## @@ -121,6 +129,49 @@ object UnwrapCastInBinaryComparison extends Rule[LogicalPlan] { if canImplicitlyCast(fromExp, toType, literalType) => simplifyNumericComparison(be, fromExp, toType, value) +// As the analyzer makes sure that the list of In is already of the same data type, then the +// rule can simply check the first literal in `in.list` can implicitly cast to `toType` or not, +// and this rule doesn't convert in when `in.list` is empty. +case in @ In(Cast(fromExp, toType: NumericType, _), list @ Seq(firstLit, _*)) +if canImplicitlyCast(fromExp, toType, firstLit.dataType) && in.inSetConvertible => + val (newValueList, exp) = +list.map(lit => unwrapCast(EqualTo(in.value, lit))) + .partition { +case EqualTo(_, _: Literal) => true +case And(IsNull(_), Literal(null, BooleanType)) => false Review comment: `case And(IsNull(_), Literal(null, BooleanType)) => false` => `case _ => false`? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala ## @@ -121,6 +129,49 @@ object UnwrapCastInBinaryComparison extends Rule[LogicalPlan] { if canImplicitlyCast(fromExp, toType, literalType) => simplifyNumericComparison(be, fromExp, toType, value) +// As the analyzer makes sure that the list of In is already of the same data type, then the +// rule can simply check the first literal in `in.list` can implicitly cast to `toType` or not, +// and this rule doesn't convert in when `in.list` is empty. +case in @ In(Cast(fromExp, toType: NumericType, _), list @ Seq(firstLit, _*)) +if canImplicitlyCast(fromExp, toType, firstLit.dataType) && in.inSetConvertible => + val (newValueList, exp) = +list.map(lit => unwrapCast(EqualTo(in.value, lit))) + .partition { +case EqualTo(_, _: Literal) => true +case And(IsNull(_), Literal(null, BooleanType)) => false + } + + val (nonNullValueList, nullValueList) = newValueList.partition { +case EqualTo(_, NonNullLiteral(_, _: NumericType)) => true +case EqualTo(_, Literal(null, _)) => false + } + // make sure the new return list have the same dataType. + val newList = { +if (nonNullValueList.nonEmpty) { + // cast the null value to the dataType of nonNullValueList + // when the nonNullValueList is nonEmpty. + nullValueList.map { +case EqualTo(_, lit) => + Cast(lit, nonNullValueList.head.asInstanceOf[EqualTo].left.dataType) + } ++ nonNullValueList.map {case EqualTo(_, lit) => lit} +} else { + // the new value list only contains null value, + // cast the null value to fromExp.dataType. + nullValueList.map { +case EqualTo(_, lit) => + Cast(lit, fromExp.dataType) + } +} + } + + val unwrapIn = In(fromExp, newList) + // since `exp` are all the same, + // convert to a single value `And(IsNull(_), Literal(null, BooleanType))`. + exp.headOption match { +case Some(falseIfNotNull) => Or(falseIfNotNull, unwrapIn) Review comment: We still need to unwrap casts in this case? IIUC we unwrap casts so that the later optimizer rules can easily push down predicates into data sources. But, predicates having `Or` makes it hard to push them down? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCastInBinaryComparison.scala ## @@ -121,6 +129,49 @@ object UnwrapCastInBinaryComparison extends Rule[LogicalPlan] { if canImplicitlyCast(fromExp, toType, literalType) => simplifyNumericComparison(be, fromExp, toType, value) +// As the analyzer makes sure that the list of In is already of the same data type, then the +// rule can simply check the first literal in `in.list` can implicitly cast to `toType` or not, +// and this rule doesn't convert in when `in.list` is empty. +case in @ In(Cast(fromExp, toType: NumericType, _), list @ Seq(firstLit, _*)) +if canImplicitlyCast(fromExp, toType, firstLit.dataType) && in.inSetConvertible => + val (newValueList, exp) = +list.map(lit => unwrapCast(EqualTo(in.value, lit))) + .partition { +case EqualTo(_, _: Literal) => true +case And(IsNull(_), Literal(null, BooleanType)) => false + } + + val (nonNullValueList, nullValueList) = newValueList.partition { +case EqualTo(_, NonNullLiteral(_, _: NumericType)) => true +case EqualTo(_, Literal(null, _)) =>
[GitHub] [spark] SparkQA commented on pull request #32643: [WIP] migrate transformAllExpressions
SparkQA commented on pull request #32643: URL: https://github.com/apache/spark/pull/32643#issuecomment-846717967 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43379/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer opened a new pull request #32646: [SPARK-35057][SQL] Group exception messages in hive/thriftserver
beliefer opened a new pull request #32646: URL: https://github.com/apache/spark/pull/32646 ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Ngone51 commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
Ngone51 commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846713976 > If I understand correctly about stage level scheduling, you still need to specify "all" resources needed for "all" tasks in StateRDD; while that may block Spark to schedule when some resources are missing (like lost executor with PVC), I'm wondering how task level schedule would work as its intention. After this, locality is the only one we can deal with, and it's not an enforcement so we're back to the origin problem. That's true. I think we can extend the `ResourceProfile.defaultProfile` by adding the state store request to id. And we may not need to add the state store request to the executor (but task only) so the executor doesn't need to load the state store at its launch time while using dynamic allocation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
SparkQA removed a comment on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846704085 **[Test build #138856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138856/testReport)** for PR 32644 at commit [`af94332`](https://github.com/apache/spark/commit/af94332023c65df92a5b1a781dd3771e115b279b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
SparkQA commented on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846713551 **[Test build #138856 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138856/testReport)** for PR 32644 at commit [`af94332`](https://github.com/apache/spark/commit/af94332023c65df92a5b1a781dd3771e115b279b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
SparkQA commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846712770 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43377/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32594: [SPARK-35447][SQL] Optimize skew join before coalescing shuffle partitions
viirya commented on a change in pull request #32594: URL: https://github.com/apache/spark/pull/32594#discussion_r637669250 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala ## @@ -48,9 +48,6 @@ import org.apache.spark.sql.internal.SQLConf * (L2-1, R2), (L2-2, R2), * (L3, R3-1), (L3, R3-2), * (L4-1, R4-1), (L4-2, R4-1), (L4-1, R4-2), (L4-2, R4-2) - * - * Note that, when this rule is enabled, it also coalesces non-skewed partitions like - * `CoalesceShufflePartitions` does. Review comment: Is this comment not true anymore? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] FatalLin commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories
FatalLin commented on pull request #32202: URL: https://github.com/apache/spark/pull/32202#issuecomment-846711680 > I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action? you mean it will hit the same problem if we trigger the action with hive engine instead of spark native reader? I thought it could be handled with the hive configuration such like "hive.mapred.supports.subdirectories". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32643: [WIP] migrate transformAllExpressions
SparkQA removed a comment on pull request #32643: URL: https://github.com/apache/spark/pull/32643#issuecomment-846704117 **[Test build #138857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138857/testReport)** for PR 32643 at commit [`773bb96`](https://github.com/apache/spark/commit/773bb961ddd1824efd7ca37ae462d031ac36edd3). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32643: [WIP] migrate transformAllExpressions
SparkQA commented on pull request #32643: URL: https://github.com/apache/spark/pull/32643#issuecomment-846711414 **[Test build #138857 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138857/testReport)** for PR 32643 at commit [`773bb96`](https://github.com/apache/spark/commit/773bb961ddd1824efd7ca37ae462d031ac36edd3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` case class DeferFetchRequestResult(fetchRequest: FetchRequest) extends FetchResult` * `case class AgeExample(birthday: Expression, child: Expression) extends RuntimeReplaceable ` * `class SessionExtensionsWithLoader extends SparkSessionExtensionsProvider ` * `class SessionExtensionsWithoutLoader extends SparkSessionExtensionsProvider ` * `case class AvroWrite(` * `case class KafkaWrite(` * `class DataTypeOps(object, metaclass=ABCMeta):` * `\"\"\"The base class for binary operations of pandas-on-Spark objects (of different data types).\"\"\"` * `class BooleanOps(DataTypeOps):` * `class CategoricalOps(DataTypeOps):` * `class DateOps(DataTypeOps):` * `class DatetimeOps(DataTypeOps):` * `class NumericOps(DataTypeOps):` * `class IntegralOps(NumericOps):` * `class FractionalOps(NumericOps):` * `class StringOps(DataTypeOps):` * `class CachedAccessor(Generic[T]):` * `class SparkIndexOpsMethods(metaclass=ABCMeta):` * `case class ReferenceEqualPlanWrapper(plan: LogicalPlan) ` * ` class ExpressionContainmentOrdering extends Ordering[Expression] ` * `case class TryEval(child: Expression) extends UnaryExpression with NullIntolerant ` * `case class TryAdd(left: Expression, right: Expression, child: Expression)` * `case class TryDivide(left: Expression, right: Expression, child: Expression)` * `new RuntimeException(s\"Failed to convert value $value (class of $cls) \" +` * `new RuntimeException(s\"class $clsName has unexpected serializer: $objSerializer\")` * `case class UpdatingSessionsExec(` * `class UpdatingSessionsIterator(` * `trait FileWrite extends Write ` * `case class CSVWrite(` * `case class JsonWrite(` * `case class OrcWrite(` * `case class ParquetWrite(` * `case class TextWrite(` * `class ConsoleWrite(schema: StructType, options: CaseInsensitiveStringMap)` * `class ForeachWrite[T](` * `class MemoryWrite(sink: MemorySink, schema: StructType, needTruncate: Boolean) extends Write ` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on a change in pull request #32563: [SPARK-35415][SQL] Change `information` to map type for SHOW TABLE EXTENDED command
yaooqinn commented on a change in pull request #32563: URL: https://github.com/apache/spark/pull/32563#discussion_r637668321 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala ## @@ -65,6 +65,11 @@ object HiveResult { // database, table name, isTemp. case command @ ExecutedCommandExec(s: ShowTablesCommand) if !s.isExtended => command.executeCollect().map(_.getString(1)) +// SHOW TABLE EXTENDED in Hive do not have isTemp while our v1 command outputs isTemp. +case command @ ExecutedCommandExec(s: ShowTablesCommand) if s.isExtended => + command.executeCollect().map(_.getMap(3)) +.map(m => m.keyArray().array.zip(m.valueArray().array) + .map(kv => s"${kv._1}: ${kv._2}").mkString("\n")) Review comment: https://github.com/apache/spark/pull/32563/files#r635773394, there seems no `space` after the colon in hive results? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32594: [SPARK-35447][SQL] Optimize skew join before coalescing shuffle partitions
viirya commented on a change in pull request #32594: URL: https://github.com/apache/spark/pull/32594#discussion_r637667408 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala ## @@ -96,10 +96,9 @@ case class AdaptiveSparkPlanExec( @transient private val queryStageOptimizerRules: Seq[Rule[SparkPlan]] = Seq( PlanAdaptiveDynamicPruningFilters(this), ReuseAdaptiveSubquery(context.subqueryCache), -CoalesceShufflePartitions(context.session), -// The following two rules need to make use of 'CustomShuffleReaderExec.partitionSpecs' -// added by `CoalesceShufflePartitions`. So they must be executed after it. +// Skew join does not handle `CustomShuffleReader` so needs to be applied first. OptimizeSkewedJoin, +CoalesceShufflePartitions(context.session), OptimizeLocalShuffleReader Review comment: As `OptimizeLocalShuffleReader` still handles `CustomShuffleReaderExec`, we should keep the comment? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] chong0929 edited a comment on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories
chong0929 edited a comment on pull request #32202: URL: https://github.com/apache/spark/pull/32202#issuecomment-846708661 I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] chong0929 commented on pull request #32202: [SPARK-28098][SQL]Supporting non-partitioned Hive tables with subdirectories
chong0929 commented on pull request #32202: URL: https://github.com/apache/spark/pull/32202#issuecomment-846708661 I found the same problem with partition Hive tables if they contain subdirectories, so why wasn't it changed in this action -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
SparkQA commented on pull request #32645: URL: https://github.com/apache/spark/pull/32645#issuecomment-846707466 **[Test build #138858 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138858/testReport)** for PR 32645 at commit [`0197136`](https://github.com/apache/spark/commit/01971362298c3813123e5ea89383e756701ca5e8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu opened a new pull request #32645: [SPARK-35129][SQL] Construct year-month interval column from integral fields
AngersZh opened a new pull request #32645: URL: https://github.com/apache/spark/pull/32645 ### What changes were proposed in this pull request? Add a new function to support construct YearMonthIntervalType from integral fields ### Why are the changes needed? Add a new function to support construct YearMonthIntervalType from integral fields ### Does this PR introduce _any_ user-facing change? Yea user can use `make_ym_interval` to construct TearMonthIntervalType from years/months integral fields ### How was this patch tested? Added UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] JkSelf commented on a change in pull request #32594: [SPARK-35447][SQL] Optimize skew join before coalescing shuffle partitions
JkSelf commented on a change in pull request #32594: URL: https://github.com/apache/spark/pull/32594#discussion_r637663065 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/OptimizeSkewedJoin.scala ## @@ -158,76 +155,68 @@ object OptimizeSkewedJoin extends CustomShuffleReaderRule { *3 tasks separately. */ private def tryOptimizeJoinChildren( - left: ShuffleStageInfo, - right: ShuffleStageInfo, + left: ShuffleQueryStageExec, + right: ShuffleQueryStageExec, joinType: JoinType): Option[(SparkPlan, SparkPlan)] = { -assert(left.partitionsWithSizes.length == right.partitionsWithSizes.length) -val numPartitions = left.partitionsWithSizes.length +val leftSizes = left.mapStats.get.bytesByPartitionId +val rightSizes = right.mapStats.get.bytesByPartitionId +assert(leftSizes.length == rightSizes.length) +val numPartitions = leftSizes.length // We use the median size of the original shuffle partitions to detect skewed partitions. -val leftMedSize = medianSize(left.mapStats) -val rightMedSize = medianSize(right.mapStats) +val leftMedSize = medianSize(leftSizes) +val rightMedSize = medianSize(rightSizes) logDebug( s""" |Optimizing skewed join. |Left side partitions size info: - |${getSizeInfo(leftMedSize, left.mapStats.bytesByPartitionId)} + |${getSizeInfo(leftMedSize, leftSizes)} |Right side partitions size info: - |${getSizeInfo(rightMedSize, right.mapStats.bytesByPartitionId)} + |${getSizeInfo(rightMedSize, rightSizes)} """.stripMargin) + val canSplitLeft = canSplitLeftSide(joinType) val canSplitRight = canSplitRightSide(joinType) -// We use the actual partition sizes (may be coalesced) to calculate target size, so that -// the final data distribution is even (coalesced partitions + split partitions). -val leftActualSizes = left.partitionsWithSizes.map(_._2) -val rightActualSizes = right.partitionsWithSizes.map(_._2) -val leftTargetSize = targetSize(leftActualSizes, leftMedSize) -val rightTargetSize = targetSize(rightActualSizes, rightMedSize) +val leftTargetSize = targetSize(leftSizes, leftMedSize) +val rightTargetSize = targetSize(rightSizes, rightMedSize) val leftSidePartitions = mutable.ArrayBuffer.empty[ShufflePartitionSpec] val rightSidePartitions = mutable.ArrayBuffer.empty[ShufflePartitionSpec] var numSkewedLeft = 0 var numSkewedRight = 0 for (partitionIndex <- 0 until numPartitions) { - val leftActualSize = leftActualSizes(partitionIndex) - val isLeftSkew = isSkewed(leftActualSize, leftMedSize) && canSplitLeft - val leftPartSpec = left.partitionsWithSizes(partitionIndex)._1 - val isLeftCoalesced = leftPartSpec.startReducerIndex + 1 < leftPartSpec.endReducerIndex - - val rightActualSize = rightActualSizes(partitionIndex) - val isRightSkew = isSkewed(rightActualSize, rightMedSize) && canSplitRight - val rightPartSpec = right.partitionsWithSizes(partitionIndex)._1 - val isRightCoalesced = rightPartSpec.startReducerIndex + 1 < rightPartSpec.endReducerIndex + val leftSize = leftSizes(partitionIndex) + val isLeftSkew = isSkewed(leftSize, leftMedSize) && canSplitLeft + val rightSize = rightSizes(partitionIndex) + val isRightSkew = isSkewed(rightSize, rightMedSize) && canSplitRight + val noSkewPartitionSpec = Seq(CoalescedPartitionSpec(partitionIndex, partitionIndex + 1)) - // A skewed partition should never be coalesced, but skip it here just to be safe. - val leftParts = if (isLeftSkew && !isLeftCoalesced) { -val reducerId = leftPartSpec.startReducerIndex + val leftParts = if (isLeftSkew) { val skewSpecs = createSkewPartitionSpecs( - left.mapStats.shuffleId, reducerId, leftTargetSize) + left.mapStats.get.shuffleId, partitionIndex, leftTargetSize) if (skewSpecs.isDefined) { logDebug(s"Left side partition $partitionIndex " + -s"(${FileUtils.byteCountToDisplaySize(leftActualSize)}) is skewed, " + +s"(${FileUtils.byteCountToDisplaySize(leftSize)}) is skewed, " + s"split it into ${skewSpecs.get.length} parts.") numSkewedLeft += 1 } -skewSpecs.getOrElse(Seq(leftPartSpec)) +skewSpecs.getOrElse(noSkewPartitionSpec) } else { -Seq(leftPartSpec) +noSkewPartitionSpec } // A skewed partition should never be coalesced, but skip it here just to be safe. Review comment: This comment can be removed. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/ShufflePartitionsUtil.scala ## @@ -21,16 +21,160 @@ import scala.collection.mutable.ArrayBuffer import org.apache.spark.MapOutputStatistics import org.apache.spark.internal.Logging -import
[GitHub] [spark] wangyum commented on a change in pull request #32563: [SPARK-35415][SQL] Change `information` to map type for SHOW TABLE EXTENDED command
wangyum commented on a change in pull request #32563: URL: https://github.com/apache/spark/pull/32563#discussion_r637663948 ## File path: sql/core/src/test/resources/sql-tests/results/show-tables.sql.out ## @@ -120,19 +120,9 @@ show_t3 -- !query SHOW TABLE EXTENDED LIKE 'show_t*' -- !query schema -struct +struct> -- !query output - show_t3 trueTable: show_t3 Review comment: `beeline` before: ``` 0: jdbc:hive2://localhost:1> set spark.sql.legacy.keepCommandOutputSchema=true; +---++ |key| value | +---++ | spark.sql.legacy.keepCommandOutputSchema | true | +---++ 1 row selected (0.055 seconds) 0: jdbc:hive2://localhost:1> SHOW TABLE EXTENDED LIKE '*'; +---+---+--++ | database | tableName | isTemporary |information | +---+---+--++ | default | test_parquet | false| CatalogTable( Database: default Table: test_parquet Owner: yumwang Created Time: Mon May 24 11:16:33 CST 2021 Last Access: UNKNOWN Created By: Spark 3.2.0-SNAPSHOT Type: MANAGED Provider: hive Table Properties: [transient_lastDdlTime=1621826201] Statistics: 290 bytes Location: file:/Users/yumwang/tmp//spark/spark-warehouse/test_parquet Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties: [serialization.format=1] Partition Provider: Catalog Schema: root |-- id: long (nullable = false) ) | +---+---+--++ 1 row selected (0.086 seconds) ``` `beeline` after: ``` 0: jdbc:hive2://localhost:1> SHOW TABLE EXTENDED LIKE '*'; ++---+--++ | namespace | tableName | isTemporary |information | ++---+--++ | default| test_parquet | false| {"Created By":"Spark 3.2.0-SNAPSHOT","Created Time":"Mon May 24 11:16:33 CST 2021","Database":"default","InputFormat":"org.apache.hadoop.mapred.TextInputFormat","Last Access":"UNKNOWN","Location":"file:/Users/yumwang/tmp//spark/spark-warehouse/test_parquet","OutputFormat":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","Owner":"yumwang","Partition Provider":"Catalog","Provider":"hive","Schema":"root |-- id: long (nullable = false) ","Serde Library":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","Statistics":"290 bytes","Storage Properties":"[serialization.format=1]","Table Properties":"[transient_lastDdlTime=1621826201]","Table":"test_parquet","Type":"MANAGED"} | ++---+--++ 1 row selected (0.903 seconds) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #32563: [SPARK-35415][SQL] Change `information` to map type for SHOW TABLE EXTENDED command
wangyum commented on a change in pull request #32563: URL: https://github.com/apache/spark/pull/32563#discussion_r637662935 ## File path: sql/core/src/test/resources/sql-tests/results/show-tables.sql.out ## @@ -120,19 +120,9 @@ show_t3 -- !query SHOW TABLE EXTENDED LIKE 'show_t*' -- !query schema -struct +struct> -- !query output - show_t3 trueTable: show_t3 Review comment: `spark-sql` before: ``` spark-sql> set spark.sql.legacy.keepCommandOutputSchema=true; spark.sql.legacy.keepCommandOutputSchema true Time taken: 0.012 seconds, Fetched 1 row(s) spark-sql> SHOW TABLE EXTENDED LIKE '*'; default test_parquetfalse CatalogTable( Database: default Table: test_parquet Owner: yumwang Created Time: Mon May 24 11:16:33 CST 2021 Last Access: UNKNOWN Created By: Spark 3.2.0-SNAPSHOT Type: MANAGED Provider: hive Table Properties: [transient_lastDdlTime=1621826201] Statistics: 290 bytes Location: file:/Users/yumwang/tmp//spark/spark-warehouse/test_parquet Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties: [serialization.format=1] Partition Provider: Catalog Schema: root |-- id: long (nullable = false) ) Time taken: 0.031 seconds, Fetched 1 row(s) ``` `spark-sql` after: ``` spark-sql> SHOW TABLE EXTENDED LIKE '*'; default test_parquetfalse {"Created By":"Spark 3.2.0-SNAPSHOT","Created Time":"Mon May 24 11:16:33 CST 2021","Database":"default","InputFormat":"org.apache.hadoop.mapred.TextInputFormat","Last Access":"UNKNOWN","Location":"file:/Users/yumwang/tmp//spark/spark-warehouse/test_parquet","OutputFormat":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","Owner":"yumwang","Partition Provider":"Catalog","Provider":"hive","Schema":"root |-- id: long (nullable = false) ","Serde Library":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","Statistics":"290 bytes","Storage Properties":"[serialization.format=1]","Table Properties":"[transient_lastDdlTime=1621826201]","Table":"test_parquet","Type":"MANAGED"} Time taken: 0.043 seconds, Fetched 1 row(s) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32643: [WIP] migrate transformAllExpressions
SparkQA commented on pull request #32643: URL: https://github.com/apache/spark/pull/32643#issuecomment-846704117 **[Test build #138857 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138857/testReport)** for PR 32643 at commit [`773bb96`](https://github.com/apache/spark/commit/773bb961ddd1824efd7ca37ae462d031ac36edd3). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
SparkQA commented on pull request #32644: URL: https://github.com/apache/spark/pull/32644#issuecomment-846704085 **[Test build #138856 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138856/testReport)** for PR 32644 at commit [`af94332`](https://github.com/apache/spark/commit/af94332023c65df92a5b1a781dd3771e115b279b). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
AmplabJenkins removed a comment on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846703760 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43376/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #32563: [SPARK-35415][SQL] Change `information` to map type for SHOW TABLE EXTENDED command
wangyum commented on a change in pull request #32563: URL: https://github.com/apache/spark/pull/32563#discussion_r637662935 ## File path: sql/core/src/test/resources/sql-tests/results/show-tables.sql.out ## @@ -120,19 +120,9 @@ show_t3 -- !query SHOW TABLE EXTENDED LIKE 'show_t*' -- !query schema -struct +struct> -- !query output - show_t3 trueTable: show_t3 Review comment: `spark-sql` Before: ``` spark-sql> set spark.sql.legacy.keepCommandOutputSchema=true; spark.sql.legacy.keepCommandOutputSchema true Time taken: 0.012 seconds, Fetched 1 row(s) spark-sql> SHOW TABLE EXTENDED LIKE '*'; default test_parquetfalse CatalogTable( Database: default Table: test_parquet Owner: yumwang Created Time: Mon May 24 11:16:33 CST 2021 Last Access: UNKNOWN Created By: Spark 3.2.0-SNAPSHOT Type: MANAGED Provider: hive Table Properties: [transient_lastDdlTime=1621826201] Statistics: 290 bytes Location: file:/Users/yumwang/tmp//spark/spark-warehouse/test_parquet Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Storage Properties: [serialization.format=1] Partition Provider: Catalog Schema: root |-- id: long (nullable = false) ) Time taken: 0.031 seconds, Fetched 1 row(s) ``` `spark-sql` After: ``` spark-sql> SHOW TABLE EXTENDED LIKE '*'; default test_parquetfalse {"Created By":"Spark 3.2.0-SNAPSHOT","Created Time":"Mon May 24 11:16:33 CST 2021","Database":"default","InputFormat":"org.apache.hadoop.mapred.TextInputFormat","Last Access":"UNKNOWN","Location":"file:/Users/yumwang/tmp//spark/spark-warehouse/test_parquet","OutputFormat":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat","Owner":"yumwang","Partition Provider":"Catalog","Provider":"hive","Schema":"root |-- id: long (nullable = false) ","Serde Library":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","Statistics":"290 bytes","Storage Properties":"[serialization.format=1]","Table Properties":"[transient_lastDdlTime=1621826201]","Table":"test_parquet","Type":"MANAGED"} Time taken: 0.043 seconds, Fetched 1 row(s) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
AmplabJenkins commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846703760 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43376/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
SparkQA commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846703020 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43377/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 opened a new pull request #32644: [MINOR] Add thread target wrapper API for pyspark pin thread mode.
WeichenXu123 opened a new pull request #32644: URL: https://github.com/apache/spark/pull/32644 ### What changes were proposed in this pull request? Add thread target wrapper API for pyspark pin thread mode. ### Why are the changes needed? A helper method which make user easier to write threading code under pin thread mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sigmod opened a new pull request #32643: [WIP] migrate transformAllExpressions
sigmod opened a new pull request #32643: URL: https://github.com/apache/spark/pull/32643 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32642: [SPARK-35495][R] Change SparkR maintainer for CRAN
dongjoon-hyun commented on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846700488 Thank you, @shivaram ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
viirya commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846699726 > That said, personally, as I commented on above comment thread, I'd like to see the interface to let end users do the hack on Spark and do whatever they want, with taking their own risks. In other words, task level scheduling itself sounds OK to me, though I promise I'll respect the decision of committers on CORE area. > > Btw, IMHO, at least for now, initial state distribution and leveraging PVC are something which should be proved in various use cases / workloads in production, before taking them on the plate of discussions. Before then, I'd like to see Spark be customizable on their needs, so that they are no longer blocked on Spark side. This API is proposed to let other developers to change the way in Spark. And yes, taking with their own risks. Actually this might be most less intrusive way to support the improvements on SS side. We don't need to touch other core pieces but just let developers to work with the API. Maybe we even don't need to add the PVC stuff into Spark upstream. As it is pretty clean API implementation, we can just implement it at our side if Spark upstream doesn't want to take the complexity. Mentioning PVC is to show the use-case in our mind when we were asked about how we are going to use the API. Unfortunately so far this is unable to push forward. And I'm not sure right now how much change is needed for stage-level scheduling for similar function. But I also want to respect the decision of committers on core area, so I'd like to take a look on stage-level scheduling as it is suggested above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shivaram commented on pull request #32642: [SPARK-35495][R] Change SparkR maintainer for CRAN
shivaram commented on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846698523 Thanks @felixcheung +1 from me as well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
viirya commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846696394 > My major point is about the characteristic of the checkpoint location. > > We require checkpoint location to be "fault-tolerant" including hardware failures (local storage doesn't make sense here), and provide "high availability" by itself so that Spark can delegate such complexity to the checkpoint location. For sure, such requirement leads underlying file system to be heavy and non-trivial to maintain, but IMHO that's not an enough reason to take the complexity back to Spark, because: I think the users face one major issue is, they don't have choice. For "fault-tolerant", as we consider PVC as an abstract way to look at storage, it can support that if the storage class supports the feature. Actually there is storage class supporting that. Again, it is about user-choice. Users can choose from different storage classes for PVC. How often the fault can occur and how serious a fault could be for the streaming app? Not to mention there is also snapshot support for volumes on K8S. From less to more, users can choose different storage classes to meet their requirements. For example, for a streaming app that fault may not be too serious issue, maybe local storage + occasional snapshot or local storage with raid may be good enough? For industry usage, sometimes it is not easy to ask whatever file system to use, e.g. Object stores in Azure or GCS or others, if the users want. Any backend file system adoption requires organization change, talent hiring, system engineering team support, policy change, etc. > I'd interpret the reasons as two folds: > > A. Majority of real-world workloads are working well with current technology > B. Some workloads don't work well, but no strong demand on this as possible issues are tolerable I don't want to guess it here, but maybe another one possibility, they are moved to other streaming engine which can support their workloads easily. > I'd be happy to see the overall system design and the result of POC. Let's continue the talk about PVC once we get the details. Sure. This stage-level scheduling is a different direction than my original proposal. I need to take some time on revising it. I will keep it posted in other place e.g. new JIRA. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn edited a comment on pull request #32594: [SPARK-35447][SQL] Optimize skew join before coalescing shuffle partitions
yaooqinn edited a comment on pull request #32594: URL: https://github.com/apache/spark/pull/32594#issuecomment-846695031 Looks fine to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] yaooqinn commented on pull request #32594: [SPARK-35447][SQL] Optimize skew join before coalescing shuffle partitions
yaooqinn commented on pull request #32594: URL: https://github.com/apache/spark/pull/32594#issuecomment-846695031 Look fine to me -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
SparkQA commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846693183 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43376/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR edited a comment on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846691545 That said, personally, as I commented on above comment thread, I'd like to see the interface to let end users do the hack on Spark and do whatever they want, with taking their own risks. In other words, task level scheduling itself sounds OK to me, though I promise I'll respect the decision of committers on CORE area. Btw, IMHO, at least for now, initial state distribution and leveraging PVC are something which should be proved in various use cases / workloads in production, before taking them on the plate of discussions. Before then, I'd like to see Spark be customizable on their needs, so that they are no longer blocked on Spark side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846691545 That said, personally, as I commented on above comment thread, I'd like to see the interface to let end users do the hack on Spark and do whatever they want, with taking their own risks. In other words, task level scheduling itself sounds OK to me, though I promise I'll respect the decision of committers on CORE area. Btw, IMHO, at least for now, initial state distribution and leveraging PVC are something which should be proved in various use cases / workloads in production, before taking them on the plate of discussions. Before then, I'd like to see Spark be able to customize on their needs, so that they are no longer blocked on Spark side. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ankurdave commented on pull request #32625: [SPARK-35486][CORE] TaskMemoryManager: retry if other task takes memory freed by partial self-spill
ankurdave commented on pull request #32625: URL: https://github.com/apache/spark/pull/32625#issuecomment-846689070 Fine with me to consider this for master only. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
SparkQA commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846688411 **[Test build #138855 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138855/testReport)** for PR 32602 at commit [`f7a14cf`](https://github.com/apache/spark/commit/f7a14cf49ac27336195dbcb6285b2b2e5f96df38). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32252: [SPARK-35094][SQL]Spark from_json(JsonToStruct) function return wrong value in permissive mode
dongjoon-hyun commented on pull request #32252: URL: https://github.com/apache/spark/pull/32252#issuecomment-846687927 Thank you so much for the confirmation, @HyukjinKwon ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR edited a comment on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR edited a comment on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846687139 My major point is about the characteristic of the checkpoint location. We require checkpoint location to be "fault-tolerant" including hardware failures (local storage doesn't make sense here), and provide "high availability" by itself so that Spark can delegate such complexity to the checkpoint location. For sure, such requirement leads underlying file system to be heavy and non-trivial to maintain, but IMHO that's not an enough reason to take the complexity back to Spark, because: 1. Spark is already quite complicated Nowadays it's quite uneasy to bring a major changes without affecting existing behaviors. Personally I'd like to see the mainline of Apache Spark as ensuring majority's demands, not everyone's demands. 2. The industry on file systems (or alike) also makes improvements Many end users were trying to deal with checkpoint location in object stores despite of eventual consistency of S3, and it even became strong consistency. Object stores in Azure had been providing strong consistency if I understand correctly. Not 100% sure of GCS but I heard they claim strong consistency. HDFS has been exposing several shortcomings so far, but the community is also making improvements like Apache Ozone. 3. Spark community even didn't try to optimize the path I'd interpret the reasons as two folds: A. Majority of real-world workloads are working well with current technology B. Some workloads don't work well, but no strong demand on this as possible issues are tolerable I think we still don't struggle for this enough. There're still spots to reduce the latency down, like I did on optimizing WAL commit phase (#31495). I'd like to put my efforts on helping majority's use cases. > Technically, PVC is kinds of abstract way to look at the volume mounted on container running executor. It could be local storage on nodes on k8s. It depends where the PVC is bound to. PVC is not a kind of abstraction which "guarantees" fault-tolerant file system, so it still has to depend on the actual file system under the hood, and also the guarantees on interface accessing file system. I imagine the operational costs on making PVC guarantees such requirements would be non-trivial as well. > Using PVC as checkpoint could be huge relief on the loading of HDFS. There are also others like better latency, simplified streaming architecture. I'd be happy to see the overall system design and the result of POC. Let's continue the talk about PVC once we get the details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #32136: [SPARK-35022][CORE] Task Scheduling Plugin in Spark
HeartSaVioR commented on pull request #32136: URL: https://github.com/apache/spark/pull/32136#issuecomment-846687139 My major point is about the characteristic of the checkpoint location. We require checkpoint location to be "fault-tolerant" including hardware failures (local storage doesn't make sense here), and provide "high availability" by itself so that Spark can delegate such complexity to the checkpoint location. For sure, such requirement leads underlying file system to be heavy and non-trivial to maintain, but IMHO that's not an enough reason to take the complexity back to Spark, because: 1. Spark is already quite complicated Nowadays it's quite uneasy to bring a major changes without affecting existing behaviors. Personally I'd like to see the mainline of Apache Spark as ensuring majority's demands, not everyone's demands. 2. The industry on file systems (or alike) also makes improvements Many end users were trying to deal with checkpoint location in object stores despite of eventual consistency of S3, and it even became strong consistency. Object stores in Azure had been providing strong consistency if I understand correctly. Not 100% sure of GCS but I heard they claim strong consistency. HDFS has been exposing several shortcomings so far, but the community is also making improvements like Apache Ozone. 3. Spark community even didn't try to optimize the path I'd interpret the reasons as two folds: A. Majority of real-world workloads are working well with current technology B. Some workloads don't work well, but no strong demand on this as possible issues are tolerable I think we still don't struggle for this enough. There're still spots to reduce the latency down, like I did on optimizing WAL commit phase (#31495). I'd like to put my efforts on helping majority's use cases. > Technically, PVC is kinds of abstract way to look at the volume mounted on container running executor. It could be local storage on nodes on k8s. It depends where the PVC is bound to. PVC is not a kind of abstraction which "guarantees" fault-tolerant file system, so it still has to depend on the actual file system under the hood, and also the guarantees on interface accessing file system. I imagine the operational costs on making PVC guarantee such requirements would be non-trivial as well. > Using PVC as checkpoint could be huge relief on the loading of HDFS. There are also others like better latency, simplified streaming architecture. I'd be happy to see the overall system design and the result of POC. Let's continue the talk about PVC once we get the details. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #32642: [SPARK-35495][R] Change SparkR maintainer for CRAN
dongjoon-hyun commented on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846686579 Merged to master/3.1/3.0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32599: [SPARK-35320][SQL] Fix from_json maps parser: Only StringType is supported for keys.
AmplabJenkins removed a comment on pull request #32599: URL: https://github.com/apache/spark/pull/32599#issuecomment-846686262 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43375/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32642: [SPARK-35495][R] Change SparkR maintainer for CRAN
AmplabJenkins removed a comment on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846686263 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43373/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
AmplabJenkins removed a comment on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846686261 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138854/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32642: [SPARK-35495][R] Change SparkR maintainer for CRAN
AmplabJenkins commented on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846686263 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43373/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
AmplabJenkins commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846686261 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138854/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32599: [SPARK-35320][SQL] Fix from_json maps parser: Only StringType is supported for keys.
AmplabJenkins commented on pull request #32599: URL: https://github.com/apache/spark/pull/32599#issuecomment-846686262 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/43375/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #32625: [SPARK-35486][CORE] TaskMemoryManager: retry if other task takes memory freed by partial self-spill
dongjoon-hyun commented on a change in pull request #32625: URL: https://github.com/apache/spark/pull/32625#discussion_r637647625 ## File path: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java ## @@ -202,14 +202,22 @@ public long acquireExecutionMemory(long required, MemoryConsumer consumer) { } } - // call spill() on itself - if (got < required) { + // Attempt to free up memory by self-spilling. + // + // When our spill handler releases memory, `ExecutionMemoryPool#releaseMemory()` will + // immediately notify other tasks that memory has been freed, and they may acquire the + // newly-freed memory before we have a chance to do so (SPARK-35486). In that case, we will + // try again in the next loop iteration. + while (got < required) { Review comment: Thank you for the explanation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
SparkQA commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846683680 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43376/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #32642: [SPARK-35495][R] Change SparkR maintainer for CRAN
dongjoon-hyun closed pull request #32642: URL: https://github.com/apache/spark/pull/32642 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ankurdave commented on a change in pull request #32625: [SPARK-35486][CORE] TaskMemoryManager: retry if other task takes memory freed by partial self-spill
ankurdave commented on a change in pull request #32625: URL: https://github.com/apache/spark/pull/32625#discussion_r637640349 ## File path: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java ## @@ -202,14 +202,22 @@ public long acquireExecutionMemory(long required, MemoryConsumer consumer) { } } - // call spill() on itself - if (got < required) { + // Attempt to free up memory by self-spilling. + // + // When our spill handler releases memory, `ExecutionMemoryPool#releaseMemory()` will + // immediately notify other tasks that memory has been freed, and they may acquire the + // newly-freed memory before we have a chance to do so (SPARK-35486). In that case, we will + // try again in the next loop iteration. + while (got < required) { Review comment: Good question. The worst case occurs when waiting tasks take all the memory released by `consumer.spill()`. The specific number of iterations depends on the behavior of `consumer.spill()`: - If `consumer.spill()` releases memory promptly when requested to, as all current implementations do, then this can iterate up to `consumer.getUsed() / (required - got) + 2` times. In all but the last two iterations, `consumer` spills `required - got` bytes; in the second-to-last iteration, `consumer` spills any remaining bytes; and in the last iteration, `consumer` spills 0 bytes. - If `consumer.spill()` trickles out memory, for example by releasing only 1 byte per call, then there can be up to `consumer.getUsed() + 1` iterations. In all but the last iteration, `consumer` spills 1 byte, and in the last iteration, `consumer` spills 0 bytes. This is the same worst-case behavior as the `while (!sortedConsumers.isEmpty())` loop above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
SparkQA removed a comment on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846670009 **[Test build #138854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138854/testReport)** for PR 32602 at commit [`e26df96`](https://github.com/apache/spark/commit/e26df96b72806ace84c72e1f134278e2b20b4457). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
SparkQA commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846679533 **[Test build #138854 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138854/testReport)** for PR 32602 at commit [`e26df96`](https://github.com/apache/spark/commit/e26df96b72806ace84c72e1f134278e2b20b4457). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
maropu commented on a change in pull request #32303: URL: https://github.com/apache/spark/pull/32303#discussion_r637630300 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -2395,6 +2421,70 @@ class Analyzer(override val catalogManager: CatalogManager) } } + /** + * This rule resolves lateral joins. + */ + object ResolveLateralJoin extends Rule[LogicalPlan] { Review comment: Q: Btw, is it difficult to merge this resolution logic into the `ResolveReferences` one? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
maropu commented on a change in pull request #32303: URL: https://github.com/apache/spark/pull/32303#discussion_r637626356 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -1607,33 +1622,30 @@ class Analyzer(override val catalogManager: CatalogManager) } } f1.copy(arguments = f1.arguments.flatMap { -case s: Star => s.expand(child, resolver) +case s: Star => expand(s) case o => o :: Nil }) case c: CreateNamedStruct if containsStar(c.valExprs) => val newChildren = c.children.grouped(2).flatMap { -case Seq(k, s : Star) => CreateStruct(s.expand(child, resolver)).children +case Seq(k, s : Star) => CreateStruct(expand(s)).children case kv => kv } c.copy(children = newChildren.toList ) case c: CreateArray if containsStar(c.children) => c.copy(children = c.children.flatMap { -case s: Star => s.expand(child, resolver) +case s: Star => expand(s) case o => o :: Nil }) case p: Murmur3Hash if containsStar(p.children) => p.copy(children = p.children.flatMap { -case s: Star => s.expand(child, resolver) +case s: Star => expand(s) case o => o :: Nil }) case p: XxHash64 if containsStar(p.children) => p.copy(children = p.children.flatMap { -case s: Star => s.expand(child, resolver) +case s: Star => expand(s) case o => o :: Nil }) -// count(*) has been replaced by count(1) -case o if containsStar(o.children) => - throw QueryCompilationErrors.invalidStarUsageError(s"expression '${o.prettyName}'") Review comment: Was this error handling moved into `CheckAnalysis` https://github.com/apache/spark/pull/32303/files#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR177-R179 ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #32303: [SPARK-34382][SQL] Support LATERAL subqueries
maropu commented on a change in pull request #32303: URL: https://github.com/apache/spark/pull/32303#discussion_r637627094 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -1663,6 +1675,27 @@ class Analyzer(override val catalogManager: CatalogManager) } } + /** + * Optionally resolve the name parts using the outer query plan and wrap resolved attributes + * with [[OuterReference]]s. + */ + private def resolveOuterReference( Review comment: nit: `tryResolveOuterReference`? ## File path: sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 ## @@ -643,12 +643,12 @@ setQuantifier ; relation -: relationPrimary joinRelation* +: LATERAL? relationPrimary joinRelation* ; joinRelation -: (joinType) JOIN right=relationPrimary joinCriteria? -| NATURAL joinType JOIN right=relationPrimary +: (joinType) JOIN LATERAL? right=relationPrimary joinCriteria? +| NATURAL joinType JOIN LATERAL? right=relationPrimary Review comment: Oh, super cleaner than the previous one. Nice. ## File path: sql/core/src/test/resources/sql-tests/inputs/join-lateral.sql ## @@ -0,0 +1,97 @@ +-- Test cases for lateral join + +CREATE VIEW t1(c1, c2) AS VALUES (0, 1), (1, 2); +CREATE VIEW t2(c1, c2) AS VALUES (0, 2), (0, 3); + +-- lateral join with single column select +SELECT * FROM t1, LATERAL (SELECT c1); +SELECT * FROM t1, LATERAL (SELECT c1 FROM t2); +SELECT * FROM t1, LATERAL (SELECT t1.c1 FROM t2); +SELECT * FROM t1, LATERAL (SELECT t1.c1 + t2.c1 FROM t2); + +-- lateral join with star expansion Review comment: Could you add tests for `quotedRegexColumnNames`, too? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/joins.scala ## @@ -168,6 +168,21 @@ object EliminateOuterJoin extends Rule[LogicalPlan] with PredicateHelper { } } +/** + * Rewrite lateral joins by rewriting all dependent joins (if any) inside the right + * sub-tree of the lateral join and converting the lateral join into a base join type. + */ +object RewriteLateralJoin extends Rule[LogicalPlan] with PredicateHelper { + + def apply(plan: LogicalPlan): LogicalPlan = plan transformUp { +case j @ Join(left, right, LateralJoin(joinType), condition, _) => + val conditions = condition.map(splitConjunctivePredicates).getOrElse(Nil) + val newRight = DecorrelateInnerQuery.rewriteDomainJoins(left, right, conditions) + // TODO: handle the COUNT bug Review comment: Can you add the jira number (SPARK-15370) here? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -1607,33 +1622,30 @@ class Analyzer(override val catalogManager: CatalogManager) } } f1.copy(arguments = f1.arguments.flatMap { -case s: Star => s.expand(child, resolver) +case s: Star => expand(s) case o => o :: Nil }) case c: CreateNamedStruct if containsStar(c.valExprs) => val newChildren = c.children.grouped(2).flatMap { -case Seq(k, s : Star) => CreateStruct(s.expand(child, resolver)).children +case Seq(k, s : Star) => CreateStruct(expand(s)).children case kv => kv } c.copy(children = newChildren.toList ) case c: CreateArray if containsStar(c.children) => c.copy(children = c.children.flatMap { -case s: Star => s.expand(child, resolver) +case s: Star => expand(s) case o => o :: Nil }) case p: Murmur3Hash if containsStar(p.children) => p.copy(children = p.children.flatMap { -case s: Star => s.expand(child, resolver) +case s: Star => expand(s) case o => o :: Nil }) case p: XxHash64 if containsStar(p.children) => p.copy(children = p.children.flatMap { -case s: Star => s.expand(child, resolver) +case s: Star => expand(s) case o => o :: Nil }) -// count(*) has been replaced by count(1) -case o if containsStar(o.children) => - throw QueryCompilationErrors.invalidStarUsageError(s"expression '${o.prettyName}'") Review comment: Did this error handling is moved into `CheckAnalysis` https://github.com/apache/spark/pull/32303/files#diff-583171e935b2dc349378063a5841c5b98b30a2d57ac3743a9eccfe7bffcb8f2aR177-R179 ? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala ## @@ -276,6 +276,7 @@ class Analyzer(override val catalogManager: CatalogManager) ResolveAliases :: ResolveSubquery :: ResolveSubqueryColumnAliases :: + ResolveLateralJoin :: Review comment: Since `ResolveLateralJoin` has a
[GitHub] [spark] ankurdave commented on a change in pull request #32625: [SPARK-35486][CORE] TaskMemoryManager: retry if other task takes memory freed by partial self-spill
ankurdave commented on a change in pull request #32625: URL: https://github.com/apache/spark/pull/32625#discussion_r637640349 ## File path: core/src/main/java/org/apache/spark/memory/TaskMemoryManager.java ## @@ -202,14 +202,22 @@ public long acquireExecutionMemory(long required, MemoryConsumer consumer) { } } - // call spill() on itself - if (got < required) { + // Attempt to free up memory by self-spilling. + // + // When our spill handler releases memory, `ExecutionMemoryPool#releaseMemory()` will + // immediately notify other tasks that memory has been freed, and they may acquire the + // newly-freed memory before we have a chance to do so (SPARK-35486). In that case, we will + // try again in the next loop iteration. + while (got < required) { Review comment: Good question. The worst case occurs when waiting tasks take all the memory released by `consumer.spill()`. The specific number of iterations depends on the behavior of `consumer.spill()`: - If `consumer.spill()` releases memory promptly when requested to, as all current implementations do, then this can iterate up to `consumer.getUsed() / (required - got) + 1` times. In all but the last iteration, `consumer` spills `required - got` bytes, and in the last iteration, `consumer` spills 0 bytes. - If `consumer.spill()` trickles out memory, for example by releasing only 1 byte per call, then there can be up to `consumer.getUsed() + 1` iterations. In all but the last iteration, `consumer` spills 1 byte, and in the last iteration, `consumer` spills 0 bytes. This is the same worst-case behavior as the `while (!sortedConsumers.isEmpty())` loop above. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32599: [SPARK-35320][SQL] Fix from_json maps parser: Only StringType is supported for keys.
SparkQA commented on pull request #32599: URL: https://github.com/apache/spark/pull/32599#issuecomment-846675354 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43375/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32642: [R][RELEASE] change SparkR maintainer for CRAN
SparkQA commented on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846675040 Kubernetes integration test status success URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43373/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32602: [SPARK-35455][SQL] Enhance EliminateUnnecessaryJoin
SparkQA commented on pull request #32602: URL: https://github.com/apache/spark/pull/32602#issuecomment-846670009 **[Test build #138854 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138854/testReport)** for PR 32602 at commit [`e26df96`](https://github.com/apache/spark/commit/e26df96b72806ace84c72e1f134278e2b20b4457). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32642: [R][RELEASE] change SparkR maintainer for CRAN
AmplabJenkins removed a comment on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846669848 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138853/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32641: [R][RELEASE] change SparkR maintainer for CRAN
AmplabJenkins removed a comment on pull request #32641: URL: https://github.com/apache/spark/pull/32641#issuecomment-846669847 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138851/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32641: [R][RELEASE] change SparkR maintainer for CRAN
AmplabJenkins commented on pull request #32641: URL: https://github.com/apache/spark/pull/32641#issuecomment-846669847 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138851/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32642: [R][RELEASE] change SparkR maintainer for CRAN
AmplabJenkins commented on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846669848 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138853/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32599: [SPARK-35320][SQL] Fix from_json maps parser: Only StringType is supported for keys.
SparkQA commented on pull request #32599: URL: https://github.com/apache/spark/pull/32599#issuecomment-84547 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43375/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32642: [R][RELEASE] change SparkR maintainer for CRAN
SparkQA commented on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-84410 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/43373/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32642: [R][RELEASE] change SparkR maintainer for CRAN
SparkQA removed a comment on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846653085 **[Test build #138853 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138853/testReport)** for PR 32642 at commit [`30cdcee`](https://github.com/apache/spark/commit/30cdceec4189b836784b4e5e17e9bb47ecc5aa41). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32642: [R][RELEASE] change SparkR maintainer for CRAN
SparkQA commented on pull request #32642: URL: https://github.com/apache/spark/pull/32642#issuecomment-846662415 **[Test build #138853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138853/testReport)** for PR 32642 at commit [`30cdcee`](https://github.com/apache/spark/commit/30cdceec4189b836784b4e5e17e9bb47ecc5aa41). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32641: [R][RELEASE] change SparkR maintainer for CRAN
SparkQA removed a comment on pull request #32641: URL: https://github.com/apache/spark/pull/32641#issuecomment-846652261 **[Test build #138851 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138851/testReport)** for PR 32641 at commit [`5dbe3bb`](https://github.com/apache/spark/commit/5dbe3bbc5d29e2ba6caca022c37eb0b5a2f13202). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32641: [R][RELEASE] change SparkR maintainer for CRAN
SparkQA commented on pull request #32641: URL: https://github.com/apache/spark/pull/32641#issuecomment-846661384 **[Test build #138851 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138851/testReport)** for PR 32641 at commit [`5dbe3bb`](https://github.com/apache/spark/commit/5dbe3bbc5d29e2ba6caca022c37eb0b5a2f13202). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org