[GitHub] [spark] c21 commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
c21 commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683245934 > did you observe any patterns or heuristics on your workloads where repartition is preferred? From our side, honestly now we don't have any automation for deciding coalesce vs repartition. We provided configs similar here for users themselves to control coalesce vs repartition. I think a rule of thumb can be we don't want to (1).coalesce: if the coalesced table is too big and # of coalesced buckets is too few, then each task has too much data and will take more time. (2).repartition: if the repartition table is too big and # of repartitioned buckets is too many, then too much duplicated data is read and will have too much more CPU/IO cost (might be worse than just shuffling this table). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
c21 commented on a change in pull request #29473: URL: https://github.com/apache/spark/pull/29473#discussion_r479612474 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceOrRepartitionBucketsInJoin.scala ## @@ -27,45 +27,48 @@ import org.apache.spark.sql.catalyst.rules.Rule import org.apache.spark.sql.execution.{FileSourceScanExec, FilterExec, ProjectExec, SparkPlan} import org.apache.spark.sql.execution.joins.{BaseJoinExec, ShuffledHashJoinExec, SortMergeJoinExec} import org.apache.spark.sql.internal.SQLConf +import org.apache.spark.sql.internal.SQLConf.BucketReadStrategyInJoin /** - * This rule coalesces one side of the `SortMergeJoin` and `ShuffledHashJoin` + * This rule coalesces or repartitions one side of the `SortMergeJoin` and `ShuffledHashJoin` * if the following conditions are met: * - Two bucketed tables are joined. * - Join keys match with output partition expressions on their respective sides. * - The larger bucket number is divisible by the smaller bucket number. - * - COALESCE_BUCKETS_IN_JOIN_ENABLED is set to true. * - The ratio of the number of buckets is less than the value set in - * COALESCE_BUCKETS_IN_JOIN_MAX_BUCKET_RATIO. + * COALESCE_OR_REPARTITION_BUCKETS_IN_JOIN_MAX_BUCKET_RATIO. Review comment: nit: shouldn't it be `BUCKET_READ_STRATEGY_IN_JOIN_MAX_BUCKET_RATIO` ? ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ## @@ -2655,24 +2655,34 @@ object SQLConf { .booleanConf .createWithDefault(true) - val COALESCE_BUCKETS_IN_JOIN_ENABLED = -buildConf("spark.sql.bucketing.coalesceBucketsInJoin.enabled") - .doc("When true, if two bucketed tables with the different number of buckets are joined, " + -"the side with a bigger number of buckets will be coalesced to have the same number " + -"of buckets as the other side. Bigger number of buckets is divisible by the smaller " + -"number of buckets. Bucket coalescing is applied to sort-merge joins and " + -"shuffled hash join. Note: Coalescing bucketed table can avoid unnecessary shuffling " + -"in join, but it also reduces parallelism and could possibly cause OOM for " + -"shuffled hash join.") - .version("3.1.0") - .booleanConf - .createWithDefault(false) + object BucketReadStrategyInJoin extends Enumeration { +val COALESCE, REPARTITION, OFF = Value + } - val COALESCE_BUCKETS_IN_JOIN_MAX_BUCKET_RATIO = -buildConf("spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio") - .doc("The ratio of the number of two buckets being coalesced should be less than or " + -"equal to this value for bucket coalescing to be applied. This configuration only " + -s"has an effect when '${COALESCE_BUCKETS_IN_JOIN_ENABLED.key}' is set to true.") + val BUCKET_READ_STRATEGY_IN_JOIN = +buildConf("spark.sql.bucketing.bucketReadStrategyInJoin") + .doc("When set to COALESCE, if two bucketed tables with the different number of buckets " + Review comment: nit: shall we first mention the allowed values to be "one of COALESCE, REPARTITION, OFF"? User might not follow exactly after long description here. Also probably worth to mention by default is "OFF" where we do not coalesce and repartition. ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala ## @@ -548,22 +560,42 @@ case class FileSourceScanExec( filesGroupedToBuckets } -val filePartitions = optionalNumCoalescedBuckets.map { numCoalescedBuckets => - logInfo(s"Coalescing to ${numCoalescedBuckets} buckets") - val coalescedBuckets = prunedFilesGroupedToBuckets.groupBy(_._1 % numCoalescedBuckets) - Seq.tabulate(numCoalescedBuckets) { bucketId => -val partitionedFiles = coalescedBuckets.get(bucketId).map { - _.values.flatten.toArray -}.getOrElse(Array.empty) -FilePartition(bucketId, partitionedFiles) - } -}.getOrElse { - Seq.tabulate(bucketSpec.numBuckets) { bucketId => +if (optionalNewNumBuckets.isEmpty) { + val filePartitions = Seq.tabulate(bucketSpec.numBuckets) { bucketId => FilePartition(bucketId, prunedFilesGroupedToBuckets.getOrElse(bucketId, Array.empty)) } + new FileScanRDD(fsRelation.sparkSession, readFile, filePartitions) +} else { + val newNumBuckets = optionalNewNumBuckets.get + if (newNumBuckets < bucketSpec.numBuckets) { +assert(bucketSpec.numBuckets % newNumBuckets == 0) +logInfo(s"Coalescing to $newNumBuckets buckets from ${bucketSpec.numBuckets} buckets") +val coalescedBuckets = prunedFilesGroupedToBuckets.groupBy(_._1 % newNumBuckets) +val filePartitions = Seq.tabulate(newNumBuckets) { bucketId => + val partitionedFiles = coalescedBuckets +.get(bucketId) +
[GitHub] [spark] HyukjinKwon commented on pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
HyukjinKwon commented on pull request #28968: URL: https://github.com/apache/spark/pull/28968#issuecomment-683241944 Thank you for kaing a look @viirya. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon edited a comment on pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
HyukjinKwon edited a comment on pull request #28968: URL: https://github.com/apache/spark/pull/28968#issuecomment-683241944 Thank you for taking a look @viirya. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
HyukjinKwon commented on a change in pull request #28968: URL: https://github.com/apache/spark/pull/28968#discussion_r479612214 ## File path: python/pyspark/util.py ## @@ -114,6 +117,64 @@ def _parse_memory(s): raise ValueError("invalid format: " + s) return int(float(s[:-1]) * units[s[-1].lower()]) + +class InheritableThread(threading.Thread): +""" +Thread that is recommended to be used in PySpark instead of :class:`threading.Thread` +when the pinned thread mode is enabled. The usage of this class is exactly same as +:class:`threading.Thread` but correctly inherits the inheritable properties specific +to JVM thread such as ``InheritableThreadLocal``. + +Also, note that pinned thread mode does not close the connection from Python +to JVM when the thread is finished in the Python side. With this class, Python +garbage-collects the Python thread instance and also closes the connection +which finishes JVM thread correctly. + +When the pinned thread mode is off, this works as :class:`threading.Thread`. + +.. note:: Experimental + +.. versionadded:: 3.1.0 +""" +def __init__(self, target, *args, **kwargs): +from pyspark import SparkContext + +sc = SparkContext._active_spark_context + +if isinstance(sc._gateway, ClientServer): +# Here's when the pinned-thread mode (PYSPARK_PIN_THREAD) is on. +properties = sc._jsc.sc().getLocalProperties().clone() Review comment: Actually we're mimicking that behaviour here because the thread in JVM does not respect the inheritance here since the thread is always sepearately created via the JVM gateway whereas Scala Java side we can keep the inheritance by creating a thread within a thread. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
HyukjinKwon commented on pull request #28968: URL: https://github.com/apache/spark/pull/28968#issuecomment-683241584 Oh yeah we should use `InheritableThread` instead of `Thread` to verify this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
AmplabJenkins removed a comment on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683241374 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
AmplabJenkins commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683241374 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
SparkQA removed a comment on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683211632 **[Test build #128011 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128011/testReport)** for PR 29473 at commit [`e2374ac`](https://github.com/apache/spark/commit/e2374ac281bbcb23c0dc49786ce7d8148f9761bd). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
SparkQA commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683241185 **[Test build #128011 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128011/testReport)** for PR 29473 at commit [`e2374ac`](https://github.com/apache/spark/commit/e2374ac281bbcb23c0dc49786ce7d8148f9761bd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jiangxb1987 commented on a change in pull request #29228: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
jiangxb1987 commented on a change in pull request #29228: URL: https://github.com/apache/spark/pull/29228#discussion_r479608346 ## File path: core/src/test/scala/org/apache/spark/LocalSC.scala ## @@ -0,0 +1,99 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark + +import _root_.io.netty.util.internal.logging.{InternalLoggerFactory, Slf4JLoggerFactory} +import org.scalatest.BeforeAndAfterAll +import org.scalatest.BeforeAndAfterEach +import org.scalatest.Suite + +import org.apache.spark.internal.Logging +import org.apache.spark.resource.ResourceProfile + +/** + * Manages a local `sc` `SparkContext` variable, correctly stopping it after each test. + * + * Note: this class is a copy of [[LocalSparkContext]]. Why copy it? Reduce conflict. Because + * many test suites use [[LocalSparkContext]] and overwrite some variable or function (e.g. + * sc of LocalSparkContext), there occurs conflict when we refactor the `sc` as a new function. + * After migrating all test suites that use [[LocalSparkContext]] to use [[LocalSC]], we will + * delete the original [[LocalSparkContext]] and rename [[LocalSC]] to [[LocalSparkContext]]. + */ +trait LocalSC extends BeforeAndAfterEach Review comment: Since this class is only used for temporary purpose, can we name it as `TempLocalSparkContext` ? TBH I don't like the `SC` name which is very vague to me. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] jiangxb1987 commented on pull request #29228: [SPARK-31847][CORE][TESTS] DAGSchedulerSuite: Rewrite the test framework to support apply specified spark configurations.
jiangxb1987 commented on pull request #29228: URL: https://github.com/apache/spark/pull/29228#issuecomment-683237385 LGTM otherwise This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
AmplabJenkins commented on pull request #29575: URL: https://github.com/apache/spark/pull/29575#issuecomment-683236144 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
SparkQA removed a comment on pull request #29575: URL: https://github.com/apache/spark/pull/29575#issuecomment-683198820 **[Test build #128008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128008/testReport)** for PR 29575 at commit [`141c8f3`](https://github.com/apache/spark/commit/141c8f3eecea97ecc75ee02806566a9f29f41af8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
AmplabJenkins removed a comment on pull request #29575: URL: https://github.com/apache/spark/pull/29575#issuecomment-683236144 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
SparkQA commented on pull request #29575: URL: https://github.com/apache/spark/pull/29575#issuecomment-683235988 **[Test build #128008 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128008/testReport)** for PR 29575 at commit [`141c8f3`](https://github.com/apache/spark/commit/141c8f3eecea97ecc75ee02806566a9f29f41af8). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
viirya commented on pull request #28968: URL: https://github.com/apache/spark/pull/28968#issuecomment-683235995 I found I missed this and looked at now. LGTM. I'm just wondering we should use `InheritableThread` in the PR description to verify the fix? ```python >>> from threading import Thread >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([, , , , ]) >>> Thread(target=lambda: spark.range(1000).collect()).start() >>> spark._jvm._gateway_client.deque deque([, , , , , ]) ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
viirya commented on a change in pull request #28968: URL: https://github.com/apache/spark/pull/28968#discussion_r479605991 ## File path: python/pyspark/util.py ## @@ -114,6 +117,64 @@ def _parse_memory(s): raise ValueError("invalid format: " + s) return int(float(s[:-1]) * units[s[-1].lower()]) + +class InheritableThread(threading.Thread): +""" +Thread that is recommended to be used in PySpark instead of :class:`threading.Thread` +when the pinned thread mode is enabled. The usage of this class is exactly same as +:class:`threading.Thread` but correctly inherits the inheritable properties specific +to JVM thread such as ``InheritableThreadLocal``. + +Also, note that pinned thread mode does not close the connection from Python +to JVM when the thread is finished in the Python side. With this class, Python +garbage-collects the Python thread instance and also closes the connection +which finishes JVM thread correctly. + +When the pinned thread mode is off, this works as :class:`threading.Thread`. + +.. note:: Experimental + +.. versionadded:: 3.1.0 +""" +def __init__(self, target, *args, **kwargs): +from pyspark import SparkContext + +sc = SparkContext._active_spark_context + +if isinstance(sc._gateway, ClientServer): +# Here's when the pinned-thread mode (PYSPARK_PIN_THREAD) is on. +properties = sc._jsc.sc().getLocalProperties().clone() Review comment: Why we need to `clone`? Doesn't `sc.localProperties` get clone in `childValue` already? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
viirya commented on a change in pull request #28968: URL: https://github.com/apache/spark/pull/28968#discussion_r479605991 ## File path: python/pyspark/util.py ## @@ -114,6 +117,64 @@ def _parse_memory(s): raise ValueError("invalid format: " + s) return int(float(s[:-1]) * units[s[-1].lower()]) + +class InheritableThread(threading.Thread): +""" +Thread that is recommended to be used in PySpark instead of :class:`threading.Thread` +when the pinned thread mode is enabled. The usage of this class is exactly same as +:class:`threading.Thread` but correctly inherits the inheritable properties specific +to JVM thread such as ``InheritableThreadLocal``. + +Also, note that pinned thread mode does not close the connection from Python +to JVM when the thread is finished in the Python side. With this class, Python +garbage-collects the Python thread instance and also closes the connection +which finishes JVM thread correctly. + +When the pinned thread mode is off, this works as :class:`threading.Thread`. + +.. note:: Experimental + +.. versionadded:: 3.1.0 +""" +def __init__(self, target, *args, **kwargs): +from pyspark import SparkContext + +sc = SparkContext._active_spark_context + +if isinstance(sc._gateway, ClientServer): +# Here's when the pinned-thread mode (PYSPARK_PIN_THREAD) is on. +properties = sc._jsc.sc().getLocalProperties().clone() Review comment: Why we need to `clone`? Isn't `sc.localProperties` gets clone in `childValue` already? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #28968: [SPARK-32010][PYTHON][CORE] Add InheritableThread for local properties and fixing a thread leak issue in pinned thread mode
viirya commented on a change in pull request #28968: URL: https://github.com/apache/spark/pull/28968#discussion_r479604890 ## File path: docs/job-scheduling.md ## @@ -297,11 +297,9 @@ via `sc.setJobGroup` in a separate PVM thread, which also disallows to cancel th later. In order to synchronize PVM threads with JVM threads, you should set `PYSPARK_PIN_THREAD` environment variable -to `true`. This pinned thread mode allows one PVM thread has one corresponding JVM thread. - -However, currently it cannot inherit the local properties from the parent thread although it isolates -each thread with its own local properties. To work around this, you should manually copy and set the -local properties from the parent thread to the child thread when you create another thread in PVM. +to `true`. This pinned thread mode allows one PVM thread has one corresponding JVM thread. With this mode, +`pyspark.InheritableThread` is recommanded to use together for a PVM thread to inherit the interitable attributes Review comment: typo: interitable -> inheritable This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] manuzhang commented on pull request #29540: [SPARK-32698][SQL] Do not fall back to default parallelism if the minimum number of coalesced partitions is not set in AQE
manuzhang commented on pull request #29540: URL: https://github.com/apache/spark/pull/29540#issuecomment-683232962 @cloud-fan What I mean is `spark.sql.adaptive.coalescePartitions.initialPartitionNum` > `spark.default.parallelism` > `spark.sql.shuffle.partitions` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
AmplabJenkins removed a comment on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683230726 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
AmplabJenkins commented on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683230726 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
SparkQA commented on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683230650 **[Test build #128009 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128009/testReport)** for PR 29576 at commit [`d015266`](https://github.com/apache/spark/commit/d015266e1d275ff9cc2c1e75a11267079827bd3d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
SparkQA removed a comment on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683201784 **[Test build #128009 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128009/testReport)** for PR 29576 at commit [`d015266`](https://github.com/apache/spark/commit/d015266e1d275ff9cc2c1e75a11267079827bd3d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs
AmplabJenkins removed a comment on pull request #29353: URL: https://github.com/apache/spark/pull/29353#issuecomment-683229295 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs
SparkQA removed a comment on pull request #29353: URL: https://github.com/apache/spark/pull/29353#issuecomment-683181113 **[Test build #128005 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128005/testReport)** for PR 29353 at commit [`d67ceed`](https://github.com/apache/spark/commit/d67ceed965fce5f56f2096032188c6fa9b3cfa5b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs
AmplabJenkins commented on pull request #29353: URL: https://github.com/apache/spark/pull/29353#issuecomment-683229295 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29353: [SPARK-32532][SQL] Improve ORC read/write performance on nested structs and array of structs
SparkQA commented on pull request #29353: URL: https://github.com/apache/spark/pull/29353#issuecomment-683229139 **[Test build #128005 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128005/testReport)** for PR 29353 at commit [`d67ceed`](https://github.com/apache/spark/commit/d67ceed965fce5f56f2096032188c6fa9b3cfa5b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles
AmplabJenkins removed a comment on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683227764 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles
AmplabJenkins commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683227764 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29574: Fix some R styles
SparkQA removed a comment on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683203194 **[Test build #128010 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128010/testReport)** for PR 29574 at commit [`f1b46f3`](https://github.com/apache/spark/commit/f1b46f30d7c7d237c5c16a71ada3456c44089adc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles
SparkQA commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683227542 **[Test build #128010 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128010/testReport)** for PR 29574 at commit [`f1b46f3`](https://github.com/apache/spark/commit/f1b46f30d7c7d237c5c16a71ada3456c44089adc). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
AmplabJenkins removed a comment on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683227452 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
AmplabJenkins commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683227452 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
SparkQA commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683227251 **[Test build #128012 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128012/testReport)** for PR 29473 at commit [`7481e36`](https://github.com/apache/spark/commit/7481e36d8781e869a0dc558e0af5d358a56ab150). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles
AmplabJenkins removed a comment on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683225470 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles
AmplabJenkins commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683225470 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29574: Fix some R styles
SparkQA removed a comment on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683197133 **[Test build #128007 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128007/testReport)** for PR 29574 at commit [`85fa83a`](https://github.com/apache/spark/commit/85fa83ad978fd0ae2b757fd2d72272dc54e3089b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles
SparkQA commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683225203 **[Test build #128007 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128007/testReport)** for PR 29574 at commit [`85fa83a`](https://github.com/apache/spark/commit/85fa83ad978fd0ae2b757fd2d72272dc54e3089b). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29352: [SPARK-32531][SQL][TEST] Add benchmarks for nested structs and arrays for different file formats
AmplabJenkins removed a comment on pull request #29352: URL: https://github.com/apache/spark/pull/29352#issuecomment-683222745 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29352: [SPARK-32531][SQL][TEST] Add benchmarks for nested structs and arrays for different file formats
AmplabJenkins commented on pull request #29352: URL: https://github.com/apache/spark/pull/29352#issuecomment-683222745 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29352: [SPARK-32531][SQL][TEST] Add benchmarks for nested structs and arrays for different file formats
SparkQA removed a comment on pull request #29352: URL: https://github.com/apache/spark/pull/29352#issuecomment-683160455 **[Test build #128004 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128004/testReport)** for PR 29352 at commit [`bef3d35`](https://github.com/apache/spark/commit/bef3d357ecdbe3be4a468184b4917b540ff7625e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29352: [SPARK-32531][SQL][TEST] Add benchmarks for nested structs and arrays for different file formats
SparkQA commented on pull request #29352: URL: https://github.com/apache/spark/pull/29352#issuecomment-683222409 **[Test build #128004 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128004/testReport)** for PR 29352 at commit [`bef3d35`](https://github.com/apache/spark/commit/bef3d357ecdbe3be4a468184b4917b540ff7625e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29567: [SPARK-32721][SQL] Simplify if clauses with null and boolean
sunchao commented on pull request #29567: URL: https://github.com/apache/spark/pull/29567#issuecomment-683219924 cc @dbtsai @viirya This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
AmplabJenkins removed a comment on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683219778 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
AmplabJenkins commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683219778 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on a change in pull request #29542: URL: https://github.com/apache/spark/pull/29542#discussion_r479594051 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ## @@ -92,60 +88,24 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException { Configuration configuration = taskAttemptContext.getConfiguration(); -ParquetInputSplit split = (ParquetInputSplit)inputSplit; +FileSplit split = (FileSplit) inputSplit; this.file = split.getPath(); -long[] rowGroupOffsets = split.getRowGroupOffsets(); -ParquetMetadata footer; -List blocks; - -// if task.side.metadata is set, rowGroupOffsets is null -if (rowGroupOffsets == null) { - // then we need to apply the predicate push down filter - footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); - MessageType fileSchema = footer.getFileMetaData().getSchema(); - FilterCompat.Filter filter = getFilter(configuration); - blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); -} else { - // otherwise we find the row groups that were selected on the client - footer = readFooter(configuration, file, NO_FILTER); Review comment: I think this path is never triggered. You can see below that we always construct `ParquetInputSplit` by initializing the `rowGroupOffsets` with null. The `rowGroupOffsets` is also deprecated along with the `ParquetInputSplit` class. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
SparkQA removed a comment on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683158107 **[Test build #128003 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128003/testReport)** for PR 29473 at commit [`5665bc1`](https://github.com/apache/spark/commit/5665bc1107d6f9f06d1663a6f1ba8fa2ef5491e5). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
SparkQA commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683219393 **[Test build #128003 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128003/testReport)** for PR 29473 at commit [`5665bc1`](https://github.com/apache/spark/commit/5665bc1107d6f9f06d1663a6f1ba8fa2ef5491e5). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types
sunchao commented on a change in pull request #29565: URL: https://github.com/apache/spark/pull/29565#discussion_r479593866 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.Literal.FalseLiteral +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.types._ + +/** + * Unwrap casts in binary comparison operations with patterns like following: + * + * `BinaryComparison(Cast(fromExp, toType), Literal(value, toType))` + * or + * `BinaryComparison(Literal(value, toType), Cast(fromExp, toType))` + * + * This rule optimizes expressions with the above pattern by either replacing the cast with simpler + * constructs, or moving the cast from the expression side to the literal side, which enables them + * to be optimized away later and pushed down to data sources. + * + * Currently this only handles cases where `fromType` (of `fromExp`) and `toType` are of integral + * types (i.e., byte, short, int and long). The rule checks to see if the literal `value` is + * within range `(min, max)`, where `min` and `max` are the minimum and maximum value of + * `fromType`, respectively. If this is true then it means we can safely cast `value` to `fromType` + * and thus able to move the cast to the literal side. + * + * If the `value` is not within range `(min, max)`, the rule breaks the scenario into different + * cases and try to replace each with simpler constructs. + * + * if `value > max`, the cases are of following: + * - `cast(exp, ty) > value` ==> if(isnull(exp), null, false) + * - `cast(exp, ty) >= value` ==> if(isnull(exp), null, false) + * - `cast(exp, ty) === value` ==> if(isnull(exp), null, false) + * - `cast(exp, ty) <=> value` ==> false + * - `cast(exp, ty) <= value` ==> if(isnull(exp), null, true) + * - `cast(exp, ty) < value` ==> if(isnull(exp), null, true) + * + * if `value == max`, the cases are of following: + * - `cast(exp, ty) > value` ==> if(isnull(exp), null, false) + * - `cast(exp, ty) >= value` ==> exp == max + * - `cast(exp, ty) === value` ==> exp == max + * - `cast(exp, ty) <=> value` ==> exp == max + * - `cast(exp, ty) <= value` ==> if(isnull(exp), null, true) + * - `cast(exp, ty) < value` ==> exp =!= max + * + * Similarly for the cases when `value == min` and `value < min`. + * + * Further, the above `if(isnull(exp), null, false)` is represented using conjunction + * `and(isnull(exp), null)`, to enable further optimization and filter pushdown to data sources. + * Similarly, `if(isnull(exp), null, true)` is represented with `or(isnotnull(exp), null)`. + */ +object UnwrapCast extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case l: LogicalPlan => l transformExpressionsUp { + case e @ BinaryComparison(_, _) => unwrapCast(e) +} + } + + private def unwrapCast(exp: Expression): Expression = exp match { +case BinaryComparison(Literal(_, _), Cast(_, _, _)) => + // Not a canonical form. In this case we first canonicalize the expression by swapping the + // literal and cast side, then process the result and swap the literal and cast again to + // restore the original order. + def swap(e: Expression): Expression = e match { +case GreaterThan(left, right) => LessThan(right, left) +case GreaterThanOrEqual(left, right) => LessThanOrEqual(right, left) +case EqualTo(left, right) => EqualTo(right, left) +case EqualNullSafe(left, right) => EqualNullSafe(right, left) +case LessThanOrEqual(left, right) => GreaterThanOrEqual(right, left) +case LessThan(left, right) => GreaterThan(right, left) +case _ => e + } + + swap(unwrapCast(swap(exp))) + +case BinaryComparison(Cast(fromExp, _, _), Literal(value, toType)) + if canImplicitlyCast(fromExp, toType) => + + // In case both sides have integral type, o
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29567: [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean
AmplabJenkins removed a comment on pull request #29567: URL: https://github.com/apache/spark/pull/29567#issuecomment-683214869 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29567: [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean
AmplabJenkins commented on pull request #29567: URL: https://github.com/apache/spark/pull/29567#issuecomment-683214869 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29567: [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean
SparkQA removed a comment on pull request #29567: URL: https://github.com/apache/spark/pull/29567#issuecomment-683147685 **[Test build #128002 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128002/testReport)** for PR 29567 at commit [`cc66198`](https://github.com/apache/spark/commit/cc661984f3ccbf59bbabb91c7d92b17524ab74d3). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29567: [WIP][SPARK-32721][SQL] Simplify if clauses with null and boolean
SparkQA commented on pull request #29567: URL: https://github.com/apache/spark/pull/29567#issuecomment-683214525 **[Test build #128002 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128002/testReport)** for PR 29567 at commit [`cc66198`](https://github.com/apache/spark/commit/cc661984f3ccbf59bbabb91c7d92b17524ab74d3). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
viirya commented on a change in pull request #29542: URL: https://github.com/apache/spark/pull/29542#discussion_r479591040 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ## @@ -92,60 +88,24 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException { Configuration configuration = taskAttemptContext.getConfiguration(); -ParquetInputSplit split = (ParquetInputSplit)inputSplit; +FileSplit split = (FileSplit) inputSplit; this.file = split.getPath(); -long[] rowGroupOffsets = split.getRowGroupOffsets(); -ParquetMetadata footer; -List blocks; - -// if task.side.metadata is set, rowGroupOffsets is null -if (rowGroupOffsets == null) { - // then we need to apply the predicate push down filter - footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); - MessageType fileSchema = footer.getFileMetaData().getSchema(); - FilterCompat.Filter filter = getFilter(configuration); - blocks = filterRowGroups(filter, footer.getBlocks(), fileSchema); -} else { - // otherwise we find the row groups that were selected on the client - footer = readFooter(configuration, file, NO_FILTER); Review comment: We don't need row groups selection here too? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
viirya commented on a change in pull request #29542: URL: https://github.com/apache/spark/pull/29542#discussion_r479590396 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ## @@ -92,60 +88,24 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException { Configuration configuration = taskAttemptContext.getConfiguration(); -ParquetInputSplit split = (ParquetInputSplit)inputSplit; +FileSplit split = (FileSplit) inputSplit; this.file = split.getPath(); -long[] rowGroupOffsets = split.getRowGroupOffsets(); -ParquetMetadata footer; -List blocks; - -// if task.side.metadata is set, rowGroupOffsets is null -if (rowGroupOffsets == null) { - // then we need to apply the predicate push down filter - footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); - MessageType fileSchema = footer.getFileMetaData().getSchema(); - FilterCompat.Filter filter = getFilter(configuration); Review comment: Ok, I see. Tracing into Parquet source code. `HadoopReadOptions` will read filter by `getFilter` and `ParquetFileReader` will use it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types
sunchao commented on a change in pull request #29565: URL: https://github.com/apache/spark/pull/29565#discussion_r479590420 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/UnwrapCast.scala ## @@ -0,0 +1,214 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.optimizer + +import org.apache.spark.sql.catalyst.expressions._ +import org.apache.spark.sql.catalyst.expressions.Literal.FalseLiteral +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.catalyst.rules.Rule +import org.apache.spark.sql.types._ + +/** + * Unwrap casts in binary comparison operations with patterns like following: + * + * `BinaryComparison(Cast(fromExp, toType), Literal(value, toType))` + * or + * `BinaryComparison(Literal(value, toType), Cast(fromExp, toType))` + * + * This rule optimizes expressions with the above pattern by either replacing the cast with simpler + * constructs, or moving the cast from the expression side to the literal side, which enables them + * to be optimized away later and pushed down to data sources. + * + * Currently this only handles cases where `fromType` (of `fromExp`) and `toType` are of integral + * types (i.e., byte, short, int and long). The rule checks to see if the literal `value` is + * within range `(min, max)`, where `min` and `max` are the minimum and maximum value of + * `fromType`, respectively. If this is true then it means we can safely cast `value` to `fromType` + * and thus able to move the cast to the literal side. + * + * If the `value` is not within range `(min, max)`, the rule breaks the scenario into different + * cases and try to replace each with simpler constructs. + * + * if `value > max`, the cases are of following: + * - `cast(exp, ty) > value` ==> if(isnull(exp), null, false) + * - `cast(exp, ty) >= value` ==> if(isnull(exp), null, false) + * - `cast(exp, ty) === value` ==> if(isnull(exp), null, false) + * - `cast(exp, ty) <=> value` ==> false + * - `cast(exp, ty) <= value` ==> if(isnull(exp), null, true) + * - `cast(exp, ty) < value` ==> if(isnull(exp), null, true) + * + * if `value == max`, the cases are of following: + * - `cast(exp, ty) > value` ==> if(isnull(exp), null, false) + * - `cast(exp, ty) >= value` ==> exp == max + * - `cast(exp, ty) === value` ==> exp == max + * - `cast(exp, ty) <=> value` ==> exp == max + * - `cast(exp, ty) <= value` ==> if(isnull(exp), null, true) + * - `cast(exp, ty) < value` ==> exp =!= max + * + * Similarly for the cases when `value == min` and `value < min`. + * + * Further, the above `if(isnull(exp), null, false)` is represented using conjunction + * `and(isnull(exp), null)`, to enable further optimization and filter pushdown to data sources. + * Similarly, `if(isnull(exp), null, true)` is represented with `or(isnotnull(exp), null)`. + */ +object UnwrapCast extends Rule[LogicalPlan] { + override def apply(plan: LogicalPlan): LogicalPlan = plan transform { +case l: LogicalPlan => l transformExpressionsUp { + case e @ BinaryComparison(_, _) => unwrapCast(e) +} + } + + private def unwrapCast(exp: Expression): Expression = exp match { +case BinaryComparison(Literal(_, _), Cast(_, _, _)) => + // Not a canonical form. In this case we first canonicalize the expression by swapping the + // literal and cast side, then process the result and swap the literal and cast again to + // restore the original order. + def swap(e: Expression): Expression = e match { +case GreaterThan(left, right) => LessThan(right, left) +case GreaterThanOrEqual(left, right) => LessThanOrEqual(right, left) +case EqualTo(left, right) => EqualTo(right, left) +case EqualNullSafe(left, right) => EqualNullSafe(right, left) +case LessThanOrEqual(left, right) => GreaterThanOrEqual(right, left) +case LessThan(left, right) => GreaterThan(right, left) +case _ => e + } + + swap(unwrapCast(swap(exp))) + +case BinaryComparison(Cast(fromExp, _, _), Literal(value, toType)) + if canImplicitlyCast(fromExp, toType) => + + // In case both sides have integral type, o
[GitHub] [spark] imback82 commented on a change in pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
imback82 commented on a change in pull request #29473: URL: https://github.com/apache/spark/pull/29473#discussion_r479590003 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceOrRepartitionBucketsInJoin.scala ## @@ -83,23 +82,39 @@ case class CoalesceBucketsInJoin(conf: SQLConf) extends Rule[SparkPlan] { } def apply(plan: SparkPlan): SparkPlan = { -if (!conf.coalesceBucketsInJoinEnabled) { +if (!conf.coalesceBucketsInJoinEnabled && !conf.repartitionBucketsInJoinEnabled) { return plan } +if (conf.coalesceBucketsInJoinEnabled && conf.repartitionBucketsInJoinEnabled) { + throw new AnalysisException("Both 'spark.sql.bucketing.coalesceBucketsInJoin.enabled' and " + +"'spark.sql.bucketing.repartitionBucketsInJoin.enabled' cannot be set to true at the" + +"same time") Review comment: Thanks for the suggestion! I think the new config makes more sense. I renamed few, and let me know if it doesn't make sense. Btw, do you think I can introduce `AUTOMATIC` as a follow up since this PR is sizable? Let me know if you want to see it in this PR. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
AmplabJenkins removed a comment on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683211853 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
AmplabJenkins commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683211853 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29473: [SPARK-32656][SQL] Repartition bucketed tables for sort merge join / shuffled hash join if applicable
SparkQA commented on pull request #29473: URL: https://github.com/apache/spark/pull/29473#issuecomment-683211632 **[Test build #128011 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128011/testReport)** for PR 29473 at commit [`e2374ac`](https://github.com/apache/spark/commit/e2374ac281bbcb23c0dc49786ce7d8148f9761bd). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types
AmplabJenkins commented on pull request #29565: URL: https://github.com/apache/spark/pull/29565#issuecomment-683211038 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types
AmplabJenkins removed a comment on pull request #29565: URL: https://github.com/apache/spark/pull/29565#issuecomment-683211038 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on pull request #29542: URL: https://github.com/apache/spark/pull/29542#issuecomment-683210745 > Parquet reader is performance-wise important component in Spark SQL. We better to make sure no performance regression due to this change. Should we run a benchmark to check it? Sure let me do that. Should I just run `DataSourceReadBenchmark`? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types
SparkQA removed a comment on pull request #29565: URL: https://github.com/apache/spark/pull/29565#issuecomment-683142261 **[Test build #128001 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128001/testReport)** for PR 29565 at commit [`fc4311b`](https://github.com/apache/spark/commit/fc4311b8f9ef52de1e0ea623ba239a66b03dbd50). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29565: [SPARK-24994][SQL] Simplify casts for literal types
SparkQA commented on pull request #29565: URL: https://github.com/apache/spark/pull/29565#issuecomment-683210663 **[Test build #128001 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128001/testReport)** for PR 29565 at commit [`fc4311b`](https://github.com/apache/spark/commit/fc4311b8f9ef52de1e0ea623ba239a66b03dbd50). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on a change in pull request #29542: URL: https://github.com/apache/spark/pull/29542#discussion_r479588691 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ## @@ -22,22 +22,20 @@ import java.io.IOException; import java.lang.reflect.InvocationTargetException; import java.util.ArrayList; -import java.util.Arrays; import java.util.Collections; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.parquet.HadoopReadOptions; +import org.apache.parquet.ParquetReadOptions; +import org.apache.parquet.hadoop.util.HadoopInputFile; +import org.apache.spark.sql.internal.SQLConf; Review comment: Sure. It's annoying that my Intellij did this. Do you usually run `dev/scalafmt` before submitting PR? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
sunchao commented on a change in pull request #29542: URL: https://github.com/apache/spark/pull/29542#discussion_r479587433 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ## @@ -92,60 +88,24 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException { Configuration configuration = taskAttemptContext.getConfiguration(); -ParquetInputSplit split = (ParquetInputSplit)inputSplit; +FileSplit split = (FileSplit) inputSplit; this.file = split.getPath(); -long[] rowGroupOffsets = split.getRowGroupOffsets(); -ParquetMetadata footer; -List blocks; - -// if task.side.metadata is set, rowGroupOffsets is null -if (rowGroupOffsets == null) { - // then we need to apply the predicate push down filter - footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); - MessageType fileSchema = footer.getFileMetaData().getSchema(); - FilterCompat.Filter filter = getFilter(configuration); Review comment: No. Currently we are doing double filter pushdowns: once in here and another in the `ParquetFileReader` ctor. This removes the first one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles
AmplabJenkins removed a comment on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683203442 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles
AmplabJenkins commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683203442 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29572: [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering
AmplabJenkins removed a comment on pull request #29572: URL: https://github.com/apache/spark/pull/29572#issuecomment-683203012 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles
SparkQA commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683203194 **[Test build #128010 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128010/testReport)** for PR 29574 at commit [`f1b46f3`](https://github.com/apache/spark/commit/f1b46f30d7c7d237c5c16a71ada3456c44089adc). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29572: [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering
AmplabJenkins commented on pull request #29572: URL: https://github.com/apache/spark/pull/29572#issuecomment-683203012 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29572: [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering
SparkQA removed a comment on pull request #29572: URL: https://github.com/apache/spark/pull/29572#issuecomment-683129162 **[Test build #128000 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128000/testReport)** for PR 29572 at commit [`1a49356`](https://github.com/apache/spark/commit/1a49356b8b7e021c1c8f7176e70b3d26ce8fc491). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29572: [WIP][SPARK-32730][SQL] Improve LeftSemi SortMergeJoin right side buffering
SparkQA commented on pull request #29572: URL: https://github.com/apache/spark/pull/29572#issuecomment-683202640 **[Test build #128000 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128000/testReport)** for PR 29572 at commit [`1a49356`](https://github.com/apache/spark/commit/1a49356b8b7e021c1c8f7176e70b3d26ce8fc491). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `trait AppendOnlyUnsafeRowArray ` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
viirya commented on a change in pull request #29542: URL: https://github.com/apache/spark/pull/29542#discussion_r479583247 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ## @@ -92,60 +88,24 @@ public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException { Configuration configuration = taskAttemptContext.getConfiguration(); -ParquetInputSplit split = (ParquetInputSplit)inputSplit; +FileSplit split = (FileSplit) inputSplit; this.file = split.getPath(); -long[] rowGroupOffsets = split.getRowGroupOffsets(); -ParquetMetadata footer; -List blocks; - -// if task.side.metadata is set, rowGroupOffsets is null -if (rowGroupOffsets == null) { - // then we need to apply the predicate push down filter - footer = readFooter(configuration, file, range(split.getStart(), split.getEnd())); - MessageType fileSchema = footer.getFileMetaData().getSchema(); - FilterCompat.Filter filter = getFilter(configuration); Review comment: Does this mean we don't have predicate push down anymore? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
AmplabJenkins removed a comment on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683200845 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
SparkQA commented on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683201784 **[Test build #128009 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128009/testReport)** for PR 29576 at commit [`d015266`](https://github.com/apache/spark/commit/d015266e1d275ff9cc2c1e75a11267079827bd3d). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
maropu commented on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683201606 Thanks! LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #29542: [SPARK-32703][SQL] Replace deprecated API calls from SpecificParquetRecordReaderBase
viirya commented on a change in pull request #29542: URL: https://github.com/apache/spark/pull/29542#discussion_r479582689 ## File path: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java ## @@ -22,22 +22,20 @@ import java.io.IOException; import java.lang.reflect.InvocationTargetException; import java.util.ArrayList; -import java.util.Arrays; import java.util.Collections; import java.util.HashMap; import java.util.HashSet; import java.util.List; import java.util.Map; import java.util.Set; +import org.apache.hadoop.mapred.FileSplit; +import org.apache.parquet.HadoopReadOptions; +import org.apache.parquet.ParquetReadOptions; +import org.apache.parquet.hadoop.util.HadoopInputFile; +import org.apache.spark.sql.internal.SQLConf; Review comment: Could you move these imports below `scala.Option;` to the third-party import group? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
AmplabJenkins commented on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683200845 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
viirya commented on pull request #29576: URL: https://github.com/apache/spark/pull/29576#issuecomment-683200720 cc @maropu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya opened a new pull request #29576: [SPARK-32693][SQL][2.4] Compare two dataframes with same schema except nullable property
viirya opened a new pull request #29576: URL: https://github.com/apache/spark/pull/29576 ### What changes were proposed in this pull request? This PR changes key data types check in `HashJoin` to use `sameType`. This backports #29555 to branch-2.4. ### Why are the changes needed? Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable. It makes inconsistent results when doing `except` between two dataframes. If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored. If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects. ### Does this PR introduce _any_ user-facing change? Yes. Making consistent `except` operation between dataframes. ### How was this patch tested? Unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #28106: [SPARK-31335][SQL] Add try function support
github-actions[bot] closed pull request #28106: URL: https://github.com/apache/spark/pull/28106 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #28491: [SPARK-30267][SQL] Interoperability tests with Avro records generated by Avro4s
github-actions[bot] commented on pull request #28491: URL: https://github.com/apache/spark/pull/28491#issuecomment-683200263 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #28574: [SPARK-31752][SQL][DOCS] Add sql doc for interval type
github-actions[bot] commented on pull request #28574: URL: https://github.com/apache/spark/pull/28574#issuecomment-683200258 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #27498: [SPARK-30688][SQL] Week based dates not being parsed with TimestampFormatter
github-actions[bot] closed pull request #27498: URL: https://github.com/apache/spark/pull/27498 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
AmplabJenkins removed a comment on pull request #29575: URL: https://github.com/apache/spark/pull/29575#issuecomment-683199099 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
AmplabJenkins commented on pull request #29575: URL: https://github.com/apache/spark/pull/29575#issuecomment-683199099 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
SparkQA commented on pull request #29575: URL: https://github.com/apache/spark/pull/29575#issuecomment-683198820 **[Test build #128008 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128008/testReport)** for PR 29575 at commit [`141c8f3`](https://github.com/apache/spark/commit/141c8f3eecea97ecc75ee02806566a9f29f41af8). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
viirya commented on pull request #29575: URL: https://github.com/apache/spark/pull/29575#issuecomment-683198529 cc @maropu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya opened a new pull request #29575: [SPARK-32693][SQL][3.0] Compare two dataframes with same schema except nullable property
viirya opened a new pull request #29575: URL: https://github.com/apache/spark/pull/29575 ### What changes were proposed in this pull request? This PR changes key data types check in `HashJoin` to use `sameType`. This backports #29555 to branch-3.0. ### Why are the changes needed? Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable. It makes inconsistent results when doing `except` between two dataframes. If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored. If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects. ### Does this PR introduce _any_ user-facing change? Yes. Making consistent `except` operation between dataframes. ### How was this patch tested? Unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles
AmplabJenkins removed a comment on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683197456 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles
AmplabJenkins commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683197456 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles
SparkQA commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683197133 **[Test build #128007 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128007/testReport)** for PR 29574 at commit [`85fa83a`](https://github.com/apache/spark/commit/85fa83ad978fd0ae2b757fd2d72272dc54e3089b). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #29574: Fix some R styles
AmplabJenkins removed a comment on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683196705 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #29574: Fix some R styles
AmplabJenkins commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683196705 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #29574: Fix some R styles
SparkQA commented on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683196636 **[Test build #128006 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128006/testReport)** for PR 29574 at commit [`ff3d769`](https://github.com/apache/spark/commit/ff3d76981285f5f752b46617f7a579e6ce85c7fe). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #29574: Fix some R styles
SparkQA removed a comment on pull request #29574: URL: https://github.com/apache/spark/pull/29574#issuecomment-683188721 **[Test build #128006 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128006/testReport)** for PR 29574 at commit [`ff3d769`](https://github.com/apache/spark/commit/ff3d76981285f5f752b46617f7a579e6ce85c7fe). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org