[GitHub] [spark] cloud-fan commented on a change in pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling
cloud-fan commented on a change in pull request #32448: URL: https://github.com/apache/spark/pull/32448#discussion_r629091515 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala ## @@ -707,32 +710,63 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession { val df2 = Seq((1, UnionClass1c(1, 2L, UnionClass4(2, 3L.toDF("id", "a") var unionDf = df1.unionByName(df2, true) - checkAnswer(unionDf, -Row(0, Row(0, 1, Row(null, 1, null, "2"))) :: - Row(1, Row(1, 2, Row(2, null, 3L, null))) :: Nil) assert(unionDf.schema.toDDL == "`id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, " + - "`nested`: STRUCT<`A`: INT, `a`: INT, `b`: BIGINT, `c`: STRING>>") + "`nested`: STRUCT<`a`: INT, `c`: STRING, `A`: INT, `b`: BIGINT>>") Review comment: is this an expected behavior change? and why do we prefer to new behavior? ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala ## @@ -707,32 +710,63 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession { val df2 = Seq((1, UnionClass1c(1, 2L, UnionClass4(2, 3L.toDF("id", "a") var unionDf = df1.unionByName(df2, true) - checkAnswer(unionDf, -Row(0, Row(0, 1, Row(null, 1, null, "2"))) :: - Row(1, Row(1, 2, Row(2, null, 3L, null))) :: Nil) assert(unionDf.schema.toDDL == "`id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, " + - "`nested`: STRUCT<`A`: INT, `a`: INT, `b`: BIGINT, `c`: STRING>>") + "`nested`: STRUCT<`a`: INT, `c`: STRING, `A`: INT, `b`: BIGINT>>") Review comment: is this an expected behavior change? and why do we prefer the new behavior? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still conti
viirya commented on a change in pull request #32399: URL: https://github.com/apache/spark/pull/32399#discussion_r629091240 ## File path: dev/sparktestsupport/modules.py ## @@ -565,6 +565,7 @@ def __hash__(self): "pyspark.ml.tests.test_stat", "pyspark.ml.tests.test_training_summary", "pyspark.ml.tests.test_tuning", +"pyspark.ml.tests.pyspark.ml.tests.test_tuning_on_pin_thread_mode", Review comment: Does `pyspark.ml.tests` duplicate? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still
HyukjinKwon commented on a change in pull request #32399: URL: https://github.com/apache/spark/pull/32399#discussion_r629086511 ## File path: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ## @@ -161,11 +169,26 @@ class TrainValidationSplit @Since("1.5.0") (@Since("1.5.0") override val uid: St } // Wait for all metrics to be calculated -val metrics = metricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) - -// Unpersist training & validation set once all metrics have been produced -trainingDataset.unpersist() -validationDataset.unpersist() +val metrics = try { + metricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) +} +catch { + case e: Throwable => +subTaskFailed = true +throw e +} +finally { + if (subTaskFailed) { +Thread.sleep(1000) +val sparkContext = dataset.sparkSession.sparkContext +sparkContext.cancelJobGroup( + sparkContext.getLocalProperty(SparkContext.SPARK_JOB_GROUP_ID) +) + } Review comment: i think we could do this as below: without `subTaskFailed` .. ? ```scala catch { case e: Throwable => val sparkContext = dataset.sparkSession.sparkContext sparkContext.cancelJobGroup( sparkContext.getLocalProperty(SparkContext.SPARK_JOB_GROUP_ID) ) ``` ## File path: python/pyspark/ml/tuning.py ## @@ -730,13 +733,40 @@ def _fit(self, dataset): train = datasets[i][0].cache() tasks = _parallelFitTasks(est, train, eva, validation, epm, collectSubModelsParam) -for j, metric, subModel in pool.imap_unordered(lambda f: f(), tasks): -metrics[j] += (metric / nFolds) -if collectSubModelsParam: -subModels[i][j] = subModel -validation.unpersist() -train.unpersist() +sub_task_failed = False + +@inheritable_thread_target +def run_task(task): +if sub_task_failed: +raise RuntimeError("Terminate this task because one of other task failed.") +return task() + +try: +for j, metric, subModel in pool.imap_unordered(run_task, tasks): +metrics[j] += (metric / nFolds) +if collectSubModelsParam: +subModels[i][j] = subModel +except: +sub_task_failed = True +raise +finally: +if sub_task_failed: +if is_pinned_thread_mode(): +try: +time.sleep(1) +sc = dataset._sc + sc.cancelJobGroup(sc.getLocalProperty("spark.jobGroup.id")) +except: +pass +else: +warnings.warn("CrossValidator {} fit call failed but some spark jobs " + "may still running for unfinished trials. Enable pyspark " Review comment: Hm, why is it inconsistent with `TrainValidationSplit`? Seems like `TrainValidationSplit` will always cancel but here only cancel when pinned thread mode is on. ## File path: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ## @@ -161,11 +169,26 @@ class TrainValidationSplit @Since("1.5.0") (@Since("1.5.0") override val uid: St } // Wait for all metrics to be calculated -val metrics = metricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) - -// Unpersist training & validation set once all metrics have been produced -trainingDataset.unpersist() -validationDataset.unpersist() +val metrics = try { + metricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) +} +catch { + case e: Throwable => +subTaskFailed = true +throw e +} +finally { + if (subTaskFailed) { +Thread.sleep(1000) Review comment: Would you mind elabourating why we should sleep here? I should avoid relying on sleep in the main codes or test codes whenever possible. ## File path: python/pyspark/util.py ## @@ -263,6 +264,69 @@ def _parse_memory(s): return int(float(s[:-1]) * units[s[-1].lower()]) +def is_pinned_thread_mode(): +""" +Return ``True`` when spark run under pinned thread mode. +""" +from pyspark import SparkContext +return isinstance(SparkContext._gateway, ClientServer) + + +def inheritable_thread_target(f): +""" +Return thread target wrapper which is recommended to be used in PySpark when the +pinned thread mode is enabled. The wrapper function, before calling original +thread target, it inherits the inheritable properties specific +to JVM thread such as `
[GitHub] [spark] WeichenXu123 commented on a change in pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still
WeichenXu123 commented on a change in pull request #32399: URL: https://github.com/apache/spark/pull/32399#discussion_r629087709 ## File path: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ## @@ -161,11 +169,26 @@ class TrainValidationSplit @Since("1.5.0") (@Since("1.5.0") override val uid: St } // Wait for all metrics to be calculated -val metrics = metricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) - -// Unpersist training & validation set once all metrics have been produced -trainingDataset.unpersist() -validationDataset.unpersist() +val metrics = try { + metricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) +} +catch { + case e: Throwable => +subTaskFailed = true +throw e +} +finally { + if (subTaskFailed) { +Thread.sleep(1000) Review comment: This sleep is for: each trial task which thread already running, may took some time running before it launch spark job, if here we cancel job immediately, then we may miss killing the spark job which will be spawned soon pseudocode for this: ``` def trial_thread_target(): if subTaskFailed: raise Error() else: # 1. run some code here # 2. launch a spark job... # 3. run some code here # 4. launch a second spark job... # ``` Suppose `cancelJobGroup` called at the running time of step 1/step 3 , then we may miss killing the spark job spwaned at step 2/step 4 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan edited a comment on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
cloud-fan edited a comment on pull request #32482: URL: https://github.com/apache/spark/pull/32482#issuecomment-836259432 In general, I think it's better to optimize the cached plan more aggressively for better performance, even though it may cause perf regression due to output partitioning change, which should be rare. About the config name, how about `spark.sql.optimizer.canChangeCachedPlanOutputPartitioning`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
cloud-fan commented on pull request #32482: URL: https://github.com/apache/spark/pull/32482#issuecomment-836259432 In general, I think it's better to optimize the cached plan more aggressively for better performance, even though it may cause perf regression due to output partitioning change, which should be rare. About the config name, how about `...canChangeCachedPlanOutputPartitioning`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on a change in pull request #32409: [SPARK-35285][SQL] Parse ANSI interval types in SQL schema
MaxGekk commented on a change in pull request #32409: URL: https://github.com/apache/spark/pull/32409#discussion_r629088422 ## File path: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CollectionExpressionsSuite.scala ## @@ -1098,7 +1098,7 @@ class CollectionExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper Literal(Date.valueOf("2018-01-05")), Literal(Period.ofDays(2))), EmptyRow, -"sequence step must be a day year-month interval if start and end values are dates") +"sequence step must be a day interval year to month if start and end values are dates") Review comment: @beliefer The error message confuses me slightly, especially the combination `a day interval year to month`. Could you open a PR to improve the error, please, something like "... sequence step must be an interval of day granularity ...". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] WeichenXu123 commented on a change in pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still
WeichenXu123 commented on a change in pull request #32399: URL: https://github.com/apache/spark/pull/32399#discussion_r629087709 ## File path: mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala ## @@ -161,11 +169,26 @@ class TrainValidationSplit @Since("1.5.0") (@Since("1.5.0") override val uid: St } // Wait for all metrics to be calculated -val metrics = metricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) - -// Unpersist training & validation set once all metrics have been produced -trainingDataset.unpersist() -validationDataset.unpersist() +val metrics = try { + metricFutures.map(ThreadUtils.awaitResult(_, Duration.Inf)) +} +catch { + case e: Throwable => +subTaskFailed = true +throw e +} +finally { + if (subTaskFailed) { +Thread.sleep(1000) Review comment: This sleep is for: each trial task which thread already running, may took some time running before it launch spark job, if here we cancel job immediately, then we may miss killing the spark job which will be spawned soon -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #31269: [SPARK-33933][SQL] Materialize BroadcastQueryStage first to try to avoid broadcast timeout in AQE
cloud-fan commented on pull request #31269: URL: https://github.com/apache/spark/pull/31269#issuecomment-836252782 @zhongyu09 please open a new JIRA, thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #32454: [SPARK-35327][SQL][TESTS] Filters out the TPC-DS queries that can cause flaky test results
cloud-fan commented on a change in pull request #32454: URL: https://github.com/apache/spark/pull/32454#discussion_r629085064 ## File path: sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala ## @@ -24,7 +24,7 @@ import org.apache.spark.sql.test.SharedSparkSession trait TPCDSBase extends SharedSparkSession with TPCDSSchema { // The TPCDS queries below are based on v1.4 - val tpcdsQueries = Seq( + def tpcdsQueries: Seq[String] = Seq( "q1", "q2", "q3", "q4", "q5", "q6", "q7", "q8", "q9", "q10", "q11", Review comment: shall we remove q6 from here for all the tests, if the only difference is an extra order by column? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #32489: [SPARK-35360][SQL] RepairTableCommand respect `spark.sql.addPartitionInBatch.size` too
AngersZh commented on pull request #32489: URL: https://github.com/apache/spark/pull/32489#issuecomment-836250813 ping @MaxGekk @wangyum @maropu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #32444: [SPARK-35111][SPARK-35112][SQL][FOLLOWUP] Rename ANSI interval patterns and regexps
MaxGekk closed pull request #32444: URL: https://github.com/apache/spark/pull/32444 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue
SparkQA removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836152449 **[Test build #138317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138317/testReport)** for PR 32399 at commit [`e8c86db`](https://github.com/apache/spark/commit/e8c86db1753a097e7ed442fd26d064693e0803e8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836244182 **[Test build #138317 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138317/testReport)** for PR 32399 at commit [`e8c86db`](https://github.com/apache/spark/commit/e8c86db1753a097e7ed442fd26d064693e0803e8). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #32444: [SPARK-35111][SPARK-35112][SQL][FOLLOWUP] Rename ANSI interval patterns and regexps
MaxGekk commented on pull request #32444: URL: https://github.com/apache/spark/pull/32444#issuecomment-836243414 +1, LGTM. Merging to master. Thank you, @AngersZh and @cloud-fan for your review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu opened a new pull request #32489: [SPARK-35360][SQL] RepairTableCommand respect `spark.sql.addPartitionInBatch.size` too
AngersZh opened a new pull request #32489: URL: https://github.com/apache/spark/pull/32489 ### What changes were proposed in this pull request? RepairTableCommand respect `spark.sql.addPartitionInBatch.size` too ### Why are the changes needed? Make RepairTableCommand add partition batch size configurable. ### Does this PR introduce _any_ user-facing change? User can use `spark.sql.addPartitionInBatch.size` to change batch size when repair table. ### How was this patch tested? Not need -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
SparkQA commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836238497 **[Test build #138323 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138323/testReport)** for PR 32487 at commit [`4098407`](https://github.com/apache/spark/commit/4098407bf6b74f2045ca27c3851da249a2a6ec7e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
SparkQA commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836235893 **[Test build #138322 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138322/testReport)** for PR 32487 at commit [`4098407`](https://github.com/apache/spark/commit/4098407bf6b74f2045ca27c3851da249a2a6ec7e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
viirya commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836235572 retest this please -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
AmplabJenkins removed a comment on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836233460 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138314/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling
HyukjinKwon commented on pull request #32448: URL: https://github.com/apache/spark/pull/32448#issuecomment-836233745 Can you also update https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2078-L2082 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
AmplabJenkins commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836233460 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138314/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
SparkQA removed a comment on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836145492 **[Test build #138314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138314/testReport)** for PR 32487 at commit [`4098407`](https://github.com/apache/spark/commit/4098407bf6b74f2045ca27c3851da249a2a6ec7e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
SparkQA commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836231948 **[Test build #138314 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138314/testReport)** for PR 32487 at commit [`4098407`](https://github.com/apache/spark/commit/4098407bf6b74f2045ca27c3851da249a2a6ec7e). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co
AmplabJenkins removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836230224 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42842/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a change in pull request #32448: [SPARK-35290][SQL] Use StructType merging for unionByName with null filling
HyukjinKwon commented on a change in pull request #32448: URL: https://github.com/apache/spark/pull/32448#discussion_r629072673 ## File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSetOperationsSuite.scala ## @@ -707,32 +710,63 @@ class DataFrameSetOperationsSuite extends QueryTest with SharedSparkSession { val df2 = Seq((1, UnionClass1c(1, 2L, UnionClass4(2, 3L.toDF("id", "a") var unionDf = df1.unionByName(df2, true) - checkAnswer(unionDf, -Row(0, Row(0, 1, Row(null, 1, null, "2"))) :: - Row(1, Row(1, 2, Row(2, null, 3L, null))) :: Nil) assert(unionDf.schema.toDDL == "`id` INT,`a` STRUCT<`a`: INT, `b`: BIGINT, " + - "`nested`: STRUCT<`A`: INT, `a`: INT, `b`: BIGINT, `c`: STRING>>") + "`nested`: STRUCT<`a`: INT, `c`: STRING, `A`: INT, `b`: BIGINT>>") Review comment: Can we update migration guide (https://github.com/apache/spark/blob/master/docs/sql-migration-guide.md)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co
AmplabJenkins removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836229891 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836230158 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r
AmplabJenkins commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836230224 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42842/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes
AmplabJenkins removed a comment on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-836229893 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42840/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r
AmplabJenkins commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836229892 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes
AmplabJenkins commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-836229893 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42840/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836229015 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes
SparkQA commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-836228137 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42840/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836226769 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42843/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes
SparkQA commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-836224236 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42840/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836216634 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836192784 **[Test build #138321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138321/testReport)** for PR 32399 at commit [`a1724ab`](https://github.com/apache/spark/commit/a1724ab3c4bb852dcb227bced236fdcbd3f3b93f). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue
SparkQA removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836190123 **[Test build #138320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138320/testReport)** for PR 32399 at commit [`a6874e5`](https://github.com/apache/spark/commit/a6874e5fc05c1f418500670c59c36bc799977761). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co
AmplabJenkins removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836190600 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138320/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836190579 **[Test build #138320 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138320/testReport)** for PR 32399 at commit [`a6874e5`](https://github.com/apache/spark/commit/a6874e5fc05c1f418500670c59c36bc799977761). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r
AmplabJenkins commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836190600 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138320/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836190123 **[Test build #138320 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138320/testReport)** for PR 32399 at commit [`a6874e5`](https://github.com/apache/spark/commit/a6874e5fc05c1f418500670c59c36bc799977761). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue
SparkQA removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836187589 **[Test build #138319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138319/testReport)** for PR 32399 at commit [`45c64ea`](https://github.com/apache/spark/commit/45c64ead77bfd897b53e383efa67e4ba35c2). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co
AmplabJenkins removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836188026 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138319/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r
AmplabJenkins commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836188026 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138319/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836188006 **[Test build #138319 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138319/testReport)** for PR 32399 at commit [`45c64ea`](https://github.com/apache/spark/commit/45c64ead77bfd897b53e383efa67e4ba35c2). * This patch **fails RAT tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836187589 **[Test build #138319 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138319/testReport)** for PR 32399 at commit [`45c64ea`](https://github.com/apache/spark/commit/45c64ead77bfd897b53e383efa67e4ba35c2). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32031: [WIP] Initial work of Remote Shuffle Service on Kubernetes
SparkQA commented on pull request #32031: URL: https://github.com/apache/spark/pull/32031#issuecomment-836185216 **[Test build #138318 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138318/testReport)** for PR 32031 at commit [`506149f`](https://github.com/apache/spark/commit/506149f3fa92b27bdf09da6748e91516b6dd5aea). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty
AmplabJenkins removed a comment on pull request #32475: URL: https://github.com/apache/spark/pull/32475#issuecomment-836183250 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42837/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
AmplabJenkins removed a comment on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836183252 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42836/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
AmplabJenkins removed a comment on pull request #32482: URL: https://github.com/apache/spark/pull/32482#issuecomment-836183249 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42838/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
AmplabJenkins commented on pull request #32482: URL: https://github.com/apache/spark/pull/32482#issuecomment-836183249 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42838/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty
AmplabJenkins commented on pull request #32475: URL: https://github.com/apache/spark/pull/32475#issuecomment-836183250 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42837/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
AmplabJenkins commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836183252 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42836/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
SparkQA commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836177985 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
SparkQA commented on pull request #32482: URL: https://github.com/apache/spark/pull/32482#issuecomment-836176561 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty
SparkQA commented on pull request #32475: URL: https://github.com/apache/spark/pull/32475#issuecomment-836174973 Kubernetes integration test status failure URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42837/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty
SparkQA commented on pull request #32475: URL: https://github.com/apache/spark/pull/32475#issuecomment-836171886 Kubernetes integration test starting URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42837/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
AmplabJenkins commented on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-836153120 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138312/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836152449 **[Test build #138317 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138317/testReport)** for PR 32399 at commit [`e8c86db`](https://github.com/apache/spark/commit/e8c86db1753a097e7ed442fd26d064693e0803e8). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
SparkQA removed a comment on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-835964738 **[Test build #138312 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138312/testReport)** for PR 32473 at commit [`21cc2ac`](https://github.com/apache/spark/commit/21cc2ac907ffe9256942d818663ce225d1a1b992). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
SparkQA commented on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-836151733 **[Test build #138312 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138312/testReport)** for PR 32473 at commit [`21cc2ac`](https://github.com/apache/spark/commit/21cc2ac907ffe9256942d818663ce225d1a1b992). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
srowen commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836150808 Getting pretty big! but OK if needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on a change in pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
ulysses-you commented on a change in pull request #32482: URL: https://github.com/apache/spark/pull/32482#discussion_r629039347 ## File path: sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala ## @@ -1175,7 +1175,7 @@ class CachedTableSuite extends QueryTest with SQLTestUtils } test("cache supports for intervals") { -withTable("interval_cache") { +withTable("interval_cache", "t1") { Review comment: not related this pr, but affected the new added test with `t1`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
SparkQA commented on pull request #32482: URL: https://github.com/apache/spark/pull/32482#issuecomment-836147815 **[Test build #138316 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138316/testReport)** for PR 32482 at commit [`7625677`](https://github.com/apache/spark/commit/76256774c52b78b9f6011f82063004bf18734f01). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ulysses-you commented on pull request #32482: [SPARK-35332][SQL] Make cache plan disable configs configurable
ulysses-you commented on pull request #32482: URL: https://github.com/apache/spark/pull/32482#issuecomment-836147309 Thank you @maropu @c21 @dongjoon-hyun . Agree, the current config seems overkill to user, it's better to just make it as `enabled`. Refactor this PR to address: * make the new config simple and improve the doc. * improve the test for two things, 1) more pattern with AQE test, 2) bucketed test -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32475: [SPARK-34775][SQL] Push down limit through window when partitionSpec is not empty
SparkQA commented on pull request #32475: URL: https://github.com/apache/spark/pull/32475#issuecomment-83614 **[Test build #138315 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138315/testReport)** for PR 32475 at commit [`bf9d041`](https://github.com/apache/spark/commit/bf9d04140d596ba9d4cfe33b0f497a5a9045ba37). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
SparkQA commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836145492 **[Test build #138314 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138314/testReport)** for PR 32487 at commit [`4098407`](https://github.com/apache/spark/commit/4098407bf6b74f2045ca27c3851da249a2a6ec7e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32488: [SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate
AmplabJenkins commented on pull request #32488: URL: https://github.com/apache/spark/pull/32488#issuecomment-836144135 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cfmcgrady opened a new pull request #32488: [SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate
cfmcgrady opened a new pull request #32488: URL: https://github.com/apache/spark/pull/32488 ### What changes were proposed in this pull request? This pr add in/inset predicate support for `UnwrapCastInBinaryComparison`. Current implement doesn't pushdown filters for `In/InSet` which contains `Cast`. For instance: ```scala spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1") spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain ``` before this pr: ``` == Physical Plan == *(1) Filter cast(id#5 as bigint) IN (1,2,4) +- *(1) ColumnarToRow +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct ``` after this pr: ``` == Physical Plan == *(1) Filter id#95 IN (1,2,4) +- *(1) ColumnarToRow +- FileScan parquet [id#95] Batched: true, DataFilters: [id#95 IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [In(id, [1,2,4])], ReadSchema: struct ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join
c21 commented on a change in pull request #32476: URL: https://github.com/apache/spark/pull/32476#discussion_r629027318 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -418,115 +443,140 @@ case class SortMergeJoinExec( // Inline mutable state since not many join operations in a task val matches = ctx.addMutableState(clsName, "matches", v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", forceInline = true) -// Copy the left keys as class members so they could be used in next function call. -val matchedKeyVars = copyKeys(ctx, leftKeyVars) +// Copy the streamed keys as class members so they could be used in next function call. +val matchedKeyVars = copyKeys(ctx, streamedKeyVars) + +// Handle the case when streamed rows has any NULL keys. +val handleStreamedAnyNull = joinType match { + case _: InnerLike => +// Skip streamed row. +s""" + |$streamedRow = null; + |continue; + """.stripMargin + case LeftOuter | RightOuter => +// Eagerly return streamed row. +s""" + |if (!$matches.isEmpty()) { + | $matches.clear(); + |} + |return false; + """.stripMargin + case x => +throw new IllegalArgumentException( + s"SortMergeJoin.genScanner should not take $x as the JoinType") +} -ctx.addNewFunction("findNextInnerJoinRows", +// Handle the case when streamed keys less than buffered keys. +val handleStreamedLessThanBuffered = joinType match { + case _: InnerLike => +// Skip streamed row. +s"$streamedRow = null;" + case LeftOuter | RightOuter => +// Eagerly return with streamed row. +"return false;" + case x => +throw new IllegalArgumentException( + s"SortMergeJoin.genScanner should not take $x as the JoinType") +} + +ctx.addNewFunction("findNextJoinRows", s""" - |private boolean findNextInnerJoinRows( - |scala.collection.Iterator leftIter, - |scala.collection.Iterator rightIter) { - | $leftRow = null; + |private boolean findNextJoinRows( Review comment: @maropu - No I think we need buffer anyway. The buffered rows has same join keys with current streamed row. But there can be multiple followed streamed rows having same join keys, as the buffered rows. Even though buffered rows cannot match condition with current streamed row, they may match condition with followed streamed rows. I think this is how current sort merge join (code-gen & iterator) is designed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on a change in pull request #32476: [SPARK-35349][SQL] Add code-gen for left/right outer sort merge join
c21 commented on a change in pull request #32476: URL: https://github.com/apache/spark/pull/32476#discussion_r629027318 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/joins/SortMergeJoinExec.scala ## @@ -418,115 +443,140 @@ case class SortMergeJoinExec( // Inline mutable state since not many join operations in a task val matches = ctx.addMutableState(clsName, "matches", v => s"$v = new $clsName($inMemoryThreshold, $spillThreshold);", forceInline = true) -// Copy the left keys as class members so they could be used in next function call. -val matchedKeyVars = copyKeys(ctx, leftKeyVars) +// Copy the streamed keys as class members so they could be used in next function call. +val matchedKeyVars = copyKeys(ctx, streamedKeyVars) + +// Handle the case when streamed rows has any NULL keys. +val handleStreamedAnyNull = joinType match { + case _: InnerLike => +// Skip streamed row. +s""" + |$streamedRow = null; + |continue; + """.stripMargin + case LeftOuter | RightOuter => +// Eagerly return streamed row. +s""" + |if (!$matches.isEmpty()) { + | $matches.clear(); + |} + |return false; + """.stripMargin + case x => +throw new IllegalArgumentException( + s"SortMergeJoin.genScanner should not take $x as the JoinType") +} -ctx.addNewFunction("findNextInnerJoinRows", +// Handle the case when streamed keys less than buffered keys. +val handleStreamedLessThanBuffered = joinType match { + case _: InnerLike => +// Skip streamed row. +s"$streamedRow = null;" + case LeftOuter | RightOuter => +// Eagerly return with streamed row. +"return false;" + case x => +throw new IllegalArgumentException( + s"SortMergeJoin.genScanner should not take $x as the JoinType") +} + +ctx.addNewFunction("findNextJoinRows", s""" - |private boolean findNextInnerJoinRows( - |scala.collection.Iterator leftIter, - |scala.collection.Iterator rightIter) { - | $leftRow = null; + |private boolean findNextJoinRows( Review comment: @maropu - No I think we need buffer anyway. The buffered rows has same join keys with current streamed row. But there can be multiple followed streamed rows having same join keys, as the buffered rows. Even though buffered rows cannot match condition with current streamed rows, they may match condition with followed streamed rows. I think this is how current sort merge join (code-gen & iterator) is designed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue
SparkQA removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836035623 **[Test build #138313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138313/testReport)** for PR 32399 at commit [`c6aa4c4`](https://github.com/apache/spark/commit/c6aa4c4ccc8b9103314d5efea148b71e19a560d4). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a change in pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
viirya commented on a change in pull request #32487: URL: https://github.com/apache/spark/pull/32487#discussion_r629025675 ## File path: dev/create-release/release-build.sh ## @@ -210,6 +210,8 @@ if [[ "$1" == "package" ]]; then PYSPARK_VERSION=`echo "$SPARK_VERSION" | sed -e "s/-/./" -e "s/SNAPSHOT/dev0/" -e "s/preview/dev/"` echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py +export MAVEN_OPTS="-Xmx12000m" Review comment: ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co
AmplabJenkins removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836109653 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138313/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r
AmplabJenkins commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836109653 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138313/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836108663 **[Test build #138313 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138313/testReport)** for PR 32399 at commit [`c6aa4c4`](https://github.com/apache/spark/commit/c6aa4c4ccc8b9103314d5efea148b71e19a560d4). * This patch **fails PySpark unit tests**. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
AmplabJenkins removed a comment on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-836106608 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138311/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
AmplabJenkins commented on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-836106608 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138311/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on a change in pull request #32410: [SPARK-35286][SQL] Replace SessionState.start with SessionState.setCurrentSessionState
wangyum commented on a change in pull request #32410: URL: https://github.com/apache/spark/pull/32410#discussion_r629020979 ## File path: sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java ## @@ -141,7 +141,7 @@ public void open(Map sessionConfMap) throws HiveSQLException { sessionState = new SessionState(hiveConf, username); sessionState.setUserIpAddress(ipAddress); sessionState.setIsHiveServerQuery(true); -SessionState.start(sessionState); +SessionState.setCurrentSessionState(sessionState); Review comment: Yes. It is safe when use `ADD JARS`. We have disabled creating these directories for more than a year with the following changes(`HiveConf.ConfVars.WITHSCRATCHDIR=false`): ![image](https://user-images.githubusercontent.com/5399861/116785447-312cc500-aacc-11eb-8dff-6ae75fbbc4d7.png) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
SparkQA removed a comment on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-835906957 **[Test build #138311 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138311/testReport)** for PR 32473 at commit [`34d0511`](https://github.com/apache/spark/commit/34d05113d307395bd1c1449651e09a8285fd0c6e). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #25911: [SPARK-29223][SQL][SS] Enable global timestamp per topic while specifying offset by timestamp in Kafka source
HeartSaVioR commented on pull request #25911: URL: https://github.com/apache/spark/pull/25911#issuecomment-836089685 I see actual customer's demand on this; "a" topic has 100+ partitions and it's weird to let them craft json which contains 100+ partitions for the same timestamp. Flink already does the thing; Flink uses global value across partitions for earliest/latest/timestamp, while it allows to set exact offset per partition. https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/#kafka-consumers-start-position-configuration ``` final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); FlinkKafkaConsumer myConsumer = new FlinkKafkaConsumer<>(...); myConsumer.setStartFromEarliest(); // start from the earliest record possible myConsumer.setStartFromLatest(); // start from the latest record myConsumer.setStartFromTimestamp(...); // start from specified epoch timestamp (milliseconds) myConsumer.setStartFromGroupOffsets(); // the default behaviour ``` ``` Map specificStartOffsets = new HashMap<>(); specificStartOffsets.put(new KafkaTopicPartition("myTopic", 0), 23L); specificStartOffsets.put(new KafkaTopicPartition("myTopic", 1), 31L); specificStartOffsets.put(new KafkaTopicPartition("myTopic", 2), 43L); myConsumer.setStartFromSpecificOffsets(specificStartOffsets); ``` Given this PR is stale, I'll rebase this with master and raise the PR again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
SparkQA commented on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-836088555 **[Test build #138311 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138311/testReport)** for PR 32473 at commit [`34d0511`](https://github.com/apache/spark/commit/34d05113d307395bd1c1449651e09a8285fd0c6e). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] c21 commented on pull request #32480: [SPARK-35354][SQL] Replace BaseJoinExec with ShuffledJoin in CoalesceBucketsInJoin
c21 commented on pull request #32480: URL: https://github.com/apache/spark/pull/32480#issuecomment-836086921 Thank you @maropu for review! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] beliefer commented on a change in pull request #32442: [SPARK-35283][SQL] Support query some DDL with CTES
beliefer commented on a change in pull request #32442: URL: https://github.com/apache/spark/pull/32442#discussion_r629016341 ## File path: sql/core/src/test/resources/sql-tests/inputs/cte-ddl.sql ## @@ -0,0 +1,65 @@ +-- Test data. +CREATE NAMESPACE IF NOT EXISTS query_ddl_namespace; +USE NAMESPACE query_ddl_namespace; +CREATE TABLE test_show_tables(a INT, b STRING, c INT) using parquet; +CREATE TABLE test_show_table_properties (a INT, b STRING, c INT) USING parquet TBLPROPERTIES('p1'='v1', 'p2'='v2'); +CREATE TABLE test_show_partitions(a String, b Int, c String, d String) USING parquet PARTITIONED BY (c, d); +ALTER TABLE test_show_partitions ADD PARTITION (c='Us', d=1); +ALTER TABLE test_show_partitions ADD PARTITION (c='Us', d=2); +ALTER TABLE test_show_partitions ADD PARTITION (c='Cn', d=1); +CREATE VIEW view_1 AS SELECT * FROM test_show_tables; +CREATE VIEW view_2 AS SELECT * FROM test_show_tables WHERE c=1; +CREATE TEMPORARY VIEW test_show_views(e int) USING parquet; +CREATE GLOBAL TEMP VIEW test_global_show_views AS SELECT 1 as col1; + +-- SHOW NAMESPACES +SHOW NAMESPACES; +WITH s AS (SHOW NAMESPACES) SELECT * FROM s; +WITH s AS (SHOW NAMESPACES) SELECT * FROM s WHERE namespace = 'query_ddl_namespace'; +WITH s(n) AS (SHOW NAMESPACES) SELECT * FROM s WHERE n = 'query_ddl_namespace'; + +-- SHOW TABLES +SHOW TABLES; +WITH s AS (SHOW TABLES) SELECT * FROM s; +WITH s AS (SHOW TABLES) SELECT * FROM s WHERE tableName = 'test_show_tables'; +WITH s(ns, tn, t) AS (SHOW TABLES) SELECT * FROM s WHERE tn = 'test_show_tables'; Review comment: OK -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang closed pull request #32374: [WIP][SPARK-35253][BUILD][SQL] Upgrade Janino from 3.0.16 to 3.1.3
LuciferYang closed pull request #32374: URL: https://github.com/apache/spark/pull/32374 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #32374: [WIP][SPARK-35253][BUILD][SQL] Upgrade Janino from 3.0.16 to 3.1.3
LuciferYang commented on pull request #32374: URL: https://github.com/apache/spark/pull/32374#issuecomment-836082137 close this because SPARK-35253 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a change in pull request #32455: [SPARK-35253][SQL][BUILD] Bump up the janino version to v3.1.4
LuciferYang commented on a change in pull request #32455: URL: https://github.com/apache/spark/pull/32455#discussion_r629014929 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala ## @@ -1434,9 +1435,10 @@ object CodeGenerator extends Logging { private def updateAndGetCompilationStats(evaluator: ClassBodyEvaluator): ByteCodeStats = { // First retrieve the generated classes. val classes = { - val resultField = classOf[SimpleCompiler].getDeclaredField("result") - resultField.setAccessible(true) - val loader = resultField.get(evaluator).asInstanceOf[ByteArrayClassLoader] + val scField = classOf[ClassBodyEvaluator].getDeclaredField("sc") Review comment: @maropu Can we directly use `evaluator.getBytecodes.asScala` instead of line 1438 ~ line 1445? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still co
AmplabJenkins removed a comment on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836069987 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42835/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue r
AmplabJenkins commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836069987 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/42835/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #32350: [SPARK-35231][SQL] logical.Range override maxRowsPerPartition
zhengruifeng commented on pull request #32350: URL: https://github.com/apache/spark/pull/32350#issuecomment-836067509 Thank you so much! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836058502 Kubernetes integration test unable to build dist. exiting with code: 1 URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/42835/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] maropu commented on a change in pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
maropu commented on a change in pull request #32487: URL: https://github.com/apache/spark/pull/32487#discussion_r629004607 ## File path: dev/create-release/release-build.sh ## @@ -210,6 +210,8 @@ if [[ "$1" == "package" ]]; then PYSPARK_VERSION=`echo "$SPARK_VERSION" | sed -e "s/-/./" -e "s/SNAPSHOT/dev0/" -e "s/preview/dev/"` echo "__version__='$PYSPARK_VERSION'" > python/pyspark/version.py +export MAVEN_OPTS="-Xmx12000m" Review comment: nit: we can say `-Xmx12g`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
huaxingao commented on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-836051980 @dongjoon-hyun > Shall we change the grouping in order see the trend according to the block size? Sorry, I just saw your comment. I guess it might be a little better to pair up the results of `Without bloom filter` and `With bloom filter` so it's easier to see the improvement for bloom filter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao commented on a change in pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
huaxingao commented on a change in pull request #32473: URL: https://github.com/apache/spark/pull/32473#discussion_r629004056 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/BloomFilterBenchmark.scala ## @@ -81,8 +80,57 @@ object BloomFilterBenchmark extends SqlBasedBenchmark { } } + private def writeParquetBenchmark(): Unit = { +withTempPath { dir => + val path = dir.getCanonicalPath + + runBenchmark(s"Parquet Write") { +val benchmark = new Benchmark(s"Write ${scaleFactor}M rows", N, output = output) +benchmark.addCase("Without bloom filter") { _ => + df.write.mode("overwrite").parquet(path + "/withoutBF") +} +benchmark.addCase("With bloom filter") { _ => + df.write.mode("overwrite") +.option(ParquetOutputFormat.BLOOM_FILTER_ENABLED + "#value", true) +.parquet(path + "/withBF") +} +benchmark.run() + } +} + } + + private def readParquetBenchmark(): Unit = { +val blockSizes = Seq(512 * 1024, 1024 * 1024, 2 * 1024 * 1024, 3 * 1024 * 1024, + 4 * 1024 * 1024, 5 * 1024 * 1024, 6 * 1024 * 1024, 7 * 1024 * 1024, + 8 * 1024 * 1024, 9 * 1024 * 1024, 10 * 1024 * 1024) +for (blocksize <- blockSizes) { + withTempPath { dir => +val path = dir.getCanonicalPath + +df.write.option("parquet.block.size", blocksize).parquet(path + "/withoutBF") Review comment: @wangyum Sorry, I am new to parquet. Somehow I didn't see parquet has compression size, seems only ORC has `orc.compress.size`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on pull request #32399: [SPARK-35271][ML][PYSPARK] Fix: After CrossValidator/TrainValidationSplit fit raised error, some backgroud threads may still continue run or
SparkQA commented on pull request #32399: URL: https://github.com/apache/spark/pull/32399#issuecomment-836035623 **[Test build #138313 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/138313/testReport)** for PR 32399 at commit [`c6aa4c4`](https://github.com/apache/spark/commit/c6aa4c4ccc8b9103314d5efea148b71e19a560d4). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
AmplabJenkins commented on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836035119 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138310/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32473: [SPARK-35345][SQL] Add Parquet tests to BloomFilterBenchmark
AmplabJenkins removed a comment on pull request #32473: URL: https://github.com/apache/spark/pull/32473#issuecomment-836035114 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on pull request #32487: [SPARK-35358][BUILD] Increase maximum Java heap used for release build to avoid OOM
AmplabJenkins removed a comment on pull request #32487: URL: https://github.com/apache/spark/pull/32487#issuecomment-836035119 Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/138310/ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org