[GitHub] [spark] dongjoon-hyun closed pull request #24993: [SPARK-18299][SQL] Allow more aggregations on KeyValueGroupedDataset
dongjoon-hyun closed pull request #24993: [SPARK-18299][SQL] Allow more aggregations on KeyValueGroupedDataset URL: https://github.com/apache/spark/pull/24993 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin closed pull request #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service.
vanzin closed pull request #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24993: [SPARK-18299][SQL] Allow more aggregations on KeyValueGroupedDataset
dongjoon-hyun commented on issue #24993: [SPARK-18299][SQL] Allow more aggregations on KeyValueGroupedDataset URL: https://github.com/apache/spark/pull/24993#issuecomment-512041697 @nooberfsh . What is your Apache JIRA ID? I want to add you to the Apache Spark Contributor group and assign SPARK-18299 to you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vanzin commented on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service.
vanzin commented on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#issuecomment-512041567 Merging to master. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #24993: [SPARK-18299][SQL] Allow more aggregations on KeyValueGroupedDataset
dongjoon-hyun commented on issue #24993: [SPARK-18299][SQL] Allow more aggregations on KeyValueGroupedDataset URL: https://github.com/apache/spark/pull/24993#issuecomment-512041073 Thank you, @nooberfsh and @JoshRosen . This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully
dongjoon-hyun commented on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully URL: https://github.com/apache/spark/pull/25174#issuecomment-512039299 Thank you so much, @hvanhovell ! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval
AmplabJenkins removed a comment on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval URL: https://github.com/apache/spark/pull/25173#issuecomment-512037229 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107758/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval
AmplabJenkins commented on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval URL: https://github.com/apache/spark/pull/25173#issuecomment-512037229 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107758/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval
AmplabJenkins commented on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval URL: https://github.com/apache/spark/pull/25173#issuecomment-512037222 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval
AmplabJenkins removed a comment on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval URL: https://github.com/apache/spark/pull/25173#issuecomment-512037222 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval
SparkQA removed a comment on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval URL: https://github.com/apache/spark/pull/25173#issuecomment-511964088 **[Test build #107758 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107758/testReport)** for PR 25173 at commit [`9eaffa7`](https://github.com/apache/spark/commit/9eaffa70bad926fb595e8b89f19b1649ae245477). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval
SparkQA commented on issue #25173: [SPARK-28416][SQL] Use java.time API in timestampAddInterval URL: https://github.com/apache/spark/pull/25173#issuecomment-512036844 **[Test build #107758 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107758/testReport)** for PR 25173 at commit [`9eaffa7`](https://github.com/apache/spark/commit/9eaffa70bad926fb595e8b89f19b1649ae245477). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mccheah commented on issue #24796: [SPARK-27900][CORE] Add uncaught exception handler to the driver
mccheah commented on issue #24796: [SPARK-27900][CORE] Add uncaught exception handler to the driver URL: https://github.com/apache/spark/pull/24796#issuecomment-512029989 Yeah adding the uncaught exception hook shouldn't hurt here. As long as this doesn't eliminate the pod from the system (so that one can access the logs if necessary), it should be fine. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ifilonenko commented on issue #24796: [SPARK-27900][CORE] Add uncaught exception handler to the driver
ifilonenko commented on issue #24796: [SPARK-27900][CORE] Add uncaught exception handler to the driver URL: https://github.com/apache/spark/pull/24796#issuecomment-512028704 I can see no reason why Kubernetes shouldn't include a kill on OOM. I agree that this should be extended for Kubernetes. These OOM issues do cause issues for our operators so this would be useful to report back a failed job. @skonto I'd prefer to see `-XX:+ExitOnOutOfMemoryError` be added to the kube entrypoint. at the very least to have parity with yarn. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
AmplabJenkins removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512028227 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/12883/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
AmplabJenkins removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512028221 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
AmplabJenkins commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512028221 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
AmplabJenkins commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512028227 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/12883/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
SparkQA commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512026634 **[Test build #107764 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107764/testReport)** for PR 24798 at commit [`581dba2`](https://github.com/apache/spark/commit/581dba2822a8c1d23f9da653e8c26c84b18cfa3e). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
AmplabJenkins removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512025787 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107756/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
AmplabJenkins removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512025783 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
AmplabJenkins commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512025783 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
AmplabJenkins commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512025787 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107756/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
SparkQA removed a comment on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-511950480 **[Test build #107756 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107756/testReport)** for PR 24798 at commit [`0b5c029`](https://github.com/apache/spark/commit/0b5c0290c6aca26030f3f4b3c93069d8c2366969). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
SparkQA commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-512024041 **[Test build #107756 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107756/testReport)** for PR 24798 at commit [`0b5c029`](https://github.com/apache/spark/commit/0b5c0290c6aca26030f3f4b3c93069d8c2366969). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rdblue commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
rdblue commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#discussion_r304148360 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala ## @@ -78,7 +80,8 @@ case class CreateTableAsSelectExec( } Utils.tryWithSafeFinallyAndFailureCallbacks({ - catalog.createTable(ident, query.schema, partitioning.toArray, properties.asJava) match { + catalog.createTable( +ident, query.schema, partitioning.toArray, properties.asJava) match { Review comment: I agree. The scope of this suggestion is larger than this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials
AmplabJenkins removed a comment on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials URL: https://github.com/apache/spark/pull/25176#issuecomment-512005604 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials
AmplabJenkins commented on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials URL: https://github.com/apache/spark/pull/25176#issuecomment-512006177 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials
AmplabJenkins removed a comment on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials URL: https://github.com/apache/spark/pull/25176#issuecomment-512005450 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
squito commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304138587 ## File path: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ## @@ -1723,4 +1722,48 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg assert(manager.resourceOffer("exec2", "host2", ANY).isEmpty) assert(manager.resourceOffer("exec3", "host3", ANY).isEmpty) } + + test("SPARK-26755 Ensure that a speculative task obeys the original locality preferences") { +sc = new SparkContext("local", "test") +sched = new FakeTaskScheduler(sc, ("exec1", "host1"), + ("exec2", "host2"), ("exec3", "host3"), ("exec4", "host4")) +// Create 3 tasks with locality preferences +val taskSet = FakeTask.createTaskSet(3, + Seq(TaskLocation("host1"), TaskLocation("host3")), + Seq(TaskLocation("host2")), + Seq(TaskLocation("host3"))) +// Set the speculation multiplier to be 0 so speculative tasks are launched immediately +sc.conf.set(config.SPECULATION_MULTIPLIER, 0.0) +sc.conf.set(config.SPECULATION_ENABLED, true) +sc.conf.set(config.SPECULATION_QUANTILE, 0.5) +val clock = new ManualClock() +val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, clock = clock) +val accumUpdatesByTask: Array[Seq[AccumulatorV2[_, _]]] = taskSet.tasks.map { task => + task.metrics.internalAccums +} +// Offer resources for 3 tasks to start +Seq("exec1" -> "host1", "exec2" -> "host2", "exec3" -> "host3").foreach { case (exec, host) => + val taskOption = manager.resourceOffer(exec, host, NO_PREF) + assert(taskOption.isDefined) + assert(taskOption.get.executorId === exec) +} +assert(sched.startedTasks.toSet === Set(0, 1, 2)) +clock.advance(1) +// Finish one task and mark the others as speculatable +manager.handleSuccessfulTask(2, createTaskResult(2, accumUpdatesByTask(2))) +assert(sched.endedTasks(2) === Success) +clock.advance(1) +assert(manager.checkSpeculatableTasks(0)) +assert(sched.speculativeTasks.toSet === Set(0, 1)) +// Ensure that the speculatable tasks obey the original locality preferences +assert(manager.resourceOffer("exec4", "host4", NODE_LOCAL).isEmpty) +assert(manager.resourceOffer("exec2", "host2", NODE_LOCAL).isEmpty) +assert(manager.resourceOffer("exec3", "host3", NODE_LOCAL).isDefined) +assert(manager.resourceOffer("exec4", "host4", ANY).isDefined) Review comment: any particular reason to pull this out into a separate test case? Seems like it could be combined. Its fine if there is a good reason, but I don't like a proliferation of test cases that are all doing more or less the same thing. It seems the only thing which you aren't doing here, but you are doing above, is checking the taskId etc. of the speculative tasks. also another thing missing from both tests -- there is no check that we do not schedule a speculative task on the same host as the original task, even despite locality preferences. (I realize some of these tests were missing before, but this logic is getting a little trickier now, and maybe those tests always should have been there) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
squito commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304136584 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -1054,6 +1043,19 @@ private[spark] object TaskSetManager { val TASK_SIZE_TO_WARN_KIB = 1000 } +// Set of pending tasks for various levels of locality: executor, host, rack, +// noPrefs and anyPrefs. These collections are actually +// treated as stacks, in which new tasks are added to the end of the +// ArrayBuffer and removed from the end. This makes it faster to detect +// tasks that repeatedly fail because whenever a task failed, it is put +// back at the head of the stack. These collections may contain duplicates +// for two reasons: +// (1): Tasks are only removed lazily; when a task is launched, it remains +// in all the pending lists except the one that it was launched from. +// (2): Tasks may be re-added to these lists multiple times as a result +// of failures. +// Duplicates are handled in dequeueTaskFromList, which ensures that a +// task hasn't already started running before launching it. Review comment: turn this into a scaladoc comment so its shows up in IDEs for PendingTasksByLocality ``` /** * Set of ... ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
AmplabJenkins removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-512005480 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] nvander1 commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API
nvander1 commented on issue #24232: [SPARK-27297] [SQL] Add higher order functions to scala API URL: https://github.com/apache/spark/pull/24232#issuecomment-512005589 I had only launched the ./build/sbt and run compile in interactive mode without any flags to the build. Thanks for helping me find the build config @HyukjinKwon . I'll test locally with that and report back later. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
AmplabJenkins commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-512005484 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107763/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials
AmplabJenkins commented on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials URL: https://github.com/apache/spark/pull/25176#issuecomment-512005604 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
AmplabJenkins commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-512005480 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
AmplabJenkins removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-512005484 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107763/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
SparkQA removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-511996567 **[Test build #107763 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107763/testReport)** for PR 25175 at commit [`566fc84`](https://github.com/apache/spark/commit/566fc84ab87470dfac1103854bc550b33d593bdd). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials
AmplabJenkins commented on issue #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials URL: https://github.com/apache/spark/pull/25176#issuecomment-512005450 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
SparkQA commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-512005183 **[Test build #107763 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107763/testReport)** for PR 25175 at commit [`566fc84`](https://github.com/apache/spark/commit/566fc84ab87470dfac1103854bc550b33d593bdd). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] modi95 opened a new pull request #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials
modi95 opened a new pull request #25176: [SPARK-28417][Spark Core] Wrap File Glob Resolution in DoAs to use ProxyUser Credentials URL: https://github.com/apache/spark/pull/25176 ## What changes were proposed in this pull request? This PR addresses SPARK-28417 by wrapping the Glob file path resolution in a DoAs so that ProxyUser credentials are used for the glob resolution. ## How was this patch tested? This patch was tested using the unit tests and via manual testing. This was a bug encountered by Spark users at Uber. The bug was patched and is currently being used in prod at Uber. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service.
AmplabJenkins removed a comment on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#issuecomment-512004788 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service.
AmplabJenkins removed a comment on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#issuecomment-512004794 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107757/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service.
SparkQA removed a comment on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#issuecomment-511961467 **[Test build #107757 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107757/testReport)** for PR 24817 at commit [`6154bf4`](https://github.com/apache/spark/commit/6154bf486e68dbb5a4c16dc9e71030cc20d8ca58). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service.
AmplabJenkins commented on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#issuecomment-512004794 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107757/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service.
AmplabJenkins commented on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#issuecomment-512004788 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service.
SparkQA commented on issue #24817: [SPARK-27963][core] Allow dynamic allocation without a shuffle service. URL: https://github.com/apache/spark/pull/24817#issuecomment-512004162 **[Test build #107757 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107757/testReport)** for PR 24817 at commit [`6154bf4`](https://github.com/apache/spark/commit/6154bf486e68dbb5a4c16dc9e71030cc20d8ca58). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on issue #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on issue #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#issuecomment-512001628 @squito @Ngone51 Have updated the PR with the changes. Thank you once again for your valuable reviews. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304135359 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -143,25 +144,18 @@ private[spark] class TaskSetManager( // of failures. // Duplicates are handled in dequeueTaskFromList, which ensures that a // task hasn't already started running before launching it. - private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]] - // Set of pending tasks for each host. Similar to pendingTasksForExecutor, - // but at host level. - private val pendingTasksForHost = new HashMap[String, ArrayBuffer[Int]] - - // Set of pending tasks for each rack -- similar to the above. - private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]] - - // Set containing pending tasks with no locality preferences. - private[scheduler] var pendingTasksWithNoPrefs = new ArrayBuffer[Int] - - // Set containing all pending tasks (also used as a stack, as above). - private val allPendingTasks = new ArrayBuffer[Int] + private[scheduler] val pendingTasks = new PendingTasksByLocality() // Tasks that can be speculated. Since these will be a small fraction of total - // tasks, we'll just hold them in a HashSet. + // tasks, we'll just hold them in a HashSet. The HashSet here ensures that we do not add + // duplicate speculative tasks. private[scheduler] val speculatableTasks = new HashSet[Int] Review comment: So I ran a Join query on a dataset of size 10TB without this change and out of 10 tasks for the ShuffleMapStage, the maximum number of speculatable tasks that was noted was close to 7900-8000 at a point. That is when we start seeing the bottleneck on the scheduler lock. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wypoon commented on issue #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
wypoon commented on issue #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#issuecomment-512000423 @attilapiros regarding type aliases: I think the injunction against them applies to writing Java compatible APIs, which does not apply in this case. ExecutorMetricsPoller is an internal implementation detail. I agree that TCMP is cryptic, but I was hoping that the preceding comment makes clear that it stands for "Task Count and Metric Peaks". This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304098114 ## File path: core/src/main/scala/org/apache/spark/executor/ExecutorMetricsPoller.scala ## @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.executor + +import java.lang.Long.{MAX_VALUE => LONG_MAX_VALUE} +import java.util.concurrent.{ConcurrentHashMap, TimeUnit} +import java.util.concurrent.atomic.{AtomicLong, AtomicLongArray} + +import scala.collection.mutable.HashMap + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryManager +import org.apache.spark.metrics.ExecutorMetricType +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * :: DeveloperApi :: Review comment: why is this a developerapi? I think it should just be internal This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304113971 ## File path: core/src/main/scala/org/apache/spark/executor/ExecutorMetricsPoller.scala ## @@ -0,0 +1,199 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.executor + +import java.lang.Long.{MAX_VALUE => LONG_MAX_VALUE} +import java.util.concurrent.{ConcurrentHashMap, TimeUnit} +import java.util.concurrent.atomic.{AtomicLong, AtomicLongArray} + +import scala.collection.mutable.HashMap + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryManager +import org.apache.spark.metrics.ExecutorMetricType +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * :: DeveloperApi :: + * A class that polls executor metrics, and tracks their peaks per task and per stage. + * Each executor keeps an instance of this class. + * The poll method polls the executor metrics, and is either run in its own thread or + * called by the executor's heartbeater thread, depending on configuration. + * The class keeps two ConcurrentHashMaps that are accessed (via its methods) by the + * executor's task runner threads concurrently with the polling thread. One thread may + * update one of these maps while another reads it, so the reading thread may not get + * the latest metrics, but this is ok. + * + * @param memoryManager the memory manager used by the executor. + * @param pollingInterval the polling interval in milliseconds. + */ +@DeveloperApi +private[spark] class ExecutorMetricsPoller( +memoryManager: MemoryManager, +pollingInterval: Long) + extends Logging { + + type StageKey = (Int, Int) + // tuple for Task Count and Metric Peaks + type TCMP = (AtomicLong, AtomicLongArray) Review comment: I'm not sure I see the value in the extra util methods, but I agree about having a small case class -- if nothing else, it would make the code a lot cleared if instead of `v._2` it was `taskCountAndMetrics.metrics` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304123225 ## File path: core/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala ## @@ -342,6 +320,87 @@ class ExecutorSuite extends SparkFunSuite } } + test("Send task executor metrics in DirectTaskResult") { +// Run a successful, trivial result task +// We need to ensure, however, that executor metrics are polled after the task is started +// so this requires some coordination using ExecutorSuiteHelper. +val conf = new SparkConf().setMaster("local").setAppName("executor suite test") +sc = new SparkContext(conf) +val serializer = SparkEnv.get.closureSerializer.newInstance() +ExecutorSuiteHelper.latches = new ExecutorSuiteHelper +val resultFunc = + (context: TaskContext, itr: Iterator[Int]) => { +ExecutorSuiteHelper.latches.latch1.await(300, TimeUnit.MILLISECONDS) +ExecutorSuiteHelper.latches.latch2.countDown() +ExecutorSuiteHelper.latches.latch3.await(500, TimeUnit.MILLISECONDS) Review comment: would help to have a short comment about the purpose of these latches. Eg. "latch2 tells the metricsPoller to poll and send results, and latch3 waits till the poller has sent its results" (I don't see the point of latch1, so you'd probably switch to just using latch1 & latch2) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304127412 ## File path: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala ## @@ -600,8 +611,16 @@ private[spark] object JsonProtocolSuite extends Assertions { assert(stageAttemptId1 === stageAttemptId2) assertSeqEquals[AccumulableInfo](updates1, updates2, (a, b) => a.equals(b)) }) -assertOptionEquals(e1.executorUpdates, e2.executorUpdates, -(e1: ExecutorMetrics, e2: ExecutorMetrics) => assertEquals(e1, e2)) +assertSeqEquals[((Int, Int), ExecutorMetrics)]( + e1.executorUpdates.toSeq.sortWith((x, y) => lexOrder(x._1, y._1)), + e2.executorUpdates.toSeq.sortWith((x, y) => lexOrder(x._1, y._1)), + (a, b) => { +val (k1, v1) = a +val (k2, v2) = b +assert(k1 === k2) +assertEquals(v1, v2) + } Review comment: can use pattern matching earlier to clean up a bit more: ```scala { case ((k1, v1), (k2, v2)) => assert(k1 === k2) assertEquals(v1, v2) } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304098528 ## File path: core/src/main/scala/org/apache/spark/executor/ExecutorMetricsPoller.scala ## @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.executor + +import java.lang.Long.{MAX_VALUE => LONG_MAX_VALUE} +import java.util.concurrent.{ConcurrentHashMap, TimeUnit} +import java.util.concurrent.atomic.{AtomicLong, AtomicLongArray} + +import scala.collection.mutable.HashMap + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryManager +import org.apache.spark.metrics.ExecutorMetricType +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * :: DeveloperApi :: + * A class that polls executor metrics, and tracks their peaks per task and per stage. + * Each executor keeps an instance of this class. + * The poll method polls the executor metrics, and is either run in its own thread or + * called by the executor's heartbeater thread, depending on configuration. + * The class keeps two ConcurrentHashMaps that are accessed (via its methods) by the + * executor's task runner threads concurrently with the polling thread. One thread may + * update one of these maps while another reads it, so the reading thread may not get + * the latest metrics, but this is ok. + * + * @param memoryManager the memory manager used by the executor. + * @param pollingInterval the polling interval in milliseconds. + */ +@DeveloperApi +private[spark] class ExecutorMetricsPoller( +memoryManager: MemoryManager, +pollingInterval: Long) + extends Logging { Review comment: nit: move this up to the line above ```scala pollingInterval: Long) extends Logging { ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304125444 ## File path: core/src/test/scala/org/apache/spark/util/JsonProtocolSuite.scala ## @@ -600,8 +611,16 @@ private[spark] object JsonProtocolSuite extends Assertions { assert(stageAttemptId1 === stageAttemptId2) assertSeqEquals[AccumulableInfo](updates1, updates2, (a, b) => a.equals(b)) }) -assertOptionEquals(e1.executorUpdates, e2.executorUpdates, -(e1: ExecutorMetrics, e2: ExecutorMetrics) => assertEquals(e1, e2)) +assertSeqEquals[((Int, Int), ExecutorMetrics)]( + e1.executorUpdates.toSeq.sortWith((x, y) => lexOrder(x._1, y._1)), + e2.executorUpdates.toSeq.sortWith((x, y) => lexOrder(x._1, y._1)), Review comment: scala provides a default ordering on tuples, which is the same as your `lexOrder`, so I think you can get rid of that and this becomes `.sortBy(_._1)` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304122257 ## File path: core/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala ## @@ -342,6 +320,87 @@ class ExecutorSuite extends SparkFunSuite } } + test("Send task executor metrics in DirectTaskResult") { +// Run a successful, trivial result task +// We need to ensure, however, that executor metrics are polled after the task is started +// so this requires some coordination using ExecutorSuiteHelper. +val conf = new SparkConf().setMaster("local").setAppName("executor suite test") +sc = new SparkContext(conf) +val serializer = SparkEnv.get.closureSerializer.newInstance() +ExecutorSuiteHelper.latches = new ExecutorSuiteHelper +val resultFunc = + (context: TaskContext, itr: Iterator[Int]) => { +ExecutorSuiteHelper.latches.latch1.await(300, TimeUnit.MILLISECONDS) +ExecutorSuiteHelper.latches.latch2.countDown() +ExecutorSuiteHelper.latches.latch3.await(500, TimeUnit.MILLISECONDS) +itr.size + } +val rdd = new RDD[Int](sc, Nil) { + override def compute(split: Partition, context: TaskContext): Iterator[Int] = { +val l = List(1) +l.iterator Review comment: can just be `Iterator(1)` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304109127 ## File path: core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ## @@ -268,12 +278,16 @@ private[spark] class EventLoggingListener( override def onExecutorMetricsUpdate(event: SparkListenerExecutorMetricsUpdate): Unit = { if (shouldLogStageExecutorMetrics) { - // For the active stages, record any new peak values for the memory metrics for the executor - event.executorUpdates.foreach { executorUpdates => -liveStageExecutorMetrics.values.foreach { peakExecutorMetrics => - val peakMetrics = peakExecutorMetrics.getOrElseUpdate( -event.execId, new ExecutorMetrics()) - peakMetrics.compareAndUpdatePeakValues(executorUpdates) + event.executorUpdates.foreach { case (stageKey1, peaks) => +liveStageExecutorMetrics.foreach { case (stageKey2, metricsPerExecutor) => + // If the update came from the driver, stageKey1 will be the dummy key (-1, -1), + // so record those peaks for all active stages. + // Otherwise, record the peaks for the matching stage. + if (stageKey1 == DRIVER_STAGE_KEY || stageKey1 == stageKey2) { +val metrics = metricsPerExecutor.getOrElseUpdate( + event.execId, new ExecutorMetrics()) +metrics.compareAndUpdatePeakValues(peaks) Review comment: minor, can you rename `peaks` -> `newPeaks` ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304100902 ## File path: core/src/main/scala/org/apache/spark/executor/ExecutorMetricsPoller.scala ## @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.executor + +import java.lang.Long.{MAX_VALUE => LONG_MAX_VALUE} +import java.util.concurrent.{ConcurrentHashMap, TimeUnit} +import java.util.concurrent.atomic.{AtomicLong, AtomicLongArray} + +import scala.collection.mutable.HashMap + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryManager +import org.apache.spark.metrics.ExecutorMetricType +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * :: DeveloperApi :: + * A class that polls executor metrics, and tracks their peaks per task and per stage. + * Each executor keeps an instance of this class. + * The poll method polls the executor metrics, and is either run in its own thread or + * called by the executor's heartbeater thread, depending on configuration. + * The class keeps two ConcurrentHashMaps that are accessed (via its methods) by the + * executor's task runner threads concurrently with the polling thread. One thread may + * update one of these maps while another reads it, so the reading thread may not get + * the latest metrics, but this is ok. + * + * @param memoryManager the memory manager used by the executor. + * @param pollingInterval the polling interval in milliseconds. + */ +@DeveloperApi +private[spark] class ExecutorMetricsPoller( +memoryManager: MemoryManager, +pollingInterval: Long) + extends Logging { + + type StageKey = (Int, Int) + // tuple for Task Count and Metric Peaks + type TCMP = (AtomicLong, AtomicLongArray) + + // Map of (stageId, stageAttemptId) to (count of running tasks, executor metric peaks) + private val stageTCMP = new ConcurrentHashMap[StageKey, TCMP] + + // Map of taskId to executor metric peaks + private val taskMetricPeaks = new ConcurrentHashMap[Long, AtomicLongArray] + + private val poller = +if (pollingInterval > 0) { + Some(ThreadUtils.newDaemonSingleThreadScheduledExecutor("executor-metrics-poller")) +} else { + None +} + + /** + * Function to poll executor metrics. + * On start, if pollingInterval is positive, this is scheduled to run at that interval. + * Otherwise, this is called by the reportHeartBeat function defined in Executor and passed + * to its Heartbeater. + */ + def poll(): Unit = { +// Note: Task runner threads may update stageTCMP or read from taskMetricPeaks concurrently +// with this function via calls to methods of this class. + +// get the latest values for the metrics +val latestMetrics = ExecutorMetrics.getCurrentMetrics(memoryManager) + +def updatePeaks(metrics: AtomicLongArray): Unit = { + (0 until metrics.length).foreach { i => +metrics.getAndAccumulate(i, latestMetrics(i), math.max) + } +} + +// for each active stage, update the peaks +stageTCMP.forEachValue(LONG_MAX_VALUE, v => updatePeaks(v._2)) + +// for each running task, update the peaks +taskMetricPeaks.forEachValue(LONG_MAX_VALUE, updatePeaks) + } + + /** Starts the polling thread. */ + def start(): Unit = { +poller.foreach { exec => + val pollingTask: Runnable = () => Utils.logUncaughtExceptions(poll()) + exec.scheduleAtFixedRate(pollingTask, 0L, pollingInterval, TimeUnit.MILLISECONDS) +} + } + + /** + * Called by TaskRunner#run. + * + * @param taskId the id of the task being run. + * @param stageId the id of the stage the task belongs to. + * @param stageAttemptId the attempt number of the stage the task belongs to. + */ + def onTaskStart(taskId: Long, stageId: Int, stageAttemptId: Int): Unit = { +// Put an entry in taskMetricPeaks for the task. +taskMetricPeaks.put(taskId, new AtomicLongArray(ExecutorMetricType.numMetrics)) + +// Put a new entry in stageTCMP for the stage if there isn't one already. +// Increment the task
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304118293 ## File path: core/src/main/scala/org/apache/spark/executor/ExecutorMetricsPoller.scala ## @@ -0,0 +1,196 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.executor + +import java.lang.Long.{MAX_VALUE => LONG_MAX_VALUE} +import java.util.concurrent.{ConcurrentHashMap, TimeUnit} +import java.util.concurrent.atomic.{AtomicLong, AtomicLongArray} + +import scala.collection.mutable.HashMap + +import org.apache.spark.annotation.DeveloperApi +import org.apache.spark.internal.Logging +import org.apache.spark.memory.MemoryManager +import org.apache.spark.metrics.ExecutorMetricType +import org.apache.spark.util.{ThreadUtils, Utils} + +/** + * :: DeveloperApi :: + * A class that polls executor metrics, and tracks their peaks per task and per stage. + * Each executor keeps an instance of this class. + * The poll method polls the executor metrics, and is either run in its own thread or + * called by the executor's heartbeater thread, depending on configuration. + * The class keeps two ConcurrentHashMaps that are accessed (via its methods) by the + * executor's task runner threads concurrently with the polling thread. One thread may + * update one of these maps while another reads it, so the reading thread may not get + * the latest metrics, but this is ok. + * + * @param memoryManager the memory manager used by the executor. Review comment: We should also explain why there are metrics tracked per stage and also per task. I think mostly just your comment here: https://issues.apache.org/jira/browse/SPARK-26329?focusedCommentId=16745313=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16745313 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics.
squito commented on a change in pull request #23767: [SPARK-26329][CORE] Faster polling of executor memory metrics. URL: https://github.com/apache/spark/pull/23767#discussion_r304123384 ## File path: core/src/test/scala/org/apache/spark/executor/ExecutorSuite.scala ## @@ -342,6 +320,87 @@ class ExecutorSuite extends SparkFunSuite } } + test("Send task executor metrics in DirectTaskResult") { +// Run a successful, trivial result task +// We need to ensure, however, that executor metrics are polled after the task is started +// so this requires some coordination using ExecutorSuiteHelper. +val conf = new SparkConf().setMaster("local").setAppName("executor suite test") +sc = new SparkContext(conf) +val serializer = SparkEnv.get.closureSerializer.newInstance() +ExecutorSuiteHelper.latches = new ExecutorSuiteHelper +val resultFunc = + (context: TaskContext, itr: Iterator[Int]) => { +ExecutorSuiteHelper.latches.latch1.await(300, TimeUnit.MILLISECONDS) +ExecutorSuiteHelper.latches.latch2.countDown() +ExecutorSuiteHelper.latches.latch3.await(500, TimeUnit.MILLISECONDS) +itr.size + } +val rdd = new RDD[Int](sc, Nil) { + override def compute(split: Partition, context: TaskContext): Iterator[Int] = { +val l = List(1) +l.iterator + } + override protected def getPartitions: Array[Partition] = { +Array(new SimplePartition) + } +} +val taskDescription = createResultTaskDescription(serializer, resultFunc, rdd, 0) + +val mockBackend = mock[ExecutorBackend] +when(mockBackend.statusUpdate(any(), meq(TaskState.RUNNING), any())) + .thenAnswer(new Answer[Unit] { +override def answer(invocationOnMock: InvocationOnMock): Unit = { + ExecutorSuiteHelper.latches.latch1.countDown() +} + }) Review comment: as I mentioned above, i don't see any point to having the call to `statusUpdate` set a latch. the statusUpdate occurs in the same thread as the taskRunner, so you know its completed by the time `resultFunc` is called. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304133121 ## File path: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ## @@ -1655,4 +1657,70 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg // get removed inside TaskSchedulerImpl later. assert(availableResources(GPU) sameElements Array("0", "1", "2", "3")) } + + test("SPARK-26755 Ensure that a speculative task is submitted only once for execution") { +sc = new SparkContext("local", "test") +sched = new FakeTaskScheduler(sc, ("exec1", "host1"), ("exec2", "host2")) +val taskSet = FakeTask.createTaskSet(4) +// Set the speculation multiplier to be 0 so speculative tasks are launched immediately +sc.conf.set(config.SPECULATION_MULTIPLIER, 0.0) +sc.conf.set(config.SPECULATION_ENABLED, true) +sc.conf.set(config.SPECULATION_QUANTILE, 0.5) +val clock = new ManualClock() +val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, clock = clock) +val accumUpdatesByTask: Array[Seq[AccumulatorV2[_, _]]] = taskSet.tasks.map { task => + task.metrics.internalAccums +} +// Offer resources for 4 tasks to start +for ((k, v) <- List( + "exec1" -> "host1", + "exec1" -> "host1", + "exec2" -> "host2", + "exec2" -> "host2")) { + val taskOption = manager.resourceOffer(k, v, NO_PREF) + assert(taskOption.isDefined) + val task = taskOption.get + assert(task.executorId === k) +} +assert(sched.startedTasks.toSet === Set(0, 1, 2, 3)) +clock.advance(1) +// Complete the first 2 tasks and leave the other 2 tasks in running +for (id <- Set(0, 1)) { + manager.handleSuccessfulTask(id, createTaskResult(id, accumUpdatesByTask(id))) + assert(sched.endedTasks(id) === Success) +} +// checkSpeculatableTasks checks that the task runtime is greater than the threshold for +// speculating. Since we use a threshold of 0 for speculation, tasks need to be running for +// > 0ms, so advance the clock by 1ms here. +clock.advance(1) +assert(manager.checkSpeculatableTasks(0)) +assert(sched.speculativeTasks.toSet === Set(2, 3)) +assert(manager.copiesRunning(2) === 1) +assert(manager.copiesRunning(3) === 1) + +// Offer resource to start the speculative attempt for the running task Review comment: Have written a separate unit test to account for the locality preference feature. Thank you. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304133210 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -1064,7 +979,8 @@ private[spark] class TaskSetManager( val info = taskInfos(tid) val index = info.index if (!successful(index) && copiesRunning(index) == 1 && info.timeRunning(time) > threshold && - !speculatableTasks.contains(index)) { +!speculatableTasks.contains(index)) { + addPendingTask(index, speculative = true) logInfo( "Marking task %d in stage %s (on %s) as speculatable because it ran more than %.0f ms" .format(index, taskSet.id, info.host, threshold)) Review comment: Makes sense, have updated the comment. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304132725 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -234,63 +228,41 @@ private[spark] class TaskSetManager( /** Add a task to all the pending-task lists that it should be on. */ private[spark] def addPendingTask( index: Int, - resolveRacks: Boolean = true): Unit = { + resolveRacks: Boolean = true, + speculative: Boolean = false): Unit = { Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304132824 ## File path: core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala ## @@ -1655,4 +1657,70 @@ class TaskSetManagerSuite extends SparkFunSuite with LocalSparkContext with Logg // get removed inside TaskSchedulerImpl later. assert(availableResources(GPU) sameElements Array("0", "1", "2", "3")) } + + test("SPARK-26755 Ensure that a speculative task is submitted only once for execution") { +sc = new SparkContext("local", "test") +sched = new FakeTaskScheduler(sc, ("exec1", "host1"), ("exec2", "host2")) +val taskSet = FakeTask.createTaskSet(4) +// Set the speculation multiplier to be 0 so speculative tasks are launched immediately +sc.conf.set(config.SPECULATION_MULTIPLIER, 0.0) +sc.conf.set(config.SPECULATION_ENABLED, true) +sc.conf.set(config.SPECULATION_QUANTILE, 0.5) +val clock = new ManualClock() +val manager = new TaskSetManager(sched, taskSet, MAX_TASK_FAILURES, clock = clock) +val accumUpdatesByTask: Array[Seq[AccumulatorV2[_, _]]] = taskSet.tasks.map { task => + task.metrics.internalAccums +} +// Offer resources for 4 tasks to start +for ((k, v) <- List( + "exec1" -> "host1", + "exec1" -> "host1", + "exec2" -> "host2", + "exec2" -> "host2")) { Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304132776 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -302,16 +274,22 @@ private[spark] class TaskSetManager( private def dequeueTaskFromList( execId: String, host: String, - list: ArrayBuffer[Int]): Option[Int] = { + list: ArrayBuffer[Int], + speculative: Boolean = false): Option[Int] = { var indexOffset = list.size while (indexOffset > 0) { indexOffset -= 1 val index = list(indexOffset) - if (!isTaskBlacklistedOnExecOrNode(index, execId, host)) { + if (!isTaskBlacklistedOnExecOrNode(index, execId, host) && +!(speculative && hasAttemptOnHost(index, host))) { Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304132690 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -143,25 +144,18 @@ private[spark] class TaskSetManager( // of failures. // Duplicates are handled in dequeueTaskFromList, which ensures that a // task hasn't already started running before launching it. - private val pendingTasksForExecutor = new HashMap[String, ArrayBuffer[Int]] - // Set of pending tasks for each host. Similar to pendingTasksForExecutor, - // but at host level. - private val pendingTasksForHost = new HashMap[String, ArrayBuffer[Int]] - - // Set of pending tasks for each rack -- similar to the above. - private val pendingTasksForRack = new HashMap[String, ArrayBuffer[Int]] - - // Set containing pending tasks with no locality preferences. - private[scheduler] var pendingTasksWithNoPrefs = new ArrayBuffer[Int] - - // Set containing all pending tasks (also used as a stack, as above). - private val allPendingTasks = new ArrayBuffer[Int] + private[scheduler] val pendingTasks = new PendingTasksByLocality() // Tasks that can be speculated. Since these will be a small fraction of total - // tasks, we'll just hold them in a HashSet. + // tasks, we'll just hold them in a HashSet. The HashSet here ensures that we do not add + // duplicate speculative tasks. Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304132653 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -143,25 +144,18 @@ private[spark] class TaskSetManager( // of failures. // Duplicates are handled in dequeueTaskFromList, which ensures that a // task hasn't already started running before launching it. Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
pgandhi999 commented on a change in pull request #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#discussion_r304132612 ## File path: core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala ## @@ -330,128 +308,65 @@ private[spark] class TaskSetManager( } } - /** - * Return a speculative task for a given executor if any are available. The task should not have - * an attempt running on this host, in case the host is slow. In addition, the task should meet - * the given locality constraint. - */ - // Labeled as protected to allow tests to override providing speculative tasks if necessary - protected def dequeueSpeculativeTask(execId: String, host: String, locality: TaskLocality.Value) -: Option[(Int, TaskLocality.Value)] = - { -speculatableTasks.retain(index => !successful(index)) // Remove finished tasks from set - -def canRunOnHost(index: Int): Boolean = { - !hasAttemptOnHost(index, host) && -!isTaskBlacklistedOnExecOrNode(index, execId, host) -} - -if (!speculatableTasks.isEmpty) { - // Check for process-local tasks; note that tasks can be process-local - // on multiple nodes when we replicate cached blocks, as in Spark Streaming - for (index <- speculatableTasks if canRunOnHost(index)) { -val prefs = tasks(index).preferredLocations -val executors = prefs.flatMap(_ match { - case e: ExecutorCacheTaskLocation => Some(e.executorId) - case _ => None -}) -if (executors.contains(execId)) { - speculatableTasks -= index - return Some((index, TaskLocality.PROCESS_LOCAL)) -} - } - - // Check for node-local tasks - if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) { -for (index <- speculatableTasks if canRunOnHost(index)) { - val locations = tasks(index).preferredLocations.map(_.host) - if (locations.contains(host)) { -speculatableTasks -= index -return Some((index, TaskLocality.NODE_LOCAL)) - } -} - } - - // Check for no-preference tasks - if (TaskLocality.isAllowed(locality, TaskLocality.NO_PREF)) { -for (index <- speculatableTasks if canRunOnHost(index)) { - val locations = tasks(index).preferredLocations - if (locations.size == 0) { -speculatableTasks -= index -return Some((index, TaskLocality.PROCESS_LOCAL)) - } -} - } - - // Check for rack-local tasks - if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) { -for (rack <- sched.getRackForHost(host)) { - for (index <- speculatableTasks if canRunOnHost(index)) { -val racks = tasks(index).preferredLocations.map(_.host).flatMap(sched.getRackForHost) -if (racks.contains(rack)) { - speculatableTasks -= index - return Some((index, TaskLocality.RACK_LOCAL)) -} - } -} - } - - // Check for non-local tasks - if (TaskLocality.isAllowed(locality, TaskLocality.ANY)) { -for (index <- speculatableTasks if canRunOnHost(index)) { - speculatableTasks -= index - return Some((index, TaskLocality.ANY)) -} - } -} - -None - } - /** * Dequeue a pending task for a given node and return its index and locality level. * Only search for tasks matching the given locality constraint. * * @return An option containing (task index within the task set, locality, is speculative?) */ private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value) -: Option[(Int, TaskLocality.Value, Boolean)] = - { -for (index <- dequeueTaskFromList(execId, host, getPendingTasksForExecutor(execId))) { - return Some((index, TaskLocality.PROCESS_LOCAL, false)) + : Option[(Int, TaskLocality.Value, Boolean)] = { Review comment: Done This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
AmplabJenkins removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-511995847 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/12882/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
AmplabJenkins removed a comment on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-511995842 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
SparkQA commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-511996567 **[Test build #107763 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107763/testReport)** for PR 25175 at commit [`566fc84`](https://github.com/apache/spark/commit/566fc84ab87470dfac1103854bc550b33d593bdd). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
brkyvz commented on issue #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#issuecomment-511996105 Approach and interface LGTM! +9000 on "keep[ing] commits smaller and more focused." in the future. Would really help speed up the development cycle. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
AmplabJenkins commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-511995847 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/12882/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
AmplabJenkins commented on issue #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175#issuecomment-511995842 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
brkyvz commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#discussion_r304127699 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CreateTableExec.scala ## @@ -20,7 +20,7 @@ package org.apache.spark.sql.execution.datasources.v2 import scala.collection.JavaConverters._ import org.apache.spark.rdd.RDD -import org.apache.spark.sql.catalog.v2.{Identifier, TableCatalog} +import org.apache.spark.sql.catalog.v2.{Identifier, StagingTableCatalog, TableCatalog} Review comment: spurious change? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
brkyvz commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#discussion_r304127549 ## File path: sql/core/src/test/scala/org/apache/spark/sql/sources/v2/TestInMemoryTableCatalog.scala ## @@ -196,7 +196,103 @@ private class InMemoryTable( } } -private class BufferedRows extends WriterCommitMessage with InputPartition with Serializable { +object TestInMemoryTableCatalog { + val SIMULATE_FAILED_WRITE_OPTION = "spark.sql.test.simulateFailedWrite" + val SIMULATE_FAILED_CREATE_PROPERTY = "spark.sql.test.simulateFailedCreate" + + def maybeSimulateFailedTableCreation(tableProperties: util.Map[String, String]): Unit = { +if (tableProperties.containsKey(TestInMemoryTableCatalog.SIMULATE_FAILED_CREATE_PROPERTY) + && tableProperties.get(TestInMemoryTableCatalog.SIMULATE_FAILED_CREATE_PROPERTY) + .equalsIgnoreCase("true")) { + throw new IllegalStateException("Manual create table failure.") +} + } + + def maybeSimulateFailedTableWrite(tableOptions: CaseInsensitiveStringMap): Unit = { +if (tableOptions.containsKey(TestInMemoryTableCatalog.SIMULATE_FAILED_WRITE_OPTION) + && tableOptions.get(TestInMemoryTableCatalog.SIMULATE_FAILED_WRITE_OPTION) + .equalsIgnoreCase("true")) { Review comment: nit: can these long clauses be evaluated above before an if statement? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mccheah commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
mccheah commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#discussion_r304127138 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala ## @@ -78,7 +80,8 @@ case class CreateTableAsSelectExec( } Utils.tryWithSafeFinallyAndFailureCallbacks({ - catalog.createTable(ident, query.schema, partitioning.toArray, properties.asJava) match { + catalog.createTable( +ident, query.schema, partitioning.toArray, properties.asJava) match { Review comment: I'm not keen to adjust any assumptions from the previous implementation in this PR - that can be a follow-up question perhaps for the dev list or in another JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mccheah commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
mccheah commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#discussion_r304127138 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala ## @@ -78,7 +80,8 @@ case class CreateTableAsSelectExec( } Utils.tryWithSafeFinallyAndFailureCallbacks({ - catalog.createTable(ident, query.schema, partitioning.toArray, properties.asJava) match { + catalog.createTable( +ident, query.schema, partitioning.toArray, properties.asJava) match { Review comment: I'm not quite willing to adjust any assumptions from the previous implementation in this PR - that can be a follow-up question perhaps for the dev list or in another JIRA. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huaxingao opened a new pull request #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
huaxingao opened a new pull request #25175: [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored URL: https://github.com/apache/spark/pull/25175 ## What changes were proposed in this pull request? In the following python code ``` df.write.mode("overwrite").insertInto("table") ``` ```insertInto``` ignores ```mode("overwrite")``` and appends by default. ## How was this patch tested? Add Unit test. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully
AmplabJenkins removed a comment on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully URL: https://github.com/apache/spark/pull/25174#issuecomment-511993132 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/12881/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully
AmplabJenkins removed a comment on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully URL: https://github.com/apache/spark/pull/25174#issuecomment-511993126 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] brkyvz commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2
brkyvz commented on a change in pull request #24798: [SPARK-27724][SQL] Implement REPLACE TABLE and REPLACE TABLE AS SELECT with V2 URL: https://github.com/apache/spark/pull/24798#discussion_r304126475 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2Exec.scala ## @@ -78,7 +80,8 @@ case class CreateTableAsSelectExec( } Utils.tryWithSafeFinallyAndFailureCallbacks({ - catalog.createTable(ident, query.schema, partitioning.toArray, properties.asJava) match { + catalog.createTable( +ident, query.schema, partitioning.toArray, properties.asJava) match { Review comment: should the schema be called with `asNullable`? I feel the nullability of the input data cannot be trusted (or just because a column is not-nullable for a query doesn't mean it can't be null in the future) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks…
SparkQA commented on issue #23677: [SPARK-26755][SCHEDULER] : Optimize Spark Scheduler to dequeue speculative tasks… URL: https://github.com/apache/spark/pull/23677#issuecomment-511993929 **[Test build #107762 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107762/testReport)** for PR 23677 at commit [`466849e`](https://github.com/apache/spark/commit/466849e3ebf8e39f994423947f21bcf390c26894). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully
AmplabJenkins commented on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully URL: https://github.com/apache/spark/pull/25174#issuecomment-511993132 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/12881/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully
AmplabJenkins commented on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully URL: https://github.com/apache/spark/pull/25174#issuecomment-511993126 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #22560: [SPARK-25547][SQL] Pluggable JDBC connection factory
SparkQA commented on issue #22560: [SPARK-25547][SQL] Pluggable JDBC connection factory URL: https://github.com/apache/spark/pull/22560#issuecomment-511991052 **[Test build #107761 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107761/testReport)** for PR 22560 at commit [`edd3245`](https://github.com/apache/spark/commit/edd3245ed5604dee2aad56c3ee0b8f99d76b48a0). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully
SparkQA commented on issue #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully URL: https://github.com/apache/spark/pull/25174#issuecomment-511991024 **[Test build #107760 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107760/testReport)** for PR 25174 at commit [`c3ae8c8`](https://github.com/apache/spark/commit/c3ae8c82e345d31320a1b754985d685259375861). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] hvanhovell opened a new pull request #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully
hvanhovell opened a new pull request #25174: [SPARK-27485][BRANCH-2.4] EnsureRequirements.reorder should handle duplicate expressions gracefully URL: https://github.com/apache/spark/pull/25174 Backport of 421d9d56efd447d31787e77316ce0eafb5fe45a5 ## What changes were proposed in this pull request? When reordering joins EnsureRequirements only checks if all the join keys are present in the partitioning expression seq. This is problematic when the joins keys and and partitioning expressions both contain duplicates but not the same number of duplicates for each expression, e.g. `Seq(a, a, b)` vs `Seq(a, b, b)`. This fails with an index lookup failure in the `reorder` function. This PR fixes this removing the equality checking logic from the `reorderJoinKeys` function, and by doing the multiset equality in the `reorder` function while building the reordered key sequences. ## How was this patch tested? Added a unit test to the `PlannerSuite` and added an integration test to `JoinSuite` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] joshrosen-stripe commented on a change in pull request #25164: [SPARK-28375] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times
joshrosen-stripe commented on a change in pull request #25164: [SPARK-28375] Prevent the PullupCorrelatedPredicates optimizer rule from removing predicates if run multiple times URL: https://github.com/apache/spark/pull/25164#discussion_r304121011 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala ## @@ -275,13 +275,16 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper plan transformExpressions { case ScalarSubquery(sub, children, exprId) if children.nonEmpty => val (newPlan, newCond) = pullOutCorrelatedPredicates(sub, outerPlans) -ScalarSubquery(newPlan, newCond, exprId) +val conds = newCond ++ children.filter(_.isInstanceOf[Predicate]) Review comment: @gatorsmile > We want to ensure all these optimizer rules can be run multiple times. This is critical for adaptive query execution. Quick clarification question: does Spark 2.4.x perform double-optimization when adaptive execution is used? Or is that only a characteristic of the new / expanded adaptive execution features in Spark 3.x? I ask in order to help determine whether this patch should be a correctness backport for 2.4.x. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a change in pull request #25134: [SPARK-28366][CORE] Logging in driver when loading single large unsplittable file via sc.textFile
srowen commented on a change in pull request #25134: [SPARK-28366][CORE] Logging in driver when loading single large unsplittable file via sc.textFile URL: https://github.com/apache/spark/pull/25134#discussion_r304120087 ## File path: core/src/main/scala/org/apache/spark/internal/config/package.scala ## @@ -1180,6 +1180,13 @@ package object config { .intConf .createWithDefault(1) + private[spark] val IO_FILE_UNSPLITTABLE_WARNING_THRESHOLD = Review comment: I just don't think this is worth another config This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner
AmplabJenkins removed a comment on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner URL: https://github.com/apache/spark/pull/25111#issuecomment-511987351 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner
AmplabJenkins removed a comment on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner URL: https://github.com/apache/spark/pull/25111#issuecomment-511987361 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107753/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner
AmplabJenkins commented on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner URL: https://github.com/apache/spark/pull/25111#issuecomment-511987361 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107753/ Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner
AmplabJenkins commented on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner URL: https://github.com/apache/spark/pull/25111#issuecomment-511987351 Merged build finished. Test PASSed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA removed a comment on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner
SparkQA removed a comment on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner URL: https://github.com/apache/spark/pull/25111#issuecomment-511914738 **[Test build #107753 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107753/testReport)** for PR 25111 at commit [`6f9b59f`](https://github.com/apache/spark/commit/6f9b59f5d360a041f6b825ddeb010dc03abee64c). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SparkQA commented on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner
SparkQA commented on issue #25111: [SPARK-28346][SQL] clone the query plan between analyzer, optimizer and planner URL: https://github.com/apache/spark/pull/25111#issuecomment-511986682 **[Test build #107753 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/107753/testReport)** for PR 25111 at commit [`6f9b59f`](https://github.com/apache/spark/commit/6f9b59f5d360a041f6b825ddeb010dc03abee64c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #22560: [SPARK-25547][SQL] Pluggable JDBC connection factory
AmplabJenkins removed a comment on issue #22560: [SPARK-25547][SQL] Pluggable JDBC connection factory URL: https://github.com/apache/spark/pull/22560#issuecomment-511982199 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/107759/ Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #22560: [SPARK-25547][SQL] Pluggable JDBC connection factory
AmplabJenkins removed a comment on issue #22560: [SPARK-25547][SQL] Pluggable JDBC connection factory URL: https://github.com/apache/spark/pull/22560#issuecomment-511982186 Merged build finished. Test FAILed. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins removed a comment on issue #22560: [SPARK-25547][SQL] Pluggable JDBC connection factory
AmplabJenkins removed a comment on issue #22560: [SPARK-25547][SQL] Pluggable JDBC connection factory URL: https://github.com/apache/spark/pull/22560#issuecomment-510680246 Can one of the admins verify this patch? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org