[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080789#comment-16080789 ] Kay Ousterhout commented on SPARK-21349: Out of curiosity, what are the task sizes that you're seeing? +[~shivaram] -- I know you've looked at task size a lot. Are these getting bigger / do you think we should just raise the warning size for everyone? > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. > According to the Jenkins log, we also have 123 warnings even in our unit test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079746#comment-16079746 ] Kay Ousterhout commented on SPARK-21349: Does that mean we should just raise this threshold for all users? > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. > According to the Jenkins log, we also have 123 warnings even in our unit test. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21349) Make TASK_SIZE_TO_WARN_KB configurable
[ https://issues.apache.org/jira/browse/SPARK-21349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079724#comment-16079724 ] Kay Ousterhout commented on SPARK-21349: Is this a major usability issue (and what's the use case where task sizes are regularly > 100KB)? I'm hesitant to make this a configuration parameter -- Spark already has a huge number of configuration parameters, making it hard for users to figure out which ones are relevant for them. > Make TASK_SIZE_TO_WARN_KB configurable > -- > > Key: SPARK-21349 > URL: https://issues.apache.org/jira/browse/SPARK-21349 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.3, 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since Spark 1.1.0, Spark emits warning when task size exceeds a threshold, > SPARK-2185. Although this is just a warning message, this issue tries to make > `TASK_SIZE_TO_WARN_KB` into a normal Spark configuration for advanced users. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20219) Schedule tasks based on size of input from ScheduledRDD
[ https://issues.apache.org/jira/browse/SPARK-20219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15957253#comment-15957253 ] Kay Ousterhout commented on SPARK-20219: Like [~mridulm80] (as mentioned on the PR) I'm hesitant about this idea because of the added complexity and information "leakage" from the TaskScheduler back to the DAGScheduler (in general, we should be making this interface between these components smaller, to make the code easier to reason about -- not larger). [~jinxing6...@126.com] you mentioned some use cases when this is helpful; can you post some concrete performance numbers about difference in runtimes? cc [~imranr]-- thoughts here about whether the performance improvement is worth the added complexity? > Schedule tasks based on size of input from ScheduledRDD > --- > > Key: SPARK-20219 > URL: https://issues.apache.org/jira/browse/SPARK-20219 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: jin xing > > When data is highly skewed on ShuffledRDD, it make sense to launch those > tasks which process much more input as soon as possible. The current > scheduling mechanism in *TaskSetManager* is quite simple: > {code} > for (i <- (0 until numTasks).reverse) { > addPendingTask(i) > } > {code} > In scenario that "large tasks" locate at bottom half of tasks array, if tasks > with much more input are launched early, we can significantly reduce the time > cost and save resource when *"dynamic allocation"* is disabled. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19868) conflict TasksetManager lead to spark stopped
[ https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout reassigned SPARK-19868: -- Assignee: liujianhui > conflict TasksetManager lead to spark stopped > - > > Key: SPARK-19868 > URL: https://issues.apache.org/jira/browse/SPARK-19868 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: liujianhui >Assignee: liujianhui > Fix For: 2.2.0 > > > ##scenario > conflict taskSetManager throw an exception which lead to sparkcontext > stopped. log as > {code} > java.lang.IllegalStateException: more than one active taskSet for stage > 4571114: 4571114.2,4571114.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > the reason for that is the resubmitting of stage conflict with the running > stage,the missing task of stage should be resubmit since the zoombie of the > tasksetManager assigned by true > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting > ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks > had failed: 0 > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting > ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at > MainApp.scala:73), which has no missing parents > {code} > the executor which the shuffle task ran on was lost > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring > possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4 > {code} > the time of the task set finished and the resubmit of stage > {code} > handleSuccessfuleTask > [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed > TaskSet 4571114.1, whose tasks have all completed, from pool > resubmit stage > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding > task set 4571114.2 with 1 tasks > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19868) conflict TasksetManager lead to spark stopped
[ https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19868. Resolution: Fixed Fix Version/s: 2.2.0 > conflict TasksetManager lead to spark stopped > - > > Key: SPARK-19868 > URL: https://issues.apache.org/jira/browse/SPARK-19868 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: liujianhui > Fix For: 2.2.0 > > > ##scenario > conflict taskSetManager throw an exception which lead to sparkcontext > stopped. log as > {code} > java.lang.IllegalStateException: more than one active taskSet for stage > 4571114: 4571114.2,4571114.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > the reason for that is the resubmitting of stage conflict with the running > stage,the missing task of stage should be resubmit since the zoombie of the > tasksetManager assigned by true > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting > ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks > had failed: 0 > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting > ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at > MainApp.scala:73), which has no missing parents > {code} > the executor which the shuffle task ran on was lost > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring > possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4 > {code} > the time of the task set finished and the resubmit of stage > {code} > handleSuccessfuleTask > [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed > TaskSet 4571114.1, whose tasks have all completed, from pool > resubmit stage > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding > task set 4571114.2 with 1 tasks > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20116) Remove task-level functionality from the DAGScheduler
Kay Ousterhout created SPARK-20116: -- Summary: Remove task-level functionality from the DAGScheduler Key: SPARK-20116 URL: https://issues.apache.org/jira/browse/SPARK-20116 Project: Spark Issue Type: Sub-task Components: Scheduler Affects Versions: 2.2.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Long, long ago, the scheduler code was more modular, and the DAGScheduler handled the logic of scheduling DAGs of stages (as the name suggests) and the TaskSchedulerImpl handled scheduling the tasks within a stage. Over time, more and more task-specific functionality has been added to the DAGScheduler, and now, the DAGScheduler duplicates a bunch of the task tracking that's done by other scheduler components. This makes the scheduler code harder to reason about, and has led to some tricky bugs (e.g., SPARK-19263). We should move all of this functionality back to the TaskSchedulerImpl and TaskSetManager, which should "hide" that complexity from the DAGScheduler. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19612) Tests failing with timeout
[ https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15944181#comment-15944181 ] Kay Ousterhout commented on SPARK-19612: And another: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75272/console > Tests failing with timeout > -- > > Key: SPARK-19612 > URL: https://issues.apache.org/jira/browse/SPARK-19612 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout >Priority: Minor > > I've seen at least one recent test failure due to hitting the 250m timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ > Filing this JIRA to track this; if it happens repeatedly we should up the > timeout. > cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19820) Allow users to kill tasks, and propagate a kill reason
[ https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout reassigned SPARK-19820: -- Assignee: Eric Liang > Allow users to kill tasks, and propagate a kill reason > -- > > Key: SPARK-19820 > URL: https://issues.apache.org/jira/browse/SPARK-19820 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Eric Liang >Assignee: Eric Liang >Priority: Minor > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19820) Allow users to kill tasks, and propagate a kill reason
[ https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19820. Resolution: Fixed Fix Version/s: 2.2.0 > Allow users to kill tasks, and propagate a kill reason > -- > > Key: SPARK-19820 > URL: https://issues.apache.org/jira/browse/SPARK-19820 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19820) Allow users to kill tasks, and propagate a kill reason
[ https://issues.apache.org/jira/browse/SPARK-19820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19820: --- Summary: Allow users to kill tasks, and propagate a kill reason (was: Allow reason to be specified for task kill) > Allow users to kill tasks, and propagate a kill reason > -- > > Key: SPARK-19820 > URL: https://issues.apache.org/jira/browse/SPARK-19820 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Eric Liang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16929) Speculation-related synchronization bottleneck in checkSpeculatableTasks
[ https://issues.apache.org/jira/browse/SPARK-16929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-16929. Resolution: Fixed Assignee: jin xing Fix Version/s: 2.2.0 > Speculation-related synchronization bottleneck in checkSpeculatableTasks > > > Key: SPARK-16929 > URL: https://issues.apache.org/jira/browse/SPARK-16929 > Project: Spark > Issue Type: Bug > Components: Scheduler >Reporter: Nicholas Brown >Assignee: jin xing > Fix For: 2.2.0 > > > Our cluster has been running slowly since I got speculation working, I looked > into it and noticed that stderr was saying some tasks were taking almost an > hour to run even though in the application logs on the nodes that task only > took a minute or so to run. Digging into the thread dump for the master node > I noticed a number of threads are blocked, apparently by speculation thread. > At line 476 of TaskSchedulerImpl it grabs a lock on the TaskScheduler while > it looks through the tasks to see what needs to be rerun. Unfortunately that > code loops through each of the tasks, so when you have even just a couple > hundred thousand tasks to run that can be prohibitively slow to run inside of > a synchronized block. Once I disabled speculation, the job went back to > having acceptable performance. > There are no comments around that lock indicating why it was added, and the > git history seems to have a couple refactorings so its hard to find where it > was added. I'm tempted to believe it is the result of someone assuming that > an extra synchronized block never hurt anyone (in reality I've probably just > as many bugs caused by over synchronization as too little) as it looks too > broad to be actually guarding any potential concurrency issue. But, since > concurrency issues can be tricky to reproduce (and yes, I understand that's > an extreme understatement) I'm not sure just blindly removing it without > being familiar with the history is necessarily safe. > Can someone look into this? Or at least make a note in the documentation > that speculation should not be used with large clusters? -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19612) Tests failing with timeout
[ https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19612: --- Affects Version/s: (was: 2.1.1) 2.2.0 > Tests failing with timeout > -- > > Key: SPARK-19612 > URL: https://issues.apache.org/jira/browse/SPARK-19612 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout >Priority: Minor > > I've seen at least one recent test failure due to hitting the 250m timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ > Filing this JIRA to track this; if it happens repeatedly we should up the > timeout. > cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19612) Tests failing with timeout
[ https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout reopened SPARK-19612: This seems to be back: saw two recently: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75124 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75127 > Tests failing with timeout > -- > > Key: SPARK-19612 > URL: https://issues.apache.org/jira/browse/SPARK-19612 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout >Priority: Minor > > I've seen at least one recent test failure due to hitting the 250m timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ > Filing this JIRA to track this; if it happens repeatedly we should up the > timeout. > cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19567) Support some Schedulable variables immutability and access
[ https://issues.apache.org/jira/browse/SPARK-19567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19567. Resolution: Fixed Assignee: Eren Avsarogullari Fix Version/s: 2.2.0 > Support some Schedulable variables immutability and access > -- > > Key: SPARK-19567 > URL: https://issues.apache.org/jira/browse/SPARK-19567 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari >Priority: Minor > Fix For: 2.2.0 > > > Support some Schedulable variables immutability and access > Some Schedulable variables need refactoring for immutability and access > modifiers as follows: > - from vars to vals(if there is no requirement): This is important to support > immutability as much as possible. > Sample => Pool: weight, minShare, priority, name and > taskSetSchedulingAlgorithm. > - access modifiers: Specially, vars access needs to be restricted from other > parts of codebase to prevent potential side effects. Sample: > Sample => TaskSetManager: tasksSuccessful, totalResultSize, calculatedTasks > etc... -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19868) conflict TasksetManager lead to spark stopped
[ https://issues.apache.org/jira/browse/SPARK-19868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19868: --- Target Version/s: 2.2.0 > conflict TasksetManager lead to spark stopped > - > > Key: SPARK-19868 > URL: https://issues.apache.org/jira/browse/SPARK-19868 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: liujianhui > > ##scenario > conflict taskSetManager throw an exception which lead to sparkcontext > stopped. log as > {code} > java.lang.IllegalStateException: more than one active taskSet for stage > 4571114: 4571114.2,4571114.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > the reason for that is the resubmitting of stage conflict with the running > stage,the missing task of stage should be resubmit since the zoombie of the > tasksetManager assigned by true > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting > ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks > had failed: 0 > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting > ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at > MainApp.scala:73), which has no missing parents > {code} > the executor which the shuffle task ran on was lost > {code} > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring > possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4 > {code} > the time of the task set finished and the resubmit of stage > {code} > handleSuccessfuleTask > [INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed > TaskSet 4571114.1, whose tasks have all completed, from pool > resubmit stage > [INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding > task set 4571114.2 with 1 tasks > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19613) Flaky test: StateStoreRDDSuite.versioning and immutability
[ https://issues.apache.org/jira/browse/SPARK-19613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19613. Resolution: Cannot Reproduce I'm closing this because, while it had a burst of failures about a month ago (see here: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite&name=versioning+and+immutability) it hasn't failed since. > Flaky test: StateStoreRDDSuite.versioning and immutability > -- > > Key: SPARK-19613 > URL: https://issues.apache.org/jira/browse/SPARK-19613 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 2.1.1 >Reporter: Kay Ousterhout >Priority: Minor > > This test: > org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite.versioning > and immutability failed on a recent PR: > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72948/testReport/junit/org.apache.spark.sql.execution.streaming.state/StateStoreRDDSuite/versioning_and_immutability/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19612) Tests failing with timeout
[ https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19612. Resolution: Cannot Reproduce Closing this for now because I haven't seen this issue in a while (we can re-open if this starts occurring again) > Tests failing with timeout > -- > > Key: SPARK-19612 > URL: https://issues.apache.org/jira/browse/SPARK-19612 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.1 >Reporter: Kay Ousterhout >Priority: Minor > > I've seen at least one recent test failure due to hitting the 250m timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ > Filing this JIRA to track this; if it happens repeatedly we should up the > timeout. > cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
[ https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19988. Resolution: Fixed Assignee: Xiao Li Fix Version/s: 2.2.0 Resolved by https://github.com/apache/spark/pull/17344 (which was merged with the wrong JIRA number, but should resolve this issue) > Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column > written by Hive > > > Key: SPARK-19988 > URL: https://issues.apache.org/jira/browse/SPARK-19988 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid >Assignee: Xiao Li > Labels: flaky-test > Fix For: 2.2.0 > > Attachments: trimmed-unit-test.log > > > "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by > Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 > days here: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite&test_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive > eg. > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/ > {noformat} > sbt.ForkMain$ForkError: > org.apache.spark.sql.execution.QueryExecutionException: FAILED: > SemanticException [Error 10072]: Database does not exist: db2 > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18886) Delay scheduling should not delay some executors indefinitely if one task is scheduled before delay timeout
[ https://issues.apache.org/jira/browse/SPARK-18886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15931009#comment-15931009 ] Kay Ousterhout commented on SPARK-18886: Sorry for the slow response here! I realized this is the same issue as SPARK-11460 (although that JIRA proposed a slightly different solution), which stalled for reasons that are completely my fault (I neglected it because I couldn't think of a practical way of solving it). Imran, unfortunately I don't think your latest idea will quite work. Delay scheduling was originally intended for situations where the number of slots that a particular job could use was limited by a fairness policy. In that case, it can be better to wait a bit for a "better" slot (i.e., one that satisfies locality preferences). In particular, if you never wait, you end up with this "sticky slot" issue where tasks for a job keep finishing up in a "bad" slot (one with no locality preferences), and then they'll be re-offered to the same job, which will again accept the bad slot. If the job just waited a bit, it could get a better slot (e.g., as a result of tasks from another job finishing). [1] This relates to your idea because of the following situation: suppose you have a cluster with 10 machines, the job has locality preferences for 5 of them (with ids 1, 2, 3, 4, 5), and fairness dictates that the job can only use 3 slots at a time (e.g., it's sharing equally with 2 other jobs). Suppose that for a long time, the job has been running tasks on slots 1, 2, and 3 (so local slots). At this point, the times for machines 6, 7, 8, 9, and 10 will have expired, because the job has been running for a while. But if the job is now offered a slot on one of those non-local machines (e.g., 6), the job hasn't been waiting long for non-local resources: until this point, it's been running it's full share of 3 slots at a time, and it's been doing so on machines that satisfy locality preferences. So, we shouldn't accept that slot on machine 6 -- we should wait a bit to see if we can get a slot on 1, 2, 3, 4, or 5. The solution I proposed (in a long PR comment) for the other JIRA is: if the task set is using fewer than the number of slots it could be using (where “# slots it could be using” is all of the slots in the cluster if the job is running alone, or the job’s fair share, if it’s not) for some period of time, increase the locality level. The problem with that solution is that I thought it was completely impractical to determine the number of slots a TSM "should" be allowed to use. However, after thinking about this more today, I think we might be able to do this in a practical way: - First, I thought that we could use information about when offers are rejected to determine this (e.g., if you've been rejecting offers for a while, then you're not using your fair share). But the problem here is that it's not easy to determine when you *are* using your fair / allowed share: accepting a single offer doesn't necessarily mean that you're now using the allowed share. This is precisely the problem with the current approach, hence this JIRA. - v1: one possible proxy for this is if there are slots that are currently available that haven't been accepted by any job. The TaskSchedulerImpl could feasibly pass this information to each TaskSetManager, and the TSM could use it to update it's delay timer: something like only reset the delay timer to 0 if (a) the TSM accepts an offer and (b) the flag passed by the TSM indicates that there are no other unused slots in the cluster. This fixes the problem described in the JIRA: in that case, the flag would indicate that there *were* other unused slots, even though a task got successfully scheduled with this offer, so the delay timer wouldn't be reset, and would eventually correctly expire. - v2: The problem with v1 is that it doesn't correctly handle situations where e.g., you have two jobs A and B with equal shares. B is "greedy" and will accept any slot (e.g., it's a reduce stage), and A is doing delay scheduling. In this case, A might have much less than its share, but the flag from the TaskSchedulerImpl would indicate that there were no other free slots in the cluster, so the delay timer wouldn't ever expire. I suspect we could handle this (e.g., with some logic in the TaskSchedulerImpl to detect when a particular TSM is getting starved: when it keeps rejecting offers that are later accepted by someone else) but before thinking about this further, I wanted to run the general idea by you to see what your thoughts are. [1] There's a whole side question / discussion of how often this is useful for Spark at all. It can be useful if you're running in a shared cluster where e.g. Yarn might be assigning you more slots over time, and it's also useful when a single Spark context is being shared
[jira] [Resolved] (SPARK-11460) Locality waits should be based on task set creation time, not last launch time
[ https://issues.apache.org/jira/browse/SPARK-11460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-11460. Resolution: Duplicate > Locality waits should be based on task set creation time, not last launch time > -- > > Key: SPARK-11460 > URL: https://issues.apache.org/jira/browse/SPARK-11460 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0, 1.2.1, 1.2.2, > 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1 > Environment: YARN >Reporter: Shengyue Ji > Original Estimate: 2h > Remaining Estimate: 2h > > Spark waits for spark.locality.waits period before going from RACK_LOCAL to > ANY when selecting an executor for assignment. The timeout was essentially > reset each time a new assignment is made. > We were running Spark streaming on Kafka with a 10 second batch window on 32 > Kafka partitions with 16 executors. All executors were in the ANY group. At > one point one RACK_LOCAL executor was added and all tasks were assigned to > it. Each task took about 0.6 second to process, resetting the > spark.locality.wait timeout (3000ms) repeatedly. This caused the whole > process to under utilize resources and created an increasing backlog. > spark.locality.wait should be based on the task set creation time, not last > launch time so that after 3000ms of initial creation, all executors can get > tasks assigned to them. > We are specifying a zero timeout for now as a workaround to disable locality > optimization. > https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L556 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-18890. Resolution: Invalid I closed this because, as [~imranr] pointed out on the PR, these already happen in the same thread. [~witgo], can you change your PR to reference SPARK-19486, which describes the behavior you implemented? > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19565) After fetching failed, success of old attempt of stage should be taken as valid.
[ https://issues.apache.org/jira/browse/SPARK-19565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930910#comment-15930910 ] Kay Ousterhout commented on SPARK-19565: [~jinxing6...@126.com] I closed this because it looks like a duplicate with the word you did for SPARK-19263. Feel free to re-open if I've misunderstood. > After fetching failed, success of old attempt of stage should be taken as > valid. > > > Key: SPARK-19565 > URL: https://issues.apache.org/jira/browse/SPARK-19565 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing > > This is related to SPARK-19263. > When fetch failed, stage will be resubmitted. There can be running tasks from > both old and new stage attempts. Success of tasks from old stage attempt > should be taken as valid and partitionId should be removed from stage's > pendingPartitions accordingly. When pending partitions is empty, downstream > stage can be scheduled, even though there's still running tasks in the > active(new) stage attempt. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19565) After fetching failed, success of old attempt of stage should be taken as valid.
[ https://issues.apache.org/jira/browse/SPARK-19565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19565. Resolution: Duplicate > After fetching failed, success of old attempt of stage should be taken as > valid. > > > Key: SPARK-19565 > URL: https://issues.apache.org/jira/browse/SPARK-19565 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing > > This is related to SPARK-19263. > When fetch failed, stage will be resubmitted. There can be running tasks from > both old and new stage attempts. Success of tasks from old stage attempt > should be taken as valid and partitionId should be removed from stage's > pendingPartitions accordingly. When pending partitions is empty, downstream > stage can be scheduled, even though there's still running tasks in the > active(new) stage attempt. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.
[ https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930909#comment-15930909 ] Kay Ousterhout commented on SPARK-19755: I'm closing this because the configs you're proposing adding already exist: spark.blacklist.enabled already exists to turn of all blacklisting (this is false by default, so the fact that you're seeing blacklisting behavior means that your configuration enables blacklisting), and spark.blacklist.maxFailedTaskPerExecutor is the other thing you proposed adding. All of the blacklisting parameters are listed here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/internal/config/package.scala#L101 Feel free to re-open this if I've misunderstood and the existing configs don't address the issues you're seeing! > Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result > - scheduler cannot create an executor after some time. > --- > > Key: SPARK-19755 > URL: https://issues.apache.org/jira/browse/SPARK-19755 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 2.1.0 > Environment: mesos, marathon, docker - driver and executors are > dockerized. >Reporter: Timur Abakumov > > When for some reason task fails - MesosCoarseGrainedSchedulerBackend > increased failure counter for a slave where that task was running. > When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded. > Over time scheduler cannot create a new executor - every slave is is in the > blacklist. Task failure not necessary related to host health- especially for > long running stream apps. > If accepted as a bug: possible solution is to use: spark.blacklist.enabled to > make that functionality optional and if it make sense MAX_SLAVE_FAILURES > also can be configurable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19755) Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result - scheduler cannot create an executor after some time.
[ https://issues.apache.org/jira/browse/SPARK-19755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19755. Resolution: Not A Problem > Blacklist is always active for MesosCoarseGrainedSchedulerBackend. As result > - scheduler cannot create an executor after some time. > --- > > Key: SPARK-19755 > URL: https://issues.apache.org/jira/browse/SPARK-19755 > Project: Spark > Issue Type: Bug > Components: Mesos, Scheduler >Affects Versions: 2.1.0 > Environment: mesos, marathon, docker - driver and executors are > dockerized. >Reporter: Timur Abakumov > > When for some reason task fails - MesosCoarseGrainedSchedulerBackend > increased failure counter for a slave where that task was running. > When counter is >=2 (MAX_SLAVE_FAILURES) mesos slave is excluded. > Over time scheduler cannot create a new executor - every slave is is in the > blacklist. Task failure not necessary related to host health- especially for > long running stream apps. > If accepted as a bug: possible solution is to use: spark.blacklist.enabled to > make that functionality optional and if it make sense MAX_SLAVE_FAILURES > also can be configurable. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929422#comment-15929422 ] Kay Ousterhout commented on SPARK-19990: Thanks [~windpiger]! > Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create > temporary view using > -- > > Key: SPARK-19990 > URL: https://issues.apache.org/jira/browse/SPARK-19990 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > This test seems to be failing consistently on all of the maven builds: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite&test_name=create+temporary+view+using > and is possibly caused by SPARK-19763. > Here's a stack trace for the failure: > java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative > path in absolute URI: > jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv > at org.apache.hadoop.fs.Path.initialize(Path.java:206) > at org.apache.hadoop.fs.Path.(Path.java:172) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at scala.collection.immutable.List.foreach(List.scala:381) > at > scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.immutable.List.flatMap(List.scala:344) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) > at > org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) > at org.apache.spark.sql.Dataset.(Dataset.scala:183) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705) > at > org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186) > at > org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) > at > org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) > at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) > at > org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at > org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) > at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) > at > org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:41) > at > org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) > at > org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.runTest(HiveDDLSuite.scala:41) > at > org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) > at > org.scalates
[jira] [Updated] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
[ https://issues.apache.org/jira/browse/SPARK-19990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19990: --- Description: This test seems to be failing consistently on all of the maven builds: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite&test_name=create+temporary+view+using and is possibly caused by SPARK-19763. Here's a stack trace for the failure: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.6/sql/core/target/spark-sql_2.11-2.2.0-SNAPSHOT-tests.jar!/test-data/cars.csv at org.apache.hadoop.fs.Path.initialize(Path.java:206) at org.apache.hadoop.fs.Path.(Path.java:172) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:344) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:343) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.immutable.List.foreach(List.scala:381) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.immutable.List.flatMap(List.scala:344) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:343) at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:91) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:67) at org.apache.spark.sql.Dataset.(Dataset.scala:183) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:68) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:617) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) at org.apache.spark.sql.test.SQLTestUtils$$anonfun$sql$1.apply(SQLTestUtils.scala:62) at org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38$$anonfun$apply$mcV$sp$8.apply$mcV$sp(DDLSuite.scala:705) at org.apache.spark.sql.test.SQLTestUtils$class.withView(SQLTestUtils.scala:186) at org.apache.spark.sql.execution.command.DDLSuite.withView(DDLSuite.scala:171) at org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply$mcV$sp(DDLSuite.scala:704) at org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) at org.apache.spark.sql.execution.command.DDLSuite$$anonfun$38.apply(DDLSuite.scala:701) at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68) at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) at org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(HiveDDLSuite.scala:41) at org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:255) at org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite.runTest(HiveDDLSuite.scala:41) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) at scala.collection.immutable.List.foreach(List.scala:381) at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[jira] [Commented] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
[ https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929334#comment-15929334 ] Kay Ousterhout commented on SPARK-19988: With some help from [~joshrosen] I spent some time digging into this and found: (1) if you look at the failures, they're all from the maven build. In fact, 100% of the maven builds shown there fail (and none of the SBT ones). This is weird because this is also failing on the PR builder, which uses SBT. (2) The maven build failures are all accompanied by 3 other tests; the group of 4 tests seems to consistently fail together. 3 tests fail with errors similar to this one (saying that some database does not exist). The 4th test, org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using, fails with a more real error. I filed SPARK-19990 for that issue. (3) A commit right around the time the tests started failing: https://github.com/apache/spark/commit/09829be621f0f9bb5076abb3d832925624699fa9#diff-b7094baa12601424a5d19cb930e3402fR46 added code to remove all of the databases after each test. I wonder if that's somehow getting run concurrently or asynchronously in the maven build (after the HiveCataloguedDDLSuite fails), which is why the error in the DDLSuite causes the other tests to fail saying that a database can't be found. I have extremely limited knowledge of both (a) how the maven tests are executed and (b) the SQL code so it's possible these are totally unrelated issues. None of this explains why the test is failing in the PR builder, where the failures have been isolated to this test. > Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column > written by Hive > > > Key: SPARK-19988 > URL: https://issues.apache.org/jira/browse/SPARK-19988 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > Attachments: trimmed-unit-test.log > > > "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by > Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 > days here: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite&test_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive > eg. > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/ > {noformat} > sbt.ForkMain$ForkError: > org.apache.spark.sql.execution.QueryExecutionException: FAILED: > SemanticException [Error 10072]: Database does not exist: db2 > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19990) Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using
Kay Ousterhout created SPARK-19990: -- Summary: Flaky test: org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite: create temporary view using Key: SPARK-19990 URL: https://issues.apache.org/jira/browse/SPARK-19990 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.2.0 Reporter: Kay Ousterhout This test seems to be failing consistently on all of the maven builds: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.execution.HiveCatalogedDDLSuite&test_name=create+temporary+view+using and is possibly caused by SPARK-19763. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout
[ https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19964: --- Summary: Flaky test: SparkSubmitSuite fails due to Timeout (was: SparkSubmitSuite fails due to Timeout) > Flaky test: SparkSubmitSuite fails due to Timeout > - > > Key: SPARK-19964 > URL: https://issues.apache.org/jira/browse/SPARK-19964 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 2.2.0 >Reporter: Eren Avsarogullari > Labels: flaky-test > Attachments: SparkSubmitSuite_Stacktrace > > > The following test case has been failed due to TestFailedDueToTimeoutException > *Test Suite:* SparkSubmitSuite > *Test Case:* includes jars passed in through --packages > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ > *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19964) Flaky test: SparkSubmitSuite fails due to Timeout
[ https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929257#comment-15929257 ] Kay Ousterhout edited comment on SPARK-19964 at 3/17/17 12:54 AM: -- [~srowen] it looks like this is failing periodically in master: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite&test_name=includes+jars+passed+in+through+--jars (I added flaky to the name which is I suspect the source of confusion) was (Author: kayousterhout): [~srowen] it looks like this is failing periodically in master: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite&test_name=includes+jars+passed+in+through+--jars > Flaky test: SparkSubmitSuite fails due to Timeout > - > > Key: SPARK-19964 > URL: https://issues.apache.org/jira/browse/SPARK-19964 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 2.2.0 >Reporter: Eren Avsarogullari > Labels: flaky-test > Attachments: SparkSubmitSuite_Stacktrace > > > The following test case has been failed due to TestFailedDueToTimeoutException > *Test Suite:* SparkSubmitSuite > *Test Case:* includes jars passed in through --packages > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ > *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19964) SparkSubmitSuite fails due to Timeout
[ https://issues.apache.org/jira/browse/SPARK-19964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929257#comment-15929257 ] Kay Ousterhout commented on SPARK-19964: [~srowen] it looks like this is failing periodically in master: https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.deploy.SparkSubmitSuite&test_name=includes+jars+passed+in+through+--jars > SparkSubmitSuite fails due to Timeout > - > > Key: SPARK-19964 > URL: https://issues.apache.org/jira/browse/SPARK-19964 > Project: Spark > Issue Type: Bug > Components: Deploy, Tests >Affects Versions: 2.2.0 >Reporter: Eren Avsarogullari > Labels: flaky-test > Attachments: SparkSubmitSuite_Stacktrace > > > The following test case has been failed due to TestFailedDueToTimeoutException > *Test Suite:* SparkSubmitSuite > *Test Case:* includes jars passed in through --packages > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74413/testReport/ > *Stacktrace is also attached.* -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19988) Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by Hive
[ https://issues.apache.org/jira/browse/SPARK-19988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19988: --- Component/s: SQL > Flaky Test: OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column > written by Hive > > > Key: SPARK-19988 > URL: https://issues.apache.org/jira/browse/SPARK-19988 > Project: Spark > Issue Type: Test > Components: SQL, Tests >Affects Versions: 2.2.0 >Reporter: Imran Rashid > Labels: flaky-test > Attachments: trimmed-unit-test.log > > > "OrcSourceSuite SPARK-19459/SPARK-18220: read char/varchar column written by > Hive" fails a lot -- right now, I see about a 50% pass rate in the last 3 > days here: > https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.hive.orc.OrcSourceSuite&test_name=SPARK-19459%2FSPARK-18220%3A+read+char%2Fvarchar+column+written+by+Hive > eg. > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.hive.orc/OrcSourceSuite/SPARK_19459_SPARK_18220__read_char_varchar_column_written_by_Hive/ > {noformat} > sbt.ForkMain$ForkError: > org.apache.spark.sql.execution.QueryExecutionException: FAILED: > SemanticException [Error 10072]: Database does not exist: db2 > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:637) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:288) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:229) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:228) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:271) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runHive(HiveClientImpl.scala:621) > at > org.apache.spark.sql.hive.client.HiveClientImpl.runSqlHive(HiveClientImpl.scala:611) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply$mcV$sp(OrcSourceSuite.scala:160) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > at > org.apache.spark.sql.hive.orc.OrcSuite$$anonfun$7.apply(OrcSourceSuite.scala:155) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19989) Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite
Kay Ousterhout created SPARK-19989: -- Summary: Flaky Test: org.apache.spark.sql.kafka010.KafkaSourceStressSuite Key: SPARK-19989 URL: https://issues.apache.org/jira/browse/SPARK-19989 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 2.2.0 Reporter: Kay Ousterhout Priority: Minor This test failed recently here: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74683/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/stress_test_with_multiple_topics_and_partitions/ And based on Josh's dashboard (https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.kafka010.KafkaSourceStressSuite&test_name=stress+test+with+multiple+topics+and+partitions), seems to fail a few times every month. Here's the full error from the most recent failure: Error Message org.scalatest.exceptions.TestFailedException: Error adding data: replication factor: 1 larger than available brokers: 0 kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:117) kafka.admin.AdminUtils$.createTopic(AdminUtils.scala:403) org.apache.spark.sql.kafka010.KafkaTestUtils.createTopic(KafkaTestUtils.scala:173) org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:903) org.apache.spark.sql.kafka010.KafkaSourceStressSuite$$anonfun$16$$anonfun$apply$mcV$sp$17$$anonfun$37.apply(KafkaSourceSuite.scala:901) org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:93) org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData$$anonfun$addData$1.apply(KafkaSourceSuite.scala:92) scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:316) org.apache.spark.sql.kafka010.KafkaSourceTest$AddKafkaData.addData(KafkaSourceSuite.scala:92) org.apache.spark.sql.streaming.StreamTest$$anonfun$liftedTree1$1$1.apply(StreamTest.scala:494) == Progress ==AssertOnQuery(, )CheckAnswer: StopStream StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@5d888be0,Map()) AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data = Range(0, 1, 2, 3, 4, 5, 6, 7, 8), message = )CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8],[9]StopStream StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@1be724ee,Map()) AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data = Range(9, 10, 11, 12, 13, 14), message = )CheckAnswer: [1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15]StopStream AddKafkaData(topics = Set(stress4, stress2, stress1, stress5, stress3), data = Range(), message = ) => AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, stress3), data = Range(15), message = Add topic stress7) AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, stress3), data = Range(16, 17, 18, 19, 20, 21, 22), message = Add partition) AddKafkaData(topics = Set(stress4, stress6, stress2, stress1, stress5, stress3), data = Range(23, 24), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(), message = Add topic stress9)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(25, 26, 27, 28, 29, 30, 31, 32, 33), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(34, 35, 36, 37, 38, 39), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(40, 41, 42, 43), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(44), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(45, 46, 47, 48, 49, 50, 51, 52), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(53, 54, 55), message = )AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(56, 57, 58, 59, 60, 61, 62, 63), message = Add partition)AddKafkaData(topics = Set(stress4, stress6, stress2, stress8, stress1, stress5, stress3), data = Range(64, 65, 66, 67, 68, 69, 70), message = ) StartStream(ProcessingTime(0),org.apache.spark.util.SystemClock@65068637,Map()) AddKafkaData(topics = Set(stress4, stress6, stress2,
[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927539#comment-15927539 ] Kay Ousterhout commented on SPARK-19803: Awesome thanks! > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Shubham Chopra > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927263#comment-15927263 ] Kay Ousterhout commented on SPARK-19803: This failed again today: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74621/testReport/org.apache.spark.storage/BlockManagerProactiveReplicationSuite/proactive_block_replication___3_replicas___2_block_manager_deletions/ > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Shubham Chopra > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18066) Add Pool usage policies test coverage for FIFO & FAIR Schedulers
[ https://issues.apache.org/jira/browse/SPARK-18066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-18066. Resolution: Fixed Assignee: Eren Avsarogullari Fix Version/s: 2.2.0 > Add Pool usage policies test coverage for FIFO & FAIR Schedulers > > > Key: SPARK-18066 > URL: https://issues.apache.org/jira/browse/SPARK-18066 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari >Priority: Minor > Fix For: 2.2.0 > > > The following Pool usage cases need to have Unit test coverage : > - FIFO Scheduler just uses *root pool* so even if *spark.scheduler.pool* > property is set, related pool is not created and *TaskSetManagers* are added > to root pool. > - FAIR Scheduler uses default pool when *spark.scheduler.pool* property is > not set. This can be happened when Properties object is null or empty(*new > Properties()*) or points default pool(*spark.scheduler.pool*=_default_). > - FAIR Scheduler creates a new pool with default values when > *spark.scheduler.pool* property points _non-existent_ pool. This can be > happened when scheduler allocation file is not set or it does not contain > related pool. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout reopened SPARK-19803: Assignee: Shubham Chopra (was: Genmao Yu) > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Shubham Chopra > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925353#comment-15925353 ] Kay Ousterhout commented on SPARK-19803: This does not appear to be fixed -- it looks like there's some error condition in the underlying code that can cause this to break? From https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74412/testReport/org.apache.spark.storage/BlockManagerProactiveReplicationSuite/proactive_block_replication___5_replicas___4_block_manager_deletions/: org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 493 times over 5.00752125399 seconds. Last failure message: 4 did not equal 5. at org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307) at org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478) at org.apache.spark.storage.BlockManagerProactiveReplicationSuite.testProactiveReplication(BlockManagerReplicationSuite.scala:492) at org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply$mcV$sp(BlockManagerReplicationSuite.scala:464) at org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply(BlockManagerReplicationSuite.scala:464) at org.apache.spark.storage.BlockManagerProactiveReplicationSuite$$anonfun$12$$anonfun$apply$mcVI$sp$1.apply(BlockManagerReplicationSuite.scala:464) [~shubhamc] and [~cloud_fan], since you worked on the original code for this, can you take a look at this? I looked at this for a bit and based on some experimentation it looked like there were some race conditions in the underlying code. > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Labels: flaky-test (was: ) > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Affects Version/s: (was: 2.3.0) 2.2.0 > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Component/s: Tests > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19803. Resolution: Fixed Assignee: Genmao Yu Fix Version/s: 2.2.0 Thanks for fixing this [~uncleGen] and for reporting it [~sitalke...@gmail.com] > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19276) FetchFailures can be hidden by user (or sql) exception handling
[ https://issues.apache.org/jira/browse/SPARK-19276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19276. Resolution: Fixed Assignee: Imran Rashid Fix Version/s: 2.2.0 > FetchFailures can be hidden by user (or sql) exception handling > --- > > Key: SPARK-19276 > URL: https://issues.apache.org/jira/browse/SPARK-19276 > Project: Spark > Issue Type: Bug > Components: Scheduler, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Critical > Fix For: 2.2.0 > > > The scheduler handles node failures by looking for a special > {{FetchFailedException}} thrown by the shuffle block fetcher. This is > handled in {{Executor}} and then passed as a special msg back to the driver: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/core/src/main/scala/org/apache/spark/executor/Executor.scala#L403 > However, user code exists in between the shuffle block fetcher and that catch > block -- it could intercept the exception, wrap it with something else, and > throw a different exception. If that happens, spark treats it as an ordinary > task failure, and retries the task, rather than regenerating the missing > shuffle data. The task eventually is retried 4 times, its doomed to fail > each time, and the job is failed. > You might think that no user code should do that -- but even sparksql does it: > https://github.com/apache/spark/blob/278fa1eb305220a85c816c948932d6af8fa619aa/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L214 > Here's an example stack trace. This is from Spark 1.6, so the sql code is > not the same, but the problem is still there: > {noformat} > 17/01/13 19:18:02 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 1983.0 (TID 304851, xxx): org.apache.spark.SparkException: Task failed while > writing rows. > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:414) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.spark.shuffle.FetchFailedException: Failed to connect > to xxx/yyy:zzz > at > org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:323) > ... > 17/01/13 19:19:29 ERROR scheduler.TaskSetManager: Task 0 in stage 1983.0 > failed 4 times; aborting job > {noformat} > I think the right fix here is to also set a fetch failure status in the > {{TaskContextImpl}}, so the executor can check that instead of just one > exception. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19796) taskScheduler fails serializing long statements received by thrift server
[ https://issues.apache.org/jira/browse/SPARK-19796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893376#comment-15893376 ] Kay Ousterhout commented on SPARK-19796: Do you think we should (separately) fix the underlying problem? Specifically, we could: (a) not send the SPARK_JOB_DESCRIPTION property to the workers, since it's only used on the master for the UI (and while users *could* access it, the variable name SPARK_JOB_DESCRIPTION is spark-private, which suggests that it shouldn't be used by users). Perhaps this is too risky because users could be using it? (b) Truncate SPARK_JOB_DESCRIPTION to something reasonable (100 characters?) before sending it to the workers. This is more backwards compatible if users are actually reading the property, but maybe a useless intermediate approach? (c) (Possibly in addition to one of the above) Log a warning if any of the properties is longer than 100 characters (or some threshold). Thoughts? I can file a JIRA if you think any of these is worthwhile. > taskScheduler fails serializing long statements received by thrift server > - > > Key: SPARK-19796 > URL: https://issues.apache.org/jira/browse/SPARK-19796 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Giambattista >Priority: Blocker > > This problem was observed after the changes made for SPARK-17931. > In my use-case I'm sending very long insert statements to Spark thrift server > and they are failing at TaskDescription.scala:89 because writeUTF fails if > requested to write strings longer than 64Kb (see > https://www.drillio.com/en/2009/java-encoded-string-too-long-64kb-limit/ for > a description of the issue). > As suggested by Imran Rashid I tracked down the offending key: it is > "spark.job.description" and it contains the complete SQL statement. > The problem can be reproduced by creating a table like: > create table test (a int) using parquet > and by sending an insert statement like: > scala> val r = 1 to 128000 > scala> println("insert into table test values (" + r.mkString("),(") + ")") -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19631) OutputCommitCoordinator should not allow commits for already failed tasks
[ https://issues.apache.org/jira/browse/SPARK-19631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout reassigned SPARK-19631: -- Assignee: Patrick Woody > OutputCommitCoordinator should not allow commits for already failed tasks > - > > Key: SPARK-19631 > URL: https://issues.apache.org/jira/browse/SPARK-19631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Patrick Woody >Assignee: Patrick Woody > Fix For: 2.2.0 > > > This is similar to SPARK-6614, but there a race condition where a task may > fail (e.g. Executor heartbeat timeout) and still manage to go through the > commit protocol successfully. After this any retries of the task will fail > indefinitely because of TaskCommitDenied. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19631) OutputCommitCoordinator should not allow commits for already failed tasks
[ https://issues.apache.org/jira/browse/SPARK-19631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19631. Resolution: Fixed Fix Version/s: 2.2.0 > OutputCommitCoordinator should not allow commits for already failed tasks > - > > Key: SPARK-19631 > URL: https://issues.apache.org/jira/browse/SPARK-19631 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Patrick Woody > Fix For: 2.2.0 > > > This is similar to SPARK-6614, but there a race condition where a task may > fail (e.g. Executor heartbeat timeout) and still manage to go through the > commit protocol successfully. After this any retries of the task will fail > indefinitely because of TaskCommitDenied. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13931) Resolve stage hanging up problem in a particular case
[ https://issues.apache.org/jira/browse/SPARK-13931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-13931. Resolution: Fixed Fix Version/s: 2.2.0 > Resolve stage hanging up problem in a particular case > - > > Key: SPARK-13931 > URL: https://issues.apache.org/jira/browse/SPARK-13931 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.4.1, 1.5.2, 1.6.0, 1.6.1 >Reporter: ZhengYaofeng > Fix For: 2.2.0 > > > Suppose the following steps: > 1. Open speculation switch in the application. > 2. Run this app and suppose last task of shuffleMapStage 1 finishes. Let's > get the record straight, from the eyes of DAG, this stage really finishes, > and from the eyes of TaskSetManager, variable 'isZombie' is set to true, but > variable runningTasksSet isn't empty because of speculation. > 3. Suddenly, executor 3 is lost. TaskScheduler receiving this signal, invokes > all executorLost functions of rootPool's taskSetManagers. DAG receiving this > signal, removes all this executor's outputLocs. > 4. TaskSetManager adds all this executor's tasks to pendingTasks and tells > DAG they will be resubmitted (Attention: possibly not on time). > 5. DAG starts to submit a new waitingStage, let's say shuffleMapStage 2, and > going to find that shuffleMapStage 1 is its missing parent because some > outputLocs are removed due to executor lost. Then DAG submits shuffleMapStage > 1 again. > 6. DAG still receives Task 'Resubmitted' signal from old taskSetManager, and > increases the number of pendingTasks of shuffleMapStage 1 each time. However, > old taskSetManager won't resolve new task to submit because its variable > 'isZombie' is set to true. > 7. Finally shuffleMapStage 1 never finishes in DAG together with all stages > depending on it. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19777) Scan runningTasksSet when check speculatable tasks in TaskSetManager.
[ https://issues.apache.org/jira/browse/SPARK-19777?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19777. Resolution: Fixed Assignee: jin xing Fix Version/s: 2.2.0 > Scan runningTasksSet when check speculatable tasks in TaskSetManager. > - > > Key: SPARK-19777 > URL: https://issues.apache.org/jira/browse/SPARK-19777 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing >Priority: Minor > Fix For: 2.2.0 > > > When check speculatable tasks in TaskSetManager, only scan runningTasksSet > instead of scanning all taskInfos. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19772) Flaky test: pyspark.streaming.tests.WindowFunctionTests
[ https://issues.apache.org/jira/browse/SPARK-19772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19772: --- Labels: flaky-test (was: ) > Flaky test: pyspark.streaming.tests.WindowFunctionTests > --- > > Key: SPARK-19772 > URL: https://issues.apache.org/jira/browse/SPARK-19772 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > Labels: flaky-test > > Here's the link to the failed build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73598 > FAIL [16.440s]: test_count_by_value_and_window > (pyspark.streaming.tests.WindowFunctionTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py", > line 668, in test_count_by_value_and_window > self._test_func(input, func, expected) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py", > line 162, in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(0,[312 chars] 2), (5, 1)], [(0, 1), (1, 1), > (2, 1), (3, 1), (4, 1), (5, 1)]] != [[(0,[312 chars] 2), (5, 1)]] > First list contains 1 additional elements. > First extra element 9: > [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)] > [[(0, 1)], >[(0, 2), (1, 1)], >[(0, 3), (1, 2), (2, 1)], >[(0, 4), (1, 3), (2, 2), (3, 1)], >[(0, 5), (1, 4), (2, 3), (3, 2), (4, 1)], >[(0, 5), (1, 5), (2, 4), (3, 3), (4, 2), (5, 1)], >[(0, 4), (1, 4), (2, 4), (3, 3), (4, 2), (5, 1)], >[(0, 3), (1, 3), (2, 3), (3, 3), (4, 2), (5, 1)], > - [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)], > ? ^ > + [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)]] > ? ^ > - [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]] > Stdout: > ('timeout after', 15) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19772) Flaky test: pyspark.streaming.tests.WindowFunctionTests
[ https://issues.apache.org/jira/browse/SPARK-19772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888758#comment-15888758 ] Kay Ousterhout commented on SPARK-19772: [~mengxr][~davies] It looks like this came up a while ago in SPARK-7497 and you fixed it. Any chance you could look at this again? > Flaky test: pyspark.streaming.tests.WindowFunctionTests > --- > > Key: SPARK-19772 > URL: https://issues.apache.org/jira/browse/SPARK-19772 > Project: Spark > Issue Type: Bug > Components: PySpark, Structured Streaming, Tests >Affects Versions: 2.2.0 >Reporter: Kay Ousterhout > > Here's the link to the failed build: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73598 > FAIL [16.440s]: test_count_by_value_and_window > (pyspark.streaming.tests.WindowFunctionTests) > -- > Traceback (most recent call last): > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py", > line 668, in test_count_by_value_and_window > self._test_func(input, func, expected) > File > "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py", > line 162, in _test_func > self.assertEqual(expected, result) > AssertionError: Lists differ: [[(0,[312 chars] 2), (5, 1)], [(0, 1), (1, 1), > (2, 1), (3, 1), (4, 1), (5, 1)]] != [[(0,[312 chars] 2), (5, 1)]] > First list contains 1 additional elements. > First extra element 9: > [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)] > [[(0, 1)], >[(0, 2), (1, 1)], >[(0, 3), (1, 2), (2, 1)], >[(0, 4), (1, 3), (2, 2), (3, 1)], >[(0, 5), (1, 4), (2, 3), (3, 2), (4, 1)], >[(0, 5), (1, 5), (2, 4), (3, 3), (4, 2), (5, 1)], >[(0, 4), (1, 4), (2, 4), (3, 3), (4, 2), (5, 1)], >[(0, 3), (1, 3), (2, 3), (3, 3), (4, 2), (5, 1)], > - [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)], > ? ^ > + [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)]] > ? ^ > - [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]] > Stdout: > ('timeout after', 15) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19772) Flaky test: pyspark.streaming.tests.WindowFunctionTests
Kay Ousterhout created SPARK-19772: -- Summary: Flaky test: pyspark.streaming.tests.WindowFunctionTests Key: SPARK-19772 URL: https://issues.apache.org/jira/browse/SPARK-19772 Project: Spark Issue Type: Bug Components: PySpark, Structured Streaming, Tests Affects Versions: 2.2.0 Reporter: Kay Ousterhout Here's the link to the failed build: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73598 FAIL [16.440s]: test_count_by_value_and_window (pyspark.streaming.tests.WindowFunctionTests) -- Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py", line 668, in test_count_by_value_and_window self._test_func(input, func, expected) File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/streaming/tests.py", line 162, in _test_func self.assertEqual(expected, result) AssertionError: Lists differ: [[(0,[312 chars] 2), (5, 1)], [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]] != [[(0,[312 chars] 2), (5, 1)]] First list contains 1 additional elements. First extra element 9: [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)] [[(0, 1)], [(0, 2), (1, 1)], [(0, 3), (1, 2), (2, 1)], [(0, 4), (1, 3), (2, 2), (3, 1)], [(0, 5), (1, 4), (2, 3), (3, 2), (4, 1)], [(0, 5), (1, 5), (2, 4), (3, 3), (4, 2), (5, 1)], [(0, 4), (1, 4), (2, 4), (3, 3), (4, 2), (5, 1)], [(0, 3), (1, 3), (2, 3), (3, 3), (4, 2), (5, 1)], - [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)], ? ^ + [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2), (5, 1)]] ? ^ - [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1)]] Stdout: ('timeout after', 15) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19597) ExecutorSuite should have test for tasks that are not deserialiazable
[ https://issues.apache.org/jira/browse/SPARK-19597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19597. Resolution: Fixed Fix Version/s: 2.2.0 > ExecutorSuite should have test for tasks that are not deserialiazable > - > > Key: SPARK-19597 > URL: https://issues.apache.org/jira/browse/SPARK-19597 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Imran Rashid >Assignee: Imran Rashid >Priority: Minor > Fix For: 2.2.0 > > > We should have a test case that ensures that Executors gracefully handle a > task that fails to deserialize, by sending back a reasonable failure message. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4681) Turn on executor level blacklisting by default
[ https://issues.apache.org/jira/browse/SPARK-4681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout closed SPARK-4681. - Resolution: Duplicate This was for the old blacklisting mechanism. The linked JIRAs introduce a new blacklisting mechanism that should eventually be enabled by default, but are currently considered experimental. > Turn on executor level blacklisting by default > -- > > Key: SPARK-4681 > URL: https://issues.apache.org/jira/browse/SPARK-4681 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Reporter: Patrick Wendell >Assignee: Kay Ousterhout > > Per discussion in https://github.com/apache/spark/pull/3541. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor
[ https://issues.apache.org/jira/browse/SPARK-19560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout closed SPARK-19560. -- Resolution: Fixed Target Version/s: 2.2.0 > Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask > from a failed executor > > > Key: SPARK-19560 > URL: https://issues.apache.org/jira/browse/SPARK-19560 > Project: Spark > Issue Type: Test > Components: Scheduler >Affects Versions: 2.1.1 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > There's some tricky code around the case when the DAGScheduler learns of a > ShuffleMapTask that completed successfully, but ran on an executor that > failed sometime after the task was launched. This case is tricky because the > TaskSetManager (i.e., the lower level scheduler) thinks the task completed > successfully, but the DAGScheduler considers the output it generated to be no > longer valid (because it was probably lost when the executor was lost). As a > result, the DAGScheduler needs to re-submit the stage, so that the task can > be re-run. This is tested in some of the tests but not clearly documented, > so we should improve this to prevent future bugs (this was encountered by > [~markhamstra] in attempting to find a better fix for SPARK-19263). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19596) After a Stage is completed, all Tasksets for the stage should be marked as zombie
[ https://issues.apache.org/jira/browse/SPARK-19596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881726#comment-15881726 ] Kay Ousterhout commented on SPARK-19596: I agree that this is an issue (although it would be implicitly fixed if we cancel running tasks in zombie stages, because that would mean that a task attempt from an earlier, still-running stage attempt can't cause a stage to be marked as complete) > After a Stage is completed, all Tasksets for the stage should be marked as > zombie > - > > Key: SPARK-19596 > URL: https://issues.apache.org/jira/browse/SPARK-19596 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.2.0 >Reporter: Imran Rashid > > Fetch Failures can lead to multiple simultaneous tasksets for one stage. The > stage may eventually be finished by task completions from a prior stage > attempt. When this happens, the most recent taskset is not marked as a > zombie. This means that taskset may continue to submit new tasks even after > the stage is complete. > This is not a correctness issue, but it will effect performance, as cluster > resources will get tied up running tasks that are not needed. > This is a follow up to https://issues.apache.org/jira/browse/SPARK-19565. > See some discussion in the pr for that issue: > https://github.com/apache/spark/pull/16901 -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881688#comment-15881688 ] Kay Ousterhout commented on SPARK-19698: My concern is that there are other cases in Spark where this issue could arise (so Spark tasks need to be very careful about how they modify external state). Here's another scenario: - Attempt 0 of a task starts and takes a long time to run - A second, speculative copy of the task is started (attempt 1) - Attempt 0 finishes successfully, but attempt 1 is still running - Attempt 1 gets partway through modifying the external state, but then gets killed because of an OOM on the machine - Attempt 1 won't get re-started, because a copy of the task already finished successfully This seems like it will have the same issue you mentioned in the JIRA, right? > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881564#comment-15881564 ] Kay Ousterhout commented on SPARK-19698: I see -- I agree that everything in your description is correct. The driver will allow all tasks to finish if it's still running (e.g., if other tasks are being submitted), but you're right it will shut down the workers while some tasks are still in progress if the Driver shuts down. To think about how to fix this, let me ask you a question about your workload: suppose a task is in the middle of manipulating some external state (as you described in the JIRA description) and it gets killed suddenly because the JVM runs out of memory (e.g., because another concurrently running task used up all of the memory). In that case, the job listener won't be told about the failed task, and it will be re-tried. Does that pose a problem in the same way that the behavior described in the PR is problematic? > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished
[ https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-14658: --- Fix Version/s: 2.2.0 > when executor lost DagScheduer may submit one stage twice even if the first > running taskset for this stage is not finished > -- > > Key: SPARK-14658 > URL: https://issues.apache.org/jira/browse/SPARK-14658 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0 > Environment: spark1.6.1 hadoop-2.6.0-cdh5.4.2 >Reporter: yixiaohua > Fix For: 2.2.0 > > > {code} > 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage 57: > 57.2,57.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > First Time: > {code} > 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, > 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, > 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, > 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, > 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, > 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, > 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239 > 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, > 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, > 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, > 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, > 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, > 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, > 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110) > {code} > Second Time: > {code} > 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 26 > 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14658) when executor lost DagScheduer may submit one stage twice even if the first running taskset for this stage is not finished
[ https://issues.apache.org/jira/browse/SPARK-14658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-14658. Resolution: Duplicate I'm fairly sure this duplicates SPARK-19263, as Mark mentioned on the PR. Check out this comment for a description of what's going on: https://github.com/apache/spark/pull/16620#issuecomment-279125227 Josh, feel free to re-open if you think this is a different issue. > when executor lost DagScheduer may submit one stage twice even if the first > running taskset for this stage is not finished > -- > > Key: SPARK-14658 > URL: https://issues.apache.org/jira/browse/SPARK-14658 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1, 2.0.0, 2.1.0, 2.2.0 > Environment: spark1.6.1 hadoop-2.6.0-cdh5.4.2 >Reporter: yixiaohua > > {code} > 16/04/14 15:35:22 ERROR DAGSchedulerEventProcessLoop: > DAGSchedulerEventProcessLoop failed; shutting down SparkContext > java.lang.IllegalStateException: more than one active taskSet for stage 57: > 57.2,57.1 > at > org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > First Time: > {code} > 16/04/14 15:35:20 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 5, 8, 9, 12, > 13, 16, 17, 18, 19, 23, 26, 27, 28, 29, 30, 31, 40, 42, 43, 48, 49, 50, 51, > 52, 53, 55, 56, 57, 59, 60, 61, 67, 70, 71, 84, 85, 86, 87, 98, 99, 100, 101, > 108, 109, 110, 111, 112, 113, 114, 115, 126, 127, 134, 136, 137, 146, 147, > 150, 151, 154, 155, 158, 159, 162, 163, 164, 165, 166, 167, 170, 171, 172, > 173, 174, 175, 176, 177, 178, 179, 180, 181, 188, 189, 190, 191, 198, 199, > 204, 206, 207, 208, 218, 219, 222, 223, 230, 231, 236, 238, 239 > 16/04/14 15:35:20 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:20 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:20 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:20 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:20 INFO DAGScheduler: Submitting 100 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:20 DEBUG DAGScheduler: New pending partitions: Set(206, 177, > 127, 98, 48, 27, 23, 163, 238, 188, 159, 28, 109, 59, 9, 176, 126, 207, 174, > 43, 170, 208, 158, 108, 29, 8, 204, 154, 223, 173, 219, 190, 111, 61, 40, > 136, 115, 86, 57, 155, 55, 230, 222, 180, 172, 151, 101, 18, 166, 56, 137, > 87, 52, 171, 71, 42, 167, 198, 67, 17, 236, 165, 13, 5, 53, 178, 99, 70, 49, > 218, 147, 164, 114, 85, 60, 31, 179, 150, 19, 100, 50, 175, 146, 134, 113, > 84, 51, 30, 199, 26, 16, 191, 162, 112, 12, 239, 231, 189, 181, 110) > {code} > Second Time: > {code} > 16/04/14 15:35:22 INFO DAGScheduler: Resubmitting ShuffleMapStage 57 (run at > AccessController.java:-2) because some of its tasks had failed: 26 > 16/04/14 15:35:22 DEBUG DAGScheduler: submitStage(ShuffleMapStage 57) > 16/04/14 15:35:22 DEBUG DAGScheduler: missing: List() > 16/04/14 15:35:22 INFO DAGScheduler: Submitting ShuffleMapStage 57 > (MapPartitionsRDD[7887] at run at AccessController.java:-2), which has no > missing parents > 16/04/14 15:35:22 DEBUG DAGScheduler: submitMissingTasks(ShuffleMapStage 57) > 16/04/14 15:35:22 INFO DAGScheduler: Submitting 1 missing tasks from > ShuffleMapStage 57 (MapPartitionsRDD[7887] at run at AccessController.java:-2) > 16/04/14 15:35:22 DEBUG DAGScheduler: New pending partitions: Set(26) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19263) DAGScheduler should avoid sending conflicting task set.
[ https://issues.apache.org/jira/browse/SPARK-19263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881406#comment-15881406 ] Kay Ousterhout commented on SPARK-19263: Just noting that this was fixed by https://github.com/apache/spark/pull/16620 (the other PR was accidentally created with the same JIRA ID) > DAGScheduler should avoid sending conflicting task set. > --- > > Key: SPARK-19263 > URL: https://issues.apache.org/jira/browse/SPARK-19263 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing > Fix For: 2.2.0 > > > In current *DAGScheduler handleTaskCompletion* code, when *event.reason* is > *Success*, it will first do *stage.pendingPartitions -= task.partitionId*, > which maybe a bug when *FetchFailed* happens. Think about below: > # Stage 0 runs and generates shuffle output data. > # Stage 1 reads the output from stage 0 and generates more shuffle data. It > has two tasks: ShuffleMapTask1 and ShuffleMapTask2, and these tasks are > launched on executorA. > # ShuffleMapTask1 fails to fetch blocks locally and sends a FetchFailed to > the driver. The driver marks executorA as lost and updates failedEpoch; > # The driver resubmits stage 0 so the missing output can be re-generated, and > then once it completes, resubmits stage 1 with ShuffleMapTask1x and > ShuffleMapTask2x. > # ShuffleMapTask2 (from the original attempt of stage 1) successfully > finishes on executorA and sends Success back to driver. This causes > DAGScheduler::handleTaskCompletion to remove partition 2 from > stage.pendingPartitions (line 1149), but it does not add the partition to the > set of output locations (line 1192), because the task’s epoch is less than > the failure epoch for the executor (because of the earlier failure on > executor A) > # ShuffleMapTask1x successfully finishes on executorB, causing the driver to > remove partition 1 from stage.pendingPartitions. Combined with the previous > step, this means that there are no more pending partitions for the stage, so > the DAGScheduler marks the stage as finished (line 1196). However, the > shuffle stage is not available (line 1215) because the completion for > ShuffleMapTask2 was ignored because of its epoch, so the DAGScheduler > resubmits the stage. > # ShuffleMapTask2x is still running, so when TaskSchedulerImpl::submitTasks > is called for the re-submitted stage, it throws an error, because there’s an > existing active task set > To reproduce the bug: > 1. We need to do some modification in *ShuffleBlockFetcherIterator*: check > whether the task's index in *TaskSetManager* and stage attempt equal to 0 at > the same time, if so, throw FetchFailedException; > 2. Rebuild spark then submit following job: > {code} > val rdd = sc.parallelize(List((0, 1), (1, 1), (2, 1), (3, 1), (1, 2), (0, > 3), (2, 1), (3, 1)), 2) > rdd.reduceByKey { > (v1, v2) => { > Thread.sleep(1) > v1 + v2 > } > }.map { > keyAndValue => { > (keyAndValue._1 % 2, keyAndValue._2) > } > }.reduceByKey { > (v1, v2) => { > Thread.sleep(1) > v1 + v2 > } > }.collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881358#comment-15881358 ] Kay Ousterhout edited comment on SPARK-19698 at 2/23/17 9:57 PM: - I think this is the same issue as SPARK-19263 -- can you check to see if that fixes the problem / have you looked at that JIRA? I wrote a super long description of the problem towards the end of the associated PR. One more note is that right now, Spark won't cancel running task attempts (although there's a JIRA to fix this), even when a stage is marked as failed. So the exact scenario you described, where the 2nd task attempt gets shut down, shouldn't occur (the driver will wait for the 2nd task attempt to complete, but will ignore the result). was (Author: kayousterhout): I think this is the same issue as SPARK-19263 -- can you check to see if that fixes the problem / have you looked at that JIRA? I wrote a super long description of the problem towards the end of the associated PR. > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19698) Race condition in stale attempt task completion vs current attempt task completion when task is doing persistent state changes
[ https://issues.apache.org/jira/browse/SPARK-19698?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15881358#comment-15881358 ] Kay Ousterhout commented on SPARK-19698: I think this is the same issue as SPARK-19263 -- can you check to see if that fixes the problem / have you looked at that JIRA? I wrote a super long description of the problem towards the end of the associated PR. > Race condition in stale attempt task completion vs current attempt task > completion when task is doing persistent state changes > -- > > Key: SPARK-19698 > URL: https://issues.apache.org/jira/browse/SPARK-19698 > Project: Spark > Issue Type: Bug > Components: Mesos, Spark Core >Affects Versions: 2.0.0 >Reporter: Charles Allen > > We have encountered a strange scenario in our production environment. Below > is the best guess we have right now as to what's going on. > Potentially, the final stage of a job has a failure in one of the tasks (such > as OOME on the executor) which can cause tasks for that stage to be > relaunched in a second attempt. > https://github.com/apache/spark/blob/v2.1.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1155 > keeps track of which tasks have been completed, but does NOT keep track of > which attempt those tasks were completed in. As such, we have encountered a > scenario where a particular task gets executed twice in different stage > attempts, and the DAGScheduler does not consider if the second attempt is > still running. This means if the first task attempt succeeded, the second > attempt can be cancelled part-way through its run cycle if all other tasks > (including the prior failed) are completed successfully. > What this means is that if a task is manipulating some state somewhere (for > example: a upload-to-temporary-file-location, then delete-then-move on an > underlying s3n storage implementation) the driver can improperly shutdown the > running (2nd attempt) task between state manipulations, leaving the > persistent state in a bad state since the 2nd attempt never got to complete > its manipulations, and was terminated prematurely at some arbitrary point in > its state change logic (ex: finished the delete but not the move). > This is using the mesos coarse grained executor. It is unclear if this > behavior is limited to the mesos coarse grained executor or not. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19684) Move info about running specific tests to developer website
[ https://issues.apache.org/jira/browse/SPARK-19684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19684. Resolution: Fixed Fix Version/s: 2.2.0 > Move info about running specific tests to developer website > --- > > Key: SPARK-19684 > URL: https://issues.apache.org/jira/browse/SPARK-19684 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.1.1 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > Fix For: 2.2.0 > > > This JIRA accompanies this change to the website: > https://github.com/apache/spark-website/pull/33. > Running individual tests is not something that changes with new versions of > the project, and is primarily used by developers (not users) so should be > moved to the developer-tools page of the main website (with a link from the > building-spark page on the release-specific docs). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19684) Move info about running specific tests to developer website
Kay Ousterhout created SPARK-19684: -- Summary: Move info about running specific tests to developer website Key: SPARK-19684 URL: https://issues.apache.org/jira/browse/SPARK-19684 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 1.1.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor This JIRA accompanies this change to the website: https://github.com/apache/spark-website/pull/33. Running individual tests is not something that changes with new versions of the project, and is primarily used by developers (not users) so should be moved to the developer-tools page of the main website (with a link from the building-spark page on the release-specific docs). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19263) DAGScheduler should avoid sending conflicting task set.
[ https://issues.apache.org/jira/browse/SPARK-19263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19263. Resolution: Fixed Assignee: jin xing Fix Version/s: 1.2.0 > DAGScheduler should avoid sending conflicting task set. > --- > > Key: SPARK-19263 > URL: https://issues.apache.org/jira/browse/SPARK-19263 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing > Fix For: 1.2.0 > > > In current *DAGScheduler handleTaskCompletion* code, when *event.reason* is > *Success*, it will first do *stage.pendingPartitions -= task.partitionId*, > which maybe a bug when *FetchFailed* happens. Think about below: > # Stage 0 runs and generates shuffle output data. > # Stage 1 reads the output from stage 0 and generates more shuffle data. It > has two tasks: ShuffleMapTask1 and ShuffleMapTask2, and these tasks are > launched on executorA. > # ShuffleMapTask1 fails to fetch blocks locally and sends a FetchFailed to > the driver. The driver marks executorA as lost and updates failedEpoch; > # The driver resubmits stage 0 so the missing output can be re-generated, and > then once it completes, resubmits stage 1 with ShuffleMapTask1x and > ShuffleMapTask2x. > # ShuffleMapTask2 (from the original attempt of stage 1) successfully > finishes on executorA and sends Success back to driver. This causes > DAGScheduler::handleTaskCompletion to remove partition 2 from > stage.pendingPartitions (line 1149), but it does not add the partition to the > set of output locations (line 1192), because the task’s epoch is less than > the failure epoch for the executor (because of the earlier failure on > executor A) > # ShuffleMapTask1x successfully finishes on executorB, causing the driver to > remove partition 1 from stage.pendingPartitions. Combined with the previous > step, this means that there are no more pending partitions for the stage, so > the DAGScheduler marks the stage as finished (line 1196). However, the > shuffle stage is not available (line 1215) because the completion for > ShuffleMapTask2 was ignored because of its epoch, so the DAGScheduler > resubmits the stage. > # ShuffleMapTask2x is still running, so when TaskSchedulerImpl::submitTasks > is called for the re-submitted stage, it throws an error, because there’s an > existing active task set > To reproduce the bug: > 1. We need to do some modification in *ShuffleBlockFetcherIterator*: check > whether the task's index in *TaskSetManager* and stage attempt equal to 0 at > the same time, if so, throw FetchFailedException; > 2. Rebuild spark then submit following job: > {code} > val rdd = sc.parallelize(List((0, 1), (1, 1), (2, 1), (3, 1), (1, 2), (0, > 3), (2, 1), (3, 1)), 2) > rdd.reduceByKey { > (v1, v2) => { > Thread.sleep(1) > v1 + v2 > } > }.map { > keyAndValue => { > (keyAndValue._1 % 2, keyAndValue._2) > } > }.reduceByKey { > (v1, v2) => { > Thread.sleep(1) > v1 + v2 > } > }.collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19613) Flaky test: StateStoreRDDSuite.versioning and immutability
Kay Ousterhout created SPARK-19613: -- Summary: Flaky test: StateStoreRDDSuite.versioning and immutability Key: SPARK-19613 URL: https://issues.apache.org/jira/browse/SPARK-19613 Project: Spark Issue Type: Bug Components: Structured Streaming, Tests Affects Versions: 2.1.1 Reporter: Kay Ousterhout Priority: Minor This test: org.apache.spark.sql.execution.streaming.state.StateStoreRDDSuite.versioning and immutability failed on a recent PR: https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/72948/testReport/junit/org.apache.spark.sql.execution.streaming.state/StateStoreRDDSuite/versioning_and_immutability/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19612) Tests failing with timeout
[ https://issues.apache.org/jira/browse/SPARK-19612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15868565#comment-15868565 ] Kay Ousterhout commented on SPARK-19612: Does that mean we could potentially fix this by limiting the concurrency on Jenkins? > Tests failing with timeout > -- > > Key: SPARK-19612 > URL: https://issues.apache.org/jira/browse/SPARK-19612 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.1 >Reporter: Kay Ousterhout >Priority: Minor > > I've seen at least one recent test failure due to hitting the 250m timeout: > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ > Filing this JIRA to track this; if it happens repeatedly we should up the > timeout. > cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19612) Tests failing with timeout
Kay Ousterhout created SPARK-19612: -- Summary: Tests failing with timeout Key: SPARK-19612 URL: https://issues.apache.org/jira/browse/SPARK-19612 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 2.1.1 Reporter: Kay Ousterhout Priority: Minor I've seen at least one recent test failure due to hitting the 250m timeout: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72882/ Filing this JIRA to track this; if it happens repeatedly we should up the timeout. cc [~shaneknapp] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage
[ https://issues.apache.org/jira/browse/SPARK-19537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19537. Resolution: Fixed Fix Version/s: 2.2.0 > Move the pendingPartitions variable from Stage to ShuffleMapStage > - > > Key: SPARK-19537 > URL: https://issues.apache.org/jira/browse/SPARK-19537 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > Fix For: 2.2.0 > > > This variable is only used by ShuffleMapStages, and it is confusing to have > it in the Stage class rather than the ShuffleMapStage class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-19502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout closed SPARK-19502. -- Resolution: Not A Problem This code actually is currently needed to handle cases where a ShuffleMapTask succeeds on an executor, but that executor was marked as failed (so the task needs to be re-run), as described in this comment: https://github.com/apache/spark/pull/16620#issuecomment-279125227 > Remove unnecessary code to re-submit stages in the DAGScheduler > --- > > Key: SPARK-19502 > URL: https://issues.apache.org/jira/browse/SPARK-19502 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.1.1 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > There are a [few lines of code in the > DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215) > to re-submit shuffle map stages when some of the tasks fail. My > understanding is that there should be a 1:1 mapping between pending tasks > (which are tasks that haven't completed successfully) and available output > locations, so that code should never be reachable. Furthermore, the approach > taken by that code (to re-submit an entire stage as a result of task > failures) is not how we handle task failures in a stage (the lower-level > scheduler resubmits the individual tasks) which is what the 5-years-old TODO > on that code seems to be implying should be done. > The big caveat is that there's a bug being fixed in SPARK-19263 that means > there is *not* a 1:1 relationship between pendingTasks and available > outputLocations, so that code is serving as a (buggy) band-aid. This should > be fixed once we resolve SPARK-19263. > cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you > see any reason we actually do need that code) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.
[ https://issues.apache.org/jira/browse/SPARK-19538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19538: --- Priority: Minor (was: Major) > DAGScheduler and TaskSetManager can have an inconsistent view of whether a > stage is complete. > - > > Key: SPARK-19538 > URL: https://issues.apache.org/jira/browse/SPARK-19538 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > The pendingPartitions in Stage tracks partitions that still need to be > computed, and is used by the DAGScheduler to determine when to mark the stage > as complete. In most cases, this variable is exactly consistent with the > tasks in the TaskSetManager (for the current version of the stage) that are > still pending. However, as discussed in SPARK-19263, these can become > inconsistent when an ShuffleMapTask for an earlier attempt of the stage > completes, in which case the DAGScheduler may think the stage has finished, > while the TaskSetManager is still waiting for some tasks to complete (see the > description in this pull request: > https://github.com/apache/spark/pull/16620). This leads to bugs like > SPARK-19263. Another problem with this behavior is that listeners can get > two StageCompleted messages: once when the DAGScheduler thinks the stage is > complete, and a second when the TaskSetManager later decides the stage is > complete. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19560) Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor
Kay Ousterhout created SPARK-19560: -- Summary: Improve tests for when DAGScheduler learns of "successful" ShuffleMapTask from a failed executor Key: SPARK-19560 URL: https://issues.apache.org/jira/browse/SPARK-19560 Project: Spark Issue Type: Test Components: Scheduler Affects Versions: 2.1.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor There's some tricky code around the case when the DAGScheduler learns of a ShuffleMapTask that completed successfully, but ran on an executor that failed sometime after the task was launched. This case is tricky because the TaskSetManager (i.e., the lower level scheduler) thinks the task completed successfully, but the DAGScheduler considers the output it generated to be no longer valid (because it was probably lost when the executor was lost). As a result, the DAGScheduler needs to re-submit the stage, so that the task can be re-run. This is tested in some of the tests but not clearly documented, so we should improve this to prevent future bugs (this was encountered by [~markhamstra] in attempting to find a better fix for SPARK-19263). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19559) Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions
[ https://issues.apache.org/jira/browse/SPARK-19559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19559: --- Description: This test has started failing frequently recently; e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/ and https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/ cc [~zsxwing] and [~tcondie] who seemed to have modified the related code most recently was:This test has started failing frequently recently; e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/ and https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/ > Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions > > > Key: SPARK-19559 > URL: https://issues.apache.org/jira/browse/SPARK-19559 > Project: Spark > Issue Type: Bug > Components: Structured Streaming, Tests >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout > > This test has started failing frequently recently; e.g., > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/ > and > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/ > cc [~zsxwing] and [~tcondie] who seemed to have modified the related code > most recently -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19559) Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions
Kay Ousterhout created SPARK-19559: -- Summary: Fix flaky KafkaSourceSuite.subscribing topic by pattern with topic deletions Key: SPARK-19559 URL: https://issues.apache.org/jira/browse/SPARK-19559 Project: Spark Issue Type: Bug Components: Structured Streaming, Tests Affects Versions: 2.1.0 Reporter: Kay Ousterhout This test has started failing frequently recently; e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72720/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/ and https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72725/testReport/org.apache.spark.sql.kafka010/KafkaSourceSuite/subscribing_topic_by_pattern_with_topic_deletions/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19466) Improve Fair Scheduler Logging
[ https://issues.apache.org/jira/browse/SPARK-19466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19466. Resolution: Fixed Assignee: Eren Avsarogullari Fix Version/s: 2.2.0 > Improve Fair Scheduler Logging > -- > > Key: SPARK-19466 > URL: https://issues.apache.org/jira/browse/SPARK-19466 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari >Priority: Minor > Fix For: 2.2.0 > > > Fair Scheduler Logging for the following cases can be useful for the user. > 1- If *valid* spark.scheduler.allocation.file property is set, user can be > informed so user can aware which scheduler file is processed when > SparkContext initializes. > 2- If *invalid* spark.scheduler.allocation.file property is set, currently, > the following stacktrace is shown to user. In addition to this, more > meaningful message can be shown to user by emphasizing the problem at > building level of fair scheduler and covering other potential issues at this > level. > {code:xml} > Exception in thread "main" java.io.FileNotFoundException: INVALID_FILE (No > such file or directory) > at java.io.FileInputStream.open0(Native Method) > at java.io.FileInputStream.open(FileInputStream.java:195) > at java.io.FileInputStream.(FileInputStream.java:138) > at java.io.FileInputStream.(FileInputStream.java:93) > at > org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:76) > at > org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:75) > {code} > 3- If spark.scheduler.allocation.file property is not set and *default* fair > scheduler file(fairscheduler.xml) is found in classpath, it will be loaded > but currently, user is not informed so logging can be useful. > 4- If spark.scheduler.allocation.file property is not set and default fair > scheduler file does not exist, currently, user is not informed so logging can > be useful. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19538) DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete.
Kay Ousterhout created SPARK-19538: -- Summary: DAGScheduler and TaskSetManager can have an inconsistent view of whether a stage is complete. Key: SPARK-19538 URL: https://issues.apache.org/jira/browse/SPARK-19538 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 2.1.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout The pendingPartitions in Stage tracks partitions that still need to be computed, and is used by the DAGScheduler to determine when to mark the stage as complete. In most cases, this variable is exactly consistent with the tasks in the TaskSetManager (for the current version of the stage) that are still pending. However, as discussed in SPARK-19263, these can become inconsistent when an ShuffleMapTask for an earlier attempt of the stage completes, in which case the DAGScheduler may think the stage has finished, while the TaskSetManager is still waiting for some tasks to complete (see the description in this pull request: https://github.com/apache/spark/pull/16620). This leads to bugs like SPARK-19263. Another problem with this behavior is that listeners can get two StageCompleted messages: once when the DAGScheduler thinks the stage is complete, and a second when the TaskSetManager later decides the stage is complete. We should fix this. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19537) Move the pendingPartitions variable from Stage to ShuffleMapStage
Kay Ousterhout created SPARK-19537: -- Summary: Move the pendingPartitions variable from Stage to ShuffleMapStage Key: SPARK-19537 URL: https://issues.apache.org/jira/browse/SPARK-19537 Project: Spark Issue Type: Improvement Components: Scheduler Affects Versions: 2.1.0 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor This variable is only used by ShuffleMapStages, and it is confusing to have it in the Stage class rather than the ShuffleMapStage class. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off
[ https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855615#comment-15855615 ] Kay Ousterhout edited comment on SPARK-18967 at 2/7/17 9:42 PM: Please use 2.2.0, not 2.2 was (Author: rxin): Oops this was me [~rxin] sorry! > Locality preferences should be used when scheduling even when delay > scheduling is turned off > > > Key: SPARK-18967 > URL: https://issues.apache.org/jira/browse/SPARK-18967 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > Fix For: 2.2.0 > > > If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you > effectively turn off the use the of locality preferences when there is a bulk > scheduling event. {{TaskSchedulerImpl}} will use resources based on whatever > random order it decides to shuffle them, rather than taking advantage of the > most local options. > This happens because {{TaskSchedulerImpl}} offers resources to a > {{TaskSetManager}} one at a time, each time subject to a maxLocality > constraint. However, that constraint doesn't move through all possible > locality levels -- it uses [{{tsm.myLocalityLevels}} > |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360]. > And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait > == 0 | > https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953]. > So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to > giving tsms the offers with {{maxLocality = ANY}}. > *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use > {{spark.locality.wait=1ms}}. The one downside of this is if you have tasks > that actually take less than 1ms. You could even run into SPARK-18886. But > that is a relatively unlikely scenario for real workloads. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off
[ https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15856842#comment-15856842 ] Kay Ousterhout commented on SPARK-18967: Oops this was me [~rxin] sorry! > Locality preferences should be used when scheduling even when delay > scheduling is turned off > > > Key: SPARK-18967 > URL: https://issues.apache.org/jira/browse/SPARK-18967 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > Fix For: 2.2.0 > > > If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you > effectively turn off the use the of locality preferences when there is a bulk > scheduling event. {{TaskSchedulerImpl}} will use resources based on whatever > random order it decides to shuffle them, rather than taking advantage of the > most local options. > This happens because {{TaskSchedulerImpl}} offers resources to a > {{TaskSetManager}} one at a time, each time subject to a maxLocality > constraint. However, that constraint doesn't move through all possible > locality levels -- it uses [{{tsm.myLocalityLevels}} > |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360]. > And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait > == 0 | > https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953]. > So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to > giving tsms the offers with {{maxLocality = ANY}}. > *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use > {{spark.locality.wait=1ms}}. The one downside of this is if you have tasks > that actually take less than 1ms. You could even run into SPARK-18886. But > that is a relatively unlikely scenario for real workloads. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off
[ https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15855615#comment-15855615 ] Kay Ousterhout edited comment on SPARK-18967 at 2/7/17 9:40 PM: Oops this was me [~rxin] sorry! was (Author: rxin): [~imranr] please don't use "2.2" as the version. It should be "2.2.0". Thanks! > Locality preferences should be used when scheduling even when delay > scheduling is turned off > > > Key: SPARK-18967 > URL: https://issues.apache.org/jira/browse/SPARK-18967 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > Fix For: 2.2.0 > > > If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you > effectively turn off the use the of locality preferences when there is a bulk > scheduling event. {{TaskSchedulerImpl}} will use resources based on whatever > random order it decides to shuffle them, rather than taking advantage of the > most local options. > This happens because {{TaskSchedulerImpl}} offers resources to a > {{TaskSetManager}} one at a time, each time subject to a maxLocality > constraint. However, that constraint doesn't move through all possible > locality levels -- it uses [{{tsm.myLocalityLevels}} > |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360]. > And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait > == 0 | > https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953]. > So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to > giving tsms the offers with {{maxLocality = ANY}}. > *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use > {{spark.locality.wait=1ms}}. The one downside of this is if you have tasks > that actually take less than 1ms. You could even run into SPARK-18886. But > that is a relatively unlikely scenario for real workloads. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler
[ https://issues.apache.org/jira/browse/SPARK-19502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19502: --- Description: There are a [few lines of code in the DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215) to re-submit shuffle map stages when some of the tasks fail. My understanding is that there should be a 1:1 mapping between pending tasks (which are tasks that haven't completed successfully) and available output locations, so that code should never be reachable. Furthermore, the approach taken by that code (to re-submit an entire stage as a result of task failures) is not how we handle task failures in a stage (the lower-level scheduler resubmits the individual tasks) which is what the 5-years-old TODO on that code seems to be implying should be done. The big caveat is that there's a bug being fixed in SPARK-19263 that means there is *not* a 1:1 relationship between pendingTasks and available outputLocations, so that code is serving as a (buggy) band-aid. This should be fixed once we resolve SPARK-19263. cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you see any reason we actually do need that code) was: There are a [few lines of code in the DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215) to re-submit shuffle map stages when some of the tasks fail. My understanding is that there should be a 1:1 mapping between pending tasks (which are tasks that haven't completed successfully) and available output locations, so that code should never be reachable. Furthermore, the approach taken by that code (to re-submit an entire stage as a result of task failures) is not how we handle task failures in a stage (the lower-level scheduler resubmits the individual tasks) which is what the 5-years-old TODO on that code seems to be implying should be done. The big caveat is that there's a bug being fixed in SPARK-19263 that means there is *not* a 1:1 relationship between pendingTasks and available outputLocations, so that code is serving as a (buggy) band-aid. This should be fixed once we resolve SPARK-19263. cc [~imranr] [~markhamstra] [~jinxing6...@126.com] > Remove unnecessary code to re-submit stages in the DAGScheduler > --- > > Key: SPARK-19502 > URL: https://issues.apache.org/jira/browse/SPARK-19502 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.1.1 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > > There are a [few lines of code in the > DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215) > to re-submit shuffle map stages when some of the tasks fail. My > understanding is that there should be a 1:1 mapping between pending tasks > (which are tasks that haven't completed successfully) and available output > locations, so that code should never be reachable. Furthermore, the approach > taken by that code (to re-submit an entire stage as a result of task > failures) is not how we handle task failures in a stage (the lower-level > scheduler resubmits the individual tasks) which is what the 5-years-old TODO > on that code seems to be implying should be done. > The big caveat is that there's a bug being fixed in SPARK-19263 that means > there is *not* a 1:1 relationship between pendingTasks and available > outputLocations, so that code is serving as a (buggy) band-aid. This should > be fixed once we resolve SPARK-19263. > cc [~imranr] [~markhamstra] [~jinxing6...@126.com] (let me know if any of you > see any reason we actually do need that code) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19502) Remove unnecessary code to re-submit stages in the DAGScheduler
Kay Ousterhout created SPARK-19502: -- Summary: Remove unnecessary code to re-submit stages in the DAGScheduler Key: SPARK-19502 URL: https://issues.apache.org/jira/browse/SPARK-19502 Project: Spark Issue Type: Bug Components: Scheduler Affects Versions: 1.1.1 Reporter: Kay Ousterhout Assignee: Kay Ousterhout Priority: Minor There are a [few lines of code in the DAGScheduler](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1215) to re-submit shuffle map stages when some of the tasks fail. My understanding is that there should be a 1:1 mapping between pending tasks (which are tasks that haven't completed successfully) and available output locations, so that code should never be reachable. Furthermore, the approach taken by that code (to re-submit an entire stage as a result of task failures) is not how we handle task failures in a stage (the lower-level scheduler resubmits the individual tasks) which is what the 5-years-old TODO on that code seems to be implying should be done. The big caveat is that there's a bug being fixed in SPARK-19263 that means there is *not* a 1:1 relationship between pendingTasks and available outputLocations, so that code is serving as a (buggy) band-aid. This should be fixed once we resolve SPARK-19263. cc [~imranr] [~markhamstra] [~jinxing6...@126.com] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18967) Locality preferences should be used when scheduling even when delay scheduling is turned off
[ https://issues.apache.org/jira/browse/SPARK-18967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-18967. Resolution: Fixed Fix Version/s: 2.2 > Locality preferences should be used when scheduling even when delay > scheduling is turned off > > > Key: SPARK-18967 > URL: https://issues.apache.org/jira/browse/SPARK-18967 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Imran Rashid >Assignee: Imran Rashid > Fix For: 2.2 > > > If you turn delay scheduling off by setting {{spark.locality.wait=0}}, you > effectively turn off the use the of locality preferences when there is a bulk > scheduling event. {{TaskSchedulerImpl}} will use resources based on whatever > random order it decides to shuffle them, rather than taking advantage of the > most local options. > This happens because {{TaskSchedulerImpl}} offers resources to a > {{TaskSetManager}} one at a time, each time subject to a maxLocality > constraint. However, that constraint doesn't move through all possible > locality levels -- it uses [{{tsm.myLocalityLevels}} > |https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L360]. > And {{tsm.myLocalityLevels}} [skips locality levels completely if the wait > == 0 | > https://github.com/apache/spark/blob/1a64388973711b4e567f25fa33d752066a018b49/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L953]. > So with delay scheduling off, {{TaskSchedulerImpl}} immediately jumps to > giving tsms the offers with {{maxLocality = ANY}}. > *WORKAROUND*: instead of setting {{spark.locality.wait=0}}, use > {{spark.locality.wait=1ms}}. The one downside of this is if you have tasks > that actually take less than 1ms. You could even run into SPARK-18886. But > that is a relatively unlikely scenario for real workloads. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19398) Log in TaskSetManager is not correct
[ https://issues.apache.org/jira/browse/SPARK-19398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19398: --- Fix Version/s: 2.2 > Log in TaskSetManager is not correct > > > Key: SPARK-19398 > URL: https://issues.apache.org/jira/browse/SPARK-19398 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing >Priority: Trivial > Fix For: 2.2 > > > Log below is misleading: > {code:title="TaskSetManager.scala"} > if (successful(index)) { > logInfo( > s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " + > "but another instance of the task has already succeeded, " + > "so not re-queuing the task to be re-executed.") > } > {code} > If fetch failed, the task is marked as *successful* in *TaskSetManager:: > handleFailedTask*. Then log above will be printed. The *successful* just > means task will not be scheduled any longer, not a real success. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19398) Log in TaskSetManager is not correct
[ https://issues.apache.org/jira/browse/SPARK-19398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19398. Resolution: Fixed Assignee: jin xing > Log in TaskSetManager is not correct > > > Key: SPARK-19398 > URL: https://issues.apache.org/jira/browse/SPARK-19398 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: jin xing >Assignee: jin xing >Priority: Trivial > > Log below is misleading: > {code:title="TaskSetManager.scala"} > if (successful(index)) { > logInfo( > s"Task ${info.id} in stage ${taskSet.id} (TID $tid) failed, " + > "but another instance of the task has already succeeded, " + > "so not re-queuing the task to be re-executed.") > } > {code} > If fetch failed, the task is marked as *successful* in *TaskSetManager:: > handleFailedTask*. Then log above will be printed. The *successful* just > means task will not be scheduled any longer, not a real success. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios
[ https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15852085#comment-15852085 ] Kay Ousterhout commented on SPARK-19326: I see that makes sense; thanks for the additional explanation. [~andrewor14] did you think about this issue when implementing dynamic allocation originally? I noticed there'a a [comment saying that speculation is not considered for simplicity](https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L579), but it does seem like this functionality can prevent speculation from occurring. > Speculated task attempts do not get launched in few scenarios > - > > Key: SPARK-19326 > URL: https://issues.apache.org/jira/browse/SPARK-19326 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > Speculated copies of tasks do not get launched in some cases. > Examples: > - All the running executors have no CPU slots left to accommodate a > speculated copy of the task(s). If the all running executors reside over a > set of slow / bad hosts, they will keep the job running for long time > - `spark.task.cpus` > 1 and the running executor has not filled up all its > CPU slots. Since the [speculated copies of tasks should run on different > host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283] > and not the host where the first copy was launched. > In both these cases, `ExecutorAllocationManager` does not know about pending > speculation task attempts and thinks that all the resource demands are well > taken care of. ([relevant > code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265]) > This adds variation in the job completion times and more importantly SLA > misses :( In prod, with a large number of jobs, I see this happening more > often than one would think. Chasing the bad hosts or reason for slowness > doesn't scale. > Here is a tiny repro. Note that you need to launch this with (Mesos or YARN > or standalone deploy mode) along with `--conf spark.speculation=true --conf > spark.executor.cores=4 --conf spark.dynamicAllocation.maxExecutors=100` > {code} > val n = 100 > val someRDD = sc.parallelize(1 to n, n) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index == 1) { > Thread.sleep(Long.MaxValue) // fake long running task(s) > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19326) Speculated task attempts do not get launched in few scenarios
[ https://issues.apache.org/jira/browse/SPARK-19326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850548#comment-15850548 ] Kay Ousterhout commented on SPARK-19326: What is the bad behavior that occurs with your example code? Is the problem that only one executor is requested, so no speculation can occur because there's not a different node to run tasks on? > Speculated task attempts do not get launched in few scenarios > - > > Key: SPARK-19326 > URL: https://issues.apache.org/jira/browse/SPARK-19326 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.0.2, 2.1.0 >Reporter: Tejas Patil > > Speculated copies of tasks do not get launched in some cases. > Examples: > - All the running executors have no CPU slots left to accommodate a > speculated copy of the task(s). If the all running executors reside over a > set of slow / bad hosts, they will keep the job running for long time > - `spark.task.cpus` > 1 and the running executor has not filled up all its > CPU slots. Since the [speculated copies of tasks should run on different > host|https://github.com/apache/spark/blob/2e139eed3194c7b8814ff6cf007d4e8a874c1e4d/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L283] > and not the host where the first copy was launched. > In both these cases, `ExecutorAllocationManager` does not know about pending > speculation task attempts and thinks that all the resource demands are well > taken care of. ([relevant > code|https://github.com/apache/spark/blob/6ee28423ad1b2e6089b82af64a31d77d3552bb38/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L265]) > This adds variation in the job completion times and more importantly SLA > misses :( In prod, with a large number of jobs, I see this happening more > often than one would think. Chasing the bad hosts or reason for slowness > doesn't scale. > Here is a tiny repro. Note that you need to launch this with (Mesos or YARN > or standalone deploy mode) along with `spark.speculation=true` > {code} > val someRDD = sc.parallelize(1 to 8, 8) > someRDD.mapPartitionsWithIndex( (index: Int, it: Iterator[Int]) => { > if (index == 7) { > Thread.sleep(Long.MaxValue) // fake long running task(s) > } > it.toList.map(x => index + ", " + x).iterator > }).collect > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-18890: --- Issue Type: Improvement (was: Bug) > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Improvement > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18890) Do all task serialization in CoarseGrainedExecutorBackend thread (rather than TaskSchedulerImpl)
[ https://issues.apache.org/jira/browse/SPARK-18890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15805495#comment-15805495 ] Kay Ousterhout commented on SPARK-18890: I just opened SPARK-19108 for the broadcast issue. In the meantime, after thinking about this more (and also based on your comments on the associated PRs Imran) I think we should go ahead and merge this change to consolidate the serialization in one place. If nothing else, that change makes the code more readable, and I suspect will make it easier to implement further optimizations to the serialization in the future. > Do all task serialization in CoarseGrainedExecutorBackend thread (rather than > TaskSchedulerImpl) > > > Key: SPARK-18890 > URL: https://issues.apache.org/jira/browse/SPARK-18890 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Priority: Minor > > As part of benchmarking this change: > https://github.com/apache/spark/pull/15505 and alternatives, [~shivaram] and > I found that moving task serialization from TaskSetManager (which happens as > part of the TaskSchedulerImpl's thread) to CoarseGranedSchedulerBackend leads > to approximately a 10% reduction in job runtime for a job that counted 10,000 > partitions (that each had 1 int) using 20 machines. Similar performance > improvements were reported in the pull request linked above. This would > appear to be because the TaskSchedulerImpl thread is the bottleneck, so > moving serialization to CGSB reduces runtime. This change may *not* improve > runtime (and could potentially worsen runtime) in scenarios where the CGSB > thread is the bottleneck (e.g., if tasks are very large, so calling launch to > send the tasks to the executor blocks on the network). > One benefit of implementing this change is that it makes it easier to > parallelize the serialization of tasks (different tasks could be serialized > by different threads). Another benefit is that all of the serialization > occurs in the same place (currently, the Task is serialized in > TaskSetManager, and the TaskDescription is serialized in CGSB). > I'm not totally convinced we should fix this because it seems like there are > better ways of reducing the serialization time (e.g., by re-using a single > serialized object with the Task/jars/files and broadcasting it for each > stage) but I wanted to open this JIRA to document the discussion. > cc [~witgo] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)
Kay Ousterhout created SPARK-19108: -- Summary: Broadcast all shared parts of tasks (to reduce task serialization time) Key: SPARK-19108 URL: https://issues.apache.org/jira/browse/SPARK-19108 Project: Spark Issue Type: Improvement Components: Scheduler Reporter: Kay Ousterhout Expand the amount of information that's broadcasted for tasks, to avoid serializing data per-task that should only be sent to each executor once for the entire stage. Conceptually, this means we'd have new classes specially for sending the minimal necessary data to the executor, like: {code} /** * metadata about the taskset needed by the executor for all tasks in this taskset. Subset of the * full data kept on the driver to make it faster to serialize and send to executors. */ class ExecutorTaskSetMeta( val stageId: Int, val stageAttemptId: Int, val properties: Properties, val addedFiles: Map[String, String], val addedJars: Map[String, String] // maybe task metrics here? ) class ExecutorTaskData( val partitionId: Int, val attemptNumber: Int, val taskId: Long, val taskBinary: Broadcast[Array[Byte]], val taskSetMeta: Broadcast[ExecutorTaskSetMeta] ) {code} Then all the info you'd need to send to the executors would be a serialized version of ExecutorTaskData. Furthermore, given the simplicity of that class, you could serialize manually, and then for each task you could just modify the first two ints & one long directly in the byte buffer. (You could do the same trick for serialization even if ExecutorTaskSetMeta was not a broadcast, but that will keep the msgs small as well.) There a bunch of details I'm skipping here: you'd also need to do some special handling for the TaskMetrics; the way tasks get started in the executor would change; you'd also need to refactor {{Task}} to let it get reconstructed from this information (or add more to ExecutorTaskSetMeta); and probably other details I'm overlooking now. (this is copied from SPARK-18890 and [~imranr]'s comment there; cc [~shivaram]) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout closed SPARK-3937. - > Unsafe memory access inside of Snappy library > - > > Key: SPARK-3937 > URL: https://issues.apache.org/jira/browse/SPARK-3937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: Patrick Wendell > > This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't > have much information about this other than the stack trace. However, it was > concerning enough I figured I should post it. > {code} > java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code > org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) > org.xerial.snappy.Snappy.uncompress(Snappy.java:480) > > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) > > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) > > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > > java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) > > java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) > java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > scala.collection.Iterator$class.foreach(Iterator.scala:727) > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > scala.collection.AbstractIterator.to(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) > > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) > > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) > > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > org.apache.spark.scheduler.Task.run(Task.scala:56) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-m
[jira] [Resolved] (SPARK-3937) Unsafe memory access inside of Snappy library
[ https://issues.apache.org/jira/browse/SPARK-3937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-3937. --- Resolution: Won't Fix Closing this due to lack of activity / reports of issues on recent versions of Spark > Unsafe memory access inside of Snappy library > - > > Key: SPARK-3937 > URL: https://issues.apache.org/jira/browse/SPARK-3937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0, 1.3.0 >Reporter: Patrick Wendell > > This was observed on master between Spark 1.1 and 1.2. Unfortunately I don't > have much information about this other than the stack trace. However, it was > concerning enough I figured I should post it. > {code} > java.lang.InternalError: a fault occurred in a recent unsafe memory access > operation in compiled Java code > org.xerial.snappy.SnappyNative.rawUncompress(Native Method) > org.xerial.snappy.Snappy.rawUncompress(Snappy.java:444) > org.xerial.snappy.Snappy.uncompress(Snappy.java:480) > > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:355) > > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) > > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2310) > > java.io.ObjectInputStream$BlockDataInputStream.read(ObjectInputStream.java:2712) > > java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2742) > java.io.ObjectInputStream.readArray(ObjectInputStream.java:1687) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344) > > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) > > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) > java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) > java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62) > > org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) > org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) > scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:350) > > org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39) > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308) > scala.collection.Iterator$class.foreach(Iterator.scala:727) > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) > > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) > scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) > scala.collection.AbstractIterator.to(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) > scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) > > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) > scala.collection.AbstractIterator.toArray(Iterator.scala:1157) > > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) > > org.apache.spark.sql.execution.Limit$$anonfun$4.apply(basicOperators.scala:140) > > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) > > org.apache.spark.SparkContext$$anonfun$runJob$3.apply(SparkContext.scala:1118) > org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) > org.apache.spark.scheduler.Task.run(Task.scala:56) > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182) > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) -
[jira] [Resolved] (SPARK-14958) Failed task hangs if error is encountered when getting task result
[ https://issues.apache.org/jira/browse/SPARK-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-14958. Resolution: Fixed Fix Version/s: 2.2.0 > Failed task hangs if error is encountered when getting task result > -- > > Key: SPARK-14958 > URL: https://issues.apache.org/jira/browse/SPARK-14958 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Rui Li >Assignee: Rui Li > Fix For: 2.2.0 > > > In {{TaskResultGetter}}, if we get an error when deserialize > {{TaskEndReason}}, TaskScheduler won't have a chance to handle the failed > task and the task just hangs. > {code} > def enqueueFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState: > TaskState, > serializedData: ByteBuffer) { > var reason : TaskEndReason = UnknownReason > try { > getTaskResultExecutor.execute(new Runnable { > override def run(): Unit = Utils.logUncaughtExceptions { > val loader = Utils.getContextOrSparkClassLoader > try { > if (serializedData != null && serializedData.limit() > 0) { > reason = serializer.get().deserialize[TaskEndReason]( > serializedData, loader) > } > } catch { > case cnd: ClassNotFoundException => > // Log an error but keep going here -- the task failed, so not > catastrophic > // if we can't deserialize the reason. > logError( > "Could not deserialize TaskEndReason: ClassNotFound with > classloader " + loader) > case ex: Exception => {} > } > scheduler.handleFailedTask(taskSetManager, tid, taskState, reason) > } > }) > } catch { > case e: RejectedExecutionException if sparkEnv.isStopped => > // ignore it > } > } > {code} > In my specific case, I got a NoClassDefFoundError and the failed task hangs > forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14958) Failed task hangs if error is encountered when getting task result
[ https://issues.apache.org/jira/browse/SPARK-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-14958: --- Affects Version/s: 1.6.0 2.0.0 2.1.0 > Failed task hangs if error is encountered when getting task result > -- > > Key: SPARK-14958 > URL: https://issues.apache.org/jira/browse/SPARK-14958 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.0, 2.0.0, 2.1.0 >Reporter: Rui Li >Assignee: Rui Li > Fix For: 2.2.0 > > > In {{TaskResultGetter}}, if we get an error when deserialize > {{TaskEndReason}}, TaskScheduler won't have a chance to handle the failed > task and the task just hangs. > {code} > def enqueueFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState: > TaskState, > serializedData: ByteBuffer) { > var reason : TaskEndReason = UnknownReason > try { > getTaskResultExecutor.execute(new Runnable { > override def run(): Unit = Utils.logUncaughtExceptions { > val loader = Utils.getContextOrSparkClassLoader > try { > if (serializedData != null && serializedData.limit() > 0) { > reason = serializer.get().deserialize[TaskEndReason]( > serializedData, loader) > } > } catch { > case cnd: ClassNotFoundException => > // Log an error but keep going here -- the task failed, so not > catastrophic > // if we can't deserialize the reason. > logError( > "Could not deserialize TaskEndReason: ClassNotFound with > classloader " + loader) > case ex: Exception => {} > } > scheduler.handleFailedTask(taskSetManager, tid, taskState, reason) > } > }) > } catch { > case e: RejectedExecutionException if sparkEnv.isStopped => > // ignore it > } > } > {code} > In my specific case, I got a NoClassDefFoundError and the failed task hangs > forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14958) Failed task hangs if error is encountered when getting task result
[ https://issues.apache.org/jira/browse/SPARK-14958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-14958: --- Assignee: Rui Li > Failed task hangs if error is encountered when getting task result > -- > > Key: SPARK-14958 > URL: https://issues.apache.org/jira/browse/SPARK-14958 > Project: Spark > Issue Type: Bug >Reporter: Rui Li >Assignee: Rui Li > > In {{TaskResultGetter}}, if we get an error when deserialize > {{TaskEndReason}}, TaskScheduler won't have a chance to handle the failed > task and the task just hangs. > {code} > def enqueueFailedTask(taskSetManager: TaskSetManager, tid: Long, taskState: > TaskState, > serializedData: ByteBuffer) { > var reason : TaskEndReason = UnknownReason > try { > getTaskResultExecutor.execute(new Runnable { > override def run(): Unit = Utils.logUncaughtExceptions { > val loader = Utils.getContextOrSparkClassLoader > try { > if (serializedData != null && serializedData.limit() > 0) { > reason = serializer.get().deserialize[TaskEndReason]( > serializedData, loader) > } > } catch { > case cnd: ClassNotFoundException => > // Log an error but keep going here -- the task failed, so not > catastrophic > // if we can't deserialize the reason. > logError( > "Could not deserialize TaskEndReason: ClassNotFound with > classloader " + loader) > case ex: Exception => {} > } > scheduler.handleFailedTask(taskSetManager, tid, taskState, reason) > } > }) > } catch { > case e: RejectedExecutionException if sparkEnv.isStopped => > // ignore it > } > } > {code} > In my specific case, I got a NoClassDefFoundError and the failed task hangs > forever. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19062) Utils.writeByteBuffer should not modify buffer position
[ https://issues.apache.org/jira/browse/SPARK-19062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19062: --- Affects Version/s: (was: 1.2.1) 2.1.0 > Utils.writeByteBuffer should not modify buffer position > --- > > Key: SPARK-19062 > URL: https://issues.apache.org/jira/browse/SPARK-19062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > Fix For: 2.2.0 > > > [~mridulm80] pointed out that Utils.writeByteBuffer may change the position > of the underlying byte buffer, which could potentially lead to subtle bugs > for callers of that function. We should change this so Utils.writeByteBuffer > doesn't change the buffer position. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19062) Utils.writeByteBuffer should not modify buffer position
[ https://issues.apache.org/jira/browse/SPARK-19062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19062. Resolution: Fixed Fix Version/s: 2.2.0 > Utils.writeByteBuffer should not modify buffer position > --- > > Key: SPARK-19062 > URL: https://issues.apache.org/jira/browse/SPARK-19062 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0 >Reporter: Kay Ousterhout >Assignee: Kay Ousterhout >Priority: Minor > Fix For: 2.2.0 > > > [~mridulm80] pointed out that Utils.writeByteBuffer may change the position > of the underlying byte buffer, which could potentially lead to subtle bugs > for callers of that function. We should change this so Utils.writeByteBuffer > doesn't change the buffer position. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19072) Catalyst's IN always returns false for infinity
[ https://issues.apache.org/jira/browse/SPARK-19072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19072: --- Description: This bug was caused by the fix for SPARK-18999 (https://github.com/apache/spark/pull/16402) This can be reproduced by adding the following test to PredicateSuite.scala (which will consistently fail): val value = NonFoldableLiteral(Double.PositiveInfinity, DoubleType) checkEvaluation(In(value, List(value)), true) This bug is causing org.apache.spark.sql.catalyst.expressions.PredicateSuite.IN to fail approximately 10% of the time (it fails anytime the value is Infinity or -Infinity and the correct answer is True -- e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70826/testReport/org.apache.spark.sql.catalyst.expressions/PredicateSuite/IN/, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70830/console). was: This can be reproduced by adding the following test to PredicateSuite.scala (which will consistently fail): val value = NonFoldableLiteral(Double.PositiveInfinity, DoubleType) checkEvaluation(In(value, List(value)), true) This bug is causing org.apache.spark.sql.catalyst.expressions.PredicateSuite.IN to fail approximately 10% of the time (it fails anytime the value is Infinity or -Infinity and the correct answer is True -- e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70826/testReport/org.apache.spark.sql.catalyst.expressions/PredicateSuite/IN/, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70830/console). > Catalyst's IN always returns false for infinity > --- > > Key: SPARK-19072 > URL: https://issues.apache.org/jira/browse/SPARK-19072 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Reporter: Kay Ousterhout > > This bug was caused by the fix for SPARK-18999 > (https://github.com/apache/spark/pull/16402) > This can be reproduced by adding the following test to PredicateSuite.scala > (which will consistently fail): > val value = NonFoldableLiteral(Double.PositiveInfinity, DoubleType) > checkEvaluation(In(value, List(value)), true) > This bug is causing > org.apache.spark.sql.catalyst.expressions.PredicateSuite.IN to fail > approximately 10% of the time (it fails anytime the value is Infinity or > -Infinity and the correct answer is True -- e.g., > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70826/testReport/org.apache.spark.sql.catalyst.expressions/PredicateSuite/IN/, > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70830/console). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org