GitHub user mxm opened a pull request: https://github.com/apache/flink/pull/640
[scheduling] implement backtracking of intermediate results For batch programs, we currently schedule all tasks which are sources and let them kick off the execution of the connected tasks. This approach bears some problems when executing large dataflows with many branches. With backtracking, we traverse the execution graph output-centrically (from the sinks) in a depth-first manner. This enables us to use resources differently. In the course of backtracking, only tasks will be executed that are required to supply inputs to the current task. When a job is newly submitted, this means that the backtracking will reach the sources. When the job has been previously executed and intermediate results are available, old ResultPartitions to resume from can be requested while backtracking. Backtracking is disabled by default. It can be enabled by setting the ScheduleMode in JobGraph to BACKTRACKING. CHANGELOG - new scheduling mode: backtracking - backtracks from the sinks of an ExecutionGraph - checks the availability of IntermediatePartitionResults - marks ExecutionVertex to be scheduled - caches ResultPartitions and reloads them - resumes from intermediate results - test for general behavior of backtracking (BacktrackingTest) - test for resuming from an intermediate result (ResumeITCase) - test for releasing of cached ResultPartitions (ResultPartitionManagerTest) - allow multiple consumers per blocking intermediate result (batch) You can merge this pull request into a Git repository by running: $ git pull https://github.com/mxm/flink backtracking-scheduling-dev Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/640.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #640 ---- commit a580c8973ccb3579c79ebd0dc860c1f754eb87fd Author: Maximilian Michels <m...@apache.org> Date: 2015-04-29T10:34:31Z [FLINK-1843] remove SoftReferences on archived ExecutionGraphs The previously introduced SoftReferences to store archived ExecutionGraphs cleared old graphs in a non-transparent order. commit b9194c01bdbce25af2bf91c617af2eab94f3353c Author: Maximilian Michels <m...@apache.org> Date: 2015-04-29T13:59:11Z [JobManager] default to single-user mode, save the last ExecutionGraph for resuming commit 57c12c36b82e866ba0b05b7a82bf0c42a0e8b042 Author: Maximilian Michels <m...@apache.org> Date: 2015-04-13T17:32:40Z [scheduling] implement backtracking of intermediate results For batch programs, we currently schedule all tasks which are sources and let them kick off the execution of the connected tasks. This approach bears some problems when executing large dataflows with many branches. With backtracking, we traverse the execution graph output-centrically (from the sinks) in a depth-first manner. This enables us to use resources differently. In the course of backtracking, only tasks will be executed that are required to supply inputs to the current task. When a job is newly submitted, this means that the backtracking will reach the sources. When the job has been previously executed and intermediate results are available, old ResultPartitions to resume from can be requested while backtracking. Backtracking is disabled by default. It can be enabled by setting the ScheduleMode in JobGraph to BACKTRACKING. CHANGELOG - new scheduling mode: backtracking - backtracks from the sinks of an ExecutionGraph - checks the availability of IntermediatePartitionResults - marks ExecutionVertex to be scheduled - caches ResultPartitions and reloads them - resumes from intermediate results - test for general behavior of backtracking (BacktrackingTest) - test for resuming from an intermediate result (ResumeITCase) - test for releasing of cached ResultPartitions (ResultPartitionManagerTest) - allow multiple consumers per blocking intermediate result (batch) ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---