GitHub user andrewor14 opened a pull request: https://github.com/apache/spark/pull/204
[SPARK-1276] Add a HistoryServer to render persisted UI Currently, a persisted UI can only be rendered through the standalone Master. This greatly limits the use case of the new feature of being able to log the details of a Spark application as events, since many people also run Spark on Yarn / Mesos. This PR introduces a new entity called the HistoryServer, which, given a log directory, keeps track of all completed applications independently of a Spark Master. Unlike Master, the HistoryServer needs not be running while the application is still running. It is relatively light-weight in that it only maintains static information of applications after-the-fact. To quickly test it out, generate event logs with ```spark.eventLog.enabled=true``` and run ```sbin/start-history-server.sh <log-dir-path>```. Your HistoryServer awaits on port 18080. A few other changes introduced in this PR include refactoring the WebUI interface, which is beginning to have a lot of duplicate code now that we add more functionality to it. Two new SparkListenerEvents have been introduced (SparkListenerApplicationStart/End) to keep track of application name and start/finish times. This PR also clarifies the semantics of the ReplayListenerBus introduced in #42. A potential TODO in the future (not part of this PR) is to render live event logging applications in addition to just completed applications. This is useful if an application fails, in which case our current HistoryServer does not render the associated UI unless the user manually signals application completion. Processing the event logs in this case becomes significantly more complicated, however, because we must deal with multiple levels of streams that may each have arbitrary behavior if we want to avoid processing the entire file over and over again. Comments and feedback are most welcome. You can merge this pull request into a Git repository by running: $ git pull https://github.com/andrewor14/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/204.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #204 ---- commit c086bd5c6837a98d3c989c43f2b75aeaa0e5eff0 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-03-20T19:43:16Z Add HistoryServer and scripts ++ Refactor WebUI interface HistoryServer can be launched with ./sbin/start-history-server.sh <log-dir> and stopped with ./sbin/stop-history-server.sh. This commit also involves refactoring all the UIs to avoid duplicate code. commit 8aac16355329809b11c76430fa8737d328f2e962 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-03-20T21:34:34Z Add basic application table commit 758441890dc86c8ed069e6c684b21528038f2ff7 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-03-21T04:59:34Z Report application start/end times to HistoryServer This involves adding application start and end events. This also allows us to record the actual app name instead of simply using the name of the directory. commit 60bc6d57577742e861d62c183ec56d9893e3ea6a Author: Andrew Or <andrewo...@gmail.com> Date: 2014-03-22T01:17:43Z First complete implementation of HistoryServer (only for finished apps) This involves a change in Spark's event log format. All event logs are now prefixed with EVENT_LOG_. If compression is used, the logger creates a special empty file prefixed with COMPRESSION_CODEC_ that indicates which codec is used. After the application finishes, the logger logs a special empty file named APPLICATION_COMPLETE. The ReplayListenerBus is now responsible for parsing all of the above file formats. In this commit, we establish a one-to-one mapping between ReplayListenerBus and event logging applications. The semantics of the ReplayListenerBus is further clarified (e.g. replay is not allowed before starting, and can only be called once). This commit also adds a control mechanism for the frequency at which HistoryServer accesses the disk to check for log updates. This enforces a minimum interval of N seconds between two checks, where N is arbitrarily chosen to be 5. commit 5dbfbb47826ea2edbf8cf2100228bddb5be473f8 Author: Andrew Or <andrewo...@gmail.com> Date: 2014-03-22T01:54:28Z Merge branch 'master' of github.com:apache/spark Conflicts: core/src/main/scala/org/apache/spark/deploy/DeployWebUI.scala core/src/main/scala/org/apache/spark/deploy/WebUI.scala core/src/main/scala/org/apache/spark/deploy/master/Master.scala core/src/main/scala/org/apache/spark/ui/WebUI.scala ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---