GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/42
[WIP] [SPARK-1132] Persisting Web UI through refactoring the SparkListener
interface
The fleeting nature of the Spark Web UI has long been a problem reported by
many users: The existing Web UI disappears as soon as the associated
application terminates. This is because SparkUI is tightly coupled with
SparkContext, and cannot be instantiated independently from it. To solve this,
some state must be saved to persistent storage while the application is still
running.
The approach taken by this PR involves persisting the UI state through
SparkListenerEvents. This requires a major refactor of the SparkListener
interface because existing events (1) maintain deep references, making
de/serialization is difficult, and (2) do not encode all the information
displayed on the UI.
The new architecture is as follows - The SparkUI registers a central
gateway listener with the SparkContext for events. For each event this gateway
listener receives, it logs the event to a file and relays it to all of its
children listeners (e.g. ExecutorsListener) used by the UI. Each of these
children listeners then construct the appropriate information from these events
and supplies it to its parent UI (e.g. ExecutorsUI), which renders the
associated page(s) on demand. Then, after the SparkContext has stopped, the
SparkUI can be revived by replaying all the logged events to the children
listeners through the gateway listener.
This patch is WIP and still expects additional features. The main TODO's at
this point include adding support for logging into HDFS, handling long running
jobs (perhaps through checkpointing), and performance testing. To try this
patch out, run your Spark application to completion as usual, then run
```bin/spark-class org.apache.spark.ui.UIReloader /tmp/spark-<user name>```.
Your revived Spark Web UI awaits on port 14040.
More details can be found in the commit messages, comments within the code,
and the [design
doc](https://spark-project.atlassian.net/secure/attachment/12900/PersistingSparkWebUI.pdf).
Comments and feedback are most welcome.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/42.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #42
----
commit 164489d6f176bdecfa9dabec2dfce5504d1ee8af
Author: Andrew Or <[email protected]>
Date: 2014-02-04T02:18:04Z
Relax assumptions on compressors and serializers when batching
This commit introduces an intermediate layer of an input stream on the
batch level.
This guards against interference from higher level streams (i.e.
compression and
deserialization streams), especially pre-fetching, without specifically
targeting
particular libraries (Kryo) and forcing shuffle spill compression to use
LZF.
commit a531d2e347acdcecf2d0ab72cd4f965ab5e145d8
Author: Andrew Or <[email protected]>
Date: 2014-02-04T02:18:04Z
Relax assumptions on compressors and serializers when batching
This commit introduces an intermediate layer of an input stream on the
batch level.
This guards against interference from higher level streams (i.e.
compression and
deserialization streams), especially pre-fetching, without specifically
targeting
particular libraries (Kryo) and forcing shuffle spill compression to use
LZF.
commit 3df700509955f7074821e9aab1e74cb53c58b5a5
Author: Andrew Or <[email protected]>
Date: 2014-02-04T02:27:49Z
Merge branch 'master' of github.com:andrewor14/incubator-spark
commit 287ef44e593ad72f7434b759be3170d9ee2723d2
Author: Andrew Or <[email protected]>
Date: 2014-02-04T21:38:32Z
Avoid reading the entire batch into memory; also simplify streaming logic
Additionally, address formatting comments.
commit bd5a1d7350467ed3dc19c2de9b2c9f531f0e6aa3
Author: Andrew Or <[email protected]>
Date: 2014-02-04T21:44:24Z
Typo: phyiscal -> physical
commit 13920c918efe22e66a1760b14beceb17a61fd8cc
Author: Andrew Or <[email protected]>
Date: 2014-02-05T00:34:15Z
Update docs
commit 090544a87a0767effd0c835a53952f72fc8d24f0
Author: Andrew Or <[email protected]>
Date: 2014-02-05T18:58:23Z
Privatize methods
commit 3ddeb7ef89a0af2b685fb5d071aa0f71c975cc82
Author: Andrew Or <[email protected]>
Date: 2014-02-05T20:09:32Z
Also privatize fields
commit e3ae35f4fb1ce8e2d7398afdbabab3dbf4bb2ffe
Author: Andrew Or <[email protected]>
Date: 2014-02-11T00:15:15Z
Merge github.com:apache/incubator-spark
Conflicts:
core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala
commit 8e09306f6dd4ab421447d769572de58035d3d66a
Author: Andrew Or <[email protected]>
Date: 2014-02-12T01:48:16Z
Use JSON for ExecutorsUI
commit 10ed49dffe4a515bff42762cb025a3f64d9cd407
Author: Andrew Or <[email protected]>
Date: 2014-02-12T18:53:32Z
Merge github.com:apache/incubator-spark into persist-ui
commit dcbd312b1e4585445868dfb562f9c64ac2fc8cda
Author: Andrew Or <[email protected]>
Date: 2014-02-12T23:58:39Z
Add JSON Serializability for all SparkListenerEvent's
This also involves a clean-up in the way these events are structured. The
existing way
in which these events are defined maintains a lot of extraneous
information. To avoid
serializing the whole tree of RDD dependencies, for instance, this commit
cherry-picks
only the relevant fields. However, this means sacrificing JobLogger's
functionality of
tracing the entire RDD tree.
Additionally, this commit also involves minor formatting and naming
clean-ups within
the scope of the above changes.
commit bb222b9f7422cdf9e3a4c682bb271da1f75f4f75
Author: Andrew Or <[email protected]>
Date: 2014-02-13T04:35:09Z
ExecutorUI: render completely from JSON
Additionally, this commit fixes the bug in the local mode, where executor
IDs of tasks
do not match those of storage statuses (more detail in ExecutorsUI.scala).
This commit currently does not serialize the SparkListenerEvents yet, but
instead
serializes changes to each executor JSON. This is a big TODO in the
upcoming commit.
commit bf0b2e9e92d760d49ba7b26aaa41b9e3aef2420f
Author: Andrew Or <[email protected]>
Date: 2014-02-14T03:12:53Z
ExecutorUI: Serialize events rather than arbitary executor information
This involves adding a new SparkListenerStorageFetchEvent, and adding JSON
serializability
to all of the objects it depends on.
commit de8a1cdb833d80423aba629ba932b6f403ecd4ab
Author: Andrew Or <[email protected]>
Date: 2014-02-15T03:22:50Z
Serialize events both to and from JSON (rather than just to)
This requires every field of every event to be completely reconstructible
from its
JSON representation. This commit may contain incomplete state.
commit 8a2ebe6ba37b2d5efe344aa3bea343cda1411212
Author: Andrew Or <[email protected]>
Date: 2014-02-15T06:01:21Z
Fix bugs for EnvironmentUI and ExecutorsUI
In particular, EnvironmentUI was not rendering until a job begins, and
ExecutorsUI
reports an incorrect number (format) of total tasks.
commit c4cd48022b3a8dbf60f458196e21ba8c9cb3b88f
Author: Andrew Or <[email protected]>
Date: 2014-02-15T06:53:43Z
Also deserialize new events
This includes SparkListenerLoadEnvironment and
SparkListenerStorageStatusFetch
commit d859efc34c9a5f07bae7eca7b4ab72fa19fb7e29
Author: Andrew Or <[email protected]>
Date: 2014-02-15T22:01:14Z
BlockManagerUI: Add JSON functionality
commit 8add36bb08126fbcd02d23c446dd3ec970f1f549
Author: Andrew Or <[email protected]>
Date: 2014-02-15T22:40:49Z
JobProgressUI: Add JSON functionality
In addition, refactor FileLogger to log in one directory per logger
commit b3976b0a2eb21b4a887d01fd16869a0f37c36f8b
Author: Andrew Or <[email protected]>
Date: 2014-02-16T06:52:43Z
Add functionality of reconstructing a persisted UI from SparkContext
With this commit, any reconstruct SparkUI resides on default port of 14040
onwards.
Logged events are posted separately from live events, such that the live
SparkListeners
are not affected.
This commit also fixes a few JSON de/serialization bugs.
commit 4dfcd224504f392302a49ac82280b294c381f381
Author: Andrew Or <[email protected]>
Date: 2014-02-17T19:14:20Z
Merge git://git.apache.org/incubator-spark into persist-ui
commit f3fc13b53725cdfeddcecb2068ab5a533566772f
Author: Andrew Or <[email protected]>
Date: 2014-02-17T21:22:01Z
General refactor
This includes reverting previous formatting and naming changes that are
irrelevant to
this patch.
commit 3fd584e30aaf6552179bf9e9b350b130fa92d0ad
Author: Andrew Or <[email protected]>
Date: 2014-02-18T02:01:12Z
Fix two major bugs
First, JobProgessListener uses HashSets of TaskInfo and StageInfo, and
relies on the equality
of these objects to remove from the corresponding HashSets correctly. This
is not a luxury that
deserialized StageInfo's and TaskInfo's have. Instead, when removing from
these collections, we
must match by the ID rather than the object itself.
Second, although SparkUI differentiates between persisted and live UI's,
its children UI's and
their corresponding listeners do not. Thus, each revived UI essentially
duplicated all the logs
that reconstructed it in the first place. Further, these zombie UI's
continued to respond to
live SparkListenerEvents. This has been fixed by requiring that revived
UI's do not register
their listeners with the current SparkContext.
With the former fix, there were major incompatibility issues with the
existing way UI classes
access and mutate the collections. Formatting improvements associated with
smoothing out these
inconsistencies are included as part of this commit.
commit 5ac906d4dfd546c5d6b6e80540c8774f3985fecc
Author: Andrew Or <[email protected]>
Date: 2014-02-18T05:38:16Z
Mostly naming, formatting, and code style changes
commit 904c7294ac221a0cd9806af843219aaa8a847085
Author: Andrew Or <[email protected]>
Date: 2014-02-18T06:06:46Z
Fix another major bug
Previously, rendering the old, persisted UI continues to trigger load
environment and
storage status fetch events. These are now only triggered for the live UI.
A related TODO: Under JobProgressUI, the total duration is inaccurate;
right now it uses
the time when the old UI is revived, rather than when it was live. This
should be fixed.
commit 427301371117e9e7889f5df0f6bba51e5916e425
Author: Andrew Or <[email protected]>
Date: 2014-02-18T23:27:39Z
Add a gateway SparkListener to simplify event logging
Instead of having each SparkListener log an independent set of events,
centralize event
logging to avoid differentiating events across UI's and thus duplicating
logged events.
Also rename the "fromDisk" parameter to "live".
TODO: Storage page currently still relies on the previous SparkContext and
is not
rendering correctly.
commit 64d2ce1efee3aa5a8166c5fe108932b2279217fc
Author: Andrew Or <[email protected]>
Date: 2014-02-19T02:29:21Z
Fix BlockManagerUI bug by introducing new event
Previously, the storage information of persisted RDD's continued to rely on
the old SparkContext,
which is no longer accessible if the UI is rendered from disk. This fix
solves it by introducing
an event, SparkListenerGetRDDInfo, which captures this information.
Per discussion with Patrick, an alternative is to encapsulate this
information within
SparkListenerTaskEnd. This would bypass the need to create a new event, but
would also require
a non-trivial refactor of BlockManager / BlockStore.
commit 6814da0cf9af2a29810b6773463acee3b259c95f
Author: Andrew Or <[email protected]>
Date: 2014-02-19T18:36:01Z
Explicitly register each UI listener rather than through some magic
This (1) allows UISparkListener to be a simple trait and (2) is more
intuitive, since it
mirrors sc.addSparkListener(listener), for all other non-UI listeners.
commit d646df6786737d67d5ca1dbf593740a02a600991
Author: Andrew Or <[email protected]>
Date: 2014-02-20T02:47:35Z
Completely decouple SparkUI from SparkContext
This involves storing additional fields, such as the scheduling mode and
the app name, into the
new event, SparkListenerApplicationStart, since these attributes are no
longer accessible without
a SparkContext. Further, environment information is refactored to be loaded
on application start
(rather than on job start).
Persisted Spark UI's can no longer be created from SparkContext. The new
way of constructing them
is through a standalone scala program. org.apache.spark.ui.UIReloader is
introduced as an example
of how to do this.
commit e9e1c6dede36788d3cefe3c65366f5a79be97a1d
Author: Andrew Or <[email protected]>
Date: 2014-02-21T07:51:08Z
Move all JSON de/serialization logic to JsonProtocol
This makes all classes involved appear less cluttered.
commit 70e7e7acf09d8efd2c7e459ee450c1db140b8f5a
Author: Andrew Or <[email protected]>
Date: 2014-02-22T02:56:26Z
Formatting changes
commit 6631c02a8791d0321f003bb339344445f4dd0cab
Author: Andrew Or <[email protected]>
Date: 2014-02-24T18:52:21Z
More formatting changes, this time mainly for Json DSL
commit bbe3501c63029ffa9c1fd9053e7ab868d0f28b10
Author: Andrew Or <[email protected]>
Date: 2014-02-26T23:27:43Z
Embed storage status and RDD info in Task events
This commit achieves three main things. First and foremost, it embeds the
information
from the SparkListenerFetchStorageStatus and SparkListenerGetRDDInfo events
into events
that are more descriptive of the SparkListenerInterface. In particular,
every Task now
maintains a list of blocks whose storage status have been updated as a
result of the task.
Previously, this information is retrieved from fetching storage status from
the driver,
an action arbitrarily associated with a stage. This change involves keeping
track of
what blocks are dropped during each call to an RDD persist. A big TODO is
to also capture
the behavior of an RDD unpersist in a SparkListenerEvent.
Second, the SparkListenerEvent interface now handles the dynamic nature of
Executors.
In particular, a new event, SparkListenerExecutorStateChange, is
introduced, which triggers
a storage status fetch from the driver. The purpose of this is mainly to
decouple fetching
storage status from the driver from the Stage. Note that storage status is
not ready until
the remote BlockManagers have been registered, so this involves attaching a
registration
listener to the BlockManagerMasterActor.
Third, changes in environment properties is now supported. This accounts
for the fact that
the user can invoke sc.addFile and sc.addJar in his/her own application,
which should be
reflected appropriately on the EnvironmentUI. In the previous
implementation, coupling this
information with application start prevents this from happening.
Other relatively minor changes include: 1) Refactoring BlockStatus and
BlockManagerInfo to
not be a part of the BlockManagerMasterActor object, 2) Formatting changes,
especially those
involving multi-line arguments, and 3) Making all UI widgets and listeners
private[ui] instead
of private[spark].
commit 28019caa5712b8d7f1db039dc41876d91e530998
Author: Andrew Or <[email protected]>
Date: 2014-02-27T00:47:00Z
Merge github.com:apache/spark
Conflicts:
core/src/main/scala/org/apache/spark/CacheManager.scala
core/src/main/scala/org/apache/spark/SparkEnv.scala
core/src/main/scala/org/apache/spark/scheduler/JobLogger.scala
core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala
core/src/main/scala/org/apache/spark/storage/BlockManager.scala
core/src/main/scala/org/apache/spark/storage/MemoryStore.scala
core/src/main/scala/org/apache/spark/storage/StorageUtils.scala
core/src/main/scala/org/apache/spark/ui/SparkUI.scala
core/src/main/scala/org/apache/spark/ui/env/EnvironmentUI.scala
core/src/main/scala/org/apache/spark/ui/exec/ExecutorsUI.scala
core/src/main/scala/org/apache/spark/ui/jobs/IndexPage.scala
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala
core/src/main/scala/org/apache/spark/ui/jobs/PoolPage.scala
core/src/main/scala/org/apache/spark/ui/jobs/StagePage.scala
core/src/main/scala/org/apache/spark/ui/jobs/StageTable.scala
core/src/main/scala/org/apache/spark/ui/storage/IndexPage.scala
core/src/main/scala/org/apache/spark/ui/storage/RDDPage.scala
core/src/main/scala/org/apache/spark/util/TimeStampedHashMap.scala
core/src/main/scala/org/apache/spark/util/Utils.scala
core/src/test/scala/org/apache/spark/ui/jobs/JobProgressListenerSuite.scala
commit d1f428591d6c33c2bb86f85468c7842b5ca00311
Author: Andrew Or <[email protected]>
Date: 2014-02-27T01:19:20Z
Migrate from lift-json to json4s-jackson
commit 7b2f8112795a53c35b10bc3d72e5be7b699ceb65
Author: Andrew Or <[email protected]>
Date: 2014-02-27T19:24:32Z
Guard against TaskMetrics NPE + Fix tests
commit 996d7a2f42d4e02c1e40ec22b0c4d7db86aa03e3
Author: Andrew Or <[email protected]>
Date: 2014-02-27T20:26:23Z
Reflect RDD unpersist on UI
This introduces a new event, SparkListenerUnpersistRDD.
commit 472fd8a4845e39a38f8d993a3527a7e77571ffad
Author: Andrew Or <[email protected]>
Date: 2014-02-27T23:03:59Z
Fix a couple of tests
commit d47585f22f243fc7e840af90132edb7e84b003ed
Author: Andrew Or <[email protected]>
Date: 2014-02-28T00:15:21Z
Clean up FileLogger
commit faa113e674a276ddf5cd7dc643c16b7bed2b5e44
Author: Andrew Or <[email protected]>
Date: 2014-02-28T01:12:19Z
General clean up
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---