Hi dev@spark, I wanted to quickly ping about Spree
<http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/>,
a live-updating web UI for Spark that I released on Friday (along with some
supporting infrastructure), and mention a couple things that came up while
I worked on it that are relevant to this list.

This blog post
<http://www.hammerlab.org/2015/07/25/spree-58-a-live-updating-web-ui-for-spark/>
and github <https://github.com/hammerlab/spree/> have lots of info about
functionality, implementation details, and installation instructions, but
the tl;dr is:

   - You register a SparkListener called JsonRelay
   <https://github.com/hammerlab/spark-json-relay> via the
   spark.extraListeners conf (thanks @JoshRosen!).
   - That listener ships SparkListenerEvents to a server called slim
   <https://github.com/hammerlab/slim> that stores them in Mongo.
      - Really what it stores are a bunch of stats similar to those
      maintained by JobProgressListener.
    - A Meteor <https://www.meteor.com/> app displays live-updating views
   of what’s in Mongo.

Feel free to read about it / try it! but the rest of this email is just
questions about Spark APIs and plans.
JsonProtocol scoping

The most awkward thing about Spree is that JsonRelay declares itself to be
in org.apache.spark
<https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L1>
so that it can use JsonProtocol.

Will JsonProtocol be private[spark] forever, on purpose, or is it just not
considered stable enough yet, so you want to discourage direct use? I’m
relatively impartial at this point since I’ve done the hacky thing and it
works for my purposes, but thought I’d ask in case there are interesting
perspectives on the ideal scope for it going forward.
@DeveloperApi trait SparkListener

Another set of tea leaves I wasn’t sure how to read was the @DeveloperApi-ness
of SparkListener
<https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/SparkListener.scala#L131-L132>.
I assumed I was doing something frowny by having JsonRelay implement the
SparkListener interface. However, I just noticed that I’m actually
extending SparkFirehoseListener
<https://github.com/apache/spark/blob/v1.4.1/core/src/main/java/org/apache/spark/SparkFirehoseListener.java>,
which is *not* @DeveloperApi afaict, so maybe I’m ok there after all?

Are there other SparkListener implementations of note in the wild (seems
like “no”)? Is that an API that people can and should use externally (seems
like “yes” to me)? I saw @vanzin recently imply on this list that the
answers may be “no” and “no”
<http://apache-spark-developers-list.1001551.n3.nabble.com/Slight-API-incompatibility-caused-by-SPARK-4072-tp13257.html>
.
Augmenting JsonProtocol

JsonRelay does two things that JsonProtocol does not:

   - adds an appId field to all events; this makes it possible/easy for
   downstream things (slim, in this case) to handle information about
   multiple Spark applications.
   - JSON-serializes SparkListenerExecutorMetricsUpdate events. This was
   added to JsonProtocol in SPARK-9036
   <https://issues.apache.org/jira/browse/SPARK-9036> (though it’s unused
   in the Spark repo currently), but I’ll have to leave my version in as long
   as I want to support Spark <= 1.4.1.
      - From one perspective, JobProgressListener was sort of “cheating” by
      using these events that were previously not accessible via
      JsonProtocol.

It seems like making an effort to let external tools get the same kinds of
data as the internal listeners is a good principle to try to maintain,
which is also relevant to the scoping questions about JsonProtocol above.

Should JsonProtocol add appIds to all events itself? Should Spark make it
easier for downstream things to to process events from multiple Spark
applications? JsonRelay currently pulls the app ID out of the SparkConf
that it is instantiated with
<https://github.com/hammerlab/spark-json-relay/blob/1.0.0/src/main/scala/org/apache/spark/JsonRelay.scala#L16>;
it works, but also feels hacky and like maybe I’m doing things I’m not
supposed to.
Thrift SparkListenerEvent Implementation?

A few months ago I built a first version of this project involving a
SparkListener called Spear <https://github.com/hammerlab/spear> that
aggregated stats from SparkListenerEvents *and* wrote those stats to Mongo,
combining JsonRelay and slim from above.

Spear used a couple of libraries (Rogue
<https://github.com/foursquare/rogue> and Spindle
<https://github.com/foursquare/spindle>) to define schemas in thrift,
generate Scala for those classes, and do all the Mongo querying in a nice,
type-safe way.

Unfortunately for me, all of the Mongo queries were synchronous in that
implementation, which led to events being dropped
<https://github.com/apache/spark/blob/v1.4.1/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L40>
when I tested it on large jobs (thanks a lot to @squito for helping debug
that). I got spooked and decided all possible work had to be removed from
the Spark driver, hence JsonRelay doing the most minimal possible thing and
sending the events to a network address.

I also decided to rewrite my JobProgressListener-equivalent piece in Node
rather than Scala because I was finding the type-safety a little
constraining and Rogue and Spindle are no longer maintained.

Anyways, a couple of things I want to call out, given this back-story:

   - There is a bunch of boilerplate related to implementing my Mongo
   schemas sitting abandoned in Spear, in thrift
   
<https://github.com/hammerlab/spear/blob/5f2affe80fae833a120eb63d51154c3c00ee57b0/src/main/thrift/spark.thrift>
   and case classes
   
<https://github.com/hammerlab/spear/blob/6f9efa9aab0ce00fe229083ded237199aafd9b74/src/main/scala/org/hammerlab/spear/SparkIDL.scala>
   .
      - One thing on my roadmap at the time was to replicate the
      SparkListenerEvents themselves in thrift and/or case classes.
      - They seem like things you’d want to be able to be able to send in
      and out of various services/languages easily.
      - Writing the raw events to Mongo is a related goal for slim, covered
      by slim#35 <https://github.com/hammerlab/slim/issues/35>.
    - It’s probably not super difficult to take something like Spear and
   make it use non-blocking Mongo queries.
      - This might make it quick enough to not have to drop events.
      - It would simplify the infrastructure story by folding slim into the
      listener that is registered with the driver.

Possible Upstreaming

I don’t know what the interest level for getting any of this functionality
(e.g. live-updating web UI, event-stats persistence to Mongo) upstreamed is
and don’t have time to work on that, so I won’t go into much detail here
but if anyone’s interested I have some ideas about how some of it could
happen (e.g. DDP <https://www.meteor.com/ddp> server implementation in
java/scala serving the web UI; Spree currently involves two JS servers so
some rewriting of things would probably have to happen), why it might be
good to do, and why it might be not good or not worth it (e.g. Spark should
make sure it’s possible and easy to do sophisticated things like this
outside of the Spark repo, putting more work in the driver process is a bad
idea, etc.).

OK, that’s my brain dump, I’d love to hear peoples’ thoughts on any/all of
this, otherwise thanks for the APIs and sorry for having to cheat them a
bit! :)

-Ryan
​

Reply via email to