Re: "Spree": Live-updating web UI for Spark

2015-07-29 Thread mkhaitman
We tested this out on our dev cluster (Hadoop 2.7.1 + Spark 1.4.0), and it
looks great! I might also be interested in contributing to it when I get a
chance! Keep up the awesome work! :)

Mark.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Spree-Live-updating-web-UI-for-Spark-tp13456p13511.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: "Spree": Live-updating web UI for Spark

2015-07-27 Thread Pedro Rodriguez
+1 to awesome work. I saw it this morning and it solves a annoyance/problem
I have had for a while, and even thought about maybe contributing something
for. I am excited to give it a try.

Pedro

On Mon, Jul 27, 2015 at 2:59 PM, Ryan Williams <
ryan.blake.willi...@gmail.com> wrote:

> Hi dev@spark, I wanted to quickly ping about Spree
> ,
> a live-updating web UI for Spark that I released on Friday (along with some
> supporting infrastructure), and mention a couple things that came up while
> I worked on it that are relevant to this list.
>
> This blog post
> 
> and github  have lots of info about
> functionality, implementation details, and installation instructions, but
> the tl;dr is:
>
>- You register a SparkListener called JsonRelay
> via the
>spark.extraListeners conf (thanks @JoshRosen!).
>- That listener ships SparkListenerEvents to a server called slim
> that stores them in Mongo.
>   - Really what it stores are a bunch of stats similar to those
>   maintained by JobProgressListener.
> - A Meteor  app displays live-updating views
>of what’s in Mongo.
>
> Feel free to read about it / try it! but the rest of this email is just
> questions about Spark APIs and plans.
> JsonProtocol scoping
>
> The most awkward thing about Spree is that JsonRelay declares itself to
> be in org.apache.spark
> 
> so that it can use JsonProtocol.
>
> Will JsonProtocol be private[spark] forever, on purpose, or is it just
> not considered stable enough yet, so you want to discourage direct use? I’m
> relatively impartial at this point since I’ve done the hacky thing and it
> works for my purposes, but thought I’d ask in case there are interesting
> perspectives on the ideal scope for it going forward.
> @DeveloperApi trait SparkListener
>
> Another set of tea leaves I wasn’t sure how to read was the @DeveloperApi-ness
> of SparkListener
> .
> I assumed I was doing something frowny by having JsonRelay implement the
> SparkListener interface. However, I just noticed that I’m actually
> extending SparkFirehoseListener
> ,
> which is *not* @DeveloperApi afaict, so maybe I’m ok there after all?
>
> Are there other SparkListener implementations of note in the wild (seems
> like “no”)? Is that an API that people can and should use externally (seems
> like “yes” to me)? I saw @vanzin recently imply on this list that the
> answers may be “no” and “no”
> 
> .
> Augmenting JsonProtocol
>
> JsonRelay does two things that JsonProtocol does not:
>
>- adds an appId field to all events; this makes it possible/easy for
>downstream things (slim, in this case) to handle information about
>multiple Spark applications.
>- JSON-serializes SparkListenerExecutorMetricsUpdate events. This was
>added to JsonProtocol in SPARK-9036
> (though it’s unused
>in the Spark repo currently), but I’ll have to leave my version in as long
>as I want to support Spark <= 1.4.1.
>   - From one perspective, JobProgressListener was sort of “cheating”
>   by using these events that were previously not accessible via
>   JsonProtocol.
>
> It seems like making an effort to let external tools get the same kinds of
> data as the internal listeners is a good principle to try to maintain,
> which is also relevant to the scoping questions about JsonProtocol above.
>
> Should JsonProtocol add appIds to all events itself? Should Spark make it
> easier for downstream things to to process events from multiple Spark
> applications? JsonRelay currently pulls the app ID out of the SparkConf
> that it is instantiated with
> ;
> it works, but also feels hacky and like maybe I’m doing things I’m not
> supposed to.
> Thrift SparkListenerEvent Implementation?
>
> A few months ago I built a first version of this project involving a
> SparkListener called Spear  that
> aggregated stats from SparkListenerEvents *and* wrote those stats to
> Mongo, combining JsonRelay and slim from above.
>
> Spear used a couple of libraries (Rogue
>  

"Spree": Live-updating web UI for Spark

2015-07-27 Thread Ryan Williams
Hi dev@spark, I wanted to quickly ping about Spree
,
a live-updating web UI for Spark that I released on Friday (along with some
supporting infrastructure), and mention a couple things that came up while
I worked on it that are relevant to this list.

This blog post

and github  have lots of info about
functionality, implementation details, and installation instructions, but
the tl;dr is:

   - You register a SparkListener called JsonRelay
    via the
   spark.extraListeners conf (thanks @JoshRosen!).
   - That listener ships SparkListenerEvents to a server called slim
    that stores them in Mongo.
  - Really what it stores are a bunch of stats similar to those
  maintained by JobProgressListener.
- A Meteor  app displays live-updating views
   of what’s in Mongo.

Feel free to read about it / try it! but the rest of this email is just
questions about Spark APIs and plans.
JsonProtocol scoping

The most awkward thing about Spree is that JsonRelay declares itself to be
in org.apache.spark

so that it can use JsonProtocol.

Will JsonProtocol be private[spark] forever, on purpose, or is it just not
considered stable enough yet, so you want to discourage direct use? I’m
relatively impartial at this point since I’ve done the hacky thing and it
works for my purposes, but thought I’d ask in case there are interesting
perspectives on the ideal scope for it going forward.
@DeveloperApi trait SparkListener

Another set of tea leaves I wasn’t sure how to read was the @DeveloperApi-ness
of SparkListener
.
I assumed I was doing something frowny by having JsonRelay implement the
SparkListener interface. However, I just noticed that I’m actually
extending SparkFirehoseListener
,
which is *not* @DeveloperApi afaict, so maybe I’m ok there after all?

Are there other SparkListener implementations of note in the wild (seems
like “no”)? Is that an API that people can and should use externally (seems
like “yes” to me)? I saw @vanzin recently imply on this list that the
answers may be “no” and “no”

.
Augmenting JsonProtocol

JsonRelay does two things that JsonProtocol does not:

   - adds an appId field to all events; this makes it possible/easy for
   downstream things (slim, in this case) to handle information about
   multiple Spark applications.
   - JSON-serializes SparkListenerExecutorMetricsUpdate events. This was
   added to JsonProtocol in SPARK-9036
    (though it’s unused
   in the Spark repo currently), but I’ll have to leave my version in as long
   as I want to support Spark <= 1.4.1.
  - From one perspective, JobProgressListener was sort of “cheating” by
  using these events that were previously not accessible via
  JsonProtocol.

It seems like making an effort to let external tools get the same kinds of
data as the internal listeners is a good principle to try to maintain,
which is also relevant to the scoping questions about JsonProtocol above.

Should JsonProtocol add appIds to all events itself? Should Spark make it
easier for downstream things to to process events from multiple Spark
applications? JsonRelay currently pulls the app ID out of the SparkConf
that it is instantiated with
;
it works, but also feels hacky and like maybe I’m doing things I’m not
supposed to.
Thrift SparkListenerEvent Implementation?

A few months ago I built a first version of this project involving a
SparkListener called Spear  that
aggregated stats from SparkListenerEvents *and* wrote those stats to Mongo,
combining JsonRelay and slim from above.

Spear used a couple of libraries (Rogue
 and Spindle
) to define schemas in thrift,
generate Scala for those classes, and do all the Mongo querying in a nice,
type-safe way.

Unfortunately for me, all of the Mongo queries were synchronous in that
implementation, which led to events being dropped

when I tested it on large jobs (tha