Hello all,

I'm currently taking a look at how to hook Spark with the Application
Timeline Server (ATS) work going on in Yarn (YARN-1530). I've got a
reasonable idea of how the Yarn part works, and the basic idea of what
needs to be done in Spark, but I've run into a couple of issues with
the current listener framework and I'd like to ask for some guidance.

(i) SparkContext.addSparkListener() may miss events

Notably, it's not possible to add a listener to catch the initial
SparkListenerEnvironmentUpdate and SparkListenerApplicationStart
events; the second one is particularly important for something like an
event logger. The current EventLoggingListener gets around that by
being initialized by SparkContext itself, but for other listeners,
that option isn't available (and that doesn't seem like a scalable
solution either).

My initial idea was to add a "listeners" argument to the "big"
SparkContext constructor, but I'm open to different suggestions, since
I kinda dislike when constructors start growing too much. (It current
has 6 arguments, some of which have default values.) A builder-like
pattern could be an option (e.g.
SparkContext.newBuilder().setConf(...).addJars(...).addListener(...).build()).


(ii) Posting things to the ATS requires an ID.

If you look at the TimelineEntity class in Yarn, it requires both a
type (which would be something like "SparkApplication" for Spark) and
an ID. A SparkContext currently has no concept of an application ID (I
don't count name as an ID).

Using a random UUID is possible, but I think is ugly.

EventLoggingListener uses app name + System.currentTimeMillis, which
is better from a user-friendliness p.o.v. (if you ignore the
possibility of clashes).

But really my preferred solution here would be to use the Yarn
application id. That would make it easy to correlate this data with
more generic data kept by Yarn (see YARN-321).

The problem here is that we don't know this ID until way after the
SparkContext is created. I can see two different ways to solve this
issue.

A more hackish way would be expose the listeners to the Yarn code, so
that it can then find the Yarn-specific listener and trigger some
action to update the application id.

A more generic way would be to allow arbitrary events to be posted to
the bus, not just the ones declared in SparkListener.scala. This way,
the Yarn code could publish a "SparkYarnApplicationStarted" event that
has the information the listener wants, and other listeners could
potentially use that information too.

How do you guys feel about the latter?


Feedback here is greatly appreciated! :-)

-- 
Marcelo

Reply via email to