Hello all, I'm currently taking a look at how to hook Spark with the Application Timeline Server (ATS) work going on in Yarn (YARN-1530). I've got a reasonable idea of how the Yarn part works, and the basic idea of what needs to be done in Spark, but I've run into a couple of issues with the current listener framework and I'd like to ask for some guidance.
(i) SparkContext.addSparkListener() may miss events Notably, it's not possible to add a listener to catch the initial SparkListenerEnvironmentUpdate and SparkListenerApplicationStart events; the second one is particularly important for something like an event logger. The current EventLoggingListener gets around that by being initialized by SparkContext itself, but for other listeners, that option isn't available (and that doesn't seem like a scalable solution either). My initial idea was to add a "listeners" argument to the "big" SparkContext constructor, but I'm open to different suggestions, since I kinda dislike when constructors start growing too much. (It current has 6 arguments, some of which have default values.) A builder-like pattern could be an option (e.g. SparkContext.newBuilder().setConf(...).addJars(...).addListener(...).build()). (ii) Posting things to the ATS requires an ID. If you look at the TimelineEntity class in Yarn, it requires both a type (which would be something like "SparkApplication" for Spark) and an ID. A SparkContext currently has no concept of an application ID (I don't count name as an ID). Using a random UUID is possible, but I think is ugly. EventLoggingListener uses app name + System.currentTimeMillis, which is better from a user-friendliness p.o.v. (if you ignore the possibility of clashes). But really my preferred solution here would be to use the Yarn application id. That would make it easy to correlate this data with more generic data kept by Yarn (see YARN-321). The problem here is that we don't know this ID until way after the SparkContext is created. I can see two different ways to solve this issue. A more hackish way would be expose the listeners to the Yarn code, so that it can then find the Yarn-specific listener and trigger some action to update the application id. A more generic way would be to allow arbitrary events to be posted to the bus, not just the ones declared in SparkListener.scala. This way, the Yarn code could publish a "SparkYarnApplicationStarted" event that has the information the listener wants, and other listeners could potentially use that information too. How do you guys feel about the latter? Feedback here is greatly appreciated! :-) -- Marcelo