Github user vanzin commented on the pull request: https://github.com/apache/spark/pull/2337#issuecomment-56406159 Lots of questions, let's go one by one. ##### Motivation This is discussed in SPARK-2636 (and probably a coupe of others), but I'll try to summarize it quickly here. Hive-on-Spark generates multiple jobs for a single query, and needs to monitor and collect metrics for each of those jobs - separately. The way to do this in Spark is through the use of `SparkListener`. But the missing piece is that when you call an action such as `collect()` or `saveAsHadoopFile()`, that does not return a job ID in any way. So HoS was using the async API, since that was the recommended workaround, and the fix for SPARK-2636 added the job's ID to the `FutureAction` API. The problem is that it did not expose the job IDs correctly, which is why I filed this bug and sent this PR. ##### Job Groups I was not familiar with the API and it sounds great to me. It would make monitoring jobs in my remote API prototype (SPARK-3215) much cleaner. The only missing piece from looking at the API is that I don't see "job group" anywhere in the events sent to listeners. e.g.: case class SparkListenerJobStart(jobId: Int, stageIds: Seq[Int], properties: Properties = null) extends SparkListenerEvent Unless `properties` contains the job group info somehow, which would look a little brittle to me but I can work with, that's something that would need to be fixed for HoS to be able to gather the information it needs. ##### Async API vs. Something Else I'm not sold on using the async API, and in fact its use in my remote client prototype looks sort of hacky and ugly. But currently that's the only way to gather the information HoS needs. Any substitute needs to allow the caller to match events to the job that was submitted, which is not possible via other means today (or, at least, not that I can see). I assume that job groups still work OK with the current async API, since the thread local data is using an `InheritableThreadLocal`.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org