Github user vanzin commented on the pull request:

    https://github.com/apache/spark/pull/2337#issuecomment-56406159
  
    Lots of questions, let's go one by one.
    
    ##### Motivation
    
    This is discussed in SPARK-2636 (and probably a coupe of others), but I'll 
try to summarize it quickly here. Hive-on-Spark generates multiple jobs for a 
single query, and needs to monitor and collect metrics for each of those jobs - 
separately. The way to do this in Spark is through the use of `SparkListener`. 
But the missing piece is that when you call an action such as `collect()` or 
`saveAsHadoopFile()`, that does not return a job ID in any way. So HoS was 
using the async API, since that was the recommended workaround, and the fix for 
SPARK-2636 added the job's ID to the `FutureAction` API. The problem is that it 
did not expose the job IDs correctly, which is why I filed this bug and sent 
this PR.
    
    ##### Job Groups
    
    I was not familiar with the API and it sounds great to me. It would make 
monitoring jobs in my remote API prototype (SPARK-3215) much cleaner. The only 
missing piece from looking at the API is that I don't see "job group" anywhere 
in the events sent to listeners. e.g.:
    
        case class SparkListenerJobStart(jobId: Int, stageIds: Seq[Int], 
properties: Properties = null)
          extends SparkListenerEvent
    
    Unless `properties` contains the job group info somehow, which would look a 
little brittle to me but I can work with, that's something that would need to 
be fixed for HoS to be able to gather the information it needs.
    
    ##### Async API vs. Something Else
    
    I'm not sold on using the async API, and in fact its use in my remote 
client prototype looks sort of hacky and ugly. But currently that's the only 
way to gather the information HoS needs. Any substitute needs to allow the 
caller to match events to the job that was submitted, which is not possible via 
other means today (or, at least, not that I can see).
    
    I assume that job groups still work OK with the current async API, since 
the thread local data is using an `InheritableThreadLocal`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to