Github user zentol commented on a diff in the pull request:

    https://github.com/apache/flink/pull/2683#discussion_r84846296
  
    --- Diff: 
flink-runtime/src/main/scala/org/apache/flink/runtime/jobmanager/JobManager.scala
 ---
    @@ -1828,6 +1828,33 @@ class JobManager(
         jobManagerMetricGroup.gauge[Long, Gauge[Long]]("numRunningJobs", new 
Gauge[Long] {
           override def getValue: Long = JobManager.this.currentJobs.size
         })
    +    jobManagerMetricGroup.gauge[Long, Gauge[Long]]("numFailedJobs", new 
Gauge[Long] {
    +      override def getValue: Long = {
    +         var failedJobs = 0
    +         val ourJobs = createJobStatusOverview()
    +         val future = (archive ? 
RequestJobsOverview.getInstance())(timeout)
    +         val archivedJobs : JobsOverview = Await.result(future, 
timeout).asInstanceOf[JobsOverview]
    +         failedJobs += ourJobs.getNumJobsFailed() + 
archivedJobs.getNumJobsFailed()
    +         failedJobs
    +    }})
    +    jobManagerMetricGroup.gauge[Long, Gauge[Long]]("numCancelledJobs", new 
Gauge[Long] {
    +      override def getValue: Long = {
    +         var cancelledJobs = 0
    +         val ourJobs = createJobStatusOverview()
    +         val future = (archive ? 
RequestJobsOverview.getInstance())(timeout)
    +         val archivedJobs : JobsOverview = Await.result(future, 
timeout).asInstanceOf[JobsOverview]
    +         cancelledJobs += ourJobs.getNumJobsCancelled() + 
archivedJobs.getNumJobsCancelled()
    +         cancelledJobs
    +    }})
    +    jobManagerMetricGroup.gauge[Long, Gauge[Long]]("numFinishedJobs", new 
Gauge[Long] {
    +      override def getValue: Long = {
    +         var finishedJobs = 0
    +         val ourJobs = createJobStatusOverview()
    +         val future = (archive ? 
RequestJobsOverview.getInstance())(timeout)
    +         val archivedJobs : JobsOverview = Await.result(future, 
timeout).asInstanceOf[JobsOverview]
    +         finishedJobs += ourJobs.getNumJobsFinished() + 
archivedJobs.getNumJobsFinished()
    +         finishedJobs
    +    }})
    --- End diff --
    
    Generally i would say no, since there is always the chance it may block for 
the full timeout duration. 
    So in this case, in theory, with the default timeout of 10 seconds, we 
could block the reporter thread for half a minute. Now this isn't very likely 
since we query the MemoryArchivist within the JM, but still.
    
    I'm just wondering whether it makes sense to add this metric; with FLIP-6 
around the corner, which will make it obsolete anyway.
    
    if we merge it I would like to see some shared object so that we don't do 
the same RPC call 3 times.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to