[ 
https://issues.apache.org/jira/browse/HADOOP-10090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Mitic updated HADOOP-10090:
--------------------------------

    Attachment: OneBoxRepro.png

I was able to repro the problem by looking at JMX metrics thru jconsole. See 
attached screenshot.

The bug appears to be coming from how the JMX metrics cache is managed in 
MetricsSourceAdapter.java. To be more specific:
a) MetricsSourceAdapter keeps track of the latest metrics records obtained from 
the call into MetricsSource#getMetrics (see MetricsSourceAdapter#getMetrics)
b) Metrics system timer thread which periodically polls on all sources, only 
polls for the metrics that have changed (not all metrics)

Scenario from this bug (JobTracker metrics):
 - Let's assume that the timer runs once per minute
 - At time 0, there were 0 jobs_submitted and 0 jobs_completed, and internal 
JMX cache reflects this state
 - At time t1, MetricsSystem timer thread runs and there were 1 jobs_submitted, 
and 0 jobs_completed. Internal JMX cache is not updated.
 - At time t1 + 0.5 minutes, above job is completed, and jobs_completed counter 
is set to 1.
 - At time t1 + 1 minute, MetricsSystem timer thread runs, and only 
jobs_completed metric changes from 0 to 1. Internal MetricsSourceAdapter 
{{lastRecs}} contains only jobs_completed metric.
 - At time t1 + 1 minute + delta, JMX is polled for the current metrics, the 
current cache is updated using the current {{lastRecs}}, and the final outcome 
is jobs_submitted = 0, jobs_completed = 1.

Since it is not obvious to me what is the intended behavior around metrics 
reported thru JMX is, I see two possible directions for the fix:
# JMX always reports the latest snapshot obtained by the MetricsSystem timer 
thread. This means that JMX does not report the fresh/current metrics state. 
Each time metrics are fetched from the timer thread, internal 
MetricsSourceAdapter JMX cache would be refreshed to keep the state in sync.
# JMX always reports the current metric values (with small TTL on the internal 
cache). In this design, MetricsSourceAdapter would (almost) always have to poll 
on the source for JMX. Additionally, there would have to be means to mark the 
source as dirty internally when this happens, to guarantee that the next time 
timer thread polls for metrics, all metrics are returned (see b. from above).

Option 2 is definitely nicer from users perspective as JMX reflects the current 
in-memory state of servers. 

Please comment back with your thoughts/recommendations.


> Jobtracker metrics not updated properly after execution of a mapreduce job
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-10090
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10090
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: metrics
>    Affects Versions: 1.2.1
>            Reporter: Ivan Mitic
>            Assignee: Ivan Mitic
>         Attachments: OneBoxRepro.png
>
>
> After executing a wordcount mapreduce sample job, jobtracker metrics are not 
> updated properly. Often times the response from the jobtracker has higher 
> number of job_completed than job_submitted (for example 8 jobs completed and 
> 7 jobs submitted). 
> Issue reported by Toma Paunovic.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to