[ https://issues.apache.org/jira/browse/MAPREDUCE-802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12738226#action_12738226 ]
Sreekanth Ramakrishnan commented on MAPREDUCE-802: -------------------------------------------------- Currently problems arise within the systems which rely on the job events can be classified into two categories: # Not all code path make call to raise status change events. The reason for this is the state change is performed in {{JobInProgress}} which does not have handle to the list of {{JobInProgressListener}} which is managed by the {{JobTracker}}. So the components which need the state change for removing/updating internal structures of {{JobInProgress}} object are left out of synch. # Relying, on {{oldStatus}} field and member of the structure to be correctly set by {{JobTracker}} before calling the listeners. Notable example of this is start time changes which is described in MAPREDUCE-45 In order to solve the problems listed above following is a proposal: * For solving the case number 1, whenever {{JobInProgress}} changes its state, we route the associated event to {{JobTracker}}. This will ensure that any part of code which changes the {{JobStatus}} would actually result in events being raised. * For solving the case number 2, we remove the the {{oldStatus}} field in {{JobStatusChangeEvent}} as it is not always correct. The change would be an incompatible change and old status is actually used in two schedulers {{JobQueueJobInProgressListener}} for default scheduler and {{JobQueueManager}} for capacity scheduler. So both these scheduler would now have to maintain their link of old status to {{JobInProgress}}. The changes proposed would change current pseudo code for raising events as below: {noformat} JobStatus oldStatus = job.getstatus.clone make changes to jobs status. JobStatus newStatus = job.getstatus.clone create event with both old and new inform listeners {noformat} To following: {noformat} make changes to job create JobChanged event inform listeners {noformat} So scheduler would have maintain an association with the scheduling information which they used to populate their internal structures previously on their own instead of the {{JobTracker}} sending correct information. Currently, default scheduler {{JobQueueTaskScheduler}} maintains the ordered list of jobs using a {{TreeMap<JobSchedulingInfo,JobInProgress>}}, the key of the map while update operation was constructed using _oldStatus_ field of the {{JobStatusChangedEvent}}. With proposed changed as _oldStatus_ is removed default scheduler would have to maintain its association between job to job scheduling info i.e. a {{Map<JobID,JobSchedulingInfo>}} the value of a JobID would be current {{JobSchedulingInfo}} which it used to insert into {{TreeMap}} of the scheduler. While {{jobUpdated()}} is called removal of the old {{JobSchedulingInfo}} from {{TreeMap}} would be done using the value from {{Map}}, then {{Map<JobID,JobSchedulingInfo>}} and {{TreeMap<JobSchedulingInfo,JobInProgress>}} are updated with most recent {{JobSchedulingInfo}}. Any comments on the above proposal and changes which it would bring to framework? > Simplify the job updated event notification between Jobtracker and schedulers > ----------------------------------------------------------------------------- > > Key: MAPREDUCE-802 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-802 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker > Reporter: Hemanth Yamijala > Assignee: Sreekanth Ramakrishnan > > HADOOP-4053 and HADOOP-4149 added events to take care of updates to the state > / property of a job like the run state / priority of a job notified to the > scheduler. We've seen some issues with this framework, such as the following: > - Events are not raised correctly at all places. If a new code path is added > to kill a job, raising events is missed out. > - Events are raised with incorrect event data. For e.g. typically start time > value is missed out. > The resulting contract break between jobtracker and schedulers has lead to > problems in the capacity scheduler where jobs remain stuck in the queue > without being ever removed and so on. > It has proven complicated to get this right in the framework and fixes have > typically still left dangling cases. Or new code paths introduce new bugs. > This JIRA is about trying to simplify the interaction model so that it is > more robust and works well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.