[ https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994140#comment-13994140 ]
Ming Ma commented on MAPREDUCE-5465: ------------------------------------ Thanks, Jason! We have discussed the performance implication in https://issues.apache.org/jira/browse/YARN-221. It is good to revisit the issue. 1. I assume job latency is the metric we want to use. The question is how much such change impacts the job latency. 2. Say umbilical notification is at t1, task receives T_ATTEMPT_SUCCEEDED or T_ATTEMPT_FAILED at t2, MRAppMaster acquires new containers from RM for next set of tasks at t3. 3. How much does (t2-t1) impact job latency? It depends on the job characteristics. mapper output can be available sooner; reducer containers can be scheduled sooner, etc. But it isn't going to be linear to number of tasks; given tasks run in parallel. So it should be much smaller. I don't have the formula. It will be useful to compare the performance difference using actual jobs. 4. Your suggestion of notifying task/job right after t1 is a good idea to improve (t2-t1). I assume it doesn't change the state transition of task attempt. We need to confirm state machine correctness point of view, given there might be some assumptions between task attempt and task state machines. 5. (t3-t1) can also impact job latency. Notifying task/job earlier won't help to improve (t3-t1). 6. To improve (t3-t1), perhaps when container exits, it should send OutofBandHeartBeat. Currently OutofBandHeartBeat is sent only when stopContainer is called. Perhaps This is useful when NM->RM's heartbeat interval is big. 7. It appears there is some issue w.r.t. the current stopContainer's calling NodeStatusUpdaterImpl's OutofBandHeartBeat processing. stopContainer first enqueues "kill" container event before calling NodeStatusUpdaterImpl's OutofBandHeartBeat. So it is possible the NodeStatusUpdaterImpl heartbeat thread sends the heartbeat to RM before the main Dispatcher thread processes the event and mark the container as completed. Thus the OutofBandHeartBeat doesn't include that container in the completed container list. Does it really need to call NodeStatusUpdaterImpl's OutofBandHeartBeat in stopContainer? It seems it is better to call it only when a container exits. > Container killed before hprof dumps profile.out > ----------------------------------------------- > > Key: MAPREDUCE-5465 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: mr-am, mrv2 > Affects Versions: trunk, 2.0.3-alpha > Reporter: Radim Kolar > Assignee: Ming Ma > Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, > MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, > MAPREDUCE-5465.patch > > > If there is profiling enabled for mapper or reducer then hprof dumps > profile.out at process exit. It is dumped after task signaled to AM that work > is finished. > AM kills container with finished work without waiting for hprof to finish > dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 > works) , it could not finish dump in time before being killed making entire > dump unusable because cpu and heap stats are missing. > There needs to be better delay before container is killed if profiling is > enabled. -- This message was sent by Atlassian JIRA (v6.2#6252)