[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

Ming Ma (JIRA) Sun, 11 May 2014 18:28:16 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13994140#comment-13994140
 ]


Ming Ma commented on MAPREDUCE-5465:
------------------------------------

Thanks, Jason! We have discussed the performance implication in 
https://issues.apache.org/jira/browse/YARN-221. It is good to revisit the issue.

1. I assume job latency is the metric we want to use. The question is how much 
such change impacts the job latency.

2. Say umbilical notification is at t1, task receives T_ATTEMPT_SUCCEEDED or 
T_ATTEMPT_FAILED at t2, MRAppMaster acquires new containers from RM for next 
set of tasks at t3.

3. How much does (t2-t1) impact job latency? It depends on the job 
characteristics. mapper output can be available sooner; reducer containers can 
be scheduled sooner, etc. But it isn't going to be linear to number of tasks; 
given tasks run in parallel. So it should be much smaller. I don't have the 
formula. It will be useful to compare the performance difference using actual 
jobs.

4. Your suggestion of notifying task/job right after t1 is a good idea to 
improve (t2-t1). I assume it doesn't change the state transition of task 
attempt. We need to confirm state machine correctness point of view, given 
there might be some assumptions between task attempt and task state machines.

5. (t3-t1) can also impact job latency. Notifying task/job earlier won't help 
to improve (t3-t1).

6. To improve (t3-t1), perhaps when container exits, it should send 
OutofBandHeartBeat. Currently OutofBandHeartBeat is sent only when 
stopContainer is called. Perhaps This is useful when NM->RM's heartbeat 
interval is big.

7. It appears there is some issue w.r.t. the current stopContainer's calling 
NodeStatusUpdaterImpl's OutofBandHeartBeat processing. stopContainer first 
enqueues "kill" container event before calling NodeStatusUpdaterImpl's 
OutofBandHeartBeat. So it is possible the NodeStatusUpdaterImpl heartbeat 
thread sends the heartbeat to RM before the main Dispatcher thread processes 
the event and mark the container as completed. Thus the OutofBandHeartBeat 
doesn't include that container in the completed container list. Does it really 
need to call NodeStatusUpdaterImpl's OutofBandHeartBeat in stopContainer? It 
seems it is better to call it only when a container exits.

> Container killed before hprof dumps profile.out
> -----------------------------------------------
>
>                 Key: MAPREDUCE-5465
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5465
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am, mrv2
>    Affects Versions: trunk, 2.0.3-alpha
>            Reporter: Radim Kolar
>            Assignee: Ming Ma
>         Attachments: MAPREDUCE-5465-2.patch, MAPREDUCE-5465-3.patch, 
> MAPREDUCE-5465-4.patch, MAPREDUCE-5465-5.patch, MAPREDUCE-5465-6.patch, 
> MAPREDUCE-5465.patch
>
>
> If there is profiling enabled for mapper or reducer then hprof dumps 
> profile.out at process exit. It is dumped after task signaled to AM that work 
> is finished.
> AM kills container with finished work without waiting for hprof to finish 
> dumps. If hprof is dumping larger outputs (such as with depth=4 while depth=3 
> works) , it could not finish dump in time before being killed making entire 
> dump unusable because cpu and heap stats are missing.
> There needs to be better delay before container is killed if profiling is 
> enabled.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAPREDUCE-5465) Container killed before hprof dumps profile.out

Reply via email to