[ 
https://issues.apache.org/jira/browse/TEZ-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278074#comment-14278074
 ] 

Jeff Zhang commented on TEZ-1069:
---------------------------------

bq. My thinking was more along the lines for querying the VertexManager to 
allow it to modify the task specifications in such cases. Changing the resource 
is not enough. One would also need to change the java opts. For the latter, we 
would need to write a java opts parser if the user had specified their own java 
opts ( Xmx, etc ).
Agree, VM is the better place to do this kind of thing, and will update the 
java opts also.

bq. Isn't it better to setup hooks in case of OOM failures for a VertexManager 
to resize the task? Furthermore, a lot of OOM failures are due to data skew 
where one task is affected but the rest are not.
I think I would add one method to VM to get notification of its task attempt 
failure, and decide whether to resize task. The rough idea is to resize only 
the task with OOM task attempt failure, and when the number of task with OOM 
task attempt failure meet some threshold, resize the whole vertex. 

bq. Last question on when should this increase be done? Should it be done on 
each attempt failure or only on the last attempt?
If we identify the task attempt failed due to OOM, I think the next attempt 
will most likely still fail due to OOM.


> Support ability to re-size a task attempt when previous attempts fail due to 
> resource constraints
> -------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-1069
>                 URL: https://issues.apache.org/jira/browse/TEZ-1069
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Hitesh Shah
>            Assignee: Jeff Zhang
>         Attachments: TEZ-1069-1.patch
>
>
> Consider a case where attempts for the final stage in a long DAG fails due to 
> out of memory. In such a scenario, the framework  ( or via the base vertex 
> manager ) should be able to change the task specifications on the fly to 
> trigger a re-run with modified specs. 
> Changes could be both java opts changes as well as container resource 
> requirements. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to