[ 
https://issues.apache.org/jira/browse/MAPREDUCE-4495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13478083#comment-13478083
 ] 

Alejandro Abdelnur commented on MAPREDUCE-4495:
-----------------------------------------------

@eric14, thanks for your comment. As I've indicated at the end of my last 
comment,  replanning mid-job (writing a WF in-flight) is possible with the 
WFLIB (and if any, it may require minor tweaks to it). My suggestion on first 
replacing the existing JobControl with one that runs a workflow (WFAM or Oozie) 
is an initial step, which I believe would bring a significant value (I 
respectfully disagree with your 'does not seem very helpful') to a stable 
version of Pig with minimal work. This is the same approach you've suggested 
for the WFAM interacting with the MRAM via the JobClient API for the first cut 
not to require significant changes in the the MRAM. Medium/long term I concur 
with you on re-planning mid-job, and I would love to see details on the idea or 
a a design doc.

@revans2 (Bobby), thanks for again for your comments, following up on them.

On *I am more curious about restarting the child AMs..*, I think it is the 
responsibility of each AM implementation to define what its recovery 
capabilities are (clean up and restart job from scratch or continue from a 
stable checkpoint).

On *The concept is great, I think that MR originally had that concept to 
reestablish communication with its tasks to..*, note that we are talking at AM 
level, not task level, you'd be using the cline API of an AM to reconnect, 
after that is up to the AM capabilities. This is how Oozie works today with WF 
actions jobs; when oozie goes down, when it comes back reconnects to Hadoop 
with the jobID, checks the job status and continues as appropriate.

On *My point is that just replacing the default container allocator..*, agree, 
last friday in the YARN meetup I was suggesting (for other reasons (1)) we 
should add a new method to the AM-NM protocol, to be able restart an existing 
container providing a subset of the currently allocated resources, on such call 
the NM would return unused resources back to the RM and it would restart the 
container as requested with the provided restart command.

On *I get that you are constrained by the DAG,..*

Keeping things as are today in the WF lib, if you have a fork, the nodes are 
started in the order the are defined. If you want an AM to have priority over 
other, we could easily add a priority attribute to actions that it is used on 
parallel runs to decide which one gets started first.

On *The MRAM currently does not do anything to allow for clients..*, the WFAM 
children AMs are an implementation detail in my mind, they should not be 
visible by the WFAM client.

On *I know that Rob Parker and Jason Lowe..*, I'd love to get details on that.

(1) the reason was that we could, in the case of MR jobs, after the Map task 
completes, restart the container with a very small footprint to serve the 
shuffle data, by doing that we could remove the shuffle service from the NM, 
which has no business being there.

                
> Workflow Application Master in YARN
> -----------------------------------
>
>                 Key: MAPREDUCE-4495
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4495
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 2.0.0-alpha
>            Reporter: Bo Wang
>            Assignee: Bo Wang
>         Attachments: MAPREDUCE-4495-v1.1.patch, MAPREDUCE-4495-v1.patch, 
> MapReduceWorkflowAM.pdf, yapp_proposal.txt
>
>
> It is useful to have a workflow application master, which will be capable of 
> running a DAG of jobs. The workflow client submits a DAG request to the AM 
> and then the AM will manage the life cycle of this application in terms of 
> requesting the needed resources from the RM, and starting, monitoring and 
> retrying the application's individual tasks.
> Compared to running Oozie with the current MapReduce Application Master, 
> these are some of the advantages:
>  - Less number of consumed resources, since only one application master will 
> be spawned for the whole workflow.
>  - Reuse of resources, since the same resources can be used by multiple 
> consecutive jobs in the workflow (no need to request/wait for resources for 
> every individual job from the central RM).
>  - More optimization opportunities in terms of collective resource requests.
>  - Optimization opportunities in terms of rewriting and composing jobs in the 
> workflow (e.g. pushing down Mappers).
>  - This Application Master can be reused/extended by higher systems like Pig 
> and hive to provide an optimized way of running their workflows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to