Ahh..I forgot about Oozie-994. My bad, I suggested that change. Everything makes sense now. Thanks!
On 8/7/13 7:38 PM, "Robert Kanter" <[email protected]> wrote: >The behavior where the ActionCheckXCommand calls handleNonTransient() with >START_MANUAL when the JT can't be reached after the retries and on RESUME >command will resubmit the job was something I did for OOZIE-994. In >hindsight, we shouldn't have done it that way. > >Yes, it will fail if job recovery is not enabled in the JT/RM; but I think >this is the more correct behavior as this is something that the external >system should be taking care of. > >- Robert > > >On Wed, Aug 7, 2013 at 5:05 PM, Virag Kothari <[email protected]> wrote: > >> Alejandro, I agree that functionality would be preserved if action is >>left >> in RUNNING during a transient error. >> >> Few questions >> >> 1) START_MANUAL seems to be set only by handleNonTransient(). If this >>is a >> bug, do you know for what purpose it was introduced? >> I thought having START_MANUAL is a way to distinguish between Oozie >> suspending job due to transient error and a user manually suspending the >> job. >> >> 2) With no oozie retry on 'RESUME', jobs will fail if JT/RM recovery is >> not enabled. And it seems that YARN recovery is still not there as >> YARN-128 is not yet committed (Not sure if looking at right JIRA). >> Its a concern for us as we ask users to RESUME their jobs after hadoop >> upgrade. Now they have to resume wf and rerun the failed actions. >> >> Thanks, >> Virag >> >> >> >> On 8/7/13 2:48 PM, "Alejandro Abdelnur" <[email protected]> wrote: >> >> >[joining the party a bit late] >> > >> >I just add an offline call with RobertK who brought me up to speed. >> > >> >By design, Oozie will retry starting a workflow action ONLY if it >>couldn't >> >start the WF action before. If Oozie started the WF action >>successfully, >> >the WF action state goes into RUNNING, and from then on it is the >> >responsibility of the external system running the action to recover it. >> >Oozie will not attempt any recovery after that point. >> > >> >This means that with Hadoop (JT or YARN) job recovery, the launcher >>job >> >will be recovered by Hadoop without any intervention from Oozie. >> > >> >It is clear that to have recovery for MR action we need to get rid of >> >the >> >swap and just hold onto the MR launcher job as we do for the other >> >actions. >> > >> >Now, on the whole discussion on the ActionCheckXCommand retries. We >>have a >> >bug in the ActionCheckXCommand, on handleNonTransient() we should not >> >change the status of the WF action to START_MANUAL, we should leave it >>in >> >RUNNING. hadnleNonTransient() will suspend the WF job thus switching >>off >> >action checks. On WF job resume, the action checks will start working >> >again, and if Hadoop has job recovery, things will work fine. Else the >>WF >> >action will fail because the launcher job is not known (the external >> >system >> >does not know how to recover jobs). Because we are reseting the status >>to >> >START_MANUAL we are dialing back on the lifecycle of the action, that >>is >> >incorrect and that creates the race condition that introduces 2 jobs. >> > >> >So again, Oozie is not responsible for recovering actions. With that >> >assumption, fixing the handleNonTransient() to leave the status in >>RUNNING >> >and getting rid of the RM swap logic we should be good. >> > >> >Thoughts? >> > >> > >> > >> > >> >On Wed, Aug 7, 2013 at 12:27 AM, Virag Kothari <[email protected]> >> >wrote: >> > >> >> Robert, >> >> >> >> I have been thinking on this for a while and have few more concerns >>if >> >>the >> >> job retries are not streamlined through Oozie. >> >> >> >> 1) Till the JT finishes recovering the job, the wf job/wf action >>status >> >> will be SUSPENDED/START_MANUAL. >> >> Isn't it misleading as the hadoop job is RUNNING while oozie >>incorrectly >> >> shows as SUSPENDED? Even if allow this, after the job completes, >>what if >> >> the callback is lost or oozie is down? >> >> To prevent the job being in SUSPENDED forever, we need to hack our >> >> services to pull SUSPENDED/START_MANUAL jobs from db and update their >> >> status. >> >> >> >> 2) Should we allow failing of the user RESUME command if the action >>is >> >>in >> >> START_MANUAL to prevent the race condition we were discussing? >> >> This would mean changing the semantics of the states. >> >> >> >> 3) Confused on mapred.job.restart.recover. Reading >> >> http://archive.cloudera.com/cdh4/cdh/4/mr1/mapred-default.html, it >>says >> >> that the default value of this is true. So, >> >> if mapred.jobtracker.restart.recover (system config) is already >>enabled, >> >> is job recovery on by default? Also, does recover mean the job will >> >>start >> >> where it left from or is it just plain restart? >> >> >> >> In summary, IMO allowing hadoop to recover jobs independently >>bypassing >> >> Oozie ins't trivial. It would have helped if the JT produced >> >>notification >> >> when it comes online, so Oozie could retry after consuming those. But >> >> currently, notification only happens when task completes. >> >> >> >> An alternate approach is to modify the semantics of START_MANUAL. >> >> Currently Oozie puts the action/job in START_MANUAL/SUSPENDED and >> >>expects >> >> the user to resume it. We can change this and make Oozie retry the >> >> START_MANUAL actions at configurable interval (~30 mins or some >>scheme >> >> like exp back off) . Of course, this is is bad as oozie will keep >> >>polling >> >> hadoop at some interval but manual resume of jobs who have faced >> >>transient >> >> errors will no longer be mandatory. >> >> >> >> --Virag >> >> >> >> >> >> On 8/6/13 4:38 PM, "Robert Kanter" <[email protected]> wrote: >> >> >> >> >If ActionCheckX is trying to retry, and the JT recovers the job, >>that >> >> >should be fine. The "retry" is to simply try connecting to the JT >>to >> >>get >> >> >the status for the job. If the user issues a "RESUME" for a >> >>START_MANUAL >> >> >job, then yes, Oozie will try to resubmit a new job for that action >>and >> >> >we'd have two of them if the JT also recovers it. >> >> > >> >> >What if we modified the ActionStartXCommand/ResumeActionXCommand >> >> >precondition to check if the action already has a Job ID that is >>valid >> >> >(i.e. not unknown to the JT), then it fails the precondition check >>or >> >> >something similar? >> >> > >> >> >- Robert >> >> > >> >> > >> >> >On Tue, Aug 6, 2013 at 4:23 PM, Virag Kothari <[email protected]> >> >> wrote: >> >> > >> >> >> ActionCheckx first retries for a configurable amount of time and >>then >> >> >> makes the status as START_MANUAL. >> >> >> So, the problem might happen when JT recovers the job during the >>same >> >> >>time >> >> >> when 1) ActionCheckX is trying to retry or the 2) user issues a >> >>"RESUME" >> >> >> for a start_manual job. >> >> >> We have to ensure that this doesn't happen otherwise we will have >>two >> >> >> hadoop jobs for the same action. >> >> >> The callback happens only when the task is completed which might >>be >> >>too >> >> >> late. During that time, Oozie might have already submitted a new >> >>hadoop >> >> >> job for that wf action. >> >> >> So it doesn't seem straightforward to prevent Oozie to submit a >>new >> >>job >> >> >>if >> >> >> the JT is already recovering the older one. >> >> >> >> >> >> >> >> >> >> >> >> On 8/6/13 4:01 PM, "Robert Kanter" <[email protected]> wrote: >> >> >> >> >> >> >Yes, if JT recovers the job, it uses the same ID. If the JT >>comes >> >>up >> >> >> >quickly and recovers the job, Oozie continues working just fine >> >> >>(without >> >> >> >the ID swap issues discussed earlier). When the JT takes longer >> >>than >> >> >>the >> >> >> >10min ActionCheck interval, and the action is START_MANUAL, that >> >>still >> >> >> >needs to be figured out. >> >> >> > >> >> >> >I haven't tested on Hadoop 2.x yet, but I've been told that it >> >>should >> >> >>have >> >> >> >the same behavior. The only differences are that the name of the >> >> >>property >> >> >> >to enable recoverability on the server (not the job-level one) is >> >> >> >different >> >> >> >obviously because it doesn't have "jobtracker" in it and it can >>also >> >> >> >recover the completed tasks, which shouldn't be a problem because >> >>the >> >> >> >launcher jar has the one task. I'll of course double check this >> >> >>though. >> >> >> > >> >> >> > >> >> >> >- Robert >> >> >> > >> >> >> > >> >> >> >On Tue, Aug 6, 2013 at 3:23 PM, Rohini Palaniswamy >> >> >> ><[email protected]>wrote: >> >> >> > >> >> >> >> Robert, >> >> >> >> You will not get a unknown hadoop job if JT has retry >> >>configured >> >> >> >>right? >> >> >> >> What happens in that case? Especially what happens when Oozie >> >>retry >> >> >> >>happens >> >> >> >> when JT comes up quickly? Also do you know what is the >>behaviour >> >> >>with >> >> >> >> Hadoop 2.x ? >> >> >> >> >> >> >> >> Mayank, >> >> >> >> OOZIE-1231 already has the changes to show Mapreduce job id >>in >> >>the >> >> >> >>Child >> >> >> >> job page to be consistent with other job types. The v1 API has >>the >> >> >>older >> >> >> >> behaviour with map job url in externalId, while v2 API has it >>in >> >> >> >> childjobids. So there is a UI change but v1 REST API has not >> >> >>changed. >> >> >> >>But >> >> >> >> OOZIE-1231 has not changed any code with respect to id swap. >> >> >> >> >> >> >> >> Regards, >> >> >> >> Rohini >> >> >> >> >> >> >> >> On Tue, Aug 6, 2013 at 2:39 PM, Robert Kanter >> >><[email protected]> >> >> >> >> wrote: >> >> >> >> >> >> >> >> > Ya, I saw a precondition failed message. >> >> >> >> > >> >> >> >> > I just tried out what happens when the job is SUSPENDED, the >> >> >>action is >> >> >> >> > START_MANUAL, and the JT recovers the hadoop job: It doesn't >> >> >>continue >> >> >> >>the >> >> >> >> > workflow. It fails the eagerVerifyPrecondition from >> >> >> >> > CompletedActionXCommand because the action isn't RUNNING. >> >>Perhaps >> >> >>we >> >> >> >> > should make the CallbackService change the status in this >> >> >>situation? >> >> >> >> > >> >> >> >> > Just to clarify, the above only happens when the JT has been >> >>down >> >> >>long >> >> >> >> > enough that the ActionCheckXCommand (every 10min by default) >>+ >> >>the >> >> >> >> retries >> >> >> >> > (3 x 1min) happen. If it comes back sooner than that, >> >>everything >> >> >> >>works >> >> >> >> > fine. >> >> >> >> > >> >> >> >> > thanks >> >> >> >> > - Robert >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > >> >> >> >> > On Tue, Aug 6, 2013 at 1:43 PM, Virag Kothari >> >><[email protected] >> >> > >> >> >> >> wrote: >> >> >> >> > >> >> >> >> > > Oh..okay. Seems like RecoveryService queues the StartX >>command >> >> >>but >> >> >> >>the >> >> >> >> > > verifyPrecondition() fails as the wf job is >> >> >> >> > > Suspended (Plz verify this from logs). >> >> >> >> > > >> >> >> >> > > In that case, if Oozie is not auto-retrying and >>resubmitting, >> >> >>then >> >> >> >>it >> >> >> >> > > seems fair to have the JT recover the job. >> >> >> >> > > But if JT recovers the job, can we make sure that the >>workflow >> >> >>job >> >> >> >> > > transits to RUNNING from SUSPENDED and wf action from >> >> >>START_MANUAL >> >> >> >>to >> >> >> >> > > RUNNING? >> >> >> >> > > It should not happen that the user resumes the job which >>makes >> >> >>Oozie >> >> >> >> > > submit a new hadoop job while the JT is also recovering the >> >>same >> >> >> >>job. >> >> >> >> > > Also, I think the error can still be considered transient >>from >> >> >>Oozie >> >> >> >> > > perspective as it is temporary depending on state of JT. >> >> >> >> > > >> >> >> >> > > Thanks, >> >> >> >> > > Virag >> >> >> >> > > >> >> >> >> > > >> >> >> >> > > On 8/6/13 1:12 PM, "Robert Kanter" <[email protected]> >> >>wrote: >> >> >> >> > > >> >> >> >> > > >Virag, >> >> >> >> > > >I just tested out killing the JT and waiting for the >>Checker >> >> >> >>service >> >> >> >> to >> >> >> >> > > >retry and give up: the action goes to START_MANUAL and the >> >>job >> >> >>gets >> >> >> >> > > >SUSPENDED. I waited around long enough, but the >> >>RecoveryService >> >> >> >> didn't >> >> >> >> > do >> >> >> >> > > >anything. Does it kick in for you? As a side note, >>looking >> >>at >> >> >>the >> >> >> >> > code, >> >> >> >> > > >the RecoveryService looks like it can handle START_MANUAL, >> >> >> >>END_MANUAL, >> >> >> >> > and >> >> >> >> > > >USER_RETRY, which all sound like things the user should be >> >> >>doing; >> >> >> >>is >> >> >> >> it >> >> >> >> > > >correct that RecoveryService is handling these? >> >> >> >> > > >The Unknown Hadoop Job error happens when the JT comes >>back >> >>in >> >> >>time >> >> >> >> > > >because >> >> >> >> > > >it won't know about the old ID if its not recovering jobs. >> >>So, >> >> >> >>Oozie >> >> >> >> > > >tries >> >> >> >> > > >to ask it about a job that no longer exists. I'm not sure >> >>that >> >> >> >>this >> >> >> >> > > >should >> >> >> >> > > >be a transient error because there's no way to determine >>if >> >>its >> >> >> >> because >> >> >> >> > > >the >> >> >> >> > > >JT restarted and Oozie should resubmit the job or if >> >>something >> >> >>else >> >> >> >> > > >happened. >> >> >> >> > > > >> >> >> >> > > >Mayank, >> >> >> >> > > >That is a good point. We could either make a v3 API or >>add >> >>an >> >> >> >> > oozie-site >> >> >> >> > > >config to turn on/off the id swap behavior and keep the v2 >> >>API. >> >> >> >> > > > >> >> >> >> > > >thanks >> >> >> >> > > >- Robert >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > > >> >> >> >> > > >On Tue, Aug 6, 2013 at 10:48 AM, Mayank Bansal >> >> >><[email protected]> >> >> >> >> > wrote: >> >> >> >> > > > >> >> >> >> > > >> Robert, >> >> >> >> > > >> >> >> >> >> > > >> Thats a break in backward compatibility. Till now user >>are >> >> >>used >> >> >> >>to >> >> >> >> > > >>click on >> >> >> >> > > >> to link to go to MR page. >> >> >> >> > > >> >> >> >> >> > > >> Is there a better way to handle this? >> >> >> >> > > >> >> >> >> >> > > >> Thanks, >> >> >> >> > > >> Mayank >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> >> > > >> On Tue, Aug 6, 2013 at 10:42 AM, Robert Kanter < >> >> >> >> [email protected]> >> >> >> >> > > >> wrote: >> >> >> >> > > >> >> >> >> >> > > >> > Mona, >> >> >> >> > > >> > As far as I'm aware, the "retry" that Oozie is doing >>is >> >>just >> >> >> >> > retrying >> >> >> >> > > >>to >> >> >> >> > > >> > connect to the JT (which is why when the JT comes back >> >>up, >> >> >> >>Oozie >> >> >> >> > > >> > can continue monitoring the hadoop job if it still has >> >>the >> >> >>same >> >> >> >> ID); >> >> >> >> > > >>it >> >> >> >> > > >> > doesn't try to submit the job again as part of the >> >>"retry". >> >> >> >> > > >> > >> >> >> >> > > >> > Mayank, >> >> >> >> > > >> > We can put the ID for the actual job in the Child IDs >>tab >> >> >>(like >> >> >> >> with >> >> >> >> > > >> Pig). >> >> >> >> > > >> > >> >> >> >> > > >> > >> >> >> >> > > >> > - Robert >> >> >> >> > > >> > >> >> >> >> > > >> > >> >> >> >> > > >> > On Tue, Aug 6, 2013 at 10:41 AM, Mayank Bansal >> >> >> >><[email protected] >> >> >> >> > >> >> >> >> > > >> wrote: >> >> >> >> > > >> > >> >> >> >> > > >> > > I agree , we should handle these two scenarios, I >>am ok >> >> >>with >> >> >> >> > > >>changing >> >> >> >> > > >> the >> >> >> >> > > >> > > launcher behavior for MR however if we remove the id >> >>swap >> >> >> >>then >> >> >> >> how >> >> >> >> > > >>we >> >> >> >> > > >> > > nevigate to MR jobs from UI as we do right now? >> >> >> >> > > >> > > >> >> >> >> > > >> > > Thanks, >> >> >> >> > > >> > > Mayank >> >> >> >> > > >> > > >> >> >> >> > > >> > > >> >> >> >> > > >> > > On Tue, Aug 6, 2013 at 10:24 AM, Robert Kanter >> >> >> >> > > >><[email protected]> >> >> >> >> > > >> > > wrote: >> >> >> >> > > >> > > >> >> >> >> > > >> > > > Suppose we leave the MR ID swap thing as is but >>set >> >>the >> >> >> >> launcher >> >> >> >> > > >> > recover >> >> >> >> > > >> > > to >> >> >> >> > > >> > > > 0 and job to 1; then consider these two scenarios: >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > 1. JT gets restarted during the launcher job but >> >>before >> >> >>the >> >> >> >> > > >>launcher >> >> >> >> > > >> > job >> >> >> >> > > >> > > > actually launches the real job: >> >> >> >> > > >> > > > - The launcher job won't be recovered >>because we >> >> >>told >> >> >> >>it >> >> >> >> > not >> >> >> >> > > >>to >> >> >> >> > > >> > > > - The real job was never launched >> >> >> >> > > >> > > > ---> Action never completes and Oozie marks >>it >> >>as >> >> >> >>failed >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > 2. Launcher job submits the real job, but JT gets >> >> >>restarted >> >> >> >> > before >> >> >> >> > > >> the >> >> >> >> > > >> > > > Oozie server has a chance to swap IDs (its not an >> >>atomic >> >> >> >> > > >>operation): >> >> >> >> > > >> > > > - The launcher job won't be recovered >>because we >> >> >>told >> >> >> >>it >> >> >> >> > not >> >> >> >> > > >>to >> >> >> >> > > >> > > > - The real job will be recovered and finish >> >> >> >>successfully >> >> >> >> > > >> > > > ---> Oozie marks the action as failed even >> >>though >> >> >>the >> >> >> >> > actual >> >> >> >> > > >>job >> >> >> >> > > >> > > > succeeded because it didn't know about the ID swap >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > It would only work for the case where the JT gets >> >> >>restarted >> >> >> >> > after >> >> >> >> > > >>the >> >> >> >> > > >> > ID >> >> >> >> > > >> > > > swap occurs. >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > - Robert >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > On Tue, Aug 6, 2013 at 10:17 AM, Mayank Bansal < >> >> >> >> > [email protected] >> >> >> >> > > > >> >> >> >> > > >> > > wrote: >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > > Hi Robert, >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > +1 for oozie to set launcher to 1 and 0 to jobs >>for >> >> >> >>recovery >> >> >> >> > in >> >> >> >> > > >>all >> >> >> >> > > >> > the >> >> >> >> > > >> > > > > cases except MR. >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > As after Id swapped Oozie only know about MR job >> >>isn't >> >> >> >>it? >> >> >> >> > then >> >> >> >> > > >> there >> >> >> >> > > >> > > > > should not be any problem. >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > If we set MR launcher recover to 0 and job to 1 >> >>then >> >> >>job >> >> >> >> will >> >> >> >> > be >> >> >> >> > > >> > > succeded >> >> >> >> > > >> > > > > in case of JT restart. >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > AM I missing something? >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > Thanks, >> >> >> >> > > >> > > > > Mayank >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > On Tue, Aug 6, 2013 at 9:59 AM, Robert Kanter < >> >> >> >> > > >> [email protected]> >> >> >> >> > > >> > > > > wrote: >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > > > I think you usually just get the "Unknown >>Hadoop >> >> >>Job" >> >> >> >> error >> >> >> >> > > >> message >> >> >> >> > > >> > > > > because >> >> >> >> > > >> > > > > > Oozie tries to look up the Hadoop Job ID it >> >>already >> >> >> >>has, >> >> >> >> but >> >> >> >> > > >>the >> >> >> >> > > >> JT >> >> >> >> > > >> > > no >> >> >> >> > > >> > > > > > longer has that ID because it was restarted. >> >>With >> >> >>JT >> >> >> >> > > >> > Recoverability >> >> >> >> > > >> > > > > turned >> >> >> >> > > >> > > > > > on, it will restart the job using the same >>ID, so >> >> >>Oozie >> >> >> >> > > >>continues >> >> >> >> > > >> > > just >> >> >> >> > > >> > > > > > fine. >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > > > > - Robert >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > > > > On Mon, Aug 5, 2013 at 5:27 PM, Rohini >> >>Palaniswamy >> >> >> >> > > >> > > > > > <[email protected]>wrote: >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > > > > > Wouldn't oozie poll for the job status and >> >>decide >> >> >> >>that >> >> >> >> it >> >> >> >> > > >>has >> >> >> >> > > >> > > failed >> >> >> >> > > >> > > > > and >> >> >> >> > > >> > > > > > > when JT comes up launch another one if >>retry is >> >> >> >> > configured? >> >> >> >> > > >> > > > > > > >> >> >> >> > > >> > > > > > > On Mon, Aug 5, 2013 at 3:11 PM, Robert >>Kanter < >> >> >> >> > > >> > > [email protected]> >> >> >> >> > > >> > > > > > > wrote: >> >> >> >> > > >> > > > > > > >> >> >> >> > > >> > > > > > > > Hi, >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > > We looked into how to support Job >> >>Recoverability >> >> >> >>(i.e. >> >> >> >> > > >>the JT >> >> >> >> > > >> > is >> >> >> >> > > >> > > > > > > restarted >> >> >> >> > > >> > > > > > > > and it wants to restart the jobs that were >> >> >>running; >> >> >> >> > > >>similarly >> >> >> >> > > >> > for >> >> >> >> > > >> > > > > YARN) >> >> >> >> > > >> > > > > > > and >> >> >> >> > > >> > > > > > > > have a pretty simple solution for all of >>the >> >> >>action >> >> >> >> > types >> >> >> >> > > >> > except >> >> >> >> > > >> > > > for >> >> >> >> > > >> > > > > > > > MapReduce. If we set >> >> >> >> mapreduce.job.restart.recover=true >> >> >> >> > > >>for >> >> >> >> > > >> > the >> >> >> >> > > >> > > > > > launcher >> >> >> >> > > >> > > > > > > > job and >>mapreduce.job.restart.recover=false >> >>for >> >> >>the >> >> >> >> jobs >> >> >> >> > > >> > launched >> >> >> >> > > >> > > > by >> >> >> >> > > >> > > > > > the >> >> >> >> > > >> > > > > > > > launcher, then when the JT restarts, it >>will >> >> >> >>recover >> >> >> >> the >> >> >> >> > > >> > launcher >> >> >> >> > > >> > > > job >> >> >> >> > > >> > > > > > but >> >> >> >> > > >> > > > > > > > not the child jobs -- the launcher job >>will >> >>then >> >> >> >>take >> >> >> >> > > >>care of >> >> >> >> > > >> > > > > > relaunching >> >> >> >> > > >> > > > > > > > the child jobs. >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > > For MapReduce, because of the optimization >> >>with >> >> >> >>the id >> >> >> >> > > >>swap, >> >> >> >> > > >> > this >> >> >> >> > > >> > > > > won't >> >> >> >> > > >> > > > > > > > work. It would be very tricky, if it's >>even >> >> >> >> practical, >> >> >> >> > > >>to do >> >> >> >> > > >> > > > > something >> >> >> >> > > >> > > > > > > > similar for the MR action. Instead, we >> >>think it >> >> >> >>would >> >> >> >> > be >> >> >> >> > > >> best >> >> >> >> > > >> > if >> >> >> >> > > >> > > > we >> >> >> >> > > >> > > > > > > simply >> >> >> >> > > >> > > > > > > > remove the MR optimization and make it >>just >> >>like >> >> >> >>the >> >> >> >> > other >> >> >> >> > > >> > action >> >> >> >> > > >> > > > > > types. >> >> >> >> > > >> > > > > > > I >> >> >> >> > > >> > > > > > > > know we normally don't want to remove >> >> >> >>optimizations, >> >> >> >> but >> >> >> >> > > >> there >> >> >> >> > > >> > > are >> >> >> >> > > >> > > > > many >> >> >> >> > > >> > > > > > > > advantages in this case, and it's only >> >>saving a >> >> >> >>single >> >> >> >> > Map >> >> >> >> > > >> slot >> >> >> >> > > >> > > for >> >> >> >> > > >> > > > > MR >> >> >> >> > > >> > > > > > > jobs >> >> >> >> > > >> > > > > > > > only. >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > > I've created OOZIE-1483 < >> >> >> >> > > >> > > > > > > >> >>https://issues.apache.org/jira/browse/OOZIE-1483> >> >> >> >> > > >> > > > > > > > with >> >> >> >> > > >> > > > > > > > more details and should have a patch soon. >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > > Thoughts? >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > > thanks >> >> >> >> > > >> > > > > > > > - Robert >> >> >> >> > > >> > > > > > > > >> >> >> >> > > >> > > > > > > >> >> >> >> > > >> > > > > > >> >> >> >> > > >> > > > > >> >> >> >> > > >> > > > >> >> >> >> > > >> > > >> >> >> >> > > >> > >> >> >> >> > > >> >> >> >> >> > > >> >> >> >> > > >> >> >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> > >> > >> >-- >> >Alejandro >> >>
