Thanks everyone. I have opened a JIRA and added a link to this discussion https://issues.apache.org/jira/browse/MAPREDUCE-5353
On Mon, Jun 24, 2013 at 8:42 PM, Devaraj k <devara...@huawei.com> wrote: > It is not mandatory to have running HS in the cluster. Still the user > can submit the job without HS in the cluster, and user may expect the > Job/App End Notification.**** > > ** ** > > Thanks**** > > Devaraj k**** > > ** ** > > *From:* Alejandro Abdelnur [mailto:t...@cloudera.com] > *Sent:* 24 June 2013 21:42 > *To:* user@hadoop.apache.org > *Cc:* user@hadoop.apache.org > > *Subject:* Re: Job end notification does not always work (Hadoop 2.x)**** > > ** ** > > if we ought to do this in a yarn service it > should be the RM or the HS. the RM is, IMO, the natural fit. the HS, would > be a good choice if we are concerned about the extra work this would cause > in the RM. the problem with the current HS is that it is MR specific, we > should generalize it for diff AM types. **** > > ** ** > > thx**** > > > Alejandro**** > > (phone typing)**** > > > On Jun 23, 2013, at 23:28, Devaraj k <devara...@huawei.com> wrote:**** > > Even if we handle all the failure cases in AM for Job End Notification, > we may miss cases like abrupt kill of AM when it is in last retry. If we > choose NM to give the notification, again RM needs to identify which NM > should give the end-notification as we don't have any direct protocol > between AM and NM.**** > > **** > > I feel it would be better to move End-Notification responsibility to RM as > Yarn Service because it ensures 100% notification and also useful for other > types of applications as well. **** > > **** > > **** > > Thanks**** > > Devaraj K**** > > **** > > *From:* Ravi Prakash [mailto:ravi...@ymail.com <ravi...@ymail.com>] > *Sent:* 23 June 2013 19:01 > *To:* user@hadoop.apache.org > *Subject:* Re: Job end notification does not always work (Hadoop 2.x)**** > > **** > > Hi Alejandro, > > Thanks for your reply! I was thinking more along the lines Prashant > suggested i.e. a failure during init() should still trigger an attempt to > notify (by the AM). But now that you mention it, maybe we would be better > of including this as a YARN feature after all (specially with all the new > AMs being written). We could let the NM of the AM handle the notification > burden, so that the RM doesn't get unduly taxed. Thoughts? > > Thanks > Ravi**** > > **** > > **** > ------------------------------ > > *From:* Alejandro Abdelnur <t...@cloudera.com> > *To:* "common-u...@hadoop.apache.org" <user@hadoop.apache.org> > *Sent:* Saturday, June 22, 2013 7:37 PM > *Subject:* Re: Job end notification does not always work (Hadoop 2.x)**** > > **** > > If the AM fails before doing the job end notification, at any stage of the > execution for whatever reason, the job end notification will never be > deliver. There is not way to fix this unless the notification is done by a > Yarn service. The 2 'candidate' services for doing this would be the RM and > the HS. The job notification URL is in the job conf. The RM never sees the > job conf, that rules out the RM out unless we add, at AM registration time > the possibility to specify a callback URL. The HS has access to the job > conf, but the HS is currently a 'passive' service.**** > > > thx**** > > **** > > On Sat, Jun 22, 2013 at 3:48 PM, Arun C Murthy <a...@hortonworks.com> > wrote:**** > > Prashanth, **** > > **** > > Please file a jira.**** > > **** > > One thing to be aware of - AMs get restarted a certain number of times > for fault-tolerance - which means we can't just assume that failure of a > single AM is equivalent to failure of the job.**** > > **** > > Only the ResourceManager is in the appropriate position to judge failure > of AM v/s failure-of-job.**** > > **** > > hth,**** > > Arun**** > > **** > > On Jun 22, 2013, at 2:44 PM, Prashant Kommireddi <prash1...@gmail.com> > wrote:**** > > > > > **** > > Thanks Ravi. > > Well, in this case its a no-effort :) A failure of AM init should be > considered as failure of the job? I looked at the code and best-effort > makes sense with respect to retry logic etc. You make a good point that > there would be no notification in case AM OOMs, but I do feel AM init > failure should send a notification by other means.**** > > **** > > On Sat, Jun 22, 2013 at 2:38 PM, Ravi Prakash <ravi...@ymail.com> wrote:** > ** > > Hi Prashant, > > I would tend to agree with you. Although job-end notification is only a > "best-effort" mechanism (i.e. we cannot always guarantee notification for > example when the AM OOMs), I agree with you that we can do more. If you > feel strongly about this, please create a JIRA and possibly upload a patch. > > Thanks > Ravi**** > > **** > > **** > ------------------------------ > > *From:* Prashant Kommireddi <prash1...@gmail.com> > *To:* "user@hadoop.apache.org" <user@hadoop.apache.org> > *Sent:* Thursday, June 20, 2013 9:45 PM > *Subject:* Job end notification does not always work (Hadoop 2.x)**** > > **** > > Hello,**** > > I came across an issue that occurs with the job notification callbacks in > MR2. It works fine if the Application master has started, but does not send > a callback if the initializing of AM fails.**** > > Here is the code from MRAppMaster.java > > ..... > .......**** > > // set job classloader if configured**** > > MRApps.setJobClassLoader(conf);**** > > initAndStartAppMaster(appMaster, conf, jobUserName);**** > > } catch (Throwable t) {**** > > LOG.fatal("Error starting MRAppMaster", t);**** > > System.exit(1);**** > > }**** > > } > > protected static void initAndStartAppMaster(final MRAppMaster appMaster,**** > > final YarnConfiguration conf, String jobUserName) throws > IOException,**** > > InterruptedException {**** > > UserGroupInformation.setConfiguration(conf);**** > > UserGroupInformation appMasterUgi = UserGroupInformation**** > > .createRemoteUser(jobUserName);**** > > appMasterUgi.doAs(new PrivilegedExceptionAction<Object>() {**** > > @Override**** > > public Object run() throws Exception {**** > > appMaster.init(conf);**** > > appMaster.start();**** > > if(appMaster.errorHappenedShutDown) {**** > > throw new IOException("Was asked to shut down.");**** > > }**** > > return null;**** > > }**** > > });**** > > }**** > > appMaster.init(conf) does not dispatch JobFinishEventHandler which is > responsible for sending a HTTP callback (via shutDownJob()). If there was > an exception at this time, the process would simply terminate (via > System.exit(1) )**** > > appMaster.start() however rightly uses the JobFinishEventHandler and > things work fine.**** > > Shouldn't a failure on init(..) also send a callback suggesting the job > failed?**** > > Thanks,**** > > Prashant**** > > **** > > **** > > **** > > **** > > --**** > > Arun C. Murthy**** > > Hortonworks Inc. > http://hortonworks.com/**** > > **** > > > > **** > > **** > > -- > Alejandro **** > > **** > >