Hey Chris, Comments are below.
On 4/6/12 9:01 PM, "Mattmann, Chris A (388J)" <[email protected]> wrote: >Hi Mike, > >Thanks, what a great page! > >I noticed this comment in the page: > >"At the time of this writing, jobs that cannot be added to the queue >disappear...." > >I think we should be more clear than "disappear". They don't disappear. >The >Scheduler will try and send a Job to the BatchMgr, and if there is an >exception, >it tries to re-queue the Job back onto the JobStack. If it's unable to do >that, then >there is an issue, but it at the very least tries to re-queue the job if >there was an >issue. The reason this blurb was put into the wiki was because when Gabe and I were looking through the Resource Manager code, this is what looks to be happening. Check out the piece of code that tries to add a job: In the JobStack.java: public String addJob(JobSpec spec) throws JobQueueException { String jobId = safeAddJob(spec); if (queue.size() != maxQueueSize) { LOG.log(Level.INFO, "Added Job: [" + spec.getJob().getId() + "] to queue"); queue.add(spec); spec.getJob().setStatus(JobStatus.QUEUED); safeUpdateJob(spec); return jobId; } else throw new JobQueueException("Reached max queue size: [" + maxQueueSize + "]: Unable to add job: [" + spec.getJob().getId() + "]"); } } When the JobQueueException gets thrown, the Resource Manager throws a SchedulerException: In the XmlRpcResourceManager.java: private String genericHandleJob(Hashtable jobHash, Object jobIn) throws SchedulerException { ... try { jobId = scheduler.getJobQueue().addJob(spec); } catch (JobQueueException e) { LOG.log(Level.WARNING, "JobQueue exception adding job: Message: " + e.getMessage()); throw new SchedulerException(e.getMessage()); } return jobId; } >From here, I can't see where the job gets re-queued if the max queue size is reached. If this is true, I can certainly file a JIRA issue. > >Also, in general, you will have as many jobs queued in Resource Manager >land >as the size of that job stack. So we should probably note that. > >Great resource here, thanks for putting it together! > >Cheers, >Chris Okay, I've noted this in the wiki as well. Cheers, Mike > >On Apr 5, 2012, at 8:43 AM, Cayanan, Michael D (388J) wrote: > >> Hi all, >> >> I recently added a page to the OODT wiki: >> >> https://cwiki.apache.org/confluence/display/OODT/Workflow+Manager+Help >> >> I ran into some issues with the Workflow hanging up and also Workflow >>jobs being lost when trying to send them off to the Resource Manager and >>just wanted to share what I learned and what to do if you run into these >>issues as well. >> >> Cheers, >> Mike > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Senior Computer Scientist >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 171-266B, Mailstop: 171-246 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Assistant Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >
