Hey Chris and Brian, I filed a JIRA issue for this:
https://issues.apache.org/jira/browse/OODT-439 So for the wiki page that I just created, should I just reference this JIRA issue on the page so that users know that this is a work around (setting the queue size of the resource manager)? Or should I remove it and document the work around with the JIRA issue as Brian has suggested? I'm okay with either solution. Thanks, Mike On 4/10/12 8:19 PM, "Mattmann, Chris A (388J)" <[email protected]> wrote: >Hey BFost, > >Totally agreed here, and with Mike on it. This is an issue that we need >to fix. Thanks to Mike and others for taking the time to document this, >and I am +1 with Brian that along with the documentation, we should >probably think of a strategy to fix this and implement it in 0.5. Mike, >I think you offered to file a JIRA issue -- that offer still stand? :) > >Thanks! > >Cheers, >Chris > >On Apr 10, 2012, at 10:58 AM, Brian Foster wrote: > >> hey chris, >> >> i believe mike is talking about the following case: >> >> 1) queue is full >> 2) scheduler pops job from queue and beginnings trying to find a node >>for job >> 3) queue now has 1 open slot >> 4) another job is given to the resource manager and is placed in the >>queue >> 5) queue is now full again >> 6) scheduler fails to schedule popped job >> 7) scheduler pushs job back into the queue >> 8) queue is full so exception is thrown and job is lost >> >> -brian >> >> On Apr 10, 2012, at 07:08 AM, "Mattmann, Chris A (388J)" >><[email protected]> wrote: >> >>> Hi Mike, >>> >>> On Apr 9, 2012, at 9:12 AM, Cayanan, Michael D (388J) wrote: >>> >>> > Hey Chris, >>> > >>> > Comments are below. >>> >> >>> >> "At the time of this writing, jobs that cannot be added to the queue >>> >> disappear...." >>> >> >>> >> I think we should be more clear than "disappear". They don't >>>disappear. >>> >> The >>> >> Scheduler will try and send a Job to the BatchMgr, and if there is >>>an >>> >> exception, >>> >> it tries to re-queue the Job back onto the JobStack. If it's unable >>>to do >>> >> that, then >>> >> there is an issue, but it at the very least tries to re-queue the >>>job if >>> >> there was an >>> >> issue. >>> > >>> > The reason this blurb was put into the wiki was because when Gabe >>>and I >>> > were looking through the Resource Manager code, this is what looks >>>to be >>> > happening. Check out the piece of code that tries to add a job: >>> >>> Reaching Max queue size is different than saying that jobs that cannot >>>be >>> added to the queue disappear. I think we should explicitly state: >>> >>> "At the time of this writing, when then queue has reached the max >>>queue >>> size, a message is logged by the Scheduler saying there is a Job Queue >>> Exception adding a job to the queue, and then the Job is dropped." >>> >>> I think that's more accurate based on your code walk. I was thinking >>>based on >>> your above message that you were talking about Jobs that couldn't be >>> Scheduled for whatever reason (e.g., the Batch Mgr being down, or a >>> Batch Stub being down) in which case they are re-queued. >>> >>> Cheers, >>> Chris >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Senior Computer Scientist >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 171-266B, Mailstop: 171-246 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Assistant Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Senior Computer Scientist >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >Office: 171-266B, Mailstop: 171-246 >Email: [email protected] >WWW: http://sunset.usc.edu/~mattmann/ >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Adjunct Assistant Professor, Computer Science Department >University of Southern California, Los Angeles, CA 90089 USA >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >
