Reuti, Thanks for the quick response.
I am not running a script, but an executable (I have "-b y" also on the command line). The executable is running a job, and that job seems to finish correctly (it has its own log file which looks correct, and the job takes the right amount of time. And it is not trying to run another qrsh. This is what is looks like is happening to me: run on submit host: qrsh <command> <command> runs on an execution host on the grid, finishes successfully, with no error status (as reported in qacct) back on the submit host, at around the time the grid job completeds, qrsh returns a exit status of "1", and the message: Your "qrsh" request could not be scheduled, try again later. And note, this is only happening on a small percentages of the jobs, all running the same <command> tool, with different options. The ones that fail are seemingly random. I'm hoping someone can suggest a means of debugging this further, there is nothing in the qmaster spool messages log, and the qacct log for the jobs that fail look good as well. It looks like some problem that happens related to qrsh on the submit host only at the end of the job. - Brian Small Northwest Logic Desk: 503-533-5800 x320 Mobile: 503-577-6869 > -----Original Message----- > From: Reuti [mailto:[email protected]] > Sent: Thursday, November 13, 2014 4:17 PM > To: Brian Small > Cc: [email protected] > Subject: Re: [gridengine users] Small percentage of qrsh jobs failing on > submit host, but successfully run on grid > > Hi, > > Am 14.11.2014 um 00:34 schrieb Brian Small: > > > Hello all, this is my first time posting to this mailing list. > > > > About 1% or less of our qrsh grid jobs are failing in an unusual way. > > > > We are running Open Grid Scheduler 2011.11 on CentOS 6.5. > > > > The small percentage of failing qrsh jobs get a non-zero exit status back to > the submit host (exit status 1), and display this message: > > What do you start by `qrsh` - a binary or a script? > > This sounds like the probably started script wants to start another `qrsh`. In > case it's a script, the first line with "#!/bin/sh -x" will list the executed > commands. > > -- Reuti > > NB: The side effect of "-now n" is that the job will go to a queue of "qtype" > set to "BATCH", while "-now y" will route to a queue with "qtype" being > "INTERACTIVE" (the same applies when this option is used for `qsub`). > > > > Your "qrsh" request could not be scheduled, try again later. > > > > Note, we do include the "-now n" option on the command line. > > > > Also the qacct log shows the job as having completed successfully: > > > > qsub_time Thu Nov 13 14:17:47 2014 > > start_time Thu Nov 13 14:21:13 2014 > > end_time Thu Nov 13 14:25:15 2014 > > granted_pe NONE > > slots 1 > > failed 0 > > exit_status 0 > > ru_wallclock 242 > > ru_utime 226.439 > > ru_stime 5.383 > > > > And reviewing the working directory, it does look like the job completed > properly. > > > > I'm not sure how to take the next step in debugging this problem. Any > advice? > > > > Brian Small > > Northwest Logic > > 1100 NW Compton Drive, Ste. 100 > > Beaverton, OR 97006 > > Desk - 503-533-5800 x-320 > > Cell - 503-577-6869 > > Fax: 503-533-5900 > > E-mail - [email protected] > > Web - www.nwlogic.com > > > > _______________________________________________ > > users mailing list > > [email protected] > > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
