Do you have multiple nodes that qrsh jobs could be sent to?
I had a similar problem a couple of weeks ago. All the queues were set
up to be batch or interactive, so a qrsh job could be assigned to any
node in the cluster (that has since been fixed). So, the first step was
to see if the problem was on all nodes, or just one that every
successive qrsh job was sent to. To find this out, I used qrsh with the
'-l hostname=...' option inside a loop, like this:
for host in host1 host2 host3; do #customize the host list for your
hostnames
qrsh -l 'hostname=$host'
done
Yes this is tedious and requirs typing 'exit' repeatedly, but I found
out it was a single host causing the problem. Once I knew the host, I
was able to look at it's logs and configuration more closely to find the
root of the problem.
Prentice
On 11/13/2014 06:34 PM, Brian Small wrote:
Hello all, this is my first time posting to this mailing list.
About 1% or less of our qrsh grid jobs are failing in an unusual way.
We are running Open Grid Scheduler 2011.11 on CentOS 6.5.
The small percentage of failing qrsh jobs get a non-zero exit status
back to the submit host (exit status 1), and display this message:
Your "qrsh" request could not be scheduled, try again later.
Note, we do include the "-now n" option on the command line.
Also the qacct log shows the job as having completed successfully:
qsub_time Thu Nov 13 14:17:47 2014
start_time Thu Nov 13 14:21:13 2014
end_time Thu Nov 13 14:25:15 2014
granted_pe NONE
slots 1
failed 0
exit_status 0
ru_wallclock 242
ru_utime 226.439
ru_stime 5.383
And reviewing the working directory, it does look like the job
completed properly.
I'm not sure how to take the next step in debugging this problem. Any
advice?
*Brian Small*
*Northwest Logic***
*1100 NW Compton Drive, Ste. 100*
*Beaverton, OR 97006*
*Desk - 503-533-5800 x-320*
*Cell - 503-577-6869*
*Fax: 503-533-5900*
*E-mail - [email protected] <mailto:[email protected]>_*
*Web - www.nwlogic.com <http://www.nwlogic.com/>*
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
--
Prentice Bisbal
Manager of Information Technology
Rutgers Discovery Informatics Institute (RDI2)
Rutgers University
http://rdi2.rutgers.edu
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users