Do you have multiple nodes that qrsh jobs could be sent to?

I had a similar problem a couple of weeks ago. All the queues were set up to be batch or interactive, so a qrsh job could be assigned to any node in the cluster (that has since been fixed). So, the first step was to see if the problem was on all nodes, or just one that every successive qrsh job was sent to. To find this out, I used qrsh with the '-l hostname=...' option inside a loop, like this:


for host in host1 host2 host3; do #customize the host list for your hostnames
    qrsh -l 'hostname=$host'
done

Yes this is tedious and requirs typing 'exit' repeatedly, but I found out it was a single host causing the problem. Once I knew the host, I was able to look at it's logs and configuration more closely to find the root of the problem.

Prentice


On 11/13/2014 06:34 PM, Brian Small wrote:

Hello all, this is my first time posting to this mailing list.

About 1% or less of our qrsh grid jobs are failing in an unusual way.

We are running Open Grid Scheduler 2011.11 on CentOS 6.5.

The small percentage of failing qrsh jobs get a non-zero exit status back to the submit host (exit status 1), and display this message:

Your "qrsh" request could not be scheduled, try again later.

Note, we do include the "-now n" option on the command line.

Also the qacct log shows the job as having completed successfully:

qsub_time    Thu Nov 13 14:17:47 2014

start_time   Thu Nov 13 14:21:13 2014

end_time     Thu Nov 13 14:25:15 2014

granted_pe   NONE

slots        1

failed       0

exit_status  0

ru_wallclock 242

ru_utime 226.439

ru_stime     5.383

And reviewing the working directory, it does look like the job completed properly.

I'm not sure how to take the next step in debugging this problem. Any advice?

*Brian Small*

*Northwest Logic***

*1100 NW Compton Drive, Ste. 100*

*Beaverton, OR  97006*

*Desk - 503-533-5800 x-320*

*Cell - 503-577-6869*

*Fax: 503-533-5900*

*E-mail - [email protected] <mailto:[email protected]>_*

*Web - www.nwlogic.com <http://www.nwlogic.com/>*



_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

--
Prentice Bisbal
Manager of Information Technology
Rutgers Discovery Informatics Institute (RDI2)
Rutgers University
http://rdi2.rutgers.edu

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to