Yes, I can reach the troublesome nodes with "qrsh." I was able to submit (via 'qsub') a simple job directly to one of those nodes.
Here's the output: [root@frame ~]# qalter -w v 133068 verification: found possible assignment with 8 slots [root@frame ~]# qalter -w p 133068 Job 133068 cannot run in queue "vision.q" because it is not contained in its hard queue list (-q) Job 133068 cannot run in queue "cuda.q" because it is not contained in its hard queue list (-q) Job 133068 cannot run in PE "orte" because it only offers 0 slots verification: no suitable queues It (133068) was submitted to the "all.q" queue. Best, Stephen On Tue, Feb 11, 2014 at 2:48 PM, Reuti <[email protected]> wrote: > Am 11.02.2014 um 23:37 schrieb Stephen Spencer: > > > I did swap them initially, sorry. > > > > Yes, "qrsh -q all.q@n20 hostname" returns the appropriate FQDN. > > So, you can reach the troublesome hosts now? > > Next step is: > > $ qalter -w v <job_id> > $ qalter -w p <job_id> > > with the waiting jobs. > > -- Reuti > > > > > > Best, > > Stephen > > > > > > On Tue, Feb 11, 2014 at 2:33 PM, Reuti <[email protected]> > wrote: > > Am 11.02.2014 um 23:20 schrieb Stephen Spencer: > > > > > The definition of "qconf -sconf" is as you expected: all "builtin." > > > > > > Could you please be specific as to the commands you'd like me to try > from the next line? > > > > > > Any output when you use the "-q ..." for `qrsh` too? In addition, you > can try "-w v" and "-w p" too. > > > > I meant: > > > > $ qrsh -q all.q@n20 hostname > > > > (queue@host, did you swap them?) > > > > -- Reuti > > > > > > > > > > I tried "qrsh -w v" and "qrsh -w p" and both returned "verification: > found suitable queue(s)". > > > "qrsh -q all.q" gave me a shell, surprisingly, on one of the > troublesome nodes. (Actually, was three for three.) > > > All nodes have "BIP" for "qtype" - no limitations, there. > > > > > > Best, > > > Stephen > > > > > > > > > On Tue, Feb 11, 2014 at 1:57 PM, Reuti <[email protected]> > wrote: > > > Hi, > > > > > > Am 11.02.2014 um 22:37 schrieb Stephen Spencer: > > > > > > > I have a sixty-node cluster running SGE 6.2u5 (RHEL 6.5). > > > > > > > > The immediate issue is that a user has jobs in the "qw" state, and > there are idle nodes in the cluster which appear to be able to accept the > jobs. > > > > > > > > What works and doesn't work? > > > > * "qsub -q [email protected] job.sh" works - the job runs on "n20" > > > > * Repeated invocations of "qrsh hostname" will not, however, > result in the job running on one of the troublesome hosts. > > > > > > What is the definition of: > > > > > > $ qconf -sconf > > > ... > > > qlogin_command builtin > > > qlogin_daemon builtin > > > rlogin_command builtin > > > rlogin_daemon builtin > > > rsh_command builtin > > > rsh_daemon builtin > > > > > > Any output when you use the "-q ..." for `qrsh` too? In addition, you > can try "-w v" and "-w p" too. > > > > > > > > > > Things I've tried, and know, so far: > > > > * I've restarted the troublesome nodes - no change. > > > > * "sge_execd" is running on the the troublesome nodes. > > > > * The troublesome nodes are in the execution host list and the > submit host list. > > > > * Most of the rest of the cluster's pretty busy. > > > > * Interestingly, the troublesome nodes don't show up in the > "scheduling info" list produced as part of the "qstat -j <jobid>" command's > output. > > > > Short of restarting the entire cluster, I'm at a loss as to what to > look at next. > > > > > > Is "qtype INTERACTIVE" limited to certain nodes/queues? > > > > > > -- Reuti > > > > > > > > > > -- > > > > Stephen Spencer > > > > [email protected] > > > > _______________________________________________ > > > > users mailing list > > > > [email protected] > > > > https://gridengine.org/mailman/listinfo/users > > > > > > > > > > > > > > > -- > > > Stephen Spencer > > > [email protected] > > > > > > > > > > -- > > Stephen Spencer > > [email protected] > > -- Stephen Spencer [email protected]
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
