Yes, I can reach the troublesome nodes with "qrsh."
I was able to submit (via 'qsub') a simple job directly to one of those
nodes.

Here's the output:

[root@frame ~]# qalter -w v 133068
verification: found possible assignment with 8 slots
[root@frame ~]# qalter -w p 133068
Job 133068 cannot run in queue "vision.q" because it is not contained in
its hard queue list (-q)
Job 133068 cannot run in queue "cuda.q" because it is not contained in its
hard queue list (-q)
Job 133068 cannot run in PE "orte" because it only offers 0 slots
verification: no suitable queues

It (133068) was submitted to the "all.q" queue.

Best,
Stephen


On Tue, Feb 11, 2014 at 2:48 PM, Reuti <[email protected]> wrote:

> Am 11.02.2014 um 23:37 schrieb Stephen Spencer:
>
> > I did swap them initially, sorry.
> >
> > Yes, "qrsh -q all.q@n20 hostname" returns the appropriate FQDN.
>
> So, you can reach the troublesome hosts now?
>
> Next step is:
>
> $ qalter -w v <job_id>
> $ qalter -w p <job_id>
>
> with the waiting jobs.
>
> -- Reuti
>
>
> >
> > Best,
> > Stephen
> >
> >
> > On Tue, Feb 11, 2014 at 2:33 PM, Reuti <[email protected]>
> wrote:
> > Am 11.02.2014 um 23:20 schrieb Stephen Spencer:
> >
> > > The definition of "qconf -sconf" is as you expected: all "builtin."
> > >
> > > Could you please be specific as to the commands you'd like me to try
> from the next line?
> > >
> > > Any output when you use the "-q ..." for `qrsh` too? In addition, you
> can try "-w v" and "-w p" too.
> >
> > I meant:
> >
> > $ qrsh -q all.q@n20 hostname
> >
> > (queue@host, did you swap them?)
> >
> > -- Reuti
> >
> >
> > >
> > > I tried "qrsh -w v" and "qrsh -w p" and both returned "verification:
> found suitable queue(s)".
> > > "qrsh -q all.q" gave me a shell, surprisingly, on one of the
> troublesome nodes. (Actually, was three for three.)
> > > All nodes have "BIP" for "qtype" - no limitations, there.
> > >
> > > Best,
> > > Stephen
> > >
> > >
> > > On Tue, Feb 11, 2014 at 1:57 PM, Reuti <[email protected]>
> wrote:
> > > Hi,
> > >
> > > Am 11.02.2014 um 22:37 schrieb Stephen Spencer:
> > >
> > > > I have a sixty-node cluster running SGE 6.2u5 (RHEL 6.5).
> > > >
> > > > The immediate issue is that a user has jobs in the "qw" state, and
> there are idle nodes in the cluster which appear to be able to accept the
> jobs.
> > > >
> > > > What works and doesn't work?
> > > >       * "qsub -q [email protected] job.sh" works - the job runs on "n20"
> > > >       * Repeated invocations of "qrsh hostname" will not, however,
> result in the job running on one of the troublesome hosts.
> > >
> > > What is the definition of:
> > >
> > > $ qconf -sconf
> > > ...
> > > qlogin_command               builtin
> > > qlogin_daemon                builtin
> > > rlogin_command               builtin
> > > rlogin_daemon                builtin
> > > rsh_command                  builtin
> > > rsh_daemon                   builtin
> > >
> > > Any output when you use the "-q ..." for `qrsh` too? In addition, you
> can try "-w v" and "-w p" too.
> > >
> > >
> > > > Things I've tried, and know, so far:
> > > >       * I've restarted the troublesome nodes - no change.
> > > >       * "sge_execd" is running on the the troublesome nodes.
> > > >       * The troublesome nodes are in the execution host list and the
> submit host list.
> > > >       * Most of the rest of the cluster's pretty busy.
> > > >       * Interestingly, the troublesome nodes don't show up in the
> "scheduling info" list produced as part of the "qstat -j <jobid>" command's
> output.
> > > > Short of restarting the entire cluster, I'm at a loss as to what to
> look at next.
> > >
> > > Is "qtype INTERACTIVE" limited to certain nodes/queues?
> > >
> > > -- Reuti
> > >
> > >
> > > > --
> > > > Stephen Spencer
> > > > [email protected]
> > > > _______________________________________________
> > > > users mailing list
> > > > [email protected]
> > > > https://gridengine.org/mailman/listinfo/users
> > >
> > >
> > >
> > >
> > > --
> > > Stephen Spencer
> > > [email protected]
> >
> >
> >
> >
> > --
> > Stephen Spencer
> > [email protected]
>
>


-- 
Stephen Spencer
[email protected]
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to