Hi guys,

I inherited a cluster running SGE 6.2u3, so it's a bit on the old side. I
had my storage node crash the other day and after a reboot the filesystem
was dirty and wouldn't mount until after I'd run an xfs_repair, although I
can't see anything missing as such.

The situation I have now is that whilst all my Cluster Queues as shown in
qstat and in qmon (divided up as short, medium and long) are still there,
the Queue Instances have disappeared for everything except the long queue.
I tried to modify the Cluster Queues for say the short queue and all the
hostlists were present as I'd expect. In qmon, it just shows broken queues
as all zeros - zero in use, zero avail, zero total, zero in error, CQLOA of
-NA-. I dug about in the filesystem to see if I'd lost files, but the
spool/qinstance/medium/nodexx.cluster type files are all present and
readable - just seems like GE is ignoring them (although I'm not sure if
loss of them would have caused this behaviour).

I found by messing around that if I cloned the short Cluster Queue via qmon
to create a short2 queue, it would populate the Queue Instances correctly
and I'd have my usual number of total slots and the short2 queue appeared
to work fine and dandy.

So my questions:
- Any ideas why my GE lost the Queue Instances?
- Is there an easier way to get them back? (Not that cloning a Cluster
Queue is difficult, but if there's a more "correct" way to do it, then I'd
rather know.)
- Is there a qconf equivalent of qmon's Clone button?

I'm a bit out of my depth with this and my google-fu seems to be letting me
down.

Thanks in advance.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to