I have a sixty-node cluster running SGE 6.2u5 (RHEL 6.5). The immediate issue is that a user has jobs in the "qw" state, and there are idle nodes in the cluster which appear to be able to accept the jobs.
What works and doesn't work? - "qsub -q [email protected] job.sh" works - the job runs on "n20" - Repeated invocations of "qrsh hostname" will not, however, result in the job running on one of the troublesome hosts. Things I've tried, and know, so far: - I've restarted the troublesome nodes - no change. - "sge_execd" is running on the the troublesome nodes. - The troublesome nodes are in the execution host list and the submit host list. - Most of the rest of the cluster's pretty busy. - Interestingly, the troublesome nodes don't show up in the "scheduling info" list produced as part of the "qstat -j <jobid>" command's output. Short of restarting the entire cluster, I'm at a loss as to what to look at next. -- Stephen Spencer [email protected]
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
