On Sun, 15 Feb 2015 17:08:36 +0000 "Z. Zhong" <[email protected]> wrote:
> We are using the SUN SGE to manage a small cluster system. There are about > fifty computing nodes. Users could submit their jobs on one independent > manager node using qsub command. > > Recently we have faced a problem: One could submit a job and the SGE could > dispatch the job into one free computing node. And suppose the returned job > id is 1000. When using qstat -j 1000, it shows the job status is r which > means the SGE thought the job is running and at the same time, it also shows > which machine the job is running on, suppose machine1. But when we using the > ssh machine1 and using the top command to show the usage of resources by the > running processes related to the job, it shows nothing. Ideally, we expect it > shows the CPU usage by the submitting job, but it didn't. We also tried the > `qrsh` to login into that node, and there are also no information about the > processes about the job. The central grid engine process can be a little out of date. Perhaps the job finished quickly? in any case I'd check the log files under the execd_spool_dir (man sge_conf) to see if they have any clues. > > Another problem is, when one submit multiple jobs at the same time, the SGE > will dispatch these jobs into one or few computing nodes. But in fact, there > are many other computing nodes are free or not busy. What's the possible > problem with the SGE? We expect the SGE could preferentially dispatch the > latest submitted jobs into the free computing nodes. The way in which grid engine selects queues to run jobs in can be configured (for serial jobs at least) via the scheduler configuration. (man sched_conf). I'd look particularly at queue_sort_method, load_formula,job_load_adjustment and load_adjustment_decay_time. For parallel jobs you might want to take a look at the parallel environments allocation_rule. On our cluster we have the jsv enforce a rule where multi-node jobs have exclusive access to the nodes they run on. > > Could anyone help give some advice or references, please? Thanks! -- William Hay <[email protected]>
pgpw92N6r0LRU.pgp
Description: PGP signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
