Am 16.02.2015 um 11:08 schrieb William Hay <[email protected]>: > > On Sun, 15 Feb 2015 17:08:36 +0000 > "Z. Zhong" <[email protected]> wrote: > >> We are using the SUN SGE to manage a small cluster system. There are about >> fifty computing nodes. Users could submit their jobs on one independent >> manager node using qsub command. >> >> Recently we have faced a problem: One could submit a job and the SGE could >> dispatch the job into one free computing node. And suppose the returned job >> id is 1000. When using qstat -j 1000, it shows the job status is r which >> means the SGE thought the job is running and at the same time, it also shows >> which machine the job is running on, suppose machine1. But when we using the >> ssh machine1 and using the top command to show the usage of resources by the >> running processes related to the job, it shows nothing. Ideally, we expect >> it shows the CPU usage by the submitting job, but it didn't. We also tried >> the `qrsh` to login into that node, and there are also no information about >> the processes about the job. > > The central grid engine process can be a little out of date. Perhaps the job > finished quickly? in any case I'd check the log files under the > execd_spool_dir (man sge_conf) to see if they have any clues.
Or only a `sleep` in the jobscript: ps -e f (f w/o -) should show the attached processes of the sgeexecd. -- Reuti >> >> Another problem is, when one submit multiple jobs at the same time, the SGE >> will dispatch these jobs into one or few computing nodes. But in fact, there >> are many other computing nodes are free or not busy. What's the possible >> problem with the SGE? We expect the SGE could preferentially dispatch the >> latest submitted jobs into the free computing nodes. > > The way in which grid engine selects queues to run jobs in can be configured > (for serial jobs at least) via the scheduler configuration. (man sched_conf). > I'd look particularly at > queue_sort_method, load_formula,job_load_adjustment and > load_adjustment_decay_time. For parallel jobs you might want to take a look > at the parallel environments allocation_rule. On our cluster we have the jsv > enforce a rule where multi-node jobs have exclusive access to the nodes they > run on. > >> >> Could anyone help give some advice or references, please? Thanks! > > > -- > William Hay <[email protected]> > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
