Am 16.02.2015 um 11:08 schrieb William Hay <[email protected]>:
> 
> On Sun, 15 Feb 2015 17:08:36 +0000
> "Z. Zhong" <[email protected]> wrote:
> 
>> We are using the SUN SGE to manage a small cluster system. There are about 
>> fifty computing nodes. Users could submit their jobs on one independent 
>> manager node using qsub command.
>> 
>> Recently we have faced a problem: One could submit a job and the SGE could 
>> dispatch the job into one free computing node. And suppose the returned job 
>> id is 1000. When using qstat -j 1000, it shows the job status is r which 
>> means the SGE thought the job is running and at the same time, it also shows 
>> which machine the job is running on, suppose machine1. But when we using the 
>> ssh machine1 and using the top command to show the usage of resources by the 
>> running processes related to the job, it shows nothing. Ideally, we expect 
>> it shows the CPU usage by the submitting job, but it didn't. We also tried 
>> the `qrsh` to login into that node, and there are also no information about 
>> the processes about the job.
> 
> The central grid engine process can be a little out of date.  Perhaps the job 
> finished quickly? in any case I'd check the log files under the 
> execd_spool_dir (man sge_conf) to see if they have any clues.

Or only a `sleep` in the jobscript:

ps -e f

(f w/o -)

should show the attached processes of the sgeexecd.

-- Reuti


>> 
>> Another problem is, when one submit multiple jobs at the same time, the SGE 
>> will dispatch these jobs into one or few computing nodes. But in fact, there 
>> are many other computing nodes are free or not busy. What's the possible 
>> problem with the SGE? We expect the SGE could preferentially dispatch the 
>> latest submitted jobs into the free computing nodes.
> 
> The way in which grid engine selects queues to run jobs in can be configured 
> (for serial jobs at least) via the scheduler configuration. (man sched_conf). 
>  I'd look particularly at
> queue_sort_method, load_formula,job_load_adjustment and 
> load_adjustment_decay_time.  For parallel jobs you might want to take a look 
> at the parallel environments allocation_rule.  On our cluster we have the jsv 
> enforce a rule where multi-node jobs have exclusive access to the nodes they 
> run on.
> 
>> 
>> Could anyone help give some advice or references, please? Thanks!
> 
> 
> --
> William Hay <[email protected]>
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to