Hi,

Am 20.04.2018 um 21:04 schrieb Ilya M:

> Hello,
> 
> I set up a test queue to test a new prolog/epilog scripts and I am seeing 
> some strange behavior when I submit a PE job to this queue, which causes the 
> job to not get scheduled forever or for a very long period of time. I tried 
> several PE with allocation rules of '1', '2', '4'. All to no avail. 
> Submitting a job without a PE makes it run immediately. I am using SGE 2.6u5.
> 
> Checking why it is not running:
> $ qalter -w v 7301747
> ...
> Job 7301747 cannot run because it exceeds limit "ilya/////" in rule 
> "limit_slots_for_users/1"
> Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots

This error message is often misleading, although there is a real reason 
preventing the scheduling.

> verification: no suitable queues
> 
> $ qconf -sp pe_1
> pe_name            pe_1
> slots              9999999
> user_lists         NONE
> xuser_lists        NONE
> start_proc_args    startmpi.sh $pe_hostfile
> stop_proc_args     stopmpi.sh $pe_hostfile
> allocation_rule    1
> control_slaves     TRUE
> job_is_first_task  TRUE
> urgency_slots      min
> accounting_summary FALSE
> 
> $ qconf -srqs limit_slots_for_users
> {
>    name         limit_slots_for_users
>    description  "limit the number of simultaneous slots any user can use"
>    enabled      TRUE
>    limit        users {*} to slots=800
> }
> 
> And finally, 
> $ qstat
> job-ID  prior   name       user         state submit/start at     queue       
>                    slots ja-task-ID 
> -----------------------------------------------------------------------------------------------------------------
> 7301584 0.60051 sleep      ilya        qw    04/20/2018 18:29:26              
>                       4        
> 7301747 0.50051 sleep      ilya        qw    04/20/2018 18:36:23              
>                       1        
> 
> So I am not running anything at the moment. If I submit a job with the same 
> PE to a production queue, it will get scheduled.
> 
> A job that I left hanging last night, finally got scheduled after 7-8 hours.
> 
> The test queue is a follows:
> qconf -sq test_gpu.q
> qname                 test_gpu.q
> hostlist              @gpu

How many hosts are in @gpu? The allocation_rule 1 means exactly one slot per 
machine – not as often 1 as the node is filled (this is different form Torque, 
where this can be assigned several times per host).


> seq_no                0
> load_thresholds       np_load_avg=1.75
> suspend_thresholds    NONE
> nsuspend              1
> suspend_interval      00:05:00
> priority              0
> min_cpu_interval      00:05:00
> processors            UNDEFINED
> qtype                 BATCH INTERACTIVE
> ckpt_list             NONE
> pe_list               make pe_1 pe_2 pe_3 pe_4 pe_slots
> rerun                 TRUE
> slots                 4
> tmpdir                /data
> shell                 /bin/sh
> prolog                [email protected]
> epilog                [email protected]
> shell_start_mode      unix_behavior
> starter_method        NONE
> suspend_method        NONE
> resume_method         NONE
> terminate_method      custom_kill -p $job_pid -j $job_id

I don't know about your custom_kill procedure, but it should kill -$job_pid, 
i.e. the process group and not only a single process.

- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to