Hi,
Am 20.04.2018 um 21:04 schrieb Ilya M:
> Hello,
>
> I set up a test queue to test a new prolog/epilog scripts and I am seeing
> some strange behavior when I submit a PE job to this queue, which causes the
> job to not get scheduled forever or for a very long period of time. I tried
> several PE with allocation rules of '1', '2', '4'. All to no avail.
> Submitting a job without a PE makes it run immediately. I am using SGE 2.6u5.
>
> Checking why it is not running:
> $ qalter -w v 7301747
> ...
> Job 7301747 cannot run because it exceeds limit "ilya/////" in rule
> "limit_slots_for_users/1"
> Job 7301747 cannot run in PE "pe_1" because it only offers 0 slots
This error message is often misleading, although there is a real reason
preventing the scheduling.
> verification: no suitable queues
>
> $ qconf -sp pe_1
> pe_name pe_1
> slots 9999999
> user_lists NONE
> xuser_lists NONE
> start_proc_args startmpi.sh $pe_hostfile
> stop_proc_args stopmpi.sh $pe_hostfile
> allocation_rule 1
> control_slaves TRUE
> job_is_first_task TRUE
> urgency_slots min
> accounting_summary FALSE
>
> $ qconf -srqs limit_slots_for_users
> {
> name limit_slots_for_users
> description "limit the number of simultaneous slots any user can use"
> enabled TRUE
> limit users {*} to slots=800
> }
>
> And finally,
> $ qstat
> job-ID prior name user state submit/start at queue
> slots ja-task-ID
> -----------------------------------------------------------------------------------------------------------------
> 7301584 0.60051 sleep ilya qw 04/20/2018 18:29:26
> 4
> 7301747 0.50051 sleep ilya qw 04/20/2018 18:36:23
> 1
>
> So I am not running anything at the moment. If I submit a job with the same
> PE to a production queue, it will get scheduled.
>
> A job that I left hanging last night, finally got scheduled after 7-8 hours.
>
> The test queue is a follows:
> qconf -sq test_gpu.q
> qname test_gpu.q
> hostlist @gpu
How many hosts are in @gpu? The allocation_rule 1 means exactly one slot per
machine – not as often 1 as the node is filled (this is different form Torque,
where this can be assigned several times per host).
> seq_no 0
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> processors UNDEFINED
> qtype BATCH INTERACTIVE
> ckpt_list NONE
> pe_list make pe_1 pe_2 pe_3 pe_4 pe_slots
> rerun TRUE
> slots 4
> tmpdir /data
> shell /bin/sh
> prolog [email protected]
> epilog [email protected]
> shell_start_mode unix_behavior
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method custom_kill -p $job_pid -j $job_id
I don't know about your custom_kill procedure, but it should kill -$job_pid,
i.e. the process group and not only a single process.
- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users