Hello all,
Our schedd is dying unexpectedly since a user has sent 750000 jobs in 3
job arrays.
I usually have schedd_job_inof to true, schedule_interval is 30 s we
have configured fairsahre priority policy and max_pending_tasks_per_job
is 50.
is the schedd_job_info=true bad? After some googling I've realized that
many people have it set to false.... But in my case having it set to
true/false makes no difference.
Anyway, the log says something like:
[....]
05/13/2014 05:13:17|event_|ant-master2|E|removing event client (schedd:0) on
host "ant-master2.linux.crg.es" after acknowledge timeout from event client list
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: job dispatching took 7009.980 s
(1978 fast, 0 fast_soft, 54 pe, 0 pe_soft, 52 res)
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: parallel matching 167
167 2004 14720 11222 14720 167
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: sequential matching 1977
0 20580 1991 1991 1991 1975
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: create pending job orders: 0.000
s
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: scheduled in 7010.120 (u
6199.200 + s 861.580 = 7060.780): 1975 sequential, 2 parallel, 4090 orders, 145
H, 73 Q, 208 QA, 55 J(q
w), 67 J(r), 0 J(s), 0 J(h), 0 J(e), 6 J(x), 125 J(all), 51 C, 45 ACL, 3 PE, 37
U, 1 D, 0 PRJ, 1 ST, 1 CKPT, 0 RU, 135 gMes, 22766 jMes, 4090/183 pre-send,
0/0/0 pe-alg
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: send orders and cleanup took:
0.030 (u 0.050,s 0.000) s
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: schedd run took: 7014.870 s
(init: 4.300 s, copy: 0.040 s, run:7010.150, free: 0.010 s, jobs: 125,
categories: 19/0)
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): scheduler thread
profiling summary:
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): other : wc
= 0.320s, utime = 0.350s, stime = 0.030s, utilization = 119%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): packing : wc
= 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): eventclient : wc
= 0.080s, utime = 0.070s, stime = 0.000s, utilization = 88%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): mirror : wc
= 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): gdi : wc
= 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): ht-resize : wc
= 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): scheduler : wc
= 0.020s, utime = 0.010s, stime = 0.010s, utilization = 100%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): pending ticket : wc
= 0.000s, utime = 0.010s, stime = 0.000s, utilization = 0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): job sorting : wc
= 0.000s, utime = 0.000s, stime = 0.000s, utilization = 0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): job dispatching: wc
= 7009.980s, utime = 6199.070s, stime = 861.570s, utilization = 101%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): send orders : wc
= 0.030s, utime = 0.050s, stime = 0.000s, utilization = 167%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): scheduler event: wc
= 0.370s, utime = 0.330s, stime = 0.060s, utilization = 105%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): copy lists : wc
= 4.350s, utime = 4.070s, stime = 0.310s, utilization = 101%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): total : wc
= 7015.270s, utime = 6204.080s, stime = 861.980s, utilization = 101%
05/13/2014 07:00:12|event_|ant-master2|E|no event client known with id 1 to
modify
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: sge_mirror processed 14 events
in 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: static urgency took 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: job ticket calculation: init:
0.110 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: job ticket calculation: init:
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: normalizing job tickets took
0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: create active job orders: 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: job-order calculation took 0.120
s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: job sorting took 0.000 s
05/13/2014 07:10:13|event_|ant-master2|W|acknowledge timeout after 600 seconds
for event client (schedd:0) on host "ant-master2.linux.crg.es"
05/13/2014 07:10:13|event_|ant-master2|E|removing event client (schedd:0) on
host "ant-master2.linux.crg.es" after acknowledge timeout from event client list
05/13/2014 08:49:36|worker|ant-master2|E|no event client known with id 1 to
deliver events immediately
[...]
After the restart I see many errors like:
05/13/2014 09:01:21|worker|ant-master2|E|The job -j of user(s) jaespinosa does
not exist
this one:
05/13/2014 09:01:48| main|ant-master2|E|error parsing double value from string
"ScCCCCCCCSCCCCCCCC"
or errors like these for all users:
05/13/2014 09:01:48| main|ant-master2|E|unrecognized characters after the
attribute values in line 9: "0.000000"
05/13/2014 09:01:48| main|ant-master2|E|error reading file:
"/var/spool/gridengine/default/qmaster/users/dsantesmasses"
05/13/2014 09:01:48| main|ant-master2|E|unrecognized characters after the
attribute values in line 9: "mem"
05/13/2014 09:01:48| main|ant-master2|E|error reading file:
"/var/spool/gridengine/default/qmaster/users/mmusy"
# cat /var/spool/gridengine/default/qmaster/users/mmusy
# Version: 2011.11p1
#
# DO NOT MODIFY THIS FILE MANUALLY!
#
name mmusy
oticket 0
fshare 0
delete_time 0
usage cpu=0.000000 mem=0.000000 io=0.000000
usage_time_stamp 1372869434
long_term_usage cpu=0.000000 mem=0.000000 io=0.000000
default_project NONE
Also, some warning about spool directory:
05/13/2014 09:22:06|worker|ant-master2|W|rule "default rule (spool dir)" in
spooling context "flatfile spooling" failed writing an object
In my case it's local :
# df -h /var/
Filesystem Size Used Avail Use% Mounted on
/dev/cciss/c0d0p7 75G 767M 70G 2% /var
this is my schedd conf:
# qconf -ssconf
algorithm default
schedule_interval 0:0:30
maxujobs 0
queue_sort_method seqno
job_load_adjustments np_load_avg=0.50
load_adjustment_decay_time 0:7:30
load_formula np_load_avg
schedd_job_info false
flush_submit_sec 0
flush_finish_sec 0
params PROFILE=1 MONITOR=1
reprioritize_interval 0:0:0
halftime 24
usage_weight_list cpu=0.500000,mem=0.500000,io=0.000000
compensation_factor 1.000000
weight_user 0.250000
weight_project 0.250000
weight_department 0.250000
weight_job 0.250000
weight_tickets_functional 0
weight_tickets_share 100000
share_override_tickets TRUE
share_functional_shares TRUE
max_functional_jobs_to_schedule 200
report_pjob_tickets TRUE
max_pending_tasks_per_job 50
halflife_decay_list io=-1
policy_hierarchy S
weight_ticket 1.000000
weight_waiting_time 0.000000
weight_deadline 3600000.000000
weight_urgency 0.001000
weight_priority 0.001000
max_reservation 150
default_duration 9999:00:00
Anyone have seen one of those errors?
Any advice for running such amount of jobs?
Am I reaching some internal SGE limit?
I'm tempted to set the debug, but I'm scared of the amount of info that
it could generate....
TIA,
Arnau
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users