Hello all,

Our schedd is dying unexpectedly since a user has sent 750000 jobs in 3
job arrays.
I usually have schedd_job_inof to true, schedule_interval is 30 s we
have configured fairsahre priority policy and max_pending_tasks_per_job
is 50.

is the schedd_job_info=true bad? After some googling I've realized that
many people have it set to false.... But in my case having it set to
true/false makes no difference.


Anyway,  the log says something like:

[....]
05/13/2014 05:13:17|event_|ant-master2|E|removing event client (schedd:0) on 
host "ant-master2.linux.crg.es" after acknowledge timeout from event client list
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: job dispatching took 7009.980 s 
(1978 fast, 0 fast_soft, 54 pe, 0 pe_soft, 52 res)
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: parallel matching            167 
         167         2004        14720        11222        14720          167
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: sequential matching         1977 
           0        20580         1991         1991         1991         1975
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: create pending job orders: 0.000 
s
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: scheduled in 7010.120 (u 
6199.200 + s 861.580 = 7060.780): 1975 sequential, 2 parallel, 4090 orders, 145 
H, 73 Q, 208 QA, 55 J(q
w), 67 J(r), 0 J(s), 0 J(h), 0 J(e), 6 J(x), 125 J(all), 51 C, 45 ACL, 3 PE, 37 
U, 1 D, 0 PRJ, 1 ST, 1 CKPT, 0 RU, 135 gMes, 22766 jMes, 4090/183 pre-send, 
0/0/0 pe-alg

05/13/2014 07:00:12|schedu|ant-master2|P|PROF: send orders and cleanup took: 
0.030 (u 0.050,s 0.000) s
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: schedd run took: 7014.870 s 
(init: 4.300 s, copy: 0.040 s, run:7010.150, free: 0.010 s, jobs: 125, 
categories: 19/0)
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): scheduler thread 
profiling summary:

05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): other          : wc 
=      0.320s, utime =      0.350s, stime =      0.030s, utilization = 119%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): packing        : wc 
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): eventclient    : wc 
=      0.080s, utime =      0.070s, stime =      0.000s, utilization =  88%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): mirror         : wc 
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): gdi            : wc 
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): ht-resize      : wc 
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): scheduler      : wc 
=      0.020s, utime =      0.010s, stime =      0.010s, utilization = 100%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): pending ticket : wc 
=      0.000s, utime =      0.010s, stime =      0.000s, utilization =   0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): job sorting    : wc 
=      0.000s, utime =      0.000s, stime =      0.000s, utilization =   0%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): job dispatching: wc 
=   7009.980s, utime =   6199.070s, stime =    861.570s, utilization = 101%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): send orders    : wc 
=      0.030s, utime =      0.050s, stime =      0.000s, utilization = 167%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): scheduler event: wc 
=      0.370s, utime =      0.330s, stime =      0.060s, utilization = 105%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): copy lists     : wc 
=      4.350s, utime =      4.070s, stime =      0.310s, utilization = 101%
05/13/2014 07:00:12|schedu|ant-master2|P|PROF(1115686656): total          : wc 
=   7015.270s, utime =   6204.080s, stime =    861.980s, utilization = 101%
05/13/2014 07:00:12|event_|ant-master2|E|no event client known with id 1 to 
modify
05/13/2014 07:00:12|schedu|ant-master2|P|PROF: sge_mirror processed 14 events 
in 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: static urgency took 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: job ticket calculation: init: 
0.110 s, pass 0: 0.010 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: job ticket calculation: init: 
0.000 s, pass 0: 0.000 s, pass 1: 0.000, pass2: 0.000, calc: 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: normalizing job tickets took 
0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: create active job orders: 0.000 s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: job-order calculation took 0.120 
s
05/13/2014 07:00:17|schedu|ant-master2|P|PROF: job sorting took 0.000 s
05/13/2014 07:10:13|event_|ant-master2|W|acknowledge timeout after 600 seconds 
for event client (schedd:0) on host "ant-master2.linux.crg.es"
05/13/2014 07:10:13|event_|ant-master2|E|removing event client (schedd:0) on 
host "ant-master2.linux.crg.es" after acknowledge timeout from event client list
05/13/2014 08:49:36|worker|ant-master2|E|no event client known with id 1 to 
deliver events immediately
[...]

After the restart I see many errors like:

05/13/2014 09:01:21|worker|ant-master2|E|The job -j of user(s) jaespinosa does 
not exist

this one:

05/13/2014 09:01:48|  main|ant-master2|E|error parsing double value from string 
"ScCCCCCCCSCCCCCCCC"

or errors like these for all users:

05/13/2014 09:01:48|  main|ant-master2|E|unrecognized characters after the 
attribute values in line 9: "0.000000"
05/13/2014 09:01:48|  main|ant-master2|E|error reading file: 
"/var/spool/gridengine/default/qmaster/users/dsantesmasses"
05/13/2014 09:01:48|  main|ant-master2|E|unrecognized characters after the 
attribute values in line 9: "mem"
05/13/2014 09:01:48|  main|ant-master2|E|error reading file: 
"/var/spool/gridengine/default/qmaster/users/mmusy"

# cat /var/spool/gridengine/default/qmaster/users/mmusy
# Version: 2011.11p1
# 
# DO NOT MODIFY THIS FILE MANUALLY!
# 
name mmusy
oticket 0
fshare 0
delete_time 0
usage cpu=0.000000 mem=0.000000 io=0.000000
usage_time_stamp 1372869434
long_term_usage cpu=0.000000 mem=0.000000 io=0.000000
default_project NONE


Also, some warning about spool directory:

05/13/2014 09:22:06|worker|ant-master2|W|rule "default rule (spool dir)" in 
spooling context "flatfile spooling" failed writing an object


In my case it's local :

# df -h /var/
Filesystem         Size  Used Avail Use% Mounted on
/dev/cciss/c0d0p7   75G  767M   70G   2% /var

this is my schedd conf:

# qconf -ssconf
algorithm                         default
schedule_interval                 0:0:30
maxujobs                          0
queue_sort_method                 seqno
job_load_adjustments              np_load_avg=0.50
load_adjustment_decay_time        0:7:30
load_formula                      np_load_avg
schedd_job_info                   false
flush_submit_sec                  0
flush_finish_sec                  0
params                            PROFILE=1 MONITOR=1
reprioritize_interval             0:0:0
halftime                          24
usage_weight_list                 cpu=0.500000,mem=0.500000,io=0.000000
compensation_factor               1.000000
weight_user                       0.250000
weight_project                    0.250000
weight_department                 0.250000
weight_job                        0.250000
weight_tickets_functional         0
weight_tickets_share              100000
share_override_tickets            TRUE
share_functional_shares           TRUE
max_functional_jobs_to_schedule   200
report_pjob_tickets               TRUE
max_pending_tasks_per_job         50
halflife_decay_list               io=-1
policy_hierarchy                  S
weight_ticket                     1.000000
weight_waiting_time               0.000000
weight_deadline                   3600000.000000
weight_urgency                    0.001000
weight_priority                   0.001000
max_reservation                   150
default_duration                  9999:00:00

Anyone have seen one of those errors?
Any advice for running such amount of jobs?
Am I reaching some internal SGE limit?

I'm tempted to set the debug, but I'm scared of the amount of info that
it could generate....

TIA,
Arnau
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to