Hi all,
I recently moved from SGE 6.2u5 to OGS/GE 2011.11p1. It wasn't a full
update, we just only changed the qmaster binary to get rid of the -pe +
hold_jid bug.
Since we moved to 2011.11p1, we had to reboot the qmaster frequently,
about once every two weeks. A couple of times it even crashed with a
segfault:
# grep -i "kernel: sge" /var/log/messages
Sep 19 14:43:54 floquet kernel: sge_qmaster[7529]: segfault at 18 ip
00000000005e6524 sp 00007fa1b97f5ba0 error 4 in sge_qmaster[400000+27f000]
Sep 30 11:25:20 floquet kernel: sge_qmaster[25325]: segfault at 18 ip
00000000005a1dfd sp 00007f014ddfabc8 error 6 in sge_qmaster[400000+27f000]
Oct 14 20:44:49 floquet kernel: sge_qmaster[9218]: segfault at 98 ip
00000000005e6524 sp 00007fedb32f5ba0 error 4 in sge_qmaster[400000+27f000]
I have been meddling a little and I have found some hints:
On yesterday's incident:
A few hours before the crash, a bunch of jobs were killed (the user
confirmed he issued a massive qdel)
spool/qmaster/messages
190x
10/14/2013 18:06:11|worker|floquet|W|job 2995354.1 failed on host
compute-0-2.local assumedly after job because: job 2995354.1 died
through signal KILL (9)
And I found some jobs being deleted before the crash, but not being
really deleted until qmaster started again:
common/reporting
Before 20:44
...
1381776214:acct:all.q:compute-1-7.local:pluisi:pluisi:ENSG00000233013:2957875:sge:0:1381662633:1381775408:1381776214:0:2:806:800.174355:4.092377:363760.000000
:0:0:0:0:112512:0:0:0.000000:8:0:0:0:114:81458:NONE:defaultdepartment:NONE:1:0:804.266732:160.128992:25.885151:-U
jaume -u pluisi -l h_vmem=4.8G:0.000000:NONE
:438968320.000000:0:0
1381776214:job_log:1381776214:finished:2957875:0:NONE:r:execution
daemon:compute-1-7.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:
sge:job exited
1381776214:job_log:1381776214:finished:2957875:0:NONE:r:master:floquet.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:sge:job
waits
for schedds deletion
And after 21:40 (when I rebooted the daemon):
1381779617:job_log:1381779617:deleted:2957870:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000197070:pluisi:pluisi::defaultdepartment:sge:job
dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957872:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000229926:pluisi:pluisi::defaultdepartment:sge:job
dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957873:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000159247:pluisi:pluisi::defaultdepartment:sge:job
dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957871:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000233998:pluisi:pluisi::defaultdepartment:sge:job
dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957874:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000237419:pluisi:pluisi::defaultdepartment:sge:job
dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957875:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:sge:job
dele
ted by schedd
Besides that, in the two previous crashes I have also seen some peaks on
job deletion right before or a few hours before the crash
spool/qmaster/messages
34x
09/30/2013 11:16:04|worker|floquet|E|master task of job 2678898.1 failed
- killing job
32x
09/30/2013 11:16:09|worker|floquet|W|job 2678910.1 failed on host
compute-1-4.local assumedly after job because: job 2678910.1 died
through signal KILL (9)
32x
09/19/2013 10:28:30|worker|floquet|W|job 2657886.1 failed on host
compute-0-11.local assumedly after job because: job 2657886.1 died
through signal KILL (9)
And on the common/reporting log we see that there is a lot of movement.
Very fast-finishing jobs, jobs failing, etc...
It seems to me that whenever the scheduler is under some high stress
loads, something might end up wrong inside the qmaster that ends up
crashing it.
Is there any known bug in 2011.11p1 qmaster? Could it be that my
scheduler interval (5 secs) is too short and when there are too many
jobs, intervals end up overlapping and this crashes the scheduler?
Thanks in advance,
Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users