Hi all,

I recently moved from SGE 6.2u5 to OGS/GE 2011.11p1. It wasn't a full update, we just only changed the qmaster binary to get rid of the -pe + hold_jid bug.

Since we moved to 2011.11p1, we had to reboot the qmaster frequently, about once every two weeks. A couple of times it even crashed with a segfault:

# grep -i "kernel: sge" /var/log/messages
Sep 19 14:43:54 floquet kernel: sge_qmaster[7529]: segfault at 18 ip 00000000005e6524 sp 00007fa1b97f5ba0 error 4 in sge_qmaster[400000+27f000] Sep 30 11:25:20 floquet kernel: sge_qmaster[25325]: segfault at 18 ip 00000000005a1dfd sp 00007f014ddfabc8 error 6 in sge_qmaster[400000+27f000] Oct 14 20:44:49 floquet kernel: sge_qmaster[9218]: segfault at 98 ip 00000000005e6524 sp 00007fedb32f5ba0 error 4 in sge_qmaster[400000+27f000]


I have been meddling a little and I have found some hints:

On yesterday's incident:

A few hours before the crash, a bunch of jobs were killed (the user confirmed he issued a massive qdel)
spool/qmaster/messages
190x
10/14/2013 18:06:11|worker|floquet|W|job 2995354.1 failed on host compute-0-2.local assumedly after job because: job 2995354.1 died through signal KILL (9)

And I found some jobs being deleted before the crash, but not being really deleted until qmaster started again:

common/reporting
Before 20:44

...
1381776214:acct:all.q:compute-1-7.local:pluisi:pluisi:ENSG00000233013:2957875:sge:0:1381662633:1381775408:1381776214:0:2:806:800.174355:4.092377:363760.000000
:0:0:0:0:112512:0:0:0.000000:8:0:0:0:114:81458:NONE:defaultdepartment:NONE:1:0:804.266732:160.128992:25.885151:-U jaume -u pluisi -l h_vmem=4.8G:0.000000:NONE
:438968320.000000:0:0
1381776214:job_log:1381776214:finished:2957875:0:NONE:r:execution daemon:compute-1-7.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:
sge:job exited
1381776214:job_log:1381776214:finished:2957875:0:NONE:r:master:floquet.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:sge:job waits
for schedds deletion


And after 21:40 (when I rebooted the daemon):

1381779617:job_log:1381779617:deleted:2957870:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000197070:pluisi:pluisi::defaultdepartment:sge:job dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957872:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000229926:pluisi:pluisi::defaultdepartment:sge:job dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957873:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000159247:pluisi:pluisi::defaultdepartment:sge:job dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957871:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000233998:pluisi:pluisi::defaultdepartment:sge:job dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957874:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000237419:pluisi:pluisi::defaultdepartment:sge:job dele
ted by schedd
1381779617:job_log:1381779617:deleted:2957875:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:sge:job dele
ted by schedd




Besides that, in the two previous crashes I have also seen some peaks on job deletion right before or a few hours before the crash

spool/qmaster/messages
34x
09/30/2013 11:16:04|worker|floquet|E|master task of job 2678898.1 failed - killing job
32x
09/30/2013 11:16:09|worker|floquet|W|job 2678910.1 failed on host compute-1-4.local assumedly after job because: job 2678910.1 died through signal KILL (9)
32x
09/19/2013 10:28:30|worker|floquet|W|job 2657886.1 failed on host compute-0-11.local assumedly after job because: job 2657886.1 died through signal KILL (9)

And on the common/reporting log we see that there is a lot of movement. Very fast-finishing jobs, jobs failing, etc...



It seems to me that whenever the scheduler is under some high stress loads, something might end up wrong inside the qmaster that ends up crashing it. Is there any known bug in 2011.11p1 qmaster? Could it be that my scheduler interval (5 secs) is too short and when there are too many jobs, intervals end up overlapping and this crashes the scheduler?

Thanks in advance,

Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to