[gridengine users] Random qmaster crashes

Txema Heredia Tue, 15 Oct 2013 04:16:02 -0700

Hi all,

I recently moved from SGE 6.2u5 to OGS/GE 2011.11p1. It wasn't a fullupdate, we just only changed the qmaster binary to get rid of the -pe +hold_jid bug.

Since we moved to 2011.11p1, we had to reboot the qmaster frequently,about once every two weeks. A couple of times it even crashed with asegfault:


# grep -i "kernel: sge" /var/log/messages

Sep 19 14:43:54 floquet kernel: sge_qmaster[7529]: segfault at 18 ip00000000005e6524 sp 00007fa1b97f5ba0 error 4 in sge_qmaster[400000+27f000]Sep 30 11:25:20 floquet kernel: sge_qmaster[25325]: segfault at 18 ip00000000005a1dfd sp 00007f014ddfabc8 error 6 in sge_qmaster[400000+27f000]Oct 14 20:44:49 floquet kernel: sge_qmaster[9218]: segfault at 98 ip00000000005e6524 sp 00007fedb32f5ba0 error 4 in sge_qmaster[400000+27f000]



I have been meddling a little and I have found some hints:

On yesterday's incident:

A few hours before the crash, a bunch of jobs were killed (the userconfirmed he issued a massive qdel)

spool/qmaster/messages
190x

10/14/2013 18:06:11|worker|floquet|W|job 2995354.1 failed on hostcompute-0-2.local assumedly after job because: job 2995354.1 diedthrough signal KILL (9)

And I found some jobs being deleted before the crash, but not beingreally deleted until qmaster started again:


common/reporting
Before 20:44

...
1381776214:acct:all.q:compute-1-7.local:pluisi:pluisi:ENSG00000233013:2957875:sge:0:1381662633:1381775408:1381776214:0:2:806:800.174355:4.092377:363760.000000

:0:0:0:0:112512:0:0:0.000000:8:0:0:0:114:81458:NONE:defaultdepartment:NONE:1:0:804.266732:160.128992:25.885151:-Ujaume -u pluisi -l h_vmem=4.8G:0.000000:NONE

:438968320.000000:0:0

1381776214:job_log:1381776214:finished:2957875:0:NONE:r:executiondaemon:compute-1-7.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:

sge:job exited

1381776214:job_log:1381776214:finished:2957875:0:NONE:r:master:floquet.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:sge:jobwaits

for schedds deletion


And after 21:40 (when I rebooted the daemon):

1381779617:job_log:1381779617:deleted:2957870:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000197070:pluisi:pluisi::defaultdepartment:sge:jobdele

ted by schedd

1381779617:job_log:1381779617:deleted:2957872:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000229926:pluisi:pluisi::defaultdepartment:sge:jobdele

ted by schedd

1381779617:job_log:1381779617:deleted:2957873:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000159247:pluisi:pluisi::defaultdepartment:sge:jobdele

ted by schedd

1381779617:job_log:1381779617:deleted:2957871:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000233998:pluisi:pluisi::defaultdepartment:sge:jobdele

ted by schedd

1381779617:job_log:1381779617:deleted:2957874:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000237419:pluisi:pluisi::defaultdepartment:sge:jobdele

ted by schedd

1381779617:job_log:1381779617:deleted:2957875:0:NONE:T:scheduler:floquet.local:0:1024:1381662633:ENSG00000233013:pluisi:pluisi::defaultdepartment:sge:jobdele

ted by schedd

Besides that, in the two previous crashes I have also seen some peaks onjob deletion right before or a few hours before the crash


spool/qmaster/messages
34x

09/30/2013 11:16:04|worker|floquet|E|master task of job 2678898.1 failed- killing job

32x

09/30/2013 11:16:09|worker|floquet|W|job 2678910.1 failed on hostcompute-1-4.local assumedly after job because: job 2678910.1 diedthrough signal KILL (9)

32x

09/19/2013 10:28:30|worker|floquet|W|job 2657886.1 failed on hostcompute-0-11.local assumedly after job because: job 2657886.1 diedthrough signal KILL (9)

And on the common/reporting log we see that there is a lot of movement.Very fast-finishing jobs, jobs failing, etc...

It seems to me that whenever the scheduler is under some high stressloads, something might end up wrong inside the qmaster that ends upcrashing it.Is there any known bug in 2011.11p1 qmaster? Could it be that myscheduler interval (5 secs) is too short and when there are too manyjobs, intervals end up overlapping and this crashes the scheduler?


Thanks in advance,

Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Random qmaster crashes

Reply via email to