There are 32 cores on the machine and it use is split between interactive and non-interactive jobs. This mix is similar on other nodes as well that we don't experience this issue. The split is doen as our interactive jobs tend to be memory intensive but CPU light and the non-interactive tend to be CPU heavy and memory light. So there are other process running on the node that are inside SGE. But only root related system processes are running outside of SGE.
I did find a few processes that were left behind but cleaning those out has no impact. The gid_range is the default: gid_range 20000-20100 Regards, Derek -----Original Message----- From: Reuti <re...@staff.uni-marburg.de> Sent: January 18, 2019 11:26 AM To: Derek Stephenson <derek.stephen...@awaveip.com> Cc: users@gridengine.org Subject: Re: [gridengine users] Dilemma with exec node reponsiveness degrading > Am 18.01.2019 um 16:26 schrieb Derek Stephenson > <derek.stephen...@awaveip.com>: > > Hi Reuti, > > I don't believe anyone has adjusted the scheduler from defaults but I see: > schedule_interval 00:00:04 > flush_submit_sec 1 > flush_finish_sec 1 With a schedule interval of 4 seconds I would set the flush values to zero to avoid a too high load on the qmaster. But this shouldn't be related to the behavior you observe. Are you running jobs with only a few seconds runtime? Otherwise even a larger schedule interval would do. > For the qlogin side, I've confirmed that there is no firewall and previously > a reboot alleviated all issues we were seeing for atleast some time, though > the duration seems to be getting smaller... we had to reboot the server 3 > weeks ago for the same issue. Was there anything else running on the node – inside or outside SGE? Were any processes left behind by a former interactive session? What is the value of: $ qconf -sconf … gid_range 20000-20100 and how many cores are available per node? -- Reuti > Regards, > > Derek > -----Original Message----- > From: Reuti <re...@staff.uni-marburg.de> > Sent: January 18, 2019 4:51 AM > To: Derek Stephenson <derek.stephen...@awaveip.com> > Cc: users@gridengine.org > Subject: Re: [gridengine users] Dilemma with exec node reponsiveness > degrading > > >> Am 18.01.2019 um 03:57 schrieb Derek Stephenson >> <derek.stephen...@awaveip.com>: >> >> Hello, >> >> I should preface this with I've just recently started getting my head around >> grid engine and as such may not have all the information I should for >> administering the grid but someone's has to do it. Anyways... >> >> Our company across an issue recently where a one of the nodes seems to >> become very delayed in its response to grid submissions. Whether it be a >> qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min to >> successfully submit. In particular, while users may complain a qsub job >> looks like it has submitted but do nothing, doing a qlogin to the node in >> question will give the following: > > This might at least for `qsub` jobs depend on when it was submitted inside > the defined scheduling interval. What is the setting of: > > $ qconf -ssconf > ... > schedule_interval 0:2:0 > ... > flush_submit_sec 4 > flush_finish_sec 4 > > >> Your job 287104 ("QLOGIN") has been submitted waiting for interactive >> job to be scheduled ...timeout (3 s) expired while waiting on socket >> fd 7 > > For interactive jobs: any firewall in place, blocking the communication > between the submission host and the exechost - maybe switched on at a later > point in time? SGE will use a random port for the communication. After the > reboot it worked instantly again? > > -- Reuti > > >> Now I've seen a series of forum articles bring up this message while >> seaching through back logs but there never seems to be any conclusions in >> those threads for me to start delving into on our end. >> >> Our past attempts to resolve the issue have only succeeded by rebooting the >> node in question, and not having any real ideas on why is becoming a general >> frustration. >> >> Any initial thoughts/pointers would be greatly appreciated >> >> Kind Regards, >> >> Derek >> >> _______________________________________________ >> users mailing list >> users@gridengine.org >> https://gridengine.org/mailman/listinfo/users > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users