It seems to me that this likely isn't closely related to GridEngine itself, but more about something going on on that node. You'd have to debug by looking at 'top' output, system logs, iostat, ifstat, to see if it's about heavy usage by some existing job, or some kind of kernel hang. But it's likely a system issue, not a GridEngine issue.
Dan On Thu, Jan 17, 2019 at 9:58 PM Derek Stephenson < derek.stephen...@awaveip.com> wrote: > Hello, > > > I should preface this with I've just recently started getting my head > around grid engine and as such may not have all the information I should > for administering the grid but someone's has to do it. Anyways... > > > Our company across an issue recently where a one of the nodes seems to > become very delayed in its response to grid submissions. Whether it be a > qsub, qrsh or qlogin submission jobs can take anywhere from 30s to 4-5min > to successfully submit. In particular, while users may complain a qsub job > looks like it has submitted but do nothing, doing a qlogin to the node in > question will give the following: > > > Your job 287104 ("QLOGIN") has been submitted > waiting for interactive job to be scheduled ...timeout (3 s) expired while > waiting on socket fd 7 > > Now I've seen a series of forum articles bring up this message while > seaching through back logs but there never seems to be any conclusions in > those threads for me to start delving into on our end. > > Our past attempts to resolve the issue have only succeeded by rebooting > the node in question, and not having any real ideas on why is becoming a > general frustration. > > Any initial thoughts/pointers would be greatly appreciated > > Kind Regards, > > Derek > > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users >
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users