Hi there,

have recently been seeing issues on our School's SGE grid where the
the qmaster (8.0.0) became unresponsive after around a day's uptime
showing a message akin to

02/21/2014 11:40:00|worker|mymaster|E|not enough memory to allocate
1048576 bytes in init_packbuffer

with clients seeing the usual "gdi timeout" messages when attempting
qstat/qsub etc.

After the qmaster was brought up tp 8.1.6, the unresponsiveness
started kicking in
almost immediately, although no discernable memory issue logging appears.

A qping of the master suggests that the older version displays

status:                   1

with all the threads in a E state, whilst with the 8.16 master, the process
entered

status:                   2

fairly quickly after a restart.

Despite any apparent mistmatch between the i386 master and x86_64 execds,
the system here has only just started to misbehave - though perhaps we've just
been "lucky".

Whilst I doubt anyone else out there will have such a system as ours, if anyone
has any suggestions as to debugging such issues, to a  deeper level than a basic
qping, which is what most of the postings a web search unearthed seem
to suggest,
I'd be delighted to hear of them

Kevin M. Buckley

eScience Consultant
School of Engineering and Computer Science
Victoria University of Wellington
New Zealand
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to