Hi there, have recently been seeing issues on our School's SGE grid where the the qmaster (8.0.0) became unresponsive after around a day's uptime showing a message akin to
02/21/2014 11:40:00|worker|mymaster|E|not enough memory to allocate 1048576 bytes in init_packbuffer with clients seeing the usual "gdi timeout" messages when attempting qstat/qsub etc. After the qmaster was brought up tp 8.1.6, the unresponsiveness started kicking in almost immediately, although no discernable memory issue logging appears. A qping of the master suggests that the older version displays status: 1 with all the threads in a E state, whilst with the 8.16 master, the process entered status: 2 fairly quickly after a restart. Despite any apparent mistmatch between the i386 master and x86_64 execds, the system here has only just started to misbehave - though perhaps we've just been "lucky". Whilst I doubt anyone else out there will have such a system as ours, if anyone has any suggestions as to debugging such issues, to a deeper level than a basic qping, which is what most of the postings a web search unearthed seem to suggest, I'd be delighted to hear of them Kevin M. Buckley eScience Consultant School of Engineering and Computer Science Victoria University of Wellington New Zealand _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
