We're running Maui 3.6.2p11 with Torque 2.1.6 (I realize these may not
be the most current versions, but we don't switch that often).

This combination has been running fine for several months. However, in 
the last 3 weeks, every Friday, Maui has been constantly reporting 
messages like the following:

cannot get node info: Unknown Job Id
cannot get node info: NULL
cannot get node info: Execution server rejected request MSG=connection
to mom timed out
cannot get node info: Execution server rejected request MSG=cannot send
job to mom, state=PRERUN
job '1545086' cannot be started: (rc: 15031  errmsg: 'Premature end of
message')

Followed of course by attempts to re-initialize the PBS interface.

These would seem to imply some kind of networking issue, although Maui
is communicating with Torque on the same machine. When these messages 
occur, Maui doesn't successfully re-initialize the PBS interface - we 
always have to kill it and then restart it manually. Often, we're doing
this every 3 minutes or so between 6am and 9am (like we are now).

Any ideas about how to troubleshoot this accurately (and quickly)?

Michael Durket


_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to