I have an error state of au on about 20% of my execute nodes, and have
been unable to release the au (alarm, unreachable) and
qstat -f
load average
normal2@blade5-5-14 BIP 0/0/24 1.42 lx-amd64
---------------------------------------------------------------------------------
normal2@blade5-5-15 BIP 0/0/24 -NA- lx-amd64 au
---------------------------------------------------------------------------------
normal2@blade5-5-16 BIP 0/0/24 1.25 lx-amd64
---------------------------------------------------------------------------------
normal2@blade5-5-2 BIP 0/0/24 1.31 lx-amd64
---------------------------------------------------------------------------------
qping allows me to reach the master from the execute node.
about 20% of my nodes now have the "au" error.
I have googled and it suggested qping or reboot.
I did the reboot on one node to no avail
messages file has hundreds of these
09/28/2015 10:55:35|worker|blade5-1-1|E|no execd known on host
blade5-5-15.dsg.wustl.edu to send conf notification
09/28/2015 10:55:37|worker|blade5-1-1|E|no execd known on host
blade5-6-6.dsg.wustl.edu to send conf notification
09/28/2015 10:55:37|worker|blade5-1-1|E|no execd known on host
blade5-6-8.dsg.wustl.edu to send conf notification
09/28/2015 10:55:51|worker|blade5-1-1|E|no execd known on host
blade5-6-2.dsg.wustl.edu to send conf notification
09/28/2015 10:55:52|worker|blade5-1-1|E|no execd known on host
blade5-6-1.dsg.wustl.edu to send conf notification
09/28/2015 10:56:15|worker|blade5-1-1|E|no execd known on host
blade5-5-15.dsg.wustl.edu to send conf notification
09/28/2015 10:56:17|worker|blade5-1-1|E|no execd known on host
blade5-6-8.dsg.wustl.edu to send conf notificati
qmaster]$ ps -eaf|grep execd
sgeadmin 4147 1 0 Aug31 ? 00:09:43
/opt/sge/bin/lx-amd64/sge_execd
sgeadmin 37198 37150 0 10:59 pts/0 00:00:00 grep execd
we have been migrating from one vlan to another, but most of the
affected nodes and the master are on the original vlan.
any suggestions where I might go from here?
Thanks,
Dan
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users