Hi guys, Recently I've faced with quite weird behavior of sgeexecd (OGS 2011.11p1), maybe you can help me to investigate the issue.
I have a cluster with local and EC2 nodes, qmaster runs locally. On local nodes sgeexecd works as usually good, but sgeexecd on cloud nodes register in qmaster (when starts) and then, exactly after 120 seconds, it tries to register in qmaster again (sgeexecd is not restarted at his point)! Of course qmaster rejects the registration with message like this: commlib error: endpoint is not unique error (endpoint "cnode001.cm.cluster/execd/1" is already connected) After that jobs hang in t state (although they are finished). Could you advise me what I should check or maybe how I can debug this? I don't see any configuration parameters with 2 minutes set, so I don't get what could trigger the re-registration after this period of time. Nothing useful is printed when I set SGE_ND and loglevel=log_info. The only difference between local and cloud nodes I see is cloud nodes have 2 networks (local nodes only one). But according netstat (and tcpdump) sgeexecd on a cloud node connects to qmaster from the same IP the first time and the next time when tries to re-register, so it seems network configuration is not a reason for that. Any idea is appreciated! Thanks, Taras
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
