Hey, It has been solved. It was networking as always (master ip was changed for a while and I've just missed a couple of packages in tcpdump from the wrong ip).
Best regards, Taras On Fri, May 16, 2014 at 3:45 PM, Taras Shapovalov < [email protected]> wrote: > Hi guys, > > Recently I've faced with quite weird behavior of sgeexecd (OGS 2011.11p1), > maybe you can help me to investigate the issue. > > I have a cluster with local and EC2 nodes, qmaster runs locally. On local > nodes sgeexecd works as usually good, but sgeexecd on cloud nodes register > in qmaster (when starts) and then, exactly after 120 seconds, it tries to > register in qmaster again (sgeexecd is not restarted at his point)! Of > course qmaster rejects the registration with message like this: > > commlib error: endpoint is not unique error (endpoint > "cnode001.cm.cluster/execd/1" is already connected) > > > After that jobs hang in t state (although they are finished). > > > Could you advise me what I should check or maybe how I can debug this? I > don't see any configuration parameters with 2 minutes set, so I don't get > what could trigger the re-registration after this period of time. Nothing > useful is printed when I set SGE_ND and loglevel=log_info. > > > The only difference between local and cloud nodes I see is cloud nodes have 2 > networks (local nodes only one). But according netstat (and tcpdump) sgeexecd > on a cloud node connects to qmaster from the same IP the first time and the > next time when tries to re-register, so it seems network configuration is not > a reason for that. > > > Any idea is appreciated! > > > Thanks, > > Taras > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
