Hi William, Even though from opensaf 4.4.0 to opensaf 4.7.0 their are considerable change in DTM socket and their option and default buffers configuration ect , the basic issues like `payloads and controllers periodically lose contact with the cluster` issue shouldn't happen even with opensaf 4.4.0, We have been using/testing the virtual machine's Network with both Host-only networking Bridged networking adopter options and we haven't faced such issue.
So can your please share your virtual machines network adopter configuration. Are you observing the same behavior with opensaf 4.7.0 as well ? -AVM On 10/27/2015 4:26 AM, William R Elliott wrote: > Hello All, > We are currently using opensaf 4.4.0. We have a cluster that is running on > redhat 6 on virtual machines. For some unknown reason various payloads and > controlers periodically lose contact with the cluster. The /var/log/messages > logs don't tell us anything that we can see except the node lost message. > > I've been looking at the dtmd code to see if I can get some idea about what > to start looking at to try to figure out what's going on. One of the things > I've been researching is the TCP idle, interval, and probes settings in the > dtmd.conf file. From what I can tell so far, the code indicates these values > are set in the DTM_INTERNODE_CB structure and are used to set attributes on > the socket by calling the setsockopt function. So it seems to me dtmd is > relying on the TCP keep alive functionality to determine if a node is lost. > Currently it looks like the idle time is set to 2 seconds, the interval is 1 > second, and the number of probes is 2. Therefore, if the socket is idle for > 2 seconds, a keep alive probe will be invoked, if no acknowledgement, after a > one second interval, the next probe will be invoked and if still no > acknowledgement the lost node message is issued. > > Since this kind of lower level TCP functionality is new to me, I started > researching TCP keep alive and encountered the following statements > concerning relying on TCP keep alive functionality to tell if communication > has been lost: > > Do NOT try to use TCP Keepalive to detect TCP socket failure more quickly > than a few minutes. People who try to set it for 5 seconds (or for > milliseconds) invariably cause serious compatibility issues with other > products - and invariably fail to be satisfied. > If you truly require detecting a TCP socket failure in 1 second or less, > which implies your TCP peers normally send data many times per second, then > use non-blocking sockets with the "socket.timeout" exception to detect > when no data had been received in your required time-frame. And if you accept > that a TCP peer quiet for 1 second is bad, then close the socket manually and > attempt recovery directly. Do not use TCP Keepalive for such short-period > detection. > > Or the following link to a forum site that have several comments discouraging > relying on TCP keep alive to determine if a connection is alive: > http://stackoverflow.com/questions/15230922/keepalive-time-cannot-reduce-below-one-minute-in-c > > This is the first time I have looked in to dtmd and since I don't have the > history and experience, it's possible I have missed something, or miss > understood the code. So here are my questions: > > 1) Am I correct that dtmd relies on TCP keep alive to determine if a > connection is alive? > > 2) Since I don't have acces to nor will I be given acces to this > environment I've mentioned, are there any utilities besides ping, > traceroute... that I can ask the users of this environment to run to help > determine what could be causing the periodic lost nodes? I'm currently > looking at tcpdump, and writing a python script that uses the socket APIs to > connect to a particular port and use keep alive functionality. > > Any suggestions would be greatly appreciated. > > > thanks > > > > > ________________________________ > The information transmitted herein is intended only for the person or entity > to which it is addressed and may contain confidential, proprietary and/or > privileged material. Any review, retransmission, dissemination or other use > of, or taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. > ------------------------------------------------------------------------------ > _______________________________________________ > Opensaf-users mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/opensaf-users ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
