Hi William, Please also try this option as well :
Apply Ticket #607 `mds: remove O_NONBLOCK option on MDS TCP transport sockets [#607]` on Opesaf 4.4 ( that you are using ) which is part of Opensaf 4.5.0 , With this #607 patch Adrian Szwej was pass the 45 containers up to 67 containers . See more details in : https://sourceforge.net/p/opensaf/tickets/607 (Mds : tcp assert in MDS on accumulated unsent messages) - AVM On 10/27/2015 1:08 PM, A V Mahesh wrote: > Hi William, > > Even though from opensaf 4.4.0 to opensaf 4.7.0 their are > considerable change in DTM socket and their option and > default buffers configuration ect , the basic issues like `payloads > and controllers periodically lose contact with the cluster` > issue shouldn't happen even with opensaf 4.4.0, We have been > using/testing the virtual machine's Network with both > Host-only networking Bridged networking adopter options and we > haven't faced such issue. > > So can your please share your virtual machines network adopter > configuration. > Are you observing the same behavior with opensaf 4.7.0 as well ? > > -AVM > > On 10/27/2015 4:26 AM, William R Elliott wrote: >> Hello All, >> We are currently using opensaf 4.4.0. We have a cluster that is >> running on redhat 6 on virtual machines. For some unknown reason >> various payloads and controlers periodically lose contact with the >> cluster. The /var/log/messages logs don't tell us anything that we >> can see except the node lost message. >> >> I've been looking at the dtmd code to see if I can get some idea >> about what to start looking at to try to figure out what's going on. >> One of the things I've been researching is the TCP idle, interval, >> and probes settings in the dtmd.conf file. From what I can tell so >> far, the code indicates these values are set in the DTM_INTERNODE_CB >> structure and are used to set attributes on the socket by calling the >> setsockopt function. So it seems to me dtmd is relying on the TCP >> keep alive functionality to determine if a node is lost. Currently >> it looks like the idle time is set to 2 seconds, the interval is 1 >> second, and the number of probes is 2. Therefore, if the socket is >> idle for 2 seconds, a keep alive probe will be invoked, if no >> acknowledgement, after a one second interval, the next probe will be >> invoked and if still no acknowledgement the lost node message is issued. >> >> Since this kind of lower level TCP functionality is new to me, I >> started researching TCP keep alive and encountered the following >> statements concerning relying on TCP keep alive functionality to tell >> if communication has been lost: >> >> Do NOT try to use TCP Keepalive to detect TCP socket failure more >> quickly than a few minutes. People who try to set it for 5 seconds >> (or for milliseconds) invariably cause serious compatibility issues >> with other products - and invariably fail to be satisfied. >> If you truly require detecting a TCP socket failure in 1 second or >> less, which implies your TCP peers normally send data many times per >> second, then use non-blocking sockets with the "socket.timeout" >> exception to detect >> when no data had been received in your required time-frame. And if >> you accept that a TCP peer quiet for 1 second is bad, then close the >> socket manually and attempt recovery directly. Do not use TCP >> Keepalive for such short-period detection. >> >> Or the following link to a forum site that have several comments >> discouraging relying on TCP keep alive to determine if a connection >> is alive: >> http://stackoverflow.com/questions/15230922/keepalive-time-cannot-reduce-below-one-minute-in-c >> >> >> >> This is the first time I have looked in to dtmd and since I don't >> have the history and experience, it's possible I have missed >> something, or miss understood the code. So here are my questions: >> >> 1) Am I correct that dtmd relies on TCP keep alive to determine >> if a connection is alive? >> >> 2) Since I don't have acces to nor will I be given acces to this >> environment I've mentioned, are there any utilities besides ping, >> traceroute... that I can ask the users of this environment to run to >> help determine what could be causing the periodic lost nodes? I'm >> currently looking at tcpdump, and writing a python script that uses >> the socket APIs to connect to a particular port and use keep alive >> functionality. >> >> Any suggestions would be greatly appreciated. >> >> >> thanks >> >> >> >> >> ________________________________ >> The information transmitted herein is intended only for the person or >> entity to which it is addressed and may contain confidential, >> proprietary and/or privileged material. Any review, retransmission, >> dissemination or other use of, or taking of any action in reliance >> upon, this information by persons or entities other than the intended >> recipient is prohibited. If you received this in error, please >> contact the sender and delete the material from any computer. >> ------------------------------------------------------------------------------ >> >> >> _______________________________________________ >> Opensaf-users mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/opensaf-users > ------------------------------------------------------------------------------ _______________________________________________ Opensaf-users mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/opensaf-users
