Hi William,

Please also try this option as well :

Apply   Ticket #607  `mds: remove O_NONBLOCK option on MDS TCP transport 
sockets [#607]`    on  Opesaf  4.4 ( that you are using )
which is part of  Opensaf  4.5.0 ,  With this  #607 patch Adrian Szwej  
was pass the 45 containers up to 67 containers .

See more details in : https://sourceforge.net/p/opensaf/tickets/607   
(Mds : tcp assert in MDS on accumulated unsent messages)

- AVM

On 10/27/2015 1:08 PM, A V Mahesh wrote:
> Hi William,
>
> Even though  from  opensaf 4.4.0 to opensaf 4.7.0 their are 
> considerable change in DTM socket and their option and
> default buffers configuration ect , the basic issues like `payloads 
> and controllers periodically lose contact with the cluster`
> issue shouldn't happen even with  opensaf 4.4.0, We have been 
> using/testing the  virtual machine's Network with both
> Host-only networking  Bridged networking adopter  options and we 
> haven't faced such issue.
>
> So can your please share your virtual machines network adopter 
> configuration.
> Are you observing the same  behavior with opensaf 4.7.0  as well ?
>
> -AVM
>
> On 10/27/2015 4:26 AM, William R Elliott wrote:
>> Hello All,
>> We are currently using opensaf 4.4.0.  We have a cluster that is 
>> running on redhat 6 on virtual machines.  For some unknown reason 
>> various payloads and controlers periodically lose contact with the 
>> cluster.  The /var/log/messages logs don't tell us anything that we 
>> can see except the node lost message.
>>
>> I've been looking at the dtmd code to see if I can get some idea 
>> about what to start looking at to try to figure out what's going on.  
>> One of the things I've been researching is the TCP idle, interval, 
>> and probes settings in the dtmd.conf file.  From what I can tell so 
>> far, the code indicates these values are set in the DTM_INTERNODE_CB 
>> structure and are used to set attributes on the socket by calling the 
>> setsockopt function.  So it seems to me dtmd is relying on the TCP 
>> keep alive functionality to determine if a node is lost.  Currently 
>> it looks like the idle time is set to 2 seconds, the interval is 1 
>> second, and the number of probes is 2.  Therefore, if the socket is 
>> idle for 2 seconds, a keep alive probe will be invoked, if no 
>> acknowledgement, after a one second interval, the next probe will be 
>> invoked and if still no acknowledgement the lost node message is issued.
>>
>> Since this kind of lower level TCP functionality is new to me, I 
>> started researching TCP keep alive and encountered the following 
>> statements concerning relying on TCP keep alive functionality to tell 
>> if communication has been lost:
>>
>> Do NOT try to use TCP Keepalive to detect TCP socket failure more 
>> quickly than a few minutes. People who try to set it for 5 seconds 
>> (or for milliseconds) invariably cause serious compatibility issues 
>> with other products - and invariably fail to be satisfied.
>> If you truly require detecting a TCP socket failure in 1 second or 
>> less, which implies your TCP peers normally send data many times per 
>> second, then use non-blocking sockets with the "socket.timeout" 
>> exception to detect
>> when no data had been received in your required time-frame. And if 
>> you accept that a TCP peer quiet for 1 second is bad, then close the 
>> socket manually and attempt recovery directly. Do not use TCP 
>> Keepalive for such short-period detection.
>>
>> Or the following link to a forum site that have several comments 
>> discouraging relying on TCP keep alive to determine if a connection 
>> is alive:
>> http://stackoverflow.com/questions/15230922/keepalive-time-cannot-reduce-below-one-minute-in-c
>>  
>>
>>
>> This is the first time I have looked in to dtmd and since I don't 
>> have the history and experience, it's possible I have missed 
>> something, or miss understood the code.  So here are my questions:
>>
>> 1)      Am I correct that dtmd relies on TCP keep alive to determine 
>> if a connection is alive?
>>
>> 2)      Since I don't have acces to nor will I be given acces to this 
>> environment I've mentioned, are there any utilities besides ping, 
>> traceroute... that I can ask the users of this environment to run to 
>> help determine what could be causing the periodic lost nodes?  I'm 
>> currently looking at tcpdump, and writing a python script that uses 
>> the socket APIs to connect to a particular port and use keep alive 
>> functionality.
>>
>> Any suggestions would be greatly appreciated.
>>
>>
>> thanks
>>
>>
>>
>>
>> ________________________________
>> The information transmitted herein is intended only for the person or 
>> entity to which it is addressed and may contain confidential, 
>> proprietary and/or privileged material. Any review, retransmission, 
>> dissemination or other use of, or taking of any action in reliance 
>> upon, this information by persons or entities other than the intended 
>> recipient is prohibited. If you received this in error, please 
>> contact the sender and delete the material from any computer.
>> ------------------------------------------------------------------------------
>>  
>>
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to