Hi
I will certainly look in to this.  Thank you for your help!

-----Original Message-----
From: A V Mahesh [mailto:[email protected]]
Sent: Wednesday, October 28, 2015 5:24 AM
To: [email protected]
Subject: Re: [users] Question concerning opensaf and TCP keep alive

Hi William,

Please also try this option as well :

Apply   Ticket #607  `mds: remove O_NONBLOCK option on MDS TCP transport
sockets [#607]`    on  Opesaf  4.4 ( that you are using )
which is part of  Opensaf  4.5.0 ,  With this  #607 patch Adrian Szwej was pass 
the 45 containers up to 67 containers .

See more details in : https://sourceforge.net/p/opensaf/tickets/607
(Mds : tcp assert in MDS on accumulated unsent messages)

- AVM

On 10/27/2015 1:08 PM, A V Mahesh wrote:
> Hi William,
>
> Even though  from  opensaf 4.4.0 to opensaf 4.7.0 their are
> considerable change in DTM socket and their option and default buffers
> configuration ect , the basic issues like `payloads and controllers
> periodically lose contact with the cluster` issue shouldn't happen
> even with  opensaf 4.4.0, We have been using/testing the  virtual
> machine's Network with both Host-only networking  Bridged networking
> adopter  options and we haven't faced such issue.
>
> So can your please share your virtual machines network adopter
> configuration.
> Are you observing the same  behavior with opensaf 4.7.0  as well ?
>
> -AVM
>
> On 10/27/2015 4:26 AM, William R Elliott wrote:
>> Hello All,
>> We are currently using opensaf 4.4.0.  We have a cluster that is
>> running on redhat 6 on virtual machines.  For some unknown reason
>> various payloads and controlers periodically lose contact with the
>> cluster.  The /var/log/messages logs don't tell us anything that we
>> can see except the node lost message.
>>
>> I've been looking at the dtmd code to see if I can get some idea
>> about what to start looking at to try to figure out what's going on.
>> One of the things I've been researching is the TCP idle, interval,
>> and probes settings in the dtmd.conf file.  From what I can tell so
>> far, the code indicates these values are set in the DTM_INTERNODE_CB
>> structure and are used to set attributes on the socket by calling the
>> setsockopt function.  So it seems to me dtmd is relying on the TCP
>> keep alive functionality to determine if a node is lost.  Currently
>> it looks like the idle time is set to 2 seconds, the interval is 1
>> second, and the number of probes is 2.  Therefore, if the socket is
>> idle for 2 seconds, a keep alive probe will be invoked, if no
>> acknowledgement, after a one second interval, the next probe will be
>> invoked and if still no acknowledgement the lost node message is issued.
>>
>> Since this kind of lower level TCP functionality is new to me, I
>> started researching TCP keep alive and encountered the following
>> statements concerning relying on TCP keep alive functionality to tell
>> if communication has been lost:
>>
>> Do NOT try to use TCP Keepalive to detect TCP socket failure more
>> quickly than a few minutes. People who try to set it for 5 seconds
>> (or for milliseconds) invariably cause serious compatibility issues
>> with other products - and invariably fail to be satisfied.
>> If you truly require detecting a TCP socket failure in 1 second or
>> less, which implies your TCP peers normally send data many times per
>> second, then use non-blocking sockets with the "socket.timeout"
>> exception to detect
>> when no data had been received in your required time-frame. And if
>> you accept that a TCP peer quiet for 1 second is bad, then close the
>> socket manually and attempt recovery directly. Do not use TCP
>> Keepalive for such short-period detection.
>>
>> Or the following link to a forum site that have several comments
>> discouraging relying on TCP keep alive to determine if a connection
>> is alive:
>> http://stackoverflow.com/questions/15230922/keepalive-time-cannot-red
>> uce-below-one-minute-in-c
>>
>>
>> This is the first time I have looked in to dtmd and since I don't
>> have the history and experience, it's possible I have missed
>> something, or miss understood the code.  So here are my questions:
>>
>> 1)      Am I correct that dtmd relies on TCP keep alive to determine
>> if a connection is alive?
>>
>> 2)      Since I don't have acces to nor will I be given acces to this
>> environment I've mentioned, are there any utilities besides ping,
>> traceroute... that I can ask the users of this environment to run to
>> help determine what could be causing the periodic lost nodes?  I'm
>> currently looking at tcpdump, and writing a python script that uses
>> the socket APIs to connect to a particular port and use keep alive
>> functionality.
>>
>> Any suggestions would be greatly appreciated.
>>
>>
>> thanks
>>
>>
>>
>>
>> ________________________________
>> The information transmitted herein is intended only for the person or
>> entity to which it is addressed and may contain confidential,
>> proprietary and/or privileged material. Any review, retransmission,
>> dissemination or other use of, or taking of any action in reliance
>> upon, this information by persons or entities other than the intended
>> recipient is prohibited. If you received this in error, please
>> contact the sender and delete the material from any computer.
>> ---------------------------------------------------------------------
>> ---------
>>
>> _______________________________________________
>> Opensaf-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/opensaf-users
>


------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users


________________________________
The information transmitted herein is intended only for the person or entity to 
which it is addressed and may contain confidential, proprietary and/or 
privileged material. Any review, retransmission, dissemination or other use of, 
or taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received this 
in error, please contact the sender and delete the material from any computer.

------------------------------------------------------------------------------
_______________________________________________
Opensaf-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/opensaf-users

Reply via email to