One other piece of information that might be helpful is if I look at "lsof -n -P" for the process I can see there are 3 entries for connections to zookeeper on port 2181 in the ESTABLISHED state. However If i look at a node that has had the session timeout it appears that the connection is in the CLOSE_WAIT state. Is there a clean way to recover from this?
Andrew Jorgensen @ajorgensen On Wed, May 3, 2017 at 11:32 PM, Andrew Jorgensen < [email protected]> wrote: > I am going to try to provide as much information as possible but it might > be a bit sparse because I am still actively trying to get a grip on what > exactly I'm seeing with the c client. > > Zookeeper client version: 3.4.5 > Zookeeper server version: 3.4.10 > 5 node zookeeper cluster > > The workflow I have is essentially a long lived process establishes an > ephemeral node with some data that is read by some number of other > processes located on separate machines, standard cluster coordination > stuff. The issue I am seeing is after about 7-9 hours of runtime, zookeeper > will expire the client session because it has reached the 30 second > timeout. On the zookeeper client side, I've confirmed there are no calls to > the supplied watcher functions or context supplied to zookeeper_init. The > long lived process is doing other things during its runtime but the > interaction with zookeeper is only via callback events and a pipe after > establishing the ephemeral node at the beginning. > > One other datapoint is that I created an event loop that uses the same > client that established the ephemeral node to get the data from the > ephemeral node every 60 seconds and log it. While this event loop is > running I do not observe the client session expiring at all even after 14 > hours or runtime. > > I am not sure how to explain the client disconnecting without any message > to either the callback function or the context. I also am not sure how to > explain this behavior happening after many hours of running without issue. > > If anyone has seen something similar, how did you go about fixing it. Also > if there are any ideas on how to debug this issue that would be very > helpful. > > Thanks! > Andrew Jorgensen > @ajorgensen >
