Re: Open socket handles build up over time (leaking?)

Leon Finker Wed, 17 Nov 2021 13:13:07 -0800

Yes definitely not sharing a durable ID between users. Durable ID has
login id@host and some other attributes. Even on the same machine if a
user tries to run the 2nd instance of the client application, then a
unique ID will be generated and appended to the durable ID.


It seems the durable session cleans up by itself on server side and
then same durable id can connect again as if nothing happened
* Very mysterious with the "due to: Unknown reason"...

[2021-11-17 14:08:25,614Z][warn <ClientHealthMonitor Thread> tid=216]
ClientHealthMonitor: Unregistering client with member id
identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]) due to: Unknown reason
[2021-11-17 14:08:25,614Z][info <Client Message Dispatcher for
x(tpdemo2@x:loner):56295:da8de524:tpdemo2@x(version:GEODE 1.14.0)
(tpdemo2@x_gem_x)> tid=30130] available ids = 5 , isEmptyAckList
=true, peekInitialized = true
[2021-11-17 14:08:25,614Z][info <Client Message Dispatcher for
x(tpdemo2@x:loner):56295:da8de524:tpdemo2@x(version:GEODE 1.14.0)
(tpdemo2@[email protected]@[email protected]_gem_x)> tid=30130]
CacheClientProxy[identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] :
Pausing processing
[2021-11-17 14:08:25,629Z][info <ClientHealthMonitor Thread> tid=216]
CacheClientNotifier: Keeping proxy for durable client named
tpdemo2@x_gem_x for 86400 seconds
CacheClientProxy[identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0].
[2021-11-17 14:08:25,637Z][warn <ClientHealthMonitor Thread> tid=216]
Monitoring client with member id
identity(x(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@x_gem_x;
timeout=86400]). It had been 60310 ms since the latest heartbeat. Max
interval is 60000. Terminated client.




On Wed, Nov 17, 2021 at 3:07 PM Anthony Baker <[email protected]> wrote:
>
> I’m curious - how do you assign a durable client id?  The id should be unique 
> to a client and not shared among a pool of clients.
>
> Anthony
>
>
> > On Nov 17, 2021, at 11:22 AM, Darrel Schneider <[email protected]> wrote:
> >
> > I think you found the leak!
> > My understanding of the code in registerClientInternal (I'm looking at the 
> > current develop branch) is that when it logs the warning "Duplicate durable 
> > clients are not allowed" that it considers the current client connect 
> > attempt to have failed. It writes this response back to it: 
> > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > throw ServerRefusedConnectionException. What seems wrong about this method 
> > is that even though it sets "unsuccessfulMsg" and correctly sends back a 
> > handshake saying the client is rejected, it does not throw an exception and 
> > it does not close "socket". I think right before it calls 
> > performPostAuthorization it should do the followiing:
> > if (unsuccessfulMsg != null) {
> >      try {
> >        socket.close();
> >      } catch (IOException ignore) {
> >      }
> > } else {
> >    performPostAuthorization(...)
> > }
> > ________________________________
> > From: Leon Finker <[email protected]>
> > Sent: Wednesday, November 17, 2021 10:30 AM
> > To: [email protected] <[email protected]>
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Following Darrel's excellent advice :) I think I tracked down the area
> > of the socket handle leak. As the article suggested, I ran the lsof
> > capture every 5 minutes. I then traced back the cleaned up socket
> > handles to the valid lsof entries. I verified a bunch of them and they
> > all ended up pointing to the same durable connection cases that fail
> > as follows:
> >
> > [2021-11-17 14:05:13,081Z][info <Client Queue Initialization Thread
> > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > [2021-11-17 14:05:13,083Z][warn <Client Queue Initialization Thread
> > 17> tid=32115] The requested durable client has the same identifier (
> > tpdemo2@x_gem_x ) as an existing durable client (
> > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> > Duplicate durable clients are not allowed.
> > [2021-11-17 14:05:13,084Z][warn <Client Queue Initialization Thread
> > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> > with identifier
> > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > timeout=86400]) and response code 64
> > [2021-11-17 14:05:13,098Z][warn <ServerConnection on port 40011 Thread
> > 34> tid=290] Server connection from
> > [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > timeout=86400]); port=60242]: connection disconnect detected by EOF.
> >
> > I then counted the number of those errors and found that the
> > difference is exactly the same as the leaked socket handles. So
> > everything points to this area in the server code. But I tried to
> > reproduce this in debugger and stepped through the geode code without
> > finding where the socket might be leaked. Put breakpoint on
> > CacheClientNotifier.registerClientInternal and the line where that
> > error happens. But then again the exception seems to bubble up to
> > AcceptorImpl.run where the is catch for IOException. And in case of
> > IOException, the client socket is closed. Is this the right socket I'm
> > after?
> >
> > Does this make it any clearer for someone who is more familiar with
> > the code? NOTE: the duplicate durable client happens sometimes due to
> > race conditions with reconnects (I think). It's not happening all the
> > time. It's not causing any problems to the client in general. But the
> > server leak eventually causes us to run out of file handles for the
> > process :(
> >
> > Thank you!
> >
> >
> > On Tue, Nov 16, 2021 at 4:33 PM Leon Finker <[email protected]> wrote:
> >>
> >> Hi Darrel,
> >>
> >> Thank you! I'll try to track it!
> >>
> >> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider <[email protected]> wrote:
> >>>
> >>> This link: 
> >>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&amp;data=04%7C01%7Cbakera%40vmware.com%7C3b69c55afcfa4eabc50508d9a9ffa1a4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727737664517186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=EtGD5SQu2%2B7CYGpqvcY362u94VsWy9ZYyrg502325kw%3D&amp;reserved=0
> >>>  points out that once the fd gets into this "can't identify protocol" 
> >>> state you can no longer figure out things like what port(s) and addresses 
> >>> are associated with the fd. They suggest running lsof periodically to 
> >>> catch that fd (in your example output the first fd is 133u) when it was 
> >>> still healthy. This would help track it down to an area of the geode 
> >>> code. For example you could see that it was on of the sockets using the 
> >>> port the cache server is listening on.
> >>> How to identify leaked file descriptors that are shown only as “can't 
> >>> identify protocol” in lsof | MuleSoft Help 
> >>> Center<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&amp;data=04%7C01%7Cbakera%40vmware.com%7C3b69c55afcfa4eabc50508d9a9ffa1a4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727737664517186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=EtGD5SQu2%2B7CYGpqvcY362u94VsWy9ZYyrg502325kw%3D&amp;reserved=0>
> >>> Functional cookies enhance functions, performance, and services on the 
> >>> website. Some examples include: cookies used to analyze site traffic, 
> >>> cookies used for market research, and cookies used to display advertising 
> >>> that is not directed to a particular individual.
> >>> help.mulesoft.com
> >>>
> >>> ________________________________
> >>> From: Leon Finker <[email protected]>
> >>> Sent: Tuesday, November 16, 2021 9:28 AM
> >>> To: [email protected] <[email protected]>
> >>> Subject: Open socket handles build up over time (leaking?)
> >>>
> >>> Hi,
> >>>
> >>> We observe in our geode (1.14 - same before as well in 1.13) cache
> >>> server (that supports durable client sessions) an increase in half
> >>> opened sockets. It seems there is a socket leak. Could someone
> >>> recommend how to track the leak down? It's not obvious where it's
> >>> leaking...I can only suspect AcceptorImpl.run and where it only
> >>> handles IOException. but I wasn't able to reproduce it in debugger
> >>> yet...
> >>>
> >>> lsof -p 344|grep "can't"
> >>>
> >>> java 344   user  133u  sock                0,6       0t0 115956017
> >>> can't identify protocol
> >>> java 344   user  142u  sock                0,6       0t0 113361870
> >>> can't identify protocol
> >>> java 344   user  143u  sock                0,6       0t0 111979650
> >>> can't identify protocol
> >>> java 344   user  156u  sock                0,6       0t0 117202529
> >>> can't identify protocol
> >>> java 344   user  178u  sock                0,6       0t0 113357568
> >>> can't identify protocol
> >>> ...
> >>>
> >>> lsof -p 344|grep "can't"|wc -l
> >>> 934
> >>>
> >>> Thank you
>

Re: Open socket handles build up over time (leaking?)

Reply via email to