Re: Open socket handles build up over time (leaking?)

Darrel Schneider Mon, 22 Nov 2021 10:45:45 -0800

Thanks for filing the ticket and for looking into the additional leak
________________________________
From: Leon Finker <[email protected]>
Sent: Monday, November 22, 2021 10:19 AM
To: [email protected] <[email protected]>
Subject: Re: Open socket handles build up over time (leaking?)


Hi,

Bug ticket: 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9819&amp;data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908748063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Ngv07U56igvyPoQHxXEtZXFGxEtUVcyJKde%2F1N2gAlQ%3D&amp;reserved=0

I applied the fix locally and now this leak is fixed. Thank you!

There is still a minor leak of about 10 handles per day that I want to
track down next week. From the looks of it it's not as simple to track
as this one :( Still happens in the durable client connection use case
and cache server port. Seems to be when an existing durable connection
is disconnected and automatically reconnected. I need to do more
testing first to know for sure.

On Wed, Nov 17, 2021 at 4:22 PM Darrel Schneider <[email protected]> wrote:
>
> Yes, open a geode bug ticket
> ________________________________
> From: Leon Finker <[email protected]>
> Sent: Wednesday, November 17, 2021 1:13 PM
> To: [email protected] <[email protected]>
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Thank you! I'll try to test this change. What's the procedure for
> this? Should I open geode bug ticket?
>
> On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider <[email protected]> wrote:
> >
> > I think you found the leak!
> > My understanding of the code in registerClientInternal (I'm looking at the 
> > current develop branch) is that when it logs the warning "Duplicate durable 
> > clients are not allowed" that it considers the current client connect 
> > attempt to have failed. It writes this response back to it: 
> > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > throw ServerRefusedConnectionException. What seems wrong about this method 
> > is that even though it sets "unsuccessfulMsg" and correctly sends back a 
> > handshake saying the client is rejected, it does not throw an exception and 
> > it does not close "socket". I think right before it calls 
> > performPostAuthorization it should do the followiing:
> > if (unsuccessfulMsg != null) {
> >       try {
> >         socket.close();
> >       } catch (IOException ignore) {
> >       }
> >  } else {
> >     performPostAuthorization(...)
> > }
> > ________________________________
> > From: Leon Finker <[email protected]>
> > Sent: Wednesday, November 17, 2021 10:30 AM
> > To: [email protected] <[email protected]>
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Following Darrel's excellent advice :) I think I tracked down the area
> > of the socket handle leak. As the article suggested, I ran the lsof
> > capture every 5 minutes. I then traced back the cleaned up socket
> > handles to the valid lsof entries. I verified a bunch of them and they
> > all ended up pointing to the same durable connection cases that fail
> > as follows:
> >
> > [2021-11-17 14:05:13,081Z][info <Client Queue Initialization Thread
> > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > [2021-11-17 14:05:13,083Z][warn <Client Queue Initialization Thread
> > 17> tid=32115] The requested durable client has the same identifier (
> > tpdemo2@x_gem_x ) as an existing durable client (
> > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> > Duplicate durable clients are not allowed.
> > [2021-11-17 14:05:13,084Z][warn <Client Queue Initialization Thread
> > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> > with identifier
> > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > timeout=86400]) and response code 64
> > [2021-11-17 14:05:13,098Z][warn <ServerConnection on port 40011 Thread
> > 34> tid=290] Server connection from
> > [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > timeout=86400]); port=60242]: connection disconnect detected by EOF.
> >
> > I then counted the number of those errors and found that the
> > difference is exactly the same as the leaked socket handles. So
> > everything points to this area in the server code. But I tried to
> > reproduce this in debugger and stepped through the geode code without
> > finding where the socket might be leaked. Put breakpoint on
> > CacheClientNotifier.registerClientInternal and the line where that
> > error happens. But then again the exception seems to bubble up to
> > AcceptorImpl.run where the is catch for IOException. And in case of
> > IOException, the client socket is closed. Is this the right socket I'm
> > after?
> >
> > Does this make it any clearer for someone who is more familiar with
> > the code? NOTE: the duplicate durable client happens sometimes due to
> > race conditions with reconnects (I think). It's not happening all the
> > time. It's not causing any problems to the client in general. But the
> > server leak eventually causes us to run out of file handles for the
> > process :(
> >
> > Thank you!
> >
> >
> > On Tue, Nov 16, 2021 at 4:33 PM Leon Finker <[email protected]> wrote:
> > >
> > > Hi Darrel,
> > >
> > > Thank you! I'll try to track it!
> > >
> > > On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider <[email protected]> 
> > > wrote:
> > > >
> > > > This link: 
> > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&amp;data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908758037%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=yMxwCrwdygrJwRK%2BQE1BIrIn1O55BkksOQcR1AkEQnw%3D&amp;reserved=0
> > > >  points out that once the fd gets into this "can't identify protocol" 
> > > > state you can no longer figure out things like what port(s) and 
> > > > addresses are associated with the fd. They suggest running lsof 
> > > > periodically to catch that fd (in your example output the first fd is 
> > > > 133u) when it was still healthy. This would help track it down to an 
> > > > area of the geode code. For example you could see that it was on of the 
> > > > sockets using the port the cache server is listening on.
> > > > How to identify leaked file descriptors that are shown only as “can't 
> > > > identify protocol” in lsof | MuleSoft Help 
> > > > Center<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&amp;data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908758037%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=yMxwCrwdygrJwRK%2BQE1BIrIn1O55BkksOQcR1AkEQnw%3D&amp;reserved=0>
> > > > Functional cookies enhance functions, performance, and services on the 
> > > > website. Some examples include: cookies used to analyze site traffic, 
> > > > cookies used for market research, and cookies used to display 
> > > > advertising that is not directed to a particular individual.
> > > > help.mulesoft.com
> > > >
> > > > ________________________________
> > > > From: Leon Finker <[email protected]>
> > > > Sent: Tuesday, November 16, 2021 9:28 AM
> > > > To: [email protected] <[email protected]>
> > > > Subject: Open socket handles build up over time (leaking?)
> > > >
> > > > Hi,
> > > >
> > > > We observe in our geode (1.14 - same before as well in 1.13) cache
> > > > server (that supports durable client sessions) an increase in half
> > > > opened sockets. It seems there is a socket leak. Could someone
> > > > recommend how to track the leak down? It's not obvious where it's
> > > > leaking...I can only suspect AcceptorImpl.run and where it only
> > > > handles IOException. but I wasn't able to reproduce it in debugger
> > > > yet...
> > > >
> > > > lsof -p 344|grep "can't"
> > > >
> > > > java 344   user  133u  sock                0,6       0t0 115956017
> > > > can't identify protocol
> > > > java 344   user  142u  sock                0,6       0t0 113361870
> > > > can't identify protocol
> > > > java 344   user  143u  sock                0,6       0t0 111979650
> > > > can't identify protocol
> > > > java 344   user  156u  sock                0,6       0t0 117202529
> > > > can't identify protocol
> > > > java 344   user  178u  sock                0,6       0t0 113357568
> > > > can't identify protocol
> > > > ...
> > > >
> > > > lsof -p 344|grep "can't"|wc -l
> > > > 934
> > > >
> > > > Thank you

Re: Open socket handles build up over time (leaking?)

Reply via email to