Re: Open socket handles build up over time (leaking?)

Leon Finker Fri, 03 Dec 2021 13:09:38 -0800

Hi,

Unfortunately I'm not able to find the cause of the remaining minor
socket handle leak.I've tried tracing it for a week. It won't cause us
any major issues since it leaks about 10 handles per day on average.
For example, after 15 days, there are only 160 socket handles leaked
on one cache peer and on the other one 120. It is definitely for the
cache port and durable client connections. Only those show up in the
lsof captures as problematic.


I tried to capture lsof and enabled debug logging in geode to be able
to correlate ports and maybe get some additional information for that
connection. But there is nothing in the log for this. I've correlated
about 10 of these CLOSE_WAIT and then abandoned sockets and in all
cases geode at debug says the following for those ports as very last
thing:

[2021-12-03 10:20:01,056Z][debug <Timer-1> tid=74] Checking whether to
ping 
CacheClientProxy[identity(192.168.66.7(xxx:loner):65188:000fae7f:xxx(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=xxx_gem_Amoeba;
timeout=86400]); port=59084; primary=true; version=GEODE 1.14.0]

[2021-12-03 10:20:01,056Z][debug <Timer-1> tid=74] Not pinging because
not idle: 
CacheClientProxy[identity(192.168.66.7(xxx:loner):65188:000fae7f:xxx(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=xx_gem_Amoeba;
timeout=86400]); port=59084; primary=true; version=GEODE 1.14.0]

(I replaced durable id details with xxx above). But after the above
lines, there is no longer any mention of port=59084 or :59084. Nothing
after hours or days. I can see the same user/durable id reconnect
after some time but on different port. The port that is lost has no
activity in the logs ever again.

I don't know how to debug it further. The other leak was easy compared
to this since there was an exception in the log and some activity for
the port so was easy to correlate.

But just to confirm the fix for the other leak is working perfectly
fine. After 2 weeks, all good!

Thank you very much!

On Mon, Nov 22, 2021 at 1:45 PM Darrel Schneider <[email protected]> wrote:
>
> Thanks for filing the ticket and for looking into the additional leak
> ________________________________
> From: Leon Finker <[email protected]>
> Sent: Monday, November 22, 2021 10:19 AM
> To: [email protected] <[email protected]>
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Hi,
>
> Bug ticket: 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9819&amp;data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908748063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Ngv07U56igvyPoQHxXEtZXFGxEtUVcyJKde%2F1N2gAlQ%3D&amp;reserved=0
>
> I applied the fix locally and now this leak is fixed. Thank you!
>
> There is still a minor leak of about 10 handles per day that I want to
> track down next week. From the looks of it it's not as simple to track
> as this one :( Still happens in the durable client connection use case
> and cache server port. Seems to be when an existing durable connection
> is disconnected and automatically reconnected. I need to do more
> testing first to know for sure.
>
> On Wed, Nov 17, 2021 at 4:22 PM Darrel Schneider <[email protected]> wrote:
> >
> > Yes, open a geode bug ticket
> > ________________________________
> > From: Leon Finker <[email protected]>
> > Sent: Wednesday, November 17, 2021 1:13 PM
> > To: [email protected] <[email protected]>
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Thank you! I'll try to test this change. What's the procedure for
> > this? Should I open geode bug ticket?
> >
> > On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider <[email protected]> wrote:
> > >
> > > I think you found the leak!
> > > My understanding of the code in registerClientInternal (I'm looking at 
> > > the current develop branch) is that when it logs the warning "Duplicate 
> > > durable clients are not allowed" that it considers the current client 
> > > connect attempt to have failed. It writes this response back to it: 
> > > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > > throw ServerRefusedConnectionException. What seems wrong about this 
> > > method is that even though it sets "unsuccessfulMsg" and correctly sends 
> > > back a handshake saying the client is rejected, it does not throw an 
> > > exception and it does not close "socket". I think right before it calls 
> > > performPostAuthorization it should do the followiing:
> > > if (unsuccessfulMsg != null) {
> > >       try {
> > >         socket.close();
> > >       } catch (IOException ignore) {
> > >       }
> > >  } else {
> > >     performPostAuthorization(...)
> > > }
> > > ________________________________
> > > From: Leon Finker <[email protected]>
> > > Sent: Wednesday, November 17, 2021 10:30 AM
> > > To: [email protected] <[email protected]>
> > > Subject: Re: Open socket handles build up over time (leaking?)
> > >
> > > Following Darrel's excellent advice :) I think I tracked down the area
> > > of the socket handle leak. As the article suggested, I ran the lsof
> > > capture every 5 minutes. I then traced back the cleaned up socket
> > > handles to the valid lsof entries. I verified a bunch of them and they
> > > all ended up pointing to the same durable connection cases that fail
> > > as follows:
> > >
> > > [2021-11-17 14:05:13,081Z][info <Client Queue Initialization Thread
> > > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > > [2021-11-17 14:05:13,083Z][warn <Client Queue Initialization Thread
> > > 17> tid=32115] The requested durable client has the same identifier (
> > > tpdemo2@x_gem_x ) as an existing durable client (
> > > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> > > Duplicate durable clients are not allowed.
> > > [2021-11-17 14:05:13,084Z][warn <Client Queue Initialization Thread
> > > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> > > with identifier
> > > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > > timeout=86400]) and response code 64
> > > [2021-11-17 14:05:13,098Z][warn <ServerConnection on port 40011 Thread
> > > 34> tid=290] Server connection from
> > > [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> > > timeout=86400]); port=60242]: connection disconnect detected by EOF.
> > >
> > > I then counted the number of those errors and found that the
> > > difference is exactly the same as the leaked socket handles. So
> > > everything points to this area in the server code. But I tried to
> > > reproduce this in debugger and stepped through the geode code without
> > > finding where the socket might be leaked. Put breakpoint on
> > > CacheClientNotifier.registerClientInternal and the line where that
> > > error happens. But then again the exception seems to bubble up to
> > > AcceptorImpl.run where the is catch for IOException. And in case of
> > > IOException, the client socket is closed. Is this the right socket I'm
> > > after?
> > >
> > > Does this make it any clearer for someone who is more familiar with
> > > the code? NOTE: the duplicate durable client happens sometimes due to
> > > race conditions with reconnects (I think). It's not happening all the
> > > time. It's not causing any problems to the client in general. But the
> > > server leak eventually causes us to run out of file handles for the
> > > process :(
> > >
> > > Thank you!
> > >
> > >
> > > On Tue, Nov 16, 2021 at 4:33 PM Leon Finker <[email protected]> wrote:
> > > >
> > > > Hi Darrel,
> > > >
> > > > Thank you! I'll try to track it!
> > > >
> > > > On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider <[email protected]> 
> > > > wrote:
> > > > >
> > > > > This link: 
> > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&amp;data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908758037%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=yMxwCrwdygrJwRK%2BQE1BIrIn1O55BkksOQcR1AkEQnw%3D&amp;reserved=0
> > > > >  points out that once the fd gets into this "can't identify protocol" 
> > > > > state you can no longer figure out things like what port(s) and 
> > > > > addresses are associated with the fd. They suggest running lsof 
> > > > > periodically to catch that fd (in your example output the first fd is 
> > > > > 133u) when it was still healthy. This would help track it down to an 
> > > > > area of the geode code. For example you could see that it was on of 
> > > > > the sockets using the port the cache server is listening on.
> > > > > How to identify leaked file descriptors that are shown only as “can't 
> > > > > identify protocol” in lsof | MuleSoft Help 
> > > > > Center<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&amp;data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908758037%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=yMxwCrwdygrJwRK%2BQE1BIrIn1O55BkksOQcR1AkEQnw%3D&amp;reserved=0>
> > > > > Functional cookies enhance functions, performance, and services on 
> > > > > the website. Some examples include: cookies used to analyze site 
> > > > > traffic, cookies used for market research, and cookies used to 
> > > > > display advertising that is not directed to a particular individual.
> > > > > help.mulesoft.com
> > > > >
> > > > > ________________________________
> > > > > From: Leon Finker <[email protected]>
> > > > > Sent: Tuesday, November 16, 2021 9:28 AM
> > > > > To: [email protected] <[email protected]>
> > > > > Subject: Open socket handles build up over time (leaking?)
> > > > >
> > > > > Hi,
> > > > >
> > > > > We observe in our geode (1.14 - same before as well in 1.13) cache
> > > > > server (that supports durable client sessions) an increase in half
> > > > > opened sockets. It seems there is a socket leak. Could someone
> > > > > recommend how to track the leak down? It's not obvious where it's
> > > > > leaking...I can only suspect AcceptorImpl.run and where it only
> > > > > handles IOException. but I wasn't able to reproduce it in debugger
> > > > > yet...
> > > > >
> > > > > lsof -p 344|grep "can't"
> > > > >
> > > > > java 344   user  133u  sock                0,6       0t0 115956017
> > > > > can't identify protocol
> > > > > java 344   user  142u  sock                0,6       0t0 113361870
> > > > > can't identify protocol
> > > > > java 344   user  143u  sock                0,6       0t0 111979650
> > > > > can't identify protocol
> > > > > java 344   user  156u  sock                0,6       0t0 117202529
> > > > > can't identify protocol
> > > > > java 344   user  178u  sock                0,6       0t0 113357568
> > > > > can't identify protocol
> > > > > ...
> > > > >
> > > > > lsof -p 344|grep "can't"|wc -l
> > > > > 934
> > > > >
> > > > > Thank you

Re: Open socket handles build up over time (leaking?)

Reply via email to