Re: Open socket handles build up over time (leaking?)

Anthony Baker Wed, 17 Nov 2021 12:07:49 -0800

I’m curious - how do you assign a durable client id?  The id should be unique 
to a client and not shared among a pool of clients.


Anthony


> On Nov 17, 2021, at 11:22 AM, Darrel Schneider <[email protected]> wrote:
> 
> I think you found the leak!
> My understanding of the code in registerClientInternal (I'm looking at the 
> current develop branch) is that when it logs the warning "Duplicate durable 
> clients are not allowed" that it considers the current client connect attempt 
> to have failed. It writes this response back to it: 
> REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw 
> ServerRefusedConnectionException. What seems wrong about this method is that 
> even though it sets "unsuccessfulMsg" and correctly sends back a handshake 
> saying the client is rejected, it does not throw an exception and it does not 
> close "socket". I think right before it calls performPostAuthorization it 
> should do the followiing:
> if (unsuccessfulMsg != null) {
>      try {
>        socket.close();
>      } catch (IOException ignore) {
>      }
> } else {
>    performPostAuthorization(...)
> }
> ________________________________
> From: Leon Finker <[email protected]>
> Sent: Wednesday, November 17, 2021 10:30 AM
> To: [email protected] <[email protected]>
> Subject: Re: Open socket handles build up over time (leaking?)
> 
> Following Darrel's excellent advice :) I think I tracked down the area
> of the socket handle leak. As the article suggested, I ran the lsof
> capture every 5 minutes. I then traced back the cleaned up socket
> handles to the valid lsof entries. I verified a bunch of them and they
> all ended up pointing to the same durable connection cases that fail
> as follows:
> 
> [2021-11-17 14:05:13,081Z][info <Client Queue Initialization Thread
> 17> tid=32115] :Cache server: Initializing secondary server-to-client
> communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> [2021-11-17 14:05:13,083Z][warn <Client Queue Initialization Thread
> 17> tid=32115] The requested durable client has the same identifier (
> tpdemo2@x_gem_x ) as an existing durable client (
> CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> Duplicate durable clients are not allowed.
> [2021-11-17 14:05:13,084Z][warn <Client Queue Initialization Thread
> 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> with identifier
> identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> timeout=86400]) and response code 64
> [2021-11-17 14:05:13,098Z][warn <ServerConnection on port 40011 Thread
> 34> tid=290] Server connection from
> [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@[email protected]@[email protected]_gem_Amoeba;
> timeout=86400]); port=60242]: connection disconnect detected by EOF.
> 
> I then counted the number of those errors and found that the
> difference is exactly the same as the leaked socket handles. So
> everything points to this area in the server code. But I tried to
> reproduce this in debugger and stepped through the geode code without
> finding where the socket might be leaked. Put breakpoint on
> CacheClientNotifier.registerClientInternal and the line where that
> error happens. But then again the exception seems to bubble up to
> AcceptorImpl.run where the is catch for IOException. And in case of
> IOException, the client socket is closed. Is this the right socket I'm
> after?
> 
> Does this make it any clearer for someone who is more familiar with
> the code? NOTE: the duplicate durable client happens sometimes due to
> race conditions with reconnects (I think). It's not happening all the
> time. It's not causing any problems to the client in general. But the
> server leak eventually causes us to run out of file handles for the
> process :(
> 
> Thank you!
> 
> 
> On Tue, Nov 16, 2021 at 4:33 PM Leon Finker <[email protected]> wrote:
>> 
>> Hi Darrel,
>> 
>> Thank you! I'll try to track it!
>> 
>> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider <[email protected]> wrote:
>>> 
>>> This link: 
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&amp;data=04%7C01%7Cbakera%40vmware.com%7C3b69c55afcfa4eabc50508d9a9ffa1a4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727737664517186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=EtGD5SQu2%2B7CYGpqvcY362u94VsWy9ZYyrg502325kw%3D&amp;reserved=0
>>>  points out that once the fd gets into this "can't identify protocol" state 
>>> you can no longer figure out things like what port(s) and addresses are 
>>> associated with the fd. They suggest running lsof periodically to catch 
>>> that fd (in your example output the first fd is 133u) when it was still 
>>> healthy. This would help track it down to an area of the geode code. For 
>>> example you could see that it was on of the sockets using the port the 
>>> cache server is listening on.
>>> How to identify leaked file descriptors that are shown only as “can't 
>>> identify protocol” in lsof | MuleSoft Help 
>>> Center<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&amp;data=04%7C01%7Cbakera%40vmware.com%7C3b69c55afcfa4eabc50508d9a9ffa1a4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727737664517186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=EtGD5SQu2%2B7CYGpqvcY362u94VsWy9ZYyrg502325kw%3D&amp;reserved=0>
>>> Functional cookies enhance functions, performance, and services on the 
>>> website. Some examples include: cookies used to analyze site traffic, 
>>> cookies used for market research, and cookies used to display advertising 
>>> that is not directed to a particular individual.
>>> help.mulesoft.com
>>> 
>>> ________________________________
>>> From: Leon Finker <[email protected]>
>>> Sent: Tuesday, November 16, 2021 9:28 AM
>>> To: [email protected] <[email protected]>
>>> Subject: Open socket handles build up over time (leaking?)
>>> 
>>> Hi,
>>> 
>>> We observe in our geode (1.14 - same before as well in 1.13) cache
>>> server (that supports durable client sessions) an increase in half
>>> opened sockets. It seems there is a socket leak. Could someone
>>> recommend how to track the leak down? It's not obvious where it's
>>> leaking...I can only suspect AcceptorImpl.run and where it only
>>> handles IOException. but I wasn't able to reproduce it in debugger
>>> yet...
>>> 
>>> lsof -p 344|grep "can't"
>>> 
>>> java 344   user  133u  sock                0,6       0t0 115956017
>>> can't identify protocol
>>> java 344   user  142u  sock                0,6       0t0 113361870
>>> can't identify protocol
>>> java 344   user  143u  sock                0,6       0t0 111979650
>>> can't identify protocol
>>> java 344   user  156u  sock                0,6       0t0 117202529
>>> can't identify protocol
>>> java 344   user  178u  sock                0,6       0t0 113357568
>>> can't identify protocol
>>> ...
>>> 
>>> lsof -p 344|grep "can't"|wc -l
>>> 934
>>> 
>>> Thank you

Re: Open socket handles build up over time (leaking?)

Reply via email to