I’m curious - how do you assign a durable client id? The id should be unique to a client and not shared among a pool of clients.
Anthony > On Nov 17, 2021, at 11:22 AM, Darrel Schneider <dar...@vmware.com> wrote: > > I think you found the leak! > My understanding of the code in registerClientInternal (I'm looking at the > current develop branch) is that when it logs the warning "Duplicate durable > clients are not allowed" that it considers the current client connect attempt > to have failed. It writes this response back to it: > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to throw > ServerRefusedConnectionException. What seems wrong about this method is that > even though it sets "unsuccessfulMsg" and correctly sends back a handshake > saying the client is rejected, it does not throw an exception and it does not > close "socket". I think right before it calls performPostAuthorization it > should do the followiing: > if (unsuccessfulMsg != null) { > try { > socket.close(); > } catch (IOException ignore) { > } > } else { > performPostAuthorization(...) > } > ________________________________ > From: Leon Finker <leon...@gmail.com> > Sent: Wednesday, November 17, 2021 10:30 AM > To: dev@geode.apache.org <dev@geode.apache.org> > Subject: Re: Open socket handles build up over time (leaking?) > > Following Darrel's excellent advice :) I think I tracked down the area > of the socket handle leak. As the article suggested, I ran the lsof > capture every 5 minutes. I then traced back the cleaned up socket > handles to the valid lsof entries. I verified a bunch of them and they > all ended up pointing to the same durable connection cases that fail > as follows: > > [2021-11-17 14:05:13,081Z][info <Client Queue Initialization Thread > 17> tid=32115] :Cache server: Initializing secondary server-to-client > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011] > [2021-11-17 14:05:13,083Z][warn <Client Queue Initialization Thread > 17> tid=32115] The requested durable client has the same identifier ( > tpdemo2@x_gem_x ) as an existing durable client ( > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba; > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ). > Duplicate durable clients are not allowed. > [2021-11-17 14:05:13,084Z][warn <Client Queue Initialization Thread > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client > with identifier > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba; > timeout=86400]) and response code 64 > [2021-11-17 14:05:13,098Z][warn <ServerConnection on port 40011 Thread > 34> tid=290] Server connection from > [identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba; > timeout=86400]); port=60242]: connection disconnect detected by EOF. > > I then counted the number of those errors and found that the > difference is exactly the same as the leaked socket handles. So > everything points to this area in the server code. But I tried to > reproduce this in debugger and stepped through the geode code without > finding where the socket might be leaked. Put breakpoint on > CacheClientNotifier.registerClientInternal and the line where that > error happens. But then again the exception seems to bubble up to > AcceptorImpl.run where the is catch for IOException. And in case of > IOException, the client socket is closed. Is this the right socket I'm > after? > > Does this make it any clearer for someone who is more familiar with > the code? NOTE: the duplicate durable client happens sometimes due to > race conditions with reconnects (I think). It's not happening all the > time. It's not causing any problems to the client in general. But the > server leak eventually causes us to run out of file handles for the > process :( > > Thank you! > > > On Tue, Nov 16, 2021 at 4:33 PM Leon Finker <leon...@gmail.com> wrote: >> >> Hi Darrel, >> >> Thank you! I'll try to track it! >> >> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider <dar...@vmware.com> wrote: >>> >>> This link: >>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&data=04%7C01%7Cbakera%40vmware.com%7C3b69c55afcfa4eabc50508d9a9ffa1a4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727737664517186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EtGD5SQu2%2B7CYGpqvcY362u94VsWy9ZYyrg502325kw%3D&reserved=0 >>> points out that once the fd gets into this "can't identify protocol" state >>> you can no longer figure out things like what port(s) and addresses are >>> associated with the fd. They suggest running lsof periodically to catch >>> that fd (in your example output the first fd is 133u) when it was still >>> healthy. This would help track it down to an area of the geode code. For >>> example you could see that it was on of the sockets using the port the >>> cache server is listening on. >>> How to identify leaked file descriptors that are shown only as “can't >>> identify protocol” in lsof | MuleSoft Help >>> Center<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhelp.mulesoft.com%2Fs%2Farticle%2FHow-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof&data=04%7C01%7Cbakera%40vmware.com%7C3b69c55afcfa4eabc50508d9a9ffa1a4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637727737664517186%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=EtGD5SQu2%2B7CYGpqvcY362u94VsWy9ZYyrg502325kw%3D&reserved=0> >>> Functional cookies enhance functions, performance, and services on the >>> website. Some examples include: cookies used to analyze site traffic, >>> cookies used for market research, and cookies used to display advertising >>> that is not directed to a particular individual. >>> help.mulesoft.com >>> >>> ________________________________ >>> From: Leon Finker <leon...@gmail.com> >>> Sent: Tuesday, November 16, 2021 9:28 AM >>> To: dev@geode.apache.org <dev@geode.apache.org> >>> Subject: Open socket handles build up over time (leaking?) >>> >>> Hi, >>> >>> We observe in our geode (1.14 - same before as well in 1.13) cache >>> server (that supports durable client sessions) an increase in half >>> opened sockets. It seems there is a socket leak. Could someone >>> recommend how to track the leak down? It's not obvious where it's >>> leaking...I can only suspect AcceptorImpl.run and where it only >>> handles IOException. but I wasn't able to reproduce it in debugger >>> yet... >>> >>> lsof -p 344|grep "can't" >>> >>> java 344 user 133u sock 0,6 0t0 115956017 >>> can't identify protocol >>> java 344 user 142u sock 0,6 0t0 113361870 >>> can't identify protocol >>> java 344 user 143u sock 0,6 0t0 111979650 >>> can't identify protocol >>> java 344 user 156u sock 0,6 0t0 117202529 >>> can't identify protocol >>> java 344 user 178u sock 0,6 0t0 113357568 >>> can't identify protocol >>> ... >>> >>> lsof -p 344|grep "can't"|wc -l >>> 934 >>> >>> Thank you