Following Darrel's excellent advice :) I think I tracked down the area
of the socket handle leak. As the article suggested, I ran the lsof
capture every 5 minutes. I then traced back the cleaned up socket
handles to the valid lsof entries. I verified a bunch of them and they
all ended up pointing to the same durable connection cases that fail
as follows:

[2021-11-17 14:05:13,081Z][info <Client Queue Initialization Thread
17> tid=32115] :Cache server: Initializing secondary server-to-client
communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
[2021-11-17 14:05:13,083Z][warn <Client Queue Initialization Thread
17> tid=32115] The requested durable client has the same identifier (
tpdemo2@x_gem_x ) as an existing durable client (
CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
Duplicate durable clients are not allowed.
[2021-11-17 14:05:13,084Z][warn <Client Queue Initialization Thread
17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
with identifier
identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]) and response code 64
[2021-11-17 14:05:13,098Z][warn <ServerConnection on port 40011 Thread
34> tid=290] Server connection from
[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
timeout=86400]); port=60242]: connection disconnect detected by EOF.

I then counted the number of those errors and found that the
difference is exactly the same as the leaked socket handles. So
everything points to this area in the server code. But I tried to
reproduce this in debugger and stepped through the geode code without
finding where the socket might be leaked. Put breakpoint on
CacheClientNotifier.registerClientInternal and the line where that
error happens. But then again the exception seems to bubble up to
AcceptorImpl.run where the is catch for IOException. And in case of
IOException, the client socket is closed. Is this the right socket I'm
after?

Does this make it any clearer for someone who is more familiar with
the code? NOTE: the duplicate durable client happens sometimes due to
race conditions with reconnects (I think). It's not happening all the
time. It's not causing any problems to the client in general. But the
server leak eventually causes us to run out of file handles for the
process :(

Thank you!


On Tue, Nov 16, 2021 at 4:33 PM Leon Finker <leon...@gmail.com> wrote:
>
> Hi Darrel,
>
> Thank you! I'll try to track it!
>
> On Tue, Nov 16, 2021 at 2:42 PM Darrel Schneider <dar...@vmware.com> wrote:
> >
> > This link: 
> > https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof
> >  points out that once the fd gets into this "can't identify protocol" state 
> > you can no longer figure out things like what port(s) and addresses are 
> > associated with the fd. They suggest running lsof periodically to catch 
> > that fd (in your example output the first fd is 133u) when it was still 
> > healthy. This would help track it down to an area of the geode code. For 
> > example you could see that it was on of the sockets using the port the 
> > cache server is listening on.
> > How to identify leaked file descriptors that are shown only as “can't 
> > identify protocol” in lsof | MuleSoft Help 
> > Center<https://help.mulesoft.com/s/article/How-to-identify-leaked-file-descriptors-that-are-shown-only-as-cant-identify-protocol-in-lsof>
> > Functional cookies enhance functions, performance, and services on the 
> > website. Some examples include: cookies used to analyze site traffic, 
> > cookies used for market research, and cookies used to display advertising 
> > that is not directed to a particular individual.
> > help.mulesoft.com
> >
> > ________________________________
> > From: Leon Finker <leon...@gmail.com>
> > Sent: Tuesday, November 16, 2021 9:28 AM
> > To: dev@geode.apache.org <dev@geode.apache.org>
> > Subject: Open socket handles build up over time (leaking?)
> >
> > Hi,
> >
> > We observe in our geode (1.14 - same before as well in 1.13) cache
> > server (that supports durable client sessions) an increase in half
> > opened sockets. It seems there is a socket leak. Could someone
> > recommend how to track the leak down? It's not obvious where it's
> > leaking...I can only suspect AcceptorImpl.run and where it only
> > handles IOException. but I wasn't able to reproduce it in debugger
> > yet...
> >
> > lsof -p 344|grep "can't"
> >
> > java 344   user  133u  sock                0,6       0t0 115956017
> > can't identify protocol
> > java 344   user  142u  sock                0,6       0t0 113361870
> > can't identify protocol
> > java 344   user  143u  sock                0,6       0t0 111979650
> > can't identify protocol
> > java 344   user  156u  sock                0,6       0t0 117202529
> > can't identify protocol
> > java 344   user  178u  sock                0,6       0t0 113357568
> > can't identify protocol
> > ...
> >
> > lsof -p 344|grep "can't"|wc -l
> > 934
> >
> > Thank you

Reply via email to