Re: Open socket handles build up over time (leaking?)

2022-05-13 Thread Leon Finker
Hi,

I'm trying to find the other slow client socket leak. Is it possible
that the socket sometimes not closed for this condition?
[warn  tid=324]
ClientHealthMonitor: Unregistering client with member id
...,connection=1,durableAttributes=DurableClientAttributes[id=xxx;
timeout=86400]) due to: The connection has been reset while reading
the header

It seems to come from Message.fetchHeader. But I'm not sure on the
usage to review properly if that socket is always closed when the
client side does EOF while reading the header from the socket. Could
someone help review? From my count of that log warning to the metric
of leaked socket handles, it seems to correlate.

Thank you in advance

On Mon, Nov 22, 2021 at 1:45 PM Darrel Schneider  wrote:
>
> Thanks for filing the ticket and for looking into the additional leak
> 
> From: Leon Finker 
> Sent: Monday, November 22, 2021 10:19 AM
> To: dev@geode.apache.org 
> Subject: Re: Open socket handles build up over time (leaking?)
>
> Hi,
>
> Bug ticket: 
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9819&data=04%7C01%7Cdarrel%40vmware.com%7C8015b9834b2845b645a108d9ade4ab34%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637732019908748063%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Ngv07U56igvyPoQHxXEtZXFGxEtUVcyJKde%2F1N2gAlQ%3D&reserved=0
>
> I applied the fix locally and now this leak is fixed. Thank you!
>
> There is still a minor leak of about 10 handles per day that I want to
> track down next week. From the looks of it it's not as simple to track
> as this one :( Still happens in the durable client connection use case
> and cache server port. Seems to be when an existing durable connection
> is disconnected and automatically reconnected. I need to do more
> testing first to know for sure.
>
> On Wed, Nov 17, 2021 at 4:22 PM Darrel Schneider  wrote:
> >
> > Yes, open a geode bug ticket
> > 
> > From: Leon Finker 
> > Sent: Wednesday, November 17, 2021 1:13 PM
> > To: dev@geode.apache.org 
> > Subject: Re: Open socket handles build up over time (leaking?)
> >
> > Thank you! I'll try to test this change. What's the procedure for
> > this? Should I open geode bug ticket?
> >
> > On Wed, Nov 17, 2021 at 2:22 PM Darrel Schneider  wrote:
> > >
> > > I think you found the leak!
> > > My understanding of the code in registerClientInternal (I'm looking at 
> > > the current develop branch) is that when it logs the warning "Duplicate 
> > > durable clients are not allowed" that it considers the current client 
> > > connect attempt to have failed. It writes this response back to it: 
> > > REPLY_EXCEPTION_DUPLICATE_DURABLE_CLIENT. This will cause the client to 
> > > throw ServerRefusedConnectionException. What seems wrong about this 
> > > method is that even though it sets "unsuccessfulMsg" and correctly sends 
> > > back a handshake saying the client is rejected, it does not throw an 
> > > exception and it does not close "socket". I think right before it calls 
> > > performPostAuthorization it should do the followiing:
> > > if (unsuccessfulMsg != null) {
> > >   try {
> > > socket.close();
> > >   } catch (IOException ignore) {
> > >   }
> > >  } else {
> > > performPostAuthorization(...)
> > > }
> > > 
> > > From: Leon Finker 
> > > Sent: Wednesday, November 17, 2021 10:30 AM
> > > To: dev@geode.apache.org 
> > > Subject: Re: Open socket handles build up over time (leaking?)
> > >
> > > Following Darrel's excellent advice :) I think I tracked down the area
> > > of the socket handle leak. As the article suggested, I ran the lsof
> > > capture every 5 minutes. I then traced back the cleaned up socket
> > > handles to the valid lsof entries. I verified a bunch of them and they
> > > all ended up pointing to the same durable connection cases that fail
> > > as follows:
> > >
> > > [2021-11-17 14:05:13,081Z][info  > > 17> tid=32115] :Cache server: Initializing secondary server-to-client
> > > communication socket: Socket[addr=/x.x.x.x,port=56090,localport=40011]
> > > [2021-11-17 14:05:13,083Z][warn  > > 17> tid=32115] The requested durable client has the same identifier (
> > > tpdemo2@x_gem_x ) as an existing durable client (
> > > CacheClientProxy[identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > > 1.14.0),connection=1,durableAttributes=DurableClientAttributes[id=tpdemo2@TS02493@21.5.3@8.2.6@2021.5.8_gem_Amoeba;
> > > timeout=86400]); port=53976; primary=true; version=GEODE 1.14.0] ).
> > > Duplicate durable clients are not allowed.
> > > [2021-11-17 14:05:13,084Z][warn  > > 17> tid=32115] CacheClientNotifier: Unsuccessfully registered client
> > > with identifier
> > > identity(TS02493(tpdemo2@x:loner):60759:9243242e:tpdemo2@x(version:GEODE
> > > 1.14.0),connection=1,durableAttributes=DurableClientAt

Re: [PROPOSAL] Remove warning logs from FunctionException

2022-05-13 Thread Kirk Lund
Hi Alberto,

I don't really know if this change is good or bad but I do have some
thoughts about logging and changing behavior like this.

First off, we might think about whether or not logging these warnings is
currently expected behavior. Would removing/changing it cause surprise for
any users upgrading to the version that removes it? Would it make things
more difficult for the vendor or Apache Geode to provide support to a user
having issues in FunctionExecution?

It might make more sense to refactor the code that issues these warnings to
use a pluggable mechanism for handling exceptions and error conditions in
FunctionExecution? For example, you might consider introducing an SPI for a
new FunctionExecutionErrorHandler. The default implementation would log the
exceptions at warning level. Since it's an SPI (using Java Service Loader
to load an implementation), you could provide your own implementation that
does anything you want in response, including the possibility of logging it
at debug level instead of warning.

This also highlights a bigger issue in Apache Geode. There has never been a
policy or even set of recommendations on the logging performed by Geode
classes and components. There is also no standard or recommendations for
formatting or wording of log statements. I believe the issue involving
exception logging in FunctionExecution is a direct result of this. Geode
needs detailed guidelines about log levels, log statement wording,
inclusion of exception stacks, use of Markers, etc. Not just per class, but
probably per Geode component or service or layer as well. Geode also needs
some devs to spend time experimenting with specifying custom log4j2.xml
configuration files, finding usability problems and hopefully improving how
this currently works so that it becomes easy for users to customize what
information is actually being logged.

For example, there are problems with how Log4j is currently used by Geode
such as using Markers at TRACE or DEBUG levels instead of at INFO level --
in order to enable Marker controlled logging, one would have to set the
log-level to DEBUG and then also enable the Marker. Since we know DEBUG
level already produces too much logging (much of it useless noise), this
just exacerbates the problem when what is really wanted is well-behaved,
consistent INFO level logging AND the ability to enable Marked controlled
logging without having to use a finer log-level. It's possible the solution
to Marker usage in Geode is to move Marker controlled log statements to
INFO level. That way you could leave the logging at INFO level and just
enable one or more Markers to increase logging from specific areas.

I primarily wanted to point out that you may be trying to address a smaller
part of the real problem here and that you should always consider pluggable
mechanisms for customizing behavior rather than just considering changes
that would directly alter the current behavior. Also, remember to think
about whether a change like this would be best in a major, minor or patch
release. Some users or the devs supporting them may rely on this
information (I'm not saying they do in this case, just that it's possible).

Thanks,
Kirk

On Thu, Apr 28, 2022 at 3:00 AM Alberto Gomez 
wrote:

> Hi Barry,
>
> If the exception is returned by passing it to the ResultCollector's
> sendException() method then the exception is not logged. If the exception
> is returned by passing it to the lastResult() method then the exception
> (and the stack trace) is logged. I am assuming that when you say that the
> exception is returned in its result is done by means of the sendException()
> method.
>
> I agree with you that Geode must be consistent and if an exception is
> thrown by the function, then the exception should be logged no matter if
> isHA is returning false or true. Like you, I have also observed that when
> isHA is returning false the exception is not logged.
>
> I also think it is worth to at least make this logging of the exception
> configurable for those cases where functions prefer to throw the exception
> instead of sending it and still do not want to see those exceptions logged.
>
> Thanks,
>
> Alberto
> 
> From: Barry Oglesby 
> Sent: Thursday, April 28, 2022 2:32 AM
> To: Alberto Gomez ; dev@geode.apache.org <
> dev@geode.apache.org>
> Subject: Re: [PROPOSAL] Remove warning logs from FunctionException
>
> A function can throw an exception and can also return an exception in its
> result. I'm not sure I've seen too many functions where throwing an
> exception is the expected result. In my very quick testing, I see the
> exception and stack logged if the exception is thrown by the function but
> not if the exception is returned. Are you seeing that same behavior, or are
> both cases logging the exception? I also see the behavior you described
> where isHA returning false does not log the exception. I guess I would say
> if an exception is thrown in either case