Re: SocketTimeoutExceptions When Communicating with a Load Balancer

Alexey Panchenko Wed, 12 Feb 2020 06:34:25 -0800

Ah, it's actually much simpler than I thought.
ConnectTimeoutException wraps SocketTimeoutExceptions
but you use ExceptionUtils.getRootCause which retrieves the inner most
exception - which is SocketTimeoutException


so
1. logging or posting a full stackrace would make the reason obvious from
the beginning
2.getRootCause() is not always the right thing to do


On Wed, Feb 12, 2020 at 5:22 AM Evan J <[email protected]> wrote:

> On Tue, Feb 11, 2020 at 3:08 PM Alexey Panchenko <[email protected]
> >
> wrote:
>
> > I guess, It can depends on which level load balancer operates.
> > if load balancer is layer 4 (TCP) then for the operating system it looks
> > like connection is made directly to the target application. In this case
> > exception should happen during connect and be wrapped as
> > ConnectTimeoutException
> > but if load balancer is layer 7 (HTTP) then TCP connection is made to the
> > load balancer (which possibly reads the whole request into memory) and
> then
> > it's trying to establish another TCP connection to the target application
> > and forward request there.
> > While that happens, the client thinks that it already has sent the
> request
> > to the target application and is already waiting for a response.
> > In this situation SocketTimeoutException can happen.
> >
> > On Tue, Feb 11, 2020 at 6:23 PM Evan J <[email protected]>
> > wrote:
> >
> > > Thank you for your response.
> > >
> > > I'm trying to understand your statement. Are you saying when connection
> > > times out, SocketTimeoutException is thrown, and then the library
> > captures
> > > it and rethrows ConnectTimeoutException?
> > >
> > > Bottom line, first, is what I explained the expected behavior, and
> > second,
> > > how can we properly distinguish between these two types of timeouts?
> > >
> > > On Tue, Feb 11, 2020, 4:24 AM Oleg Kalnichevski <[email protected]>
> > wrote:
> > >
> > > > On Mon, 2020-02-10 at 23:25 -0500, Evan J wrote:
> > > > >  Hi,
> > > > >
> > > > > (looks like I'd sent this to a wrong user group originally)
> > > > >
> > > > > We deploy an application B (which is basically a backend
> application
> > > > > serving a web application) to a cluster of application servers
> (JBoss
> > > > > EAP
> > > > > 7.2 -- 8 instances). These instances send HTTP requests to a set of
> > > > > gateway
> > > > > applications, call them application Gs (10 instances), where the
> > > > > requests
> > > > > go through an F5 loadbalancer that sits between them.
> > > > >
> > > > > Instances of application B are Spring Boot applications (2.1.x)
> that
> > > > > have
> > > > > been configured with Apache HttpClient 4.5.5 and HttpCore 4.4.9.
> The
> > > > > configuration is almost identical to this github code:
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
> https://github.com/spring-framework-guru/sfg-blog-posts/tree/master/resttemplate/src/main/java/guru/springframework/resttemplate/config
> > > > >
> > > > > The only exception is that RestTemplateConfig#restTemplate() is
> > > > > configured
> > > > > by creation of RestTemplate and passing
> > > > > HttpComponentsClientHttpRequestFactory to its constructor rather
> than
> > > > > using
> > > > > Spring Boot's RestTemplateBuilder().
> > > > >
> > > > > The keep alive configuration in application B is basically defined
> > > > > something like:
> > > > >
> > > > >
> > > >
> > > >
> > >
> >
> https://github.com/spring-framework-guru/sfg-blog-posts/blob/0152fb0c4acf08d019128ca38c3dd2523871c43c/resttemplate/src/main/java/guru/springframework/resttemplate/config/ApacheHttpClientConfig.java#L52
> > > > >
> > > > > When a load test is executed in a single stack environment where
> the
> > > > > load
> > > > > balancer is omitted (one application B -> one application G), the
> > > > > requests
> > > > > and responses are processed and validated accordingly (each request
> > > > > should
> > > > > correspond to a right response and be sent to the web application
> > > > > front
> > > > > end).
> > > > >
> > > > > However, when we run the same load test in a distributed
> environment
> > > > > with a
> > > > > load balancer between application B's and application G's, we see a
> > > > > lot of
> > > > > SocketTimeoutExceptions being logged, and we notice them very
> quickly
> > > > > (about 5% of total of responses in application B throw that
> > > > > exception).
> > > > >
> > > > > The code structure is very straightforward:
> > > > >
> > > > > try {
> > > > >   // RestTemplate call
> > > > > } catch (RestClientException exception) {
> > > > >   Exception rootCause = ExceptionUtils.getRootCause(exception); //
> > > > > Apache
> > > > > Commons lib
> > > > >   if (rootCause != null) {
> > > > >     if
> > > > >
> (SocketTimeoutException.getClass().getName().equals(rootCause.getClas
> > > > > s().getName())
> > > > > {
> > > > >     // or even if (rootCause instanceof SocketTimeoutException) {
> > > > >       // Log for socket timeout
> > > > >     }
> > > > >     if
> > > > >
> (ConnectTimeoutException.getClass().getName().equals(rootCause.getCla
> > > > > ss().getName())
> > > > > {
> > > > >     // or even if (rootCause instanceof  ConnectTimeoutException )
> {
> > > > >       // Log for connection timeout
> > > > >     }
> > > > >   }
> > > > > }
> > > > >
> > > > > Application B's keep alive has been set to 20 seconds while the
> > > > > socket
> > > > > timeout has been set to 10 seconds by default (connection timeout
> to
> > > > > 1
> > > > > second).
> > > > >
> > > > > After placing timer to log how long it takes for an exception is
> > > > > thrown, we
> > > > > saw, the time that it took from the moment RestTemplate is invoked
> > > > > till the
> > > > > exception is thrown was slightly above 1 second, e.g. 1030 ms, 1045
> > > > > ms,
> > > > > 1020 ms, etc.. This led us to increase the connection timeouts
> from 1
> > > > > second to 2 seconds, and afterward, we didn't get any timeout
> > > > > exception of
> > > > > any sort under the similar load.
> > > > >
> > > > > My question is, why is that the majority of exceptions that are
> being
> > > > > thrown have SocketTimeoutExceptions type as opposed to
> > > > > ConnectTimeoutExceptions which, based on the timeout adjustment
> > > > > mentioned
> > > > > above, appears to be the latter (Connect) vs. socket (read)
> timeout?
> > > > > Note
> > > > > that I said the majority of time, as I've seen a few
> > > > > ConnectTimeoutExceptions as well, but almost 99% of the failed ones
> > > > > are
> > > > > SocketTimeoutExceptions.
> > > > >
> > > > > Also, in our logs, we log the "rootCause's" class name to avoid
> > > > > ambiguity,
> > > > > but as I mentioned, they are being logged as
> SocketTimeoutExceptions
> > > > > class
> > > > > names.
> > > > >
> > > > > What is Apache Components library doing under the hood that signals
> > > > > the
> > > > > underlying JDK code to throw SocketTimeoutExceptions rather than
> > > > > ConnectTimeoutException
> > > >
> > > > Connection management code in HttpClient re-throws
> > > > SocketTimeoutExceptions thrown while connecting a socket to its
> remote
> > > > endpoint as ConnectTimeoutException to help distinguish connect
> timeout
> > > > from socket (read operation) timeout. SocketTimeoutExceptions are
> > > > always thrown by the JRE, not by HttpClient.
> > > >
> > > > Oleg
> > > >
> > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: [email protected]
> > > > For additional commands, e-mail: [email protected]
> > > >
> > > >
> > >
> >
>
> Apologize for top posting.
>
> Alexey,
>
> I have to verify, but I think the load balancer is a pass through so the
> first scenario might be true.
>
> However, let's assume the load balancer is running in ISO layer 7 mode. If
> the load balancer is consuming the entire request and tries to establish
> another TCP connection to the target servers, when SocketTimeoutException
> is triggered in the source application (B), it should be an indicative that
> the maximum time allotted to receive packets between the last exchange and
> the current one has exceeded.
>
> But the only reason this could happen if the socket read between two
> packets has exceeded 10 seconds while the logs show from the moment the
> request has been made from the source, to the moment when the exception is
> thrown is merely slightly above 1 second (incidentally a smidgen more than
> the connect timeout).
>
> Moreover, if we increase the socket (read) timeout to 30 seconds, keep the
> connect timeout at 1 second, and run the test in the same distributed
> environment with load balancer, we still see the same result (as a matter
> fact, the error rate/timeout exceptions actually increases by a few
> percentage -- 2%-3%).
>
> For successful requests, we can verify through the logs on the target
> applications that the elapse time from when the requests are received and
> results put on wire by the target instances doesn't exceed more than 3
> seconds (much less than 10 seconds).
>
> For those requests that an exceptions is thrown in the source applications,
> we know for the fact that the requests don't make it to the target
> instances (we have unique Id generated for each request, so if they are not
> getting logged on the target instances, we know they never made it there
> which I assume make the second scenario [layer 7] unlikely).
>
> Any other idea as to why this is happening would be appreciated.
>

Re: SocketTimeoutExceptions When Communicating with a Load Balancer

Reply via email to