Re: SocketTimeoutExceptions When Communicating with a Load Balancer

Evan J Tue, 11 Feb 2020 19:23:26 -0800

On Tue, Feb 11, 2020 at 3:08 PM Alexey Panchenko <alex.panche...@gmail.com>
wrote:


> I guess, It can depends on which level load balancer operates.
> if load balancer is layer 4 (TCP) then for the operating system it looks
> like connection is made directly to the target application. In this case
> exception should happen during connect and be wrapped as
> ConnectTimeoutException
> but if load balancer is layer 7 (HTTP) then TCP connection is made to the
> load balancer (which possibly reads the whole request into memory) and then
> it's trying to establish another TCP connection to the target application
> and forward request there.
> While that happens, the client thinks that it already has sent the request
> to the target application and is already waiting for a response.
> In this situation SocketTimeoutException can happen.
>
> On Tue, Feb 11, 2020 at 6:23 PM Evan J <maps.this.addr...@gmail.com>
> wrote:
>
> > Thank you for your response.
> >
> > I'm trying to understand your statement. Are you saying when connection
> > times out, SocketTimeoutException is thrown, and then the library
> captures
> > it and rethrows ConnectTimeoutException?
> >
> > Bottom line, first, is what I explained the expected behavior, and
> second,
> > how can we properly distinguish between these two types of timeouts?
> >
> > On Tue, Feb 11, 2020, 4:24 AM Oleg Kalnichevski <ol...@apache.org>
> wrote:
> >
> > > On Mon, 2020-02-10 at 23:25 -0500, Evan J wrote:
> > > >  Hi,
> > > >
> > > > (looks like I'd sent this to a wrong user group originally)
> > > >
> > > > We deploy an application B (which is basically a backend application
> > > > serving a web application) to a cluster of application servers (JBoss
> > > > EAP
> > > > 7.2 -- 8 instances). These instances send HTTP requests to a set of
> > > > gateway
> > > > applications, call them application Gs (10 instances), where the
> > > > requests
> > > > go through an F5 loadbalancer that sits between them.
> > > >
> > > > Instances of application B are Spring Boot applications (2.1.x) that
> > > > have
> > > > been configured with Apache HttpClient 4.5.5 and HttpCore 4.4.9. The
> > > > configuration is almost identical to this github code:
> > > >
> > > >
> > >
> > >
> >
> https://github.com/spring-framework-guru/sfg-blog-posts/tree/master/resttemplate/src/main/java/guru/springframework/resttemplate/config
> > > >
> > > > The only exception is that RestTemplateConfig#restTemplate() is
> > > > configured
> > > > by creation of RestTemplate and passing
> > > > HttpComponentsClientHttpRequestFactory to its constructor rather than
> > > > using
> > > > Spring Boot's RestTemplateBuilder().
> > > >
> > > > The keep alive configuration in application B is basically defined
> > > > something like:
> > > >
> > > >
> > >
> > >
> >
> https://github.com/spring-framework-guru/sfg-blog-posts/blob/0152fb0c4acf08d019128ca38c3dd2523871c43c/resttemplate/src/main/java/guru/springframework/resttemplate/config/ApacheHttpClientConfig.java#L52
> > > >
> > > > When a load test is executed in a single stack environment where the
> > > > load
> > > > balancer is omitted (one application B -> one application G), the
> > > > requests
> > > > and responses are processed and validated accordingly (each request
> > > > should
> > > > correspond to a right response and be sent to the web application
> > > > front
> > > > end).
> > > >
> > > > However, when we run the same load test in a distributed environment
> > > > with a
> > > > load balancer between application B's and application G's, we see a
> > > > lot of
> > > > SocketTimeoutExceptions being logged, and we notice them very quickly
> > > > (about 5% of total of responses in application B throw that
> > > > exception).
> > > >
> > > > The code structure is very straightforward:
> > > >
> > > > try {
> > > >   // RestTemplate call
> > > > } catch (RestClientException exception) {
> > > >   Exception rootCause = ExceptionUtils.getRootCause(exception); //
> > > > Apache
> > > > Commons lib
> > > >   if (rootCause != null) {
> > > >     if
> > > > (SocketTimeoutException.getClass().getName().equals(rootCause.getClas
> > > > s().getName())
> > > > {
> > > >     // or even if (rootCause instanceof SocketTimeoutException) {
> > > >       // Log for socket timeout
> > > >     }
> > > >     if
> > > > (ConnectTimeoutException.getClass().getName().equals(rootCause.getCla
> > > > ss().getName())
> > > > {
> > > >     // or even if (rootCause instanceof  ConnectTimeoutException ) {
> > > >       // Log for connection timeout
> > > >     }
> > > >   }
> > > > }
> > > >
> > > > Application B's keep alive has been set to 20 seconds while the
> > > > socket
> > > > timeout has been set to 10 seconds by default (connection timeout to
> > > > 1
> > > > second).
> > > >
> > > > After placing timer to log how long it takes for an exception is
> > > > thrown, we
> > > > saw, the time that it took from the moment RestTemplate is invoked
> > > > till the
> > > > exception is thrown was slightly above 1 second, e.g. 1030 ms, 1045
> > > > ms,
> > > > 1020 ms, etc.. This led us to increase the connection timeouts from 1
> > > > second to 2 seconds, and afterward, we didn't get any timeout
> > > > exception of
> > > > any sort under the similar load.
> > > >
> > > > My question is, why is that the majority of exceptions that are being
> > > > thrown have SocketTimeoutExceptions type as opposed to
> > > > ConnectTimeoutExceptions which, based on the timeout adjustment
> > > > mentioned
> > > > above, appears to be the latter (Connect) vs. socket (read) timeout?
> > > > Note
> > > > that I said the majority of time, as I've seen a few
> > > > ConnectTimeoutExceptions as well, but almost 99% of the failed ones
> > > > are
> > > > SocketTimeoutExceptions.
> > > >
> > > > Also, in our logs, we log the "rootCause's" class name to avoid
> > > > ambiguity,
> > > > but as I mentioned, they are being logged as SocketTimeoutExceptions
> > > > class
> > > > names.
> > > >
> > > > What is Apache Components library doing under the hood that signals
> > > > the
> > > > underlying JDK code to throw SocketTimeoutExceptions rather than
> > > > ConnectTimeoutException
> > >
> > > Connection management code in HttpClient re-throws
> > > SocketTimeoutExceptions thrown while connecting a socket to its remote
> > > endpoint as ConnectTimeoutException to help distinguish connect timeout
> > > from socket (read operation) timeout. SocketTimeoutExceptions are
> > > always thrown by the JRE, not by HttpClient.
> > >
> > > Oleg
> > >
> > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: httpclient-users-unsubscr...@hc.apache.org
> > > For additional commands, e-mail: httpclient-users-h...@hc.apache.org
> > >
> > >
> >
>

Apologize for top posting.

Alexey,

I have to verify, but I think the load balancer is a pass through so the
first scenario might be true.

However, let's assume the load balancer is running in ISO layer 7 mode. If
the load balancer is consuming the entire request and tries to establish
another TCP connection to the target servers, when SocketTimeoutException
is triggered in the source application (B), it should be an indicative that
the maximum time allotted to receive packets between the last exchange and
the current one has exceeded.

But the only reason this could happen if the socket read between two
packets has exceeded 10 seconds while the logs show from the moment the
request has been made from the source, to the moment when the exception is
thrown is merely slightly above 1 second (incidentally a smidgen more than
the connect timeout).

Moreover, if we increase the socket (read) timeout to 30 seconds, keep the
connect timeout at 1 second, and run the test in the same distributed
environment with load balancer, we still see the same result (as a matter
fact, the error rate/timeout exceptions actually increases by a few
percentage -- 2%-3%).

For successful requests, we can verify through the logs on the target
applications that the elapse time from when the requests are received and
results put on wire by the target instances doesn't exceed more than 3
seconds (much less than 10 seconds).

For those requests that an exceptions is thrown in the source applications,
we know for the fact that the requests don't make it to the target
instances (we have unique Id generated for each request, so if they are not
getting logged on the target instances, we know they never made it there
which I assume make the second scenario [layer 7] unlikely).

Any other idea as to why this is happening would be appreciated.

Re: SocketTimeoutExceptions When Communicating with a Load Balancer

Reply via email to