Re: SocketTimeoutExceptions When Communicating with a Load Balancer

Evan J Wed, 12 Feb 2020 17:23:11 -0800

On Wed, Feb 12, 2020 at 9:33 AM Alexey Panchenko <[email protected]>
wrote:


> Ah, it's actually much simpler than I thought.
> ConnectTimeoutException wraps SocketTimeoutExceptions
> but you use ExceptionUtils.getRootCause which retrieves the inner most
> exception - which is SocketTimeoutException
>
> so
> 1. logging or posting a full stackrace would make the reason obvious from
> the beginning
> 2.getRootCause() is not always the right thing to do
>
>
> On Wed, Feb 12, 2020 at 5:22 AM Evan J <[email protected]>
> wrote:
>
> > On Tue, Feb 11, 2020 at 3:08 PM Alexey Panchenko <
> [email protected]
> > >
> > wrote:
> >
> > > I guess, It can depends on which level load balancer operates.
> > > if load balancer is layer 4 (TCP) then for the operating system it
> looks
> > > like connection is made directly to the target application. In this
> case
> > > exception should happen during connect and be wrapped as
> > > ConnectTimeoutException
> > > but if load balancer is layer 7 (HTTP) then TCP connection is made to
> the
> > > load balancer (which possibly reads the whole request into memory) and
> > then
> > > it's trying to establish another TCP connection to the target
> application
> > > and forward request there.
> > > While that happens, the client thinks that it already has sent the
> > request
> > > to the target application and is already waiting for a response.
> > > In this situation SocketTimeoutException can happen.
> > >
> > > On Tue, Feb 11, 2020 at 6:23 PM Evan J <[email protected]>
> > > wrote:
> > >
> > > > Thank you for your response.
> > > >
> > > > I'm trying to understand your statement. Are you saying when
> connection
> > > > times out, SocketTimeoutException is thrown, and then the library
> > > captures
> > > > it and rethrows ConnectTimeoutException?
> > > >
> > > > Bottom line, first, is what I explained the expected behavior, and
> > > second,
> > > > how can we properly distinguish between these two types of timeouts?
> > > >
> > > > On Tue, Feb 11, 2020, 4:24 AM Oleg Kalnichevski <[email protected]>
> > > wrote:
> > > >
> > > > > On Mon, 2020-02-10 at 23:25 -0500, Evan J wrote:
> > > > > >  Hi,
> > > > > >
> > > > > > (looks like I'd sent this to a wrong user group originally)
> > > > > >
> > > > > > We deploy an application B (which is basically a backend
> > application
> > > > > > serving a web application) to a cluster of application servers
> > (JBoss
> > > > > > EAP
> > > > > > 7.2 -- 8 instances). These instances send HTTP requests to a set
> of
> > > > > > gateway
> > > > > > applications, call them application Gs (10 instances), where the
> > > > > > requests
> > > > > > go through an F5 loadbalancer that sits between them.
> > > > > >
> > > > > > Instances of application B are Spring Boot applications (2.1.x)
> > that
> > > > > > have
> > > > > > been configured with Apache HttpClient 4.5.5 and HttpCore 4.4.9.
> > The
> > > > > > configuration is almost identical to this github code:
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/spring-framework-guru/sfg-blog-posts/tree/master/resttemplate/src/main/java/guru/springframework/resttemplate/config
> > > > > >
> > > > > > The only exception is that RestTemplateConfig#restTemplate() is
> > > > > > configured
> > > > > > by creation of RestTemplate and passing
> > > > > > HttpComponentsClientHttpRequestFactory to its constructor rather
> > than
> > > > > > using
> > > > > > Spring Boot's RestTemplateBuilder().
> > > > > >
> > > > > > The keep alive configuration in application B is basically
> defined
> > > > > > something like:
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/spring-framework-guru/sfg-blog-posts/blob/0152fb0c4acf08d019128ca38c3dd2523871c43c/resttemplate/src/main/java/guru/springframework/resttemplate/config/ApacheHttpClientConfig.java#L52
> > > > > >
> > > > > > When a load test is executed in a single stack environment where
> > the
> > > > > > load
> > > > > > balancer is omitted (one application B -> one application G), the
> > > > > > requests
> > > > > > and responses are processed and validated accordingly (each
> request
> > > > > > should
> > > > > > correspond to a right response and be sent to the web application
> > > > > > front
> > > > > > end).
> > > > > >
> > > > > > However, when we run the same load test in a distributed
> > environment
> > > > > > with a
> > > > > > load balancer between application B's and application G's, we
> see a
> > > > > > lot of
> > > > > > SocketTimeoutExceptions being logged, and we notice them very
> > quickly
> > > > > > (about 5% of total of responses in application B throw that
> > > > > > exception).
> > > > > >
> > > > > > The code structure is very straightforward:
> > > > > >
> > > > > > try {
> > > > > >   // RestTemplate call
> > > > > > } catch (RestClientException exception) {
> > > > > >   Exception rootCause = ExceptionUtils.getRootCause(exception);
> //
> > > > > > Apache
> > > > > > Commons lib
> > > > > >   if (rootCause != null) {
> > > > > >     if
> > > > > >
> > (SocketTimeoutException.getClass().getName().equals(rootCause.getClas
> > > > > > s().getName())
> > > > > > {
> > > > > >     // or even if (rootCause instanceof SocketTimeoutException) {
> > > > > >       // Log for socket timeout
> > > > > >     }
> > > > > >     if
> > > > > >
> > (ConnectTimeoutException.getClass().getName().equals(rootCause.getCla
> > > > > > ss().getName())
> > > > > > {
> > > > > >     // or even if (rootCause instanceof  ConnectTimeoutException
> )
> > {
> > > > > >       // Log for connection timeout
> > > > > >     }
> > > > > >   }
> > > > > > }
> > > > > >
> > > > > > Application B's keep alive has been set to 20 seconds while the
> > > > > > socket
> > > > > > timeout has been set to 10 seconds by default (connection timeout
> > to
> > > > > > 1
> > > > > > second).
> > > > > >
> > > > > > After placing timer to log how long it takes for an exception is
> > > > > > thrown, we
> > > > > > saw, the time that it took from the moment RestTemplate is
> invoked
> > > > > > till the
> > > > > > exception is thrown was slightly above 1 second, e.g. 1030 ms,
> 1045
> > > > > > ms,
> > > > > > 1020 ms, etc.. This led us to increase the connection timeouts
> > from 1
> > > > > > second to 2 seconds, and afterward, we didn't get any timeout
> > > > > > exception of
> > > > > > any sort under the similar load.
> > > > > >
> > > > > > My question is, why is that the majority of exceptions that are
> > being
> > > > > > thrown have SocketTimeoutExceptions type as opposed to
> > > > > > ConnectTimeoutExceptions which, based on the timeout adjustment
> > > > > > mentioned
> > > > > > above, appears to be the latter (Connect) vs. socket (read)
> > timeout?
> > > > > > Note
> > > > > > that I said the majority of time, as I've seen a few
> > > > > > ConnectTimeoutExceptions as well, but almost 99% of the failed
> ones
> > > > > > are
> > > > > > SocketTimeoutExceptions.
> > > > > >
> > > > > > Also, in our logs, we log the "rootCause's" class name to avoid
> > > > > > ambiguity,
> > > > > > but as I mentioned, they are being logged as
> > SocketTimeoutExceptions
> > > > > > class
> > > > > > names.
> > > > > >
> > > > > > What is Apache Components library doing under the hood that
> signals
> > > > > > the
> > > > > > underlying JDK code to throw SocketTimeoutExceptions rather than
> > > > > > ConnectTimeoutException
> > > > >
> > > > > Connection management code in HttpClient re-throws
> > > > > SocketTimeoutExceptions thrown while connecting a socket to its
> > remote
> > > > > endpoint as ConnectTimeoutException to help distinguish connect
> > timeout
> > > > > from socket (read operation) timeout. SocketTimeoutExceptions are
> > > > > always thrown by the JRE, not by HttpClient.
> > > > >
> > > > > Oleg
> > > > >
> > > > >
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: [email protected]
> > > > > For additional commands, e-mail:
> [email protected]
> > > > >
> > > > >
> > > >
> > >
> >
> > Apologize for top posting.
> >
> > Alexey,
> >
> > I have to verify, but I think the load balancer is a pass through so the
> > first scenario might be true.
> >
> > However, let's assume the load balancer is running in ISO layer 7 mode.
> If
> > the load balancer is consuming the entire request and tries to establish
> > another TCP connection to the target servers, when SocketTimeoutException
> > is triggered in the source application (B), it should be an indicative
> that
> > the maximum time allotted to receive packets between the last exchange
> and
> > the current one has exceeded.
> >
> > But the only reason this could happen if the socket read between two
> > packets has exceeded 10 seconds while the logs show from the moment the
> > request has been made from the source, to the moment when the exception
> is
> > thrown is merely slightly above 1 second (incidentally a smidgen more
> than
> > the connect timeout).
> >
> > Moreover, if we increase the socket (read) timeout to 30 seconds, keep
> the
> > connect timeout at 1 second, and run the test in the same distributed
> > environment with load balancer, we still see the same result (as a matter
> > fact, the error rate/timeout exceptions actually increases by a few
> > percentage -- 2%-3%).
> >
> > For successful requests, we can verify through the logs on the target
> > applications that the elapse time from when the requests are received and
> > results put on wire by the target instances doesn't exceed more than 3
> > seconds (much less than 10 seconds).
> >
> > For those requests that an exceptions is thrown in the source
> applications,
> > we know for the fact that the requests don't make it to the target
> > instances (we have unique Id generated for each request, so if they are
> not
> > getting logged on the target instances, we know they never made it there
> > which I assume make the second scenario [layer 7] unlikely).
> >
> > Any other idea as to why this is happening would be appreciated.
> >
>

Alexey,

I see it now. The ConnectTimeoutException is wrapping
SocketTimeoutException which is not the intended exception I am looking for
such scenario.

Yes, dumping the stacktrace from the original exception should provided the
needed details, however, the application has somewhat a unique requirement
to send specific Ids for each of these timeout types in order to notify the
appropriate team (don't ask me why).

I can do an ugly hack to look for a cause, and if it matches connection
timeout, pull it from the immediate cause; otherwise, grab the root cause
to handle others (there are other custom exceptions that are wrapped by
RestClientException, hence the reason).

Was there a reason why the Component's implementation has connection
exception wrapping socket exception?

Re: SocketTimeoutExceptions When Communicating with a Load Balancer

Reply via email to