Hi,

(looks like I'd sent this to a wrong user group originally)

We deploy an application B (which is basically a backend application
serving a web application) to a cluster of application servers (JBoss EAP
7.2 -- 8 instances). These instances send HTTP requests to a set of gateway
applications, call them application Gs (10 instances), where the requests
go through an F5 loadbalancer that sits between them.

Instances of application B are Spring Boot applications (2.1.x) that have
been configured with Apache HttpClient 4.5.5 and HttpCore 4.4.9. The
configuration is almost identical to this github code:

https://github.com/spring-framework-guru/sfg-blog-posts/tree/master/resttemplate/src/main/java/guru/springframework/resttemplate/config

The only exception is that RestTemplateConfig#restTemplate() is configured
by creation of RestTemplate and passing
HttpComponentsClientHttpRequestFactory to its constructor rather than using
Spring Boot's RestTemplateBuilder().

The keep alive configuration in application B is basically defined
something like:

https://github.com/spring-framework-guru/sfg-blog-posts/blob/0152fb0c4acf08d019128ca38c3dd2523871c43c/resttemplate/src/main/java/guru/springframework/resttemplate/config/ApacheHttpClientConfig.java#L52

When a load test is executed in a single stack environment where the load
balancer is omitted (one application B -> one application G), the requests
and responses are processed and validated accordingly (each request should
correspond to a right response and be sent to the web application front
end).

However, when we run the same load test in a distributed environment with a
load balancer between application B's and application G's, we see a lot of
SocketTimeoutExceptions being logged, and we notice them very quickly
(about 5% of total of responses in application B throw that exception).

The code structure is very straightforward:

try {
  // RestTemplate call
} catch (RestClientException exception) {
  Exception rootCause = ExceptionUtils.getRootCause(exception); // Apache
Commons lib
  if (rootCause != null) {
    if
(SocketTimeoutException.getClass().getName().equals(rootCause.getClass().getName())
{
    // or even if (rootCause instanceof SocketTimeoutException) {
      // Log for socket timeout
    }
    if
(ConnectTimeoutException.getClass().getName().equals(rootCause.getClass().getName())
{
    // or even if (rootCause instanceof  ConnectTimeoutException ) {
      // Log for connection timeout
    }
  }
}

Application B's keep alive has been set to 20 seconds while the socket
timeout has been set to 10 seconds by default (connection timeout to 1
second).

After placing timer to log how long it takes for an exception is thrown, we
saw, the time that it took from the moment RestTemplate is invoked till the
exception is thrown was slightly above 1 second, e.g. 1030 ms, 1045 ms,
1020 ms, etc.. This led us to increase the connection timeouts from 1
second to 2 seconds, and afterward, we didn't get any timeout exception of
any sort under the similar load.

My question is, why is that the majority of exceptions that are being
thrown have SocketTimeoutExceptions type as opposed to
ConnectTimeoutExceptions which, based on the timeout adjustment mentioned
above, appears to be the latter (Connect) vs. socket (read) timeout? Note
that I said the majority of time, as I've seen a few
ConnectTimeoutExceptions as well, but almost 99% of the failed ones are
SocketTimeoutExceptions.

Also, in our logs, we log the "rootCause's" class name to avoid ambiguity,
but as I mentioned, they are being logged as SocketTimeoutExceptions class
names.

What is Apache Components library doing under the hood that signals the
underlying JDK code to throw SocketTimeoutExceptions rather than
ConnectTimeoutException?

Reply via email to