The interesting part of my case is that the connection timeout is set to 5
seconds,
and while it times out sometimes, the above-mentioned metrics I've already
created reveal that some connections start taking 1-3 seconds to actually
fully establish.

This would indicate some low level OS/networking timeout might have been
reached and a retry/resend happened automatically.
I could certainly lower the timeout to something way more reasonable and
implement a retry mechanism to try again a few more times,
but at this point of troubleshooting, it seems like trying to mask the
underlying issue without solving it, which could bite us later on.

As such, for now I would love to actually find the issue (it's a fresh new
piece of infrastructure we're migrating to) but even limiting any dumps
severely, we're looking at large amounts of packets per minute.
Having metrics for each step of the handshake could help reveal
misconfigurations we've performed while setting it all up.

Thanks for the mention regardless, we're already using Resilience4j for its
circuit breaking capabilities, so if all attempts fail, the retry is
certainly an option.

Richard


On Fri, Mar 15, 2024 at 10:05 PM Skylos <sky...@gmail.com> wrote:

> I do have a thought - sometimes its just important to make it work, not to
> drill into perfectly what it is - I have had some luck with the
> Resilience4J library, where I set it up on a short timeout - connections
> that work work within, say, 30ms - and if they take more than 50ms, they're
> never coming back on my measurements. So - I set the timeout to 50ms, the
> delay to 10ms, and the retry count to 3.  It can make noise every time it
> retries - but thanks to the squirrely behavior I was seeing under load,
> which seemed more infrastructure than server based, the retry solved the
> problem neatly.
>
> The similar sounding problem I had was spates of rest calls between
> services at different cloud providers going through different load
> balance/api management system layers.  It seemed to me that a low
> percentage of my attempts to connect would just disappear - like a packet
> loss or a process crash on the load balance - remote service would never
> see a packet.
>
> Just a thought, hope it helps.  I love to aggressively chase solutions.
>
> David
>
>
>
> On Fri, Mar 15, 2024 at 4:58 PM Richard Tippl <richard.ti...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I am supporting a Spring Boot application, which uses HttpClient 5 in the
> > background. We're mainly using PoolingHttpClientConnectionManager to
> send a
> > large amount of requests to a target server.
> >
> > We're experiencing some network issues (socket connection timeouts during
> > high load scenarios) and in trying to locate them, I've begun the process
> > of trying to look into what actually happens during connection
> > establishment.
> > My idea was to measure the time it takes for certain steps taken when
> > creating a connection. Mainly I wanted to measure TCP socket open and SSL
> > handshake.
> >
> > The initial version I've come up with uses (abuses) the
> > ConnectionSocketFactory interface, wrapping it in a way to measure the
> > length of execution for connectSocket. This gives the sum of TCP open and
> > SSL handshake.
> > This way I can at least get some numbers and use them to help with
> locating
> > and resolving the issues.
> >
> > There are 2 issues with this approach, as far as i can tell, I can't
> > measure these times separately, and in the newest alpha version of 5.4
> the
> > interface I'm using has been deprecated and replaced by the
> > DefaultHttpClientConnectionOperator, which performs all of the connection
> > steps in a single method call.
> >
> > Am I missing some easier way to plug into the flow of creating a
> connection
> > and getting the ability to measure what I wish to measure? Will it still
> be
> > possible after the deprecated interfaces get removed? Is there a way I
> > could measure both socket open and SSL handshake separately?
> > The metrics I've achieved so far already started showing us certain
> trends
> > and extending them could help us more in trying to solve these issues.
> >
> > Thanks for responding.
> >
> > Richard
> >
>
>
> --
> Dog approved this message.
>

Reply via email to