Re: [pool] Recovering from transient factory outages

Phil Steitz Wed, 14 Feb 2024 08:37:44 -0800

> On Feb 14, 2024, at 12:24 AM, Romain Manni-Bucau <rmannibu...@gmail.com> 
> wrote:
> 
> Hi Phil,
> 
> You are right it can be done in pool - I'm not sure it is the right level
> (for instance in my previous example it will need to expose some
> "getCircuitBreakerState" to see if it can be used or not) but maybe I'm too
> used to decorators ;).
> The key point for [pool] is the last one, the proxying.
> Pool can't do it itself since it manages banalised instances but if you add
> the notion of proxy factory and fallback on jre proxy when it is only about
> interfaces it will work.

Thanks, that confirms that the resilient factory approach might be the way to 
go.  [pool] is interesting because so much happens in the factories and the 
lifecycle events provide natural extension points.  DBCP’s 
PoolableConnectionFactory is a great example.  I will keep playing with the 
ResilentFactory idea and see how much I can get set up generically, using DBCP 
and some of my own apps that pool other things as examples.  Patches welcome!
> 
> Romain Manni-Bucau
> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
> <https://rmannibucau.metawerx.net/> | Old Blog
> <http://rmannibucau.wordpress.com> | Github <https://github.com/rmannibucau> |
> LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
> <https://www.packtpub.com/application-development/java-ee-8-high-performance>
> 
> 
>> Le mar. 13 févr. 2024 à 22:38, Phil Steitz <phil.ste...@gmail.com> a écrit :
>> 
>> Thanks, Romain, this is awesome.  I would really like to find a way to get
>> this kind of thing implemented in [pool] or via enhanced factories.  See
>> more on that below.
>> 
>> On Tue, Feb 13, 2024 at 1:27 PM Romain Manni-Bucau <rmannibu...@gmail.com>
>> wrote:
>> 
>>> Hi Phil,
>>> 
>>> What I used by the past for this kind of thing was to rely on the timeout
>>> of the pool plus in the healthcheck - external to the pool - have some
>>> trigger (the simplest was "if 5 healthchecks fail without any success in
>>> between" for ex), such trigger will spawn a task (think thread even if it
>>> uses an executor but guarantee to have a place for this task) which will
>>> retry but at a faster pace (instead of every 30s it is 5 times in a run
>> for
>>> - number was tunable but 5 was my default).
>>> If still detected as down - vs not overloaded or alike - it will consider
>>> the database down and spawn a task which will retry every 30 seconds, if
>>> the database comes back - I added some business check but idea is not
>> just
>>> check the connection but the tables are accessible cause often after
>> such a
>>> downtime the db does not come at once - just destroy/recreate the pool.
>>> The destroy/recreate was handled using a DataSource proxy in front of the
>>> pool and change the delegate.
>>> 
>> 
>> It seems to me that all of this might be possible using what I was calling
>> a ReslientFactory.  The factory could implement the health-checking itself,
>> using pluggable strategies for how to check, how often, what means outage,
>> etc.  And the factory could (if so configured and in the right state)
>> bounce the pool.  I like the model of escalating concern.
>> 
>> 
>>> Indeed it is not magic inside the pool but can only better work than the
>>> pool solution cause you can integrate to your already existing checks and
>>> add more advanced checks - if you have jpa just do a fast query on any
>>> table to validate db is back for ex.
>>> At the end code is pretty simple and has another big advantage: you can
>>> circuit break the database completely while you consider the db is down
>>> just letting passing 10% of whatever ratio you want - of the requests
>> (kind
>>> of canary testing which avoids too much pressure on the pool).
>>> 
>>> I guess it was not exactly the answer you expected but think it can be a
>>> good solution and ultimately can site in a new package in dbcp or alike?
>>> 
>> 
>> I don't see anything here that is specific really to database connections
>> (other than the proxy setup to gracefully handle bounces), so I want to
>> keep thinking about how to solve the general problem by somehow enhancing
>> factories and/or pools.
>> 
>> Phil
>> 
>>> 
>>> Best,
>>> Romain Manni-Bucau
>>> @rmannibucau <https://twitter.com/rmannibucau> |  Blog
>>> <https://rmannibucau.metawerx.net/> | Old Blog
>>> <http://rmannibucau.wordpress.com> | Github <
>>> https://github.com/rmannibucau> |
>>> LinkedIn <https://www.linkedin.com/in/rmannibucau> | Book
>>> <
>>> 
>> https://www.packtpub.com/application-development/java-ee-8-high-performance
>>>> 
>>> 
>>> 
>>> Le mar. 13 févr. 2024 à 21:11, Phil Steitz <phil.ste...@gmail.com> a
>>> écrit :
>>> 
>>>> POOL-407 tracks a basic liveness problem that we have never been able
>> to
>>>> solve:
>>>> 
>>>> A factory "goes down" resulting in either failed object creation or
>>> failed
>>>> validation during the outage.  The pool has capacity to create, but the
>>>> factory fails to serve threads as they arrive, so they end up parked
>>>> waiting on the idle object pool.  After a possibly very brief
>>> interruption,
>>>> the factory heals itself (maybe a database comes back up) and the
>> waiting
>>>> threads can be served, but until other threads arrive, get served and
>>>> return instances to the pool, the parked threads remain blocked.
>>>> Configuring minIdle and pool maintenance (timeBetweenEvictionRuns > 0)
>>> can
>>>> improve the situation, but running the evictor at high enough frequency
>>> to
>>>> handle every transient failure is not a great solution.
>>>> 
>>>> I am stuck on how to improve this.  I have experimented with the idea
>> of
>>> a
>>>> ResilientFactory, placing the responsibility on the factory to know
>> when
>>> it
>>>> is down and when it comes back up and when it does, to keep calling
>> it's
>>>> pool's create as long as it has take waiters and capacity; but I am not
>>>> sure that is the best approach.  The advantage of this is that
>>>> resource-specific failure and recovery-detection can be implemented.
>>>> 
>>>> Another option that I have played with is to have the pool keep track
>> of
>>>> factory failures and when it observes enough failures over a long
>> enough
>>>> time, it starts a thread to do some kind of exponential backoff to keep
>>>> retrying the factory.  Once the factory comes back, the recovery thread
>>>> creates as many instances as it can without exceeding capacity and adds
>>>> them to the pool.
>>>> 
>>>> I don't really like either of these.  Anyone have any better ideas?
>>>> 
>>>> Phil
>>>> 
>>> 
>> 

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
For additional commands, e-mail: dev-h...@commons.apache.org
Re: [pool] Recovering from transient factory outages

Reply via email to