Hello everyone.

I have a problem with APISIX and I hope I can discuss it with you.

APISIX has a configuration item: `etcd.resync_delay`, the effect is to
pause for a while before launching the next watch request when the method
call of watch etcd returns an error.
I understand that this logic is to protect the etcd server from being
overloaded by uninterrupted retries by the client after an unintended
exception.
I think this protection mechanism is reasonable, but one of the cases of
error is timeout error, which means that no event is generated for the
specified key within the time period of this watch (default 30s timeout),
this kind of error is expected, because usually the configuration of the
gateway does not change frequently, and at this time we do not have special
handling for timeout error, so it will also cause the next watch call to be
launched with a wait of `etcd.resync_delay` seconds. This is very
dangerous.

For example: in the default configuration, when the user's upstream
configuration does not change within 30s, apisix will suspend the
synchronization configuration for about 6-7 seconds (5s+jitter), and apisix
will not be able to respond to all changes to the upstream during this
period.

So I think we should let the timeout error go and not take the resync delay
logic. This is in line with the millisecond configuration synchronization
requirements claimed in the apisix documentation.
The impact of doing so: removing the resync delay after timeout error will
cause apisix to have more concurrent etcd connections over time, for
example, in the default configuration (`etcd.timeout=30,
etcd.resync_delay=5`), the delay resync after timeout processing can reduce
the number of concurrent connections by ~ 1/6(6/(6+30)). I think this
impact is negligible compared to the configuration not taking effect in
time.

What do you think?

Reply via email to