ak11234 opened a new issue, #12377: URL: https://github.com/apache/apisix/issues/12377
### Current Behavior When using the APISIX helm chart per instructions to install APISIX into a kubernetes cluster, the chart also installs and configures an installation of etcd. That configuration does not seem to enable any auto compaction of the etcd database, thus it grows without bounds. In the 6 months our installation has been live, over 250,000 value revisions have been retained, almost 300mb of data. When an etcd node inevitably restarts, it requires many minutes to reload such a large database into memory, but the default health/liveness check that is configured will kill the pod before it gets the chance. Thus if an etcd node goes down after a certain amount of time, it will get suck in a boot loop and never come back online. Thus APISIX is left without etcd. I was able to solve this issue by disabling the healthcheck, allowing the etcd pod to take as long as it needed to load the database (many minutes!). I then used `etcdctl compact` to prune the first 250k revisions and `etcdctl defrag` to reclaim the disk space. After that, the pods were able to restart normally within seconds. ### Expected Behavior I expect for APISIX's helm chart to configure etcd in such a way that it can run long-term without such manual intervention or at least document it. ### Error Logs The primary indicator in the etcd log is "db file is flocked by another process, or taking too long" after 10 seconds, which it never gets past due to the default liveness check restarting the pod. ### Steps to Reproduce 1. Install APISIX to a kubernetes cluster using the helm chart, per instructions 2. Setup APISIX routes, upstreams, plugins, etc to reach a usable state 3. Wait (weeks, months, depending) for the etcd database to grow very large 4. Attempt to restart the etcd pods and note that they fail liveness check and are killed before getting a chance to restart ### Environment - APISIX version (run `apisix version`): 3.7.0 (helm chart version 1.7.0) - Operating system (run `uname -a`): Amazon linux - OpenResty / Nginx version (run `openresty -V` or `nginx -V`): openresty/1.21.4.2 - etcd version, if relevant (run `curl http://127.0.0.1:9090/v1/server_info`): 3.5.7 - APISIX Dashboard version, if relevant: 3.0.0 - Plugin runner version, for issues related to plugin runners: - LuaRocks version, for installation issues (run `luarocks --version`): -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@apisix.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org