ak11234 opened a new issue, #12377:
URL: https://github.com/apache/apisix/issues/12377

   ### Current Behavior
   
   When using the APISIX helm chart per instructions to install APISIX into a 
kubernetes cluster, the chart also installs and configures an installation of 
etcd. 
   
   That configuration does not seem to enable any auto compaction of the etcd 
database, thus it grows without bounds. In the 6 months our installation has 
been live, over 250,000 value revisions have been retained, almost 300mb of 
data. 
   
   When an etcd node inevitably restarts, it requires many minutes to reload 
such a large database into memory, but the default health/liveness check that 
is configured will kill the pod before it gets the chance. Thus if an etcd node 
goes down after a certain amount of time, it will get suck in a boot loop and 
never come back online. Thus APISIX is left without etcd.
   
   I was able to solve this issue by disabling the healthcheck, allowing the 
etcd pod to take as long as it needed to load the database (many minutes!). I 
then used `etcdctl compact` to prune the first 250k revisions and `etcdctl 
defrag` to reclaim the disk space. 
   
   After that, the pods were able to restart normally within seconds.
   
   ### Expected Behavior
   
   I expect for APISIX's helm chart to configure etcd in such a way that it can 
run long-term without such manual intervention or at least document it.
   
   ### Error Logs
   
   The primary indicator in the etcd log is "db file is flocked by another 
process, or taking too long" after 10 seconds, which it never gets past due to 
the default liveness check restarting the pod.
   
   ### Steps to Reproduce
   
   1. Install APISIX to a kubernetes cluster using the helm chart, per 
instructions
   2. Setup APISIX routes, upstreams, plugins, etc to reach a usable state
   3. Wait (weeks, months, depending) for the etcd database to grow very large
   4. Attempt to restart the etcd pods and note that they fail liveness check 
and are killed before getting a chance to restart
   
   ### Environment
   
   - APISIX version (run `apisix version`): 3.7.0 (helm chart version 1.7.0)
   - Operating system (run `uname -a`): Amazon linux
   - OpenResty / Nginx version (run `openresty -V` or `nginx -V`): 
openresty/1.21.4.2
   - etcd version, if relevant (run `curl 
http://127.0.0.1:9090/v1/server_info`): 3.5.7
   - APISIX Dashboard version, if relevant: 3.0.0
   - Plugin runner version, for issues related to plugin runners:
   - LuaRocks version, for installation issues (run `luarocks --version`):
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@apisix.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to