Hello. About 1 year ago, we had two CouchDB 2.3.1 instances running inside Docker containers and pull-replicating one each other. This way, we could read from and write to any of these servers, although we generally choose one as the "active" server and write to it. The second server would act as a spare or backup.
At this point (1y ago) we decided to migrate from CouchDB version 2.3.1 to 3.1.1. Instead of upgrading our existing databases, we added two extra instances and configure pull replications in all of them until we get the following scenario: 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B where <===> represents two pull replications, one configured on each side. i.e: 2.3.1-A pulls from 2.3.1-B and vice versa. If a write is made at 2.3.1-A, it has to make it through all servers until it reaches 3.1.1-B. All of them have an exclusive HDD which is not shared with any other service. We have not a single problem with 2.3.1. After pointing our services to 3.1.1-*A*, it gradually started to increase Read I/O wait times over weeks until it reached peaks of 600ms (totally unworkable). So we stopped making write requests (http POST) to it and pointed all applications to 3.1.1-*B*. 3.1.1-*A* was still receiving writes but only by replication protocol, as I explained before. At 3.1.1-*A* server, disk stats decreased to acceptable values, so a few weeks after we pointed applications back to it in order to confirm whether the problem is related to write requests sent from our application or not. Read I/O times did not increase this time. Instead, 3.1.1-B (which handled application traffic for a few weeks), started to show the same behaviour, despite it was not handling requests from applications. It feels like some fragmentation is occurring, but filesystem (ext4) shows none. Some changes we've made since problem started: - Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic - Upgraded ubuntu from 18.04 to 20.04 - Deleted _global_changes database from couchdb3.1.1-A More info: - Couchdb is using docker local-persist ( https://github.com/MatchbookLab/local-persist) volumes. - Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1 couchdbs. - We have only one database of 88GiB and 2 views: one of 22GB and a little one of 30MB (highly updated) - docker stats shows that couchdb3.1.1 uses lot of memory compared to 2.3.1: - 2.5GiB for couchdb3.1.1-A (not receiving direct write requests) - 5.0GiB for couchdb3.1.1-B (receiving both read and write requests) - 900MiB for 2.3.1-A - 800MiB for 2.3.1-B - Database compaction is run at night. Problem only occurs over day, when most of the writes are made. - Most of the config is default. - Latency graph from munin monitoring attached (at the peak, there is an outage of the server caused by a kernel upgrade that went wrong) Any help is appreciated. -- -- *Roberto E. Iglesias*
