CouchDB 3.1.1 high disk utilization and degradation over time

Roberto Iglesias Mon, 04 Apr 2022 09:22:46 -0700

Hello.

About 1 year ago, we had two CouchDB 2.3.1 instances running inside Docker
containers and pull-replicating one each other. This way, we could read
from and write to any of these servers, although we generally choose one as
the "active" server and write to it. The second server would act as a spare
or backup.


At this point (1y ago) we decided to migrate from CouchDB version 2.3.1 to
3.1.1. Instead of upgrading our existing databases, we added two extra
instances and configure pull replications in all of them until we get the
following scenario:

2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B

where <===> represents two pull replications, one configured on each side.
i.e: 2.3.1-A pulls from 2.3.1-B and vice versa.

If a write is made at 2.3.1-A, it has to make it through all servers until
it reaches 3.1.1-B.

All of them have an exclusive HDD which is not shared with any other
service.

We have not a single problem with 2.3.1.

After pointing our services to 3.1.1-*A*, it gradually started to increase
Read I/O wait times over weeks until it reached peaks of 600ms (totally
unworkable). So we stopped making write requests (http POST) to it and
pointed all applications to 3.1.1-*B*. 3.1.1-*A* was still receiving writes
but only by replication protocol, as I explained before.

At 3.1.1-*A* server, disk stats decreased to acceptable values, so a few
weeks after we pointed applications back to it in order to confirm whether
the problem is related to write requests sent from our application or not.
Read I/O times did not increase this time. Instead, 3.1.1-B (which handled
application traffic for a few weeks), started to show the same behaviour,
despite it was not handling requests from applications.

It feels like some fragmentation is occurring, but filesystem (ext4) shows
none.

Some changes we've made since problem started:

   - Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic
   - Upgraded ubuntu from 18.04 to 20.04
   - Deleted _global_changes database from couchdb3.1.1-A


More info:

   - Couchdb is using docker local-persist (
   https://github.com/MatchbookLab/local-persist) volumes.
   - Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1 couchdbs.
   - We have only one database of 88GiB and 2 views: one of 22GB and a
   little one of 30MB (highly updated)
   - docker stats shows that couchdb3.1.1 uses lot of memory compared to
   2.3.1:
   - 2.5GiB for couchdb3.1.1-A (not receiving direct write requests)
   - 5.0GiB for couchdb3.1.1-B (receiving both read and write requests)
   - 900MiB for 2.3.1-A
   - 800MiB for 2.3.1-B
   - Database compaction is run at night. Problem only occurs over day,
   when most of the writes are made.
   - Most of the config is default.
   - Latency graph from munin monitoring attached (at the peak, there is an
   outage of the server caused by a kernel upgrade that went wrong)


Any help is appreciated.

-- 
--

*Roberto E. Iglesias*

CouchDB 3.1.1 high disk utilization and degradation over time

Reply via email to