My configuration is different (a lot of small DBs) but I had disk I/O performance issues too when upgrading from CouchDB 2 to CouchDB 3. Maybe it's related, maybe it's not. I use AWS, the solution for me was to increase AWS disk IOPs.
See the full discussion here: https://github.com/apache/couchdb/discussions/3217 Le lun. 4 avr. 2022 à 18:22, Roberto Iglesias <[email protected]> a écrit : > Hello. > > About 1 year ago, we had two CouchDB 2.3.1 instances running inside Docker > containers and pull-replicating one each other. This way, we could read > from and write to any of these servers, although we generally choose one as > the "active" server and write to it. The second server would act as a spare > or backup. > > At this point (1y ago) we decided to migrate from CouchDB version 2.3.1 to > 3.1.1. Instead of upgrading our existing databases, we added two extra > instances and configure pull replications in all of them until we get the > following scenario: > > 2.3.1-A <===> 2.3.1-B <===> 3.1.1-A <===> 3.1.1-B > > where <===> represents two pull replications, one configured on each side. > i.e: 2.3.1-A pulls from 2.3.1-B and vice versa. > > If a write is made at 2.3.1-A, it has to make it through all servers until > it reaches 3.1.1-B. > > All of them have an exclusive HDD which is not shared with any other > service. > > We have not a single problem with 2.3.1. > > After pointing our services to 3.1.1-*A*, it gradually started to > increase Read I/O wait times over weeks until it reached peaks of 600ms > (totally unworkable). So we stopped making write requests (http POST) to it > and pointed all applications to 3.1.1-*B*. 3.1.1-*A* was still receiving > writes but only by replication protocol, as I explained before. > > At 3.1.1-*A* server, disk stats decreased to acceptable values, so a few > weeks after we pointed applications back to it in order to confirm whether > the problem is related to write requests sent from our application or not. > Read I/O times did not increase this time. Instead, 3.1.1-B (which handled > application traffic for a few weeks), started to show the same behaviour, > despite it was not handling requests from applications. > > It feels like some fragmentation is occurring, but filesystem (ext4) shows > none. > > Some changes we've made since problem started: > > - Upgraded kernel from 4.15.0-55-generic to 5.4.0-88-generic > - Upgraded ubuntu from 18.04 to 20.04 > - Deleted _global_changes database from couchdb3.1.1-A > > > More info: > > - Couchdb is using docker local-persist ( > https://github.com/MatchbookLab/local-persist) volumes. > - Disks are WD Purple for 2.3.1 couchdbs and WD Black for 3.1.1 > couchdbs. > - We have only one database of 88GiB and 2 views: one of 22GB and a > little one of 30MB (highly updated) > - docker stats shows that couchdb3.1.1 uses lot of memory compared to > 2.3.1: > - 2.5GiB for couchdb3.1.1-A (not receiving direct write requests) > - 5.0GiB for couchdb3.1.1-B (receiving both read and write requests) > - 900MiB for 2.3.1-A > - 800MiB for 2.3.1-B > - Database compaction is run at night. Problem only occurs over day, > when most of the writes are made. > - Most of the config is default. > - Latency graph from munin monitoring attached (at the peak, there is > an outage of the server caused by a kernel upgrade that went wrong) > > > Any help is appreciated. > > -- > -- > > *Roberto E. Iglesias* >
