Hi Christopher, It would depend on what bottleneck is causing the slowness. It could be limited by the speed at the source, the replication job itself, or the target write throughput. Some of the settings you can play with are: * Increase batch size if the documents are not too large https://docs.couchdb.org/en/stable/config/replicator.html#replicator/worker_batch_size * Increase the number of worker processes from 4 to something larger, 10 or 20 perhaps https://docs.couchdb.org/en/stable/config/replicator.html#replicator/worker_processes * If you increase workers size, also increase the http_connections a bit as the workers might be limited to the http_connection anyway, so can make that 40 or 100 or so.
If the bottleneck is the source I/O, it might not be much you can do. In general, depending where the bottleneck is, adjusting some of those settings might not have any effect. A few more notes: * Keep an eye on the logs and replication job stats so see if there are any timeouts or errors, basically to see if the replication job crashes and restarts. * Q=256 does seem a bit high, That might affect any change feeds or view queries. But if you have enough disk I/O throughput (parallelism) it could work. * There are perhaps a few tweaks you can do to vm.args (erlang vm args) to help with disk I/O. One is setting +SDio to something higher than the default 16 https://github.com/apache/couchdb/blob/main/rel/overlay/etc/vm.args#L62. We use +SDio 80 in production but I've seen others use +SDio 128 and such. Cheers, -Nick On Fri, Mar 15, 2024 at 3:39 AM Chris Bayliss <[email protected]> wrote: > > Hi all, > > I inherited a single-node CouchDB database that backs a medical research > project. We’ve been using CouchDB for 10+ years so not a concern. Then I > spotted it uses a single database to store billions, 10^9 if we’re being > pedantic, of documents (2B at the time just over a TB of data) across the > default 2 shards. Not ideal but technically not a problem then I spotted it’s > ingesting ~30M documents a day and was continuously compressing and > reindexing everything associated with this database. > > Skipping over months of trial and error. I’m currently replicating it to a 4 > node NVMe backed cluster n=3 q=256. Everything is running 3.3.3 (the Erlang > 24.3 version). I’ve read [1] and [2] and right now it’s replicating at 2.25k > documents a second +/- 0.5k . This is acceptable, it will catch up with the > initial node eventually, but at the rate it’s going it’ll be ~60 days. > > How can speed this process up if at all? > > I’d add the code that accesses this database isn’t mine either so splitting > the database out into logical subsets isn’t an option at this time. > > Thanks > > Chris > > 1 - > https://blog.cloudant.com/2023/02/08/Replication-efficiency-improvements.html > 2 - https://github.com/apache/couchdb/issues/4308 > > > -- > Christopher Bayliss > Senior Software Engineer, Melbourne eResearch Group > > School of Computing and Information Systems > Level 5, Melbourne Connect (Building 290) > University of Melbourne, VIC, 3010, Australia > > Email: > [email protected]<mailto:[email protected]> > >
