The answer to your clarifying question is absolutely yes. The “pending_changes” metric refers to the number of committed changes on the shard replica emitting the log event that need to be cross-checked on another replica. It’s not a measure of writes that need to be executed.
Cheers, Adam > On Jun 5, 2017, at 4:37 PM, Phil May <phil....@motorolasolutions.com> wrote: > > Hi Adam, > > Thanks for the info! > > When we run at high write rates, we will start to fall behind, but when we > reduce the rate, we eventually catch up. > > I have a clarification question – can the warning messages we are seeing > still occur in a healthy cluster due to the "redundant cross-check" taking > long enough that more changes have accumulated that now also need to be > cross-checked (even when no actual writes were needed)? > > We have had some luck modifying sync_concurrency (which is exposed in the > .ini file) and batch_size (which we exposed), and that does give us more > throughput capacity. > > Thanks! > > - Phil > > > On Mon, Jun 5, 2017 at 11:38 AM, Adam Kocoloski <kocol...@apache.org> wrote: > >> Hi Phil, >> >> Here’s the thing to keep in mind about those warning messages: in a >> healthy cluster, the internal replication traffic that generates them is >> really just a redundant cross-check. It exists to “heal” a cluster member >> that was down during some write operations. When you write data into a >> CouchDB cluster the copies are written to all relevant shard replicas >> proactively. >> >> If your cluster’s steady-state write load is causing internal cluster >> replication to fall behind permanently, that’s problematic. You should tune >> the cluster replication parameters to give it more throughput. If the >> replication is only falling behind during some batch data load and then >> catches up later it may be a different story. You may want to keep things >> configured as-is. >> >> Does that make sense? >> >> Cheers, Adam >> >>> On Jun 4, 2017, at 11:06 PM, Phil May <phil....@motorolasolutions.com> >> wrote: >>> >>> I'm writing to check whether modifying replication batch_count and >>> batch_size parameters for cluster replication is good idea. >>> >>> Some background – our data platform dev team noticed that under heavy >> write >>> load, cluster replication was falling behind. The following warning >>> messages started appearing in the logs, and the pending_changes value >>> consistently increased while under load. >>> >>> [warning] 2017-05-18T20:15:22.320498Z couch-1@couch-1.couchdb <0.316.0> >>> -------- mem3_sync shards/a0000000-bfffffff/test.1495137986 >>> couch-3@couch-3.couchdb >>> {pending_changes,474} >>> >>> What we saw is described in COUCHDB-3421 >>> <https://issues.apache.org/jira/browse/COUCHDB-3421>. In addition, >> CouchDB >>> appears to be CPU bound while this is occurring, not I/O bound as would >>> seem reasonable to expect for replication. >>> >>> When we looked into this, we discovered in the source two values >> affecting >>> replication, batch_size and batch_count. For cluster replication, these >>> values are fixed at 100 and 1 respectively, so we made them configurable. >>> We tried various values and it seems increasing the batch_size (and to a >>> lesser extent) batch_count improves our write performance. As a point of >>> reference, with batch_count=50 and batch_size=5000 we can handle about >>> double the write throughput with no warnings. We are experimenting with >>> other values. >>> >>> We wanted to know if adjusting these parameters is a sound approach. >>> >>> Thanks! >>> >>> - Phil >> >>