Eli Stevens created COUCHDB-2240: ------------------------------------ Summary: Many continuous replications cause DOS Key: COUCHDB-2240 URL: https://issues.apache.org/jira/browse/COUCHDB-2240 Project: CouchDB Issue Type: Bug Security Level: public (Regular issues) Reporter: Eli Stevens
Currently, I can configure an arbitrary number of replications between localhost DBs (in my case, they are in the _replicator DB with continuous set to true). However, there is a limit beyond which requests to the DB start to fail. Trying to do another replication fails with the error: ServerError: (500, ('checkpoint_commit_failure', "Target database out of sync. Try to increase max_dbs_open at the target's server.")) Due to COUCHDB-2239, it's not clear what the actual issue is. I also believe that while the DB was in this state GET requests to documents were also failing, but the machine that has the logs of this has already had it's drives wiped. If need be, I can recreate the situation and provide those logs as well. I think that instead of there being a single fixed pool of resources that cause errors when exhausted, the system should have a per-task-type pool of resources that result in performance degradation when exhausted. N replication workers with P DB connections, and if that's not enough they start to round-robin; that sort of thing. When a user has too much to replicate, it gets slow instead of failing. As it stands now, I have a potentially large number of continuous replications that produce a fixed rate of data to replicate (because there's a fixed application worker pool that writes the data in the first place). We use a DB+replication per batch of data to process, and if we receive a burst of batches, then couchdb starts failing. The current setup means that I'm always going to be playing chicken between burst size and whatever setting limit we're hitting. That sucks, and isn't acceptable for a production system, so we're going to have to re-architect how we do replication, and basically implement poor-man's continuous by doing one off replications at various points of our data processing runs. -- This message was sent by Atlassian JIRA (v6.2#6252)