Eli Stevens created COUCHDB-2240:
------------------------------------

             Summary: Many continuous replications cause DOS
                 Key: COUCHDB-2240
                 URL: https://issues.apache.org/jira/browse/COUCHDB-2240
             Project: CouchDB
          Issue Type: Bug
      Security Level: public (Regular issues)
            Reporter: Eli Stevens


Currently, I can configure an arbitrary number of replications between 
localhost DBs (in my case, they are in the _replicator DB with continuous set 
to true). However, there is a limit beyond which requests to the DB start to 
fail.  Trying to do another replication fails with the error:

ServerError: (500, ('checkpoint_commit_failure', "Target database out of sync. 
Try to increase max_dbs_open at the target's server."))

Due to COUCHDB-2239, it's not clear what the actual issue is. 

I also believe that while the DB was in this state GET requests to documents 
were also failing, but the machine that has the logs of this has already had 
it's drives wiped. If need be, I can recreate the situation and provide those 
logs as well.

I think that instead of there being a single fixed pool of resources that cause 
errors when exhausted, the system should have a per-task-type pool of resources 
that result in performance degradation when exhausted. N replication workers 
with P DB connections, and if that's not enough they start to round-robin; that 
sort of thing. When a user has too much to replicate, it gets slow instead of 
failing.

As it stands now, I have a potentially large number of continuous replications 
that produce a fixed rate of data to replicate (because there's a fixed 
application worker pool that writes the data in the first place). We use a 
DB+replication per batch of data to process, and if we receive a burst of 
batches, then couchdb starts failing. The current setup means that I'm always 
going to be playing chicken between burst size and whatever setting limit we're 
hitting.  That sucks, and isn't acceptable for a production system, so we're 
going to have to re-architect how we do replication, and basically implement 
poor-man's continuous by doing one off replications at various points of our 
data processing runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to