[GitHub] Avaq commented on issue #1081: Replicator infinite failure loop

GitBox Thu, 25 Jan 2018 07:28:15 -0800

Avaq commented on issue #1081: Replicator infinite failure loop
URL: https://github.com/apache/couchdb/issues/1081#issuecomment-360500296
 
 
   Hi @nickva, thank you for your response!
   
   > Noticed in the test behavior script you specified a heartbeat. In 2.x 
replicator doesn't use hearbeats, instead it uses timeouts
   
   I didn't know heartbeats were removed. It doesn't really matter for my case 
though. The only reason I'm specifying a heartbeat in my tests is to make the 
result more visual (so you don't have to wait ten seconds using 1.6 to see if 
something is happening). 
   
   I have adjusted my test.
   
   ```sh
   # We create our test database
   curl -X PUT localhost:5984/replication-source
   
   # We insert a design doc with a filter function that is guaranteed to take 
long
   # The reason is so we can simulate a database with a lot of documents which 
are
   # not going to pass in the filtering process.
   curl -X PUT localhost:5984/replication-source/_design/test -d 
'{"filters":{"test":"function(){var future = Date.now() + 2000; 
while(Date.now() < future){}; return false}"}}'
   
   # We insert a bunch of documents so that filtering them will take time. Note
   # that I increased the number from 20 to 100, because I have more CPU cores
   # this time around (I didn't consider that before).
   for i in {1..100}; do curl -X POST -H 'Content-Type: application/json' 
localhost:5984/replication-source -d '{"foo":"bar"}'; done
   ```
   ```sh
   # I send a request for changes to the database. This request resembles the 
request
   # a replication client might send very closely.
   curl 
'localhost:5984/replication-source/_changes?feed=normal&style=all_docs&since=0&filter=test%2Ftest&timeout=10000'
   ```
   
   I'm getting better results now. I do indeed see the `{"results":[`-line 
printed after about ten seconds, followed by a periodic newline, until finally 
returning the last sequence number. Unfortunately, this is not what's happening 
on the production environment, but these results are a huge step forward! Thank 
you.
   
   > To double check, is the replication itself running on a 2.x cluster? What 
are the versions of the targets and source? Are they all 2.x as well?
   
   There is once "central server" to which, and from which, a large number of 
clients push and pull subsets of information. The server runs a 2.x cluster, 
and the clients are single-node CouchDB instances ranging between version 1.6 
on Windows XP and 2.x on Windows 10.
   
   > Are there any proxies or load balancers involved and do you think they 
could affect the connections?
   
   The central server sits behind an nginx reverse proxy, which is now my prime 
suspect. Thank you for pointing this out to me.
   
   > How many replication jobs are running?
   
   There are a few replication jobs running within the central server itself, 
but they do not cause problem. At any given time, some fifty clients running 
their own replication jobs will be polling the server for changes.
   
   > In case of filtered replications, with large source db and a restrictive 
filter, like you have, replications won't checkpoint unless they receive a 
document update via the filter. However if it takes too long and the job is 
swapped out by the scheduler, it might not have chance to checkpoint, it will 
be stopped. Next time starts will use 0 for the changes feed start 0, and it 
will wait again, not get a document, will be stopped, etc.
   
   This sounds a lot like what I thought was happening, but every node only 
runs two replication jobs. One for upstream replication, and one for 
downstream. Neither are continues.
   
   ----
   
   I will be investigating whether nginx might be buffering the response before 
sending it along, causing connection to drop.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] Avaq commented on issue #1081: Replicator infinite failure loop

Reply via email to