Avaq commented on issue #1081: Replicator infinite failure loop URL: https://github.com/apache/couchdb/issues/1081#issuecomment-360500296 Hi @nickva, thank you for your response! > Noticed in the test behavior script you specified a heartbeat. In 2.x replicator doesn't use hearbeats, instead it uses timeouts I didn't know heartbeats were removed. It doesn't really matter for my case though. The only reason I'm specifying a heartbeat in my tests is to make the result more visual (so you don't have to wait ten seconds using 1.6 to see if something is happening). I have adjusted my test. ```sh # We create our test database curl -X PUT localhost:5984/replication-source # We insert a design doc with a filter function that is guaranteed to take long # The reason is so we can simulate a database with a lot of documents which are # not going to pass in the filtering process. curl -X PUT localhost:5984/replication-source/_design/test -d '{"filters":{"test":"function(){var future = Date.now() + 2000; while(Date.now() < future){}; return false}"}}' # We insert a bunch of documents so that filtering them will take time. Note # that I increased the number from 20 to 100, because I have more CPU cores # this time around (I didn't consider that before). for i in {1..100}; do curl -X POST -H 'Content-Type: application/json' localhost:5984/replication-source -d '{"foo":"bar"}'; done ``` ```sh # I send a request for changes to the database. This request resembles the request # a replication client might send very closely. curl 'localhost:5984/replication-source/_changes?feed=normal&style=all_docs&since=0&filter=test%2Ftest&timeout=10000' ``` I'm getting better results now. I do indeed see the `{"results":[`-line printed after about ten seconds, followed by a periodic newline, until finally returning the last sequence number. Unfortunately, this is not what's happening on the production environment, but these results are a huge step forward! Thank you. > To double check, is the replication itself running on a 2.x cluster? What are the versions of the targets and source? Are they all 2.x as well? There is once "central server" to which, and from which, a large number of clients push and pull subsets of information. The server runs a 2.x cluster, and the clients are single-node CouchDB instances ranging between version 1.6 on Windows XP and 2.x on Windows 10. > Are there any proxies or load balancers involved and do you think they could affect the connections? The central server sits behind an nginx reverse proxy, which is now my prime suspect. Thank you for pointing this out to me. > How many replication jobs are running? There are a few replication jobs running within the central server itself, but they do not cause problem. At any given time, some fifty clients running their own replication jobs will be polling the server for changes. > In case of filtered replications, with large source db and a restrictive filter, like you have, replications won't checkpoint unless they receive a document update via the filter. However if it takes too long and the job is swapped out by the scheduler, it might not have chance to checkpoint, it will be stopped. Next time starts will use 0 for the changes feed start 0, and it will wait again, not get a document, will be stopped, etc. This sounds a lot like what I thought was happening, but every node only runs two replication jobs. One for upstream replication, and one for downstream. Neither are continues. ---- I will be investigating whether nginx might be buffering the response before sending it along, causing connection to drop.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
