GitHub user nickva opened a pull request: https://github.com/apache/couchdb-couch-replicator/pull/54
Allow configuring maximum document ID length during replication Currently due to a bug in http parser and lack of document ID length enforcement, large document IDs will break replication jobs. Large IDs will pass through the _change feed, revs diffs, but then fail during open_revs get request. open_revs request will keep retrying until it gives up after long enough time, then replication task crashes and restart again with the same pattern. The current effective limit is around 8k or so. (The buffer size default 8192 and if the first line of the request is larger than that, request will fail). (See http://erlang.org/pipermail/erlang-questions/2011-June/059567.html for more information about the possible failure mechanism). Bypassing the parser bug by increasing recbuf size, will alow replication to finish, however that means simply spreading the abnormal document through the rest of the system, and might not be desirable always. Also once long document IDs have been inserted in the source DB. Simply deleting them doesn't work as they'd still appear in the change feed. They'd have to be purged or somehow skipped during the replication step. This commit helps do the later. Operators can configure maximum length via this setting: ``` replicator.max_document_id_length=0 ``` The default value is 0 which means there is no maximum enforced, which is backwards compatible behavior. During replication if maximum is hit by a document, that document is skipped, an error is written to the log: ``` Replicator: document id `aaaaaaaaaaaaaaaaaaaaa...` from source db `http://.../cdyno-0000001/` is too long, ignoring. ``` and `"doc_write_failures"` statistic is bumped. COUCHDB-3291 You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloudant/couchdb-couch-replicator couchdb-3291-limit-doc-id-size-in-replicator Alternatively you can review and apply these changes as the patch at: https://github.com/apache/couchdb-couch-replicator/pull/54.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #54 ---- commit 3ff2d83893481afd68025a52a6d859a2efaf0bcf Author: Nick Vatamaniuc <vatam...@apache.org> Date: 2017-02-03T23:00:37Z Allow configuring maximum document ID length during replication Currently due to a bug in http parser and lack of document ID length enforcement, large document IDs will break replication jobs. Large IDs will pass through the _change feed, revs diffs, but then fail during open_revs get request. open_revs request will keep retrying until it gives up after long enough time, then replication task crashes and restart again with the same pattern. The current effective limit is around 8k or so. (The buffer size default 8192 and if the first line of the request is larger than that, request will fail). (See http://erlang.org/pipermail/erlang-questions/2011-June/059567.html for more information about the possible failure mechanism). Bypassing the parser bug by increasing recbuf size, will alow replication to finish, however that means simply spreading the abnormal document through the rest of the system, and might not be desirable always. Also once long document IDs have been inserted in the source DB. Simply deleting them doesn't work as they'd still appear in the change feed. They'd have to be purged or somehow skipped during the replication step. This commit helps do the later. Operators can configure maximum length via this setting: ``` replicator.max_document_id_length=0 ``` The default value is 0 which means there is no maximum enforced, which is backwards compatible behavior. During replication if maximum is hit by a document, that document is skipped, an error is written to the log: ``` Replicator: document id `aaaaaaaaaaaaaaaaaaaaa...` from source db `http://.../cdyno-0000001/` is too long, ignoring. ``` and `"doc_write_failures"` statistic is bumped. COUCHDB-3291 ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---