davisp opened a new pull request #1862: Fix fabric_open_doc_revs
URL: https://github.com/apache/couchdb/pull/1862
 
 
   There was a subtle bug when opening specific revisions in
   fabric_doc_open_revs due to a race condition between updates being
   applied across a cluster.
   
   The underlying cause here was due to the stemming after a document had
   been updated more than revs_limit number of times along with concurrent
   reads to a node that had not yet made the update. To illustrate lets
   consider a document A which has a revision history from `{N, RevN}` to
   `{N+1000, RevN+1000}` (assuming revs_limit is the default 1000). If we
   consider a single node perspective when an update comes in we added the
   new revision and stem the oldest revision. The docs the revisions on the
   node would be `{N+1, RevN+1}` to `{N+1001, RevN+1001}`.
   
   The bug exists when we attempt to open revisions on a different node
   that has yet to apply the new update. In this case when
   fabric_doc_open_revs could be called with `{N+1000, RevN+1000}`. This
   results in a response from fabric_doc_open_revs that includes two
   different `{ok, Doc}` results instead of the expected one instance. The
   reason for this is that one document has revisions `{N+1, RevN+1}` to
   `{N+1000, RevN+1000}` from the node that has applied the update, while
   the node without the update responds with revisions `{N, RevN}` to
   {N+1000, RevN+1000}`.
   
   To rephrase that, a node that has applied an update can end up returning
   a revision path that contains `revs_limit - 1` revisions while a node
   wihtout the update returns all `revs_limit` revisions. This slight
   change in the path prevented the responses from being properly combined
   into a single response.
   
   This bug has existed for many years. However, read repair effectively
   prevents it from being a significant issue by immediately fixing the
   revision history discrepancy. This was discovered due to the recent bug
   in read repair during a mixed cluster upgrade to a release including
   clustered purge. In this situation we end up crashing the design
   document cache which then leads to all of the design document requests
   being direct reads which can end up causing cluster nodes to OOM and
   die. The conditions require a significant number of design document
   edits coupled with already significant load to those modified design
   documents. The most direct example observed was a clustered that had a
   significant number of filtered replications in and out of the cluster.
   
   <!-- Thank you for your contribution!
   
        Please file this form by replacing the Markdown comments
        with your text. If a section needs no action - remove it.
   
        Also remember, that CouchDB uses the Review-Then-Commit (RTC) model
        of code collaboration. Positive feedback is represented +1 from 
committers
        and negative is a -1. The -1 also means veto, and needs to be addressed
        to proceed. Once there are no objections, the PR can be merged by a
        CouchDB committer.
   
        See: http://couchdb.apache.org/bylaws.html#decisions for more info. -->
   
   ## Testing recommendations
   
   `make check`
   
   ## Related Issues or Pull Requests
   
   This was discovered due to issues caused by #1860 
   
   ## Checklist
   
   - [x] Code is written and works correctly;
   - [x] Changes are covered by tests;
   - [ ] Documentation reflects the changes;
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to