[ 
https://issues.apache.org/jira/browse/COUCHDB-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920469#action_12920469
 ] 

Randall Leeds commented on COUCHDB-704:
---------------------------------------

Can I get someone to take a quick look at this, please?
I don't see any reason not to commit this to 1.0.x and trunk and close it, and 
very good reason to do so.

3 insertions and 1 deletion. Easy review. Get it while it's hot (well, it's 
already a month and a half cold)!

Summary:
The bug - the replication log is updated by changing the last entry in place 
with the contents of each checkpoint. This is fine except when nasty network 
errors cause the log to be written on only one of the two dbs involved. If this 
occurs then the last history entry will not have a matching session_id in the 
other log. Imagine 3 months of replication checkpoints lost because a switch 
flapped. Ouch.

The change - replication keeps the same session_id across checkpoints. Even if 
only one log is written, the last entries will still have a matching session_id 
and we can be sure that the recorded_seq is committed to both. At most one 
checkpoint is lost.

> Replication can lose checkpoints
> --------------------------------
>
>                 Key: COUCHDB-704
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-704
>             Project: CouchDB
>          Issue Type: Bug
>          Components: Replication
>    Affects Versions: 0.11.2, 1.0.1
>            Reporter: Randall Leeds
>            Priority: Minor
>         Attachments: keep_session_id.patch, save-all-rep-checkpoints.patch, 
> whitespace.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> When saving replication checkpoints in the _local/<repid> document the new 
> entry is always pushed onto the _original_ "history" list property that 
> existed at the start of the replication. When any number of things causes the 
> checkpoint to be written to only one of the databases the head of the history 
> list gets out of sync. Subsequent attempts to start this replication must 
> start from the latest common replication log entry in the _original_ history, 
> as though this replication never occurred.
> A better idea is to push every checkpoint onto the history instead of 
> replacing the head on each save.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to