[
https://issues.apache.org/jira/browse/COUCHDB-704?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921723#action_12921723
]
Randall Leeds commented on COUCHDB-704:
---------------------------------------
Filipe,
It's true. This is an edge case, but I have had it happen in production with a
database that had crawled to *very* slow writes and pull replication. The
checkpoint code updated the source first and the local document was written,
but the response was too slow so it was taken as a timeout. When the replicator
retried the save it got a conflict. Replication crashed and the target was
never written.
I can imagine other, rare instances where this could occur. It's an edge case,
but a potentially nasty one.
> Replication can lose checkpoints
> --------------------------------
>
> Key: COUCHDB-704
> URL: https://issues.apache.org/jira/browse/COUCHDB-704
> Project: CouchDB
> Issue Type: Bug
> Components: Replication
> Affects Versions: 0.11.2, 1.0.1
> Reporter: Randall Leeds
> Priority: Minor
> Attachments: keep_session_id.patch, save-all-rep-checkpoints.patch,
> whitespace.patch
>
> Original Estimate: 0h
> Remaining Estimate: 0h
>
> When saving replication checkpoints in the _local/<repid> document the new
> entry is always pushed onto the _original_ "history" list property that
> existed at the start of the replication. When any number of things causes the
> checkpoint to be written to only one of the databases the head of the history
> list gets out of sync. Subsequent attempts to start this replication must
> start from the latest common replication log entry in the _original_ history,
> as though this replication never occurred.
> A better idea is to push every checkpoint onto the history instead of
> replacing the head on each save.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.