[ 
https://issues.apache.org/jira/browse/CASSANDRA-7720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14090268#comment-14090268
 ] 

Mike Schrag commented on CASSANDRA-7720:
----------------------------------------

With quorum reads + writes, it would have to be a pretty catastrophic failure 
to not get an A at all on at least one replica in the cluster by the time you 
get a B. I'm not actually sure how it would be possible in our case, since our 
process is gated such that A has to successfully write to the cluster before 
the process that initiates the B write. Somehow you'd have to have a scenario 
where the A write manages to tell the client that it succeeded (which would 
mean that it had to land on the commit log of at least SOME coordinator, which 
should then flush to disk as part of the snapshot), then B writes, but we 
somehow have now lost A. The only case I can think of is a network partition on 
every replica of A at the point of snapshotting, at which point, I would expect 
snapshotting to be considered a failure, anyway. I grant that there are 
probably still edge cases, but in the current implementation, it's not all that 
edge-y for a system that is under heavy writes, and something like this 
approach, I think, would make backups more reliable for people who prefer to 
make the snapshot performance trade-off.

As far as a partial-write, that can happen already with snapshotting (if you 
managed to crash all of the replicas for a given token in the middle or 
something), and I think it would be considered a failed snapshot and unsuitable 
for use as a backup target.

> Add a more consistent snapshot mechanism
> ----------------------------------------
>
>                 Key: CASSANDRA-7720
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-7720
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Mike Schrag
>
> We’ve hit an interesting issue with snapshotting, which makes sense in 
> hindsight, but presents an interesting challenge for consistent restores:
> * initiate snapshot
> * snapshotting flushes table A and takes the snapshot
> * insert into table A
> * insert into table B
> * snapshotting flushes table B and takes the snapshot
> * snapshot finishes
> So what happens here is that we end up having a B, but NOT having an A, even 
> though B was chronologically inserted after A.
> It makes sense when I think about what snapshot is doing, but I wonder if 
> snapshots actually should get a little fancier to behave a little more like 
> what I think most people would expect. What I think should happen is 
> something along the lines of the following:
> For each node:
> * pass a client timestamp in the snapshot call corresponding to "now"
> * snapshot the tables using the existing procedure
> * walk backwards through the linked snapshot sstables in that snapshot
>   * if the earliest update in that sstable is after the client's timestamp, 
> delete the sstable in the snapshot
>   * if the earliest update in the sstable is before the client's timestamp, 
> then look at the last update. Walk backwards through that sstable.
>     * if any updates fall after the timestamp, make a copy of that sstable in 
> the snapshot folder only up to the point of the timestamp and then delete the 
> original sstable in the snapshot (we need to copy because we're likely 
> holding a shared hard linked sstable)
> I think this would guarantee that you have a chronologically consistent view 
> of your snapshot across all machines and columnfamilies within a given 
> snapshot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to