[ https://issues.apache.org/jira/browse/CASSANDRA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sylvain Lebresne updated CASSANDRA-1546: ---------------------------------------- Attachment: 0001-Remove-IClock-from-internals.patch 0002-Counters.patch 0003-Generated-thrift-files-changes.patch At its core, this patch uses the same main idea than CASSANDRA-1072, but instead of using a Clock, it splits the subparts of the context of CASSANDRA-1072 into individual columns. This has, at the very least, two advantages over the clock structure: # it is much simpler in terms of code as its reuse of lot of what exists. In particular, it avoids the fairly complex byte array manipulation of CASSANDRA-1072. # each parts of the counter (that is, each pair (host id, count)) have a timestamp. This means we get full counters (increment *and* decrement) for free (that alone is worth it if you ask me). More anecdotally, this allows for a different approach than the one of CASSANDRA-1072 for the 'cleanContext' logic, which I believe is faster and cleaner (and hopefully right). This is more of an implementation detail however, so we can discuss it if someone find a problem. The patch also includes some modification of the patch I had once attached to CASSANDRA-580 and CASSANDRA-1397. It adds a new verb handler when submitting a counter update that first forward the update to an actual replica of the counter. This replica, applies the update, and then replicate the update. So this patch does support the usual consistency level (for read and writes). This involves a local read on the first replicate (called leader in the patch), but at least, the patch ensures that at CL.ONE, this read happens after the write is acked to the client. In this patch, a counter is thus a (small) bunch of columns (one by replica), and hence a counter is internally a row. It could have been a super column, but it's not, for reasons that follow. But let me add that when (in 0.8 maybe?) we have super columns that we don't need to deserialize fully, it will make a lot of sense to change the counter to be super columns instead (and it won't be hard). CASSANDRA-1072 has a fairly important problem: it is a bit too much unreliable. If you do a write and for some reason (TimeoutException or loss of connection to the coordinator) this doesn't succeed, you have NO WAY to know if the increment have been counted or not. Nothing really new, standard writes have the same problem, but with standard writes, you just replay the update. With CASSANDRA-1072, you can't because that can mean incrementing the value twice .. or not. Given that TimeoutException can really happens, this is far from perfect, even though I perfectly understand that some can live with that. In this patch, I propose an improvement that, while maybe not perfect, improve this situation a lot. The idea is to tag each counter update with a unique id (typically a uuid, even though the implementation doesn't assume it). If an update timeout, you simply replay it with the same uuid and the system will make sure it didn't get applied twice. To do that, when a new update for a counter is applied, a marked column with the provided uuid will be inserted in the same row (remember, counters are rows). But before even applying the read, we try to read the marker. If we find one, we just skip the update, otherwise we apply (and insert the marker). The details are more complicated, in particular because we don't know if the same 'leader' will be chosen for the replay of the update, but I let people interested look at the implementation for details. Those marker columns are expiring column, so stay only for some time (gcGrace in the patch, that can change). You can choose your TTL here, choosing the window of time during which you can replay a failed update (btw, CASSANDRA-1537 will make a lot of sense for those marker columns). Note that those marker are only used on writes. The read path only read the columns that holds the counts. So the read performance is as good as it gets. Let's sum up: this means 2 reads on the write path, one for the marker and one before replication. BUT: * the marker is totally optional. If you don't provide one, you will not be able to replay a failed update, but if you can live with that, you can. * all those columns (for a given counter), being in the same row, you can assign a fair amount of row cache and make those read fairly painless. For the actual user of CASSANDRA-1072, it means that if you don't use uuids and you write at CL.ONE, you should have virtually the same write performances with this patch. All this means that all operation on counters goes trough new thrift API calls. The patch is split in 3 parts. The first one remove the clock structure (it is CASSANDRA-1502 in fact). The second part is the actual patch. The third is the thrift generated file changes (I join it for convenience sake only, nothing worth reading there). Now the less good news. This patch is really just out of the oven (written over the course of the week-end). It is clearly under-tested. Actually, the extends of my test are the system tests included in the patch. That is: no test on multiple hosts or stress tests yet. I will work on those, but I felt the patch was in a good enough shape that it was worth sharing and it was a good time to get feedbacks. The patch also has a few know problems: # Even though I include a system test for it, the remove is broken. In some case, where an update issued after a delete gets in before the delete, the delete could be wrongly discarded. For what it's worth, I'm pretty sure CASSANDRA-1072 also suffers from the exact same problem. I have to think more about that, but any idea is welcome. # There is no hints for the CounterMutation (cf. the patch). This makes the marker idea more fragile that it should. Those should be added at some point. It's also worth noting that the maker idea is not bulletproof and will probably never be. But it's my strong believe that it's much better than nothing. # It could be that streaming is broken with this (for the same reason that CASSANDRA-1072 has code for this). It could be that it's not. I'm not hugely familiar with this part of the code. I have to check and fix it if needs be. # Because it's not the coordinator but the first replica (leader) that does the replication, we could have to send UnavailableException back to the coordinator (and then back to the client). It's almost done but not completely. I will try to address all of what's above asap. But I will have very little time to give to this during this week, so any help (in tests and reviews in particular) will be appreciated. All opinions, remarks, criticisms are also highly appreciated. > (Yet another) approach to counting > ---------------------------------- > > Key: CASSANDRA-1546 > URL: https://issues.apache.org/jira/browse/CASSANDRA-1546 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Sylvain Lebresne > Assignee: Sylvain Lebresne > Fix For: 0.7.0 > > Attachments: 0001-Remove-IClock-from-internals.patch, > 0002-Counters.patch, 0003-Generated-thrift-files-changes.patch > > > This could be described as a mix between CASSANDRA-1072 without clocks and > CASSANDRA-1421. > More details in the comment below. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.