Re: Understanding atomicity in Cassandra
Hi, Patricio, It's hard to comment on your original questions without knowing details of your own domain specific data model and data processing expectation. W.R.T. lumping things into one big row, there is a limitation on data model in Cassandra. You got CF and SCF. That is, you have only 2 level of nesting at most for an atomic value update. I.e. you cannot lump arbitrarily complex data into a single big row. Even as the update for one particular row is atomic, you would run into the situation of having concurrent read-write operations that conflict with each other. For example, having a list of values as one of your column value. Old value is: "a, b, c" And, the operation is: you want to add "d" to that list. The desired new value is: "a, b, c, d" If there is another concurrent operation that tries to add "e" to the list, you would still have problem given the present atomic semantic of row update in cassandra. On the other hand, there are a number of application scenario, where update operations are safe to be considered as idempotent. E.g. bulk loading data from flat files into Cassandra If your main worry is about client process crashing, regardless what kind of ACID properties that Cassandra can provide, you still want to have a way to verify whether Cassandra has stored the desired state and/or log the processed update operation in the context of bulk loading. Then, you can decide whether a particular data update needs to be repeated or not. A full fledge ACID database ("all or nothing semantic") can decrease the complexity of verification of the succeed of storage. But, it cannot remove that concern completely. Consider the case that the client process crashes right at the moment of "dbConn.commit()". You still don't know for sure whether that update operation has gone through. Hope this email helps. Thanks! Regards, Alex Yiu On Tue, Jul 20, 2010 at 2:03 PM, Jonathan Ellis wrote: > 2010/7/20 Patricio Echagüe : > > Would it be bad design to store all the data that need to be > > consistent under one big key? > > That really depends how unnatural it is from a query perspective. :) > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >
Re: Understanding atomicity in Cassandra
2010/7/20 Patricio Echagüe : > Would it be bad design to store all the data that need to be > consistent under one big key? That really depends how unnatural it is from a query perspective. :) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com
Re: Understanding atomicity in Cassandra
Hi, regarding the retrying strategy, I understand that it might make sense assuming that the client can actually perform a retry. We are trying to build a fault tolerance solution based on Cassandra. In some scenarios, the client machine can go down during a transaction. Would it be bad design to store all the data that need to be consistent under one big key? In this case the batch_mutate operations will not be big since just a small part is updated/add at a time. But at least we know that the operation either succeeded or failed. We basically have: CF: usernames (similar to Twitter model) SCF: User_tree (it has all the information related to the user) Thanks On Mon, Jul 19, 2010 at 9:40 PM, Alex Yiu wrote: > > Hi, Stuart, > If I may paraphrase what Jonathan said, typically your batch_mutate > operation is idempotent. > That is, you can replay / retry the same operation within a short timeframe > without any undesirable side effect. > The assumption behind the "short timeframe" here refers to: there is no > other concurrent writer trying to write anything conflicting in an > interleaving fashion. > Imagine that if there was another writer trying to write: >> "some-uuid-1": { >> "path": "/foo/bar", >> "size": 10 >> }, > ... >> { >> "/foo/bar": { >> "uuid": "some-uuid-1" >> }, > Then, there is a chance of 4 write operations (two writes for "/a/b/c" into > 2 CFs and two writes for "/foo/bar" into 2) would interleave each other and > create an undesirable result. > I guess that is not a likely situation in your case. > Hopefully, my email helps. > See also: > http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic > > Regards, > Alex Yiu > > > On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis wrote: >> >> typically you will update both as part of a batch_mutate, and if it >> fails, retry the operation. re-writing any part that succeeded will >> be harmless. >> >> On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge >> wrote: >> > Hi, Cassandra people! >> > >> > We're looking at Cassandra as a possible replacement for some parts of >> > our database structures, and on an early look I'm a bit confused about >> > atomicity guarantees and rollbacks and such, so I wanted to ask what >> > standard practice is for dealing with the sorts of situation I outline >> > below. >> > >> > Imagine that we're storing information about files. Each file has a path >> > and a uuid, and sometimes we need to look up stuff about a file by its >> > path and sometimes by its uuid. The best way to do this, as I understand >> > it, is to store the data in Cassandra twice: once indexed by nodeid and >> > once by path. So, I have two ColumnFamilies, one indexed by uuid: >> > >> > { >> > "some-uuid-1": { >> > "path": "/a/b/c", >> > "size": 10 >> > }, >> > "some-uuid-2" { >> > ... >> > }, >> > ... >> > } >> > >> > and one indexed by path >> > >> > { >> > "/a/b/c": { >> > "uuid": "some-uuid-1", >> > "size": 10 >> > }, >> > "/d/e/f" { >> > ... >> > }, >> > ... >> > } >> > >> > So, first, do please correct me if I've misunderstood the terminology >> > here (and I've shown a "short form" of ColumnFamily here, as per >> > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model). >> > >> > The thing I don't quite get is: what happens when I want to add a new >> > file? I need to add it to both these ColumnFamilies, but there's no "add >> > it to both" atomic operation. What's the way that people handle the >> > situation where I add to the first CF and then my program crashes, so I >> > never added to the second? (Assume that there is lots more data than >> > I've outlined above, so that "put it all in one SuperColumnFamily, >> > because that can be updated atomically" won't work because it would end >> > up with our entire database in one SCF). Should we add to one, and then >> > if we fail to add to the other for some reason continually retry until >> > it works? Have a "garbage collection" procedure which finds >> > discrepancies between indexes like this and fixes them up and run it >> > from cron? We'd love to hear some advice on how to do this, or if we're >> > modelling the data in the wrong way and there's a better way which >> > avoids these problems! >> > >> > sil >> > >> > >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of Riptano, the source for professional Cassandra support >> http://riptano.com > > -- Patricio.-
Re: Understanding atomicity in Cassandra
Hi, Stuart, If I may paraphrase what Jonathan said, typically your batch_mutate operation is idempotent. That is, you can replay / retry the same operation within a short timeframe without any undesirable side effect. The assumption behind the "short timeframe" here refers to: there is no other concurrent writer trying to write anything conflicting in an interleaving fashion. Imagine that if there was another writer trying to write: > "some-uuid-1": { >"path": "/foo/bar", >"size": 10 > }, ... > { > "/foo/bar": { >"uuid": "some-uuid-1" > }, Then, there is a chance of 4 write operations (two writes for "/a/b/c" into 2 CFs and two writes for "/foo/bar" into 2) would interleave each other and create an undesirable result. I guess that is not a likely situation in your case. Hopefully, my email helps. See also: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Regards, Alex Yiu On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis wrote: > typically you will update both as part of a batch_mutate, and if it > fails, retry the operation. re-writing any part that succeeded will > be harmless. > > On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge > wrote: > > Hi, Cassandra people! > > > > We're looking at Cassandra as a possible replacement for some parts of > > our database structures, and on an early look I'm a bit confused about > > atomicity guarantees and rollbacks and such, so I wanted to ask what > > standard practice is for dealing with the sorts of situation I outline > > below. > > > > Imagine that we're storing information about files. Each file has a path > > and a uuid, and sometimes we need to look up stuff about a file by its > > path and sometimes by its uuid. The best way to do this, as I understand > > it, is to store the data in Cassandra twice: once indexed by nodeid and > > once by path. So, I have two ColumnFamilies, one indexed by uuid: > > > > { > > "some-uuid-1": { > >"path": "/a/b/c", > >"size": 10 > > }, > > "some-uuid-2" { > >... > > }, > > ... > > } > > > > and one indexed by path > > > > { > > "/a/b/c": { > >"uuid": "some-uuid-1", > >"size": 10 > > }, > > "/d/e/f" { > >... > > }, > > ... > > } > > > > So, first, do please correct me if I've misunderstood the terminology > > here (and I've shown a "short form" of ColumnFamily here, as per > > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model). > > > > The thing I don't quite get is: what happens when I want to add a new > > file? I need to add it to both these ColumnFamilies, but there's no "add > > it to both" atomic operation. What's the way that people handle the > > situation where I add to the first CF and then my program crashes, so I > > never added to the second? (Assume that there is lots more data than > > I've outlined above, so that "put it all in one SuperColumnFamily, > > because that can be updated atomically" won't work because it would end > > up with our entire database in one SCF). Should we add to one, and then > > if we fail to add to the other for some reason continually retry until > > it works? Have a "garbage collection" procedure which finds > > discrepancies between indexes like this and fixes them up and run it > > from cron? We'd love to hear some advice on how to do this, or if we're > > modelling the data in the wrong way and there's a better way which > > avoids these problems! > > > > sil > > > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >
Re: Understanding atomicity in Cassandra
typically you will update both as part of a batch_mutate, and if it fails, retry the operation. re-writing any part that succeeded will be harmless. On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge wrote: > Hi, Cassandra people! > > We're looking at Cassandra as a possible replacement for some parts of > our database structures, and on an early look I'm a bit confused about > atomicity guarantees and rollbacks and such, so I wanted to ask what > standard practice is for dealing with the sorts of situation I outline > below. > > Imagine that we're storing information about files. Each file has a path > and a uuid, and sometimes we need to look up stuff about a file by its > path and sometimes by its uuid. The best way to do this, as I understand > it, is to store the data in Cassandra twice: once indexed by nodeid and > once by path. So, I have two ColumnFamilies, one indexed by uuid: > > { > "some-uuid-1": { > "path": "/a/b/c", > "size": 10 > }, > "some-uuid-2" { > ... > }, > ... > } > > and one indexed by path > > { > "/a/b/c": { > "uuid": "some-uuid-1", > "size": 10 > }, > "/d/e/f" { > ... > }, > ... > } > > So, first, do please correct me if I've misunderstood the terminology > here (and I've shown a "short form" of ColumnFamily here, as per > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model). > > The thing I don't quite get is: what happens when I want to add a new > file? I need to add it to both these ColumnFamilies, but there's no "add > it to both" atomic operation. What's the way that people handle the > situation where I add to the first CF and then my program crashes, so I > never added to the second? (Assume that there is lots more data than > I've outlined above, so that "put it all in one SuperColumnFamily, > because that can be updated atomically" won't work because it would end > up with our entire database in one SCF). Should we add to one, and then > if we fail to add to the other for some reason continually retry until > it works? Have a "garbage collection" procedure which finds > discrepancies between indexes like this and fixes them up and run it > from cron? We'd love to hear some advice on how to do this, or if we're > modelling the data in the wrong way and there's a better way which > avoids these problems! > > sil > > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of Riptano, the source for professional Cassandra support http://riptano.com