Hi, Stuart, If I may paraphrase what Jonathan said, typically your batch_mutate operation is idempotent. That is, you can replay / retry the same operation within a short timeframe without any undesirable side effect.
The assumption behind the "short timeframe" here refers to: there is no other concurrent writer trying to write anything conflicting in an interleaving fashion. Imagine that if there was another writer trying to write: > "some-uuid-1": { > "path": "/foo/bar", > "size": 100000 > }, ... > { > "/foo/bar": { > "uuid": "some-uuid-1" > }, Then, there is a chance of 4 write operations (two writes for "/a/b/c" into 2 CFs and two writes for "/foo/bar" into 2) would interleave each other and create an undesirable result. I guess that is not a likely situation in your case. Hopefully, my email helps. See also: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Regards, Alex Yiu On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > typically you will update both as part of a batch_mutate, and if it > fails, retry the operation. re-writing any part that succeeded will > be harmless. > > On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge > <stuart.langri...@canonical.com> wrote: > > Hi, Cassandra people! > > > > We're looking at Cassandra as a possible replacement for some parts of > > our database structures, and on an early look I'm a bit confused about > > atomicity guarantees and rollbacks and such, so I wanted to ask what > > standard practice is for dealing with the sorts of situation I outline > > below. > > > > Imagine that we're storing information about files. Each file has a path > > and a uuid, and sometimes we need to look up stuff about a file by its > > path and sometimes by its uuid. The best way to do this, as I understand > > it, is to store the data in Cassandra twice: once indexed by nodeid and > > once by path. So, I have two ColumnFamilies, one indexed by uuid: > > > > { > > "some-uuid-1": { > > "path": "/a/b/c", > > "size": 100000 > > }, > > "some-uuid-2" { > > ... > > }, > > ... > > } > > > > and one indexed by path > > > > { > > "/a/b/c": { > > "uuid": "some-uuid-1", > > "size": 100000 > > }, > > "/d/e/f" { > > ... > > }, > > ... > > } > > > > So, first, do please correct me if I've misunderstood the terminology > > here (and I've shown a "short form" of ColumnFamily here, as per > > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model). > > > > The thing I don't quite get is: what happens when I want to add a new > > file? I need to add it to both these ColumnFamilies, but there's no "add > > it to both" atomic operation. What's the way that people handle the > > situation where I add to the first CF and then my program crashes, so I > > never added to the second? (Assume that there is lots more data than > > I've outlined above, so that "put it all in one SuperColumnFamily, > > because that can be updated atomically" won't work because it would end > > up with our entire database in one SCF). Should we add to one, and then > > if we fail to add to the other for some reason continually retry until > > it works? Have a "garbage collection" procedure which finds > > discrepancies between indexes like this and fixes them up and run it > > from cron? We'd love to hear some advice on how to do this, or if we're > > modelling the data in the wrong way and there's a better way which > > avoids these problems! > > > > sil > > > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of Riptano, the source for professional Cassandra support > http://riptano.com >