Hi, Cassandra people! We're looking at Cassandra as a possible replacement for some parts of our database structures, and on an early look I'm a bit confused about atomicity guarantees and rollbacks and such, so I wanted to ask what standard practice is for dealing with the sorts of situation I outline below.
Imagine that we're storing information about files. Each file has a path and a uuid, and sometimes we need to look up stuff about a file by its path and sometimes by its uuid. The best way to do this, as I understand it, is to store the data in Cassandra twice: once indexed by nodeid and once by path. So, I have two ColumnFamilies, one indexed by uuid: { "some-uuid-1": { "path": "/a/b/c", "size": 100000 }, "some-uuid-2" { ... }, ... } and one indexed by path { "/a/b/c": { "uuid": "some-uuid-1", "size": 100000 }, "/d/e/f" { ... }, ... } So, first, do please correct me if I've misunderstood the terminology here (and I've shown a "short form" of ColumnFamily here, as per http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model). The thing I don't quite get is: what happens when I want to add a new file? I need to add it to both these ColumnFamilies, but there's no "add it to both" atomic operation. What's the way that people handle the situation where I add to the first CF and then my program crashes, so I never added to the second? (Assume that there is lots more data than I've outlined above, so that "put it all in one SuperColumnFamily, because that can be updated atomically" won't work because it would end up with our entire database in one SCF). Should we add to one, and then if we fail to add to the other for some reason continually retry until it works? Have a "garbage collection" procedure which finds discrepancies between indexes like this and fixes them up and run it from cron? We'd love to hear some advice on how to do this, or if we're modelling the data in the wrong way and there's a better way which avoids these problems! sil