Hi, Stuart,

If I may paraphrase what Jonathan said, typically your batch_mutate
operation is idempotent.
That is, you can replay / retry the same operation within a short timeframe
without any undesirable side effect.

The assumption behind the "short timeframe" here refers to: there is no
other concurrent writer trying to write anything conflicting in an
interleaving fashion.

Imagine that if there was another writer trying to write:
>  "some-uuid-1": {
>    "path": "/foo/bar",
>    "size": 100000
>  },
...
> {
>  "/foo/bar": {
>    "uuid": "some-uuid-1"
>  },

Then, there is a chance of 4 write operations (two writes for "/a/b/c" into
2 CFs and two writes for "/foo/bar" into 2) would interleave each other and
create an undesirable result.

I guess that is not a likely situation in your case.

Hopefully, my email helps.

See also:
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic


Regards,
Alex Yiu



On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis <jbel...@gmail.com> wrote:

> typically you will update both as part of a batch_mutate, and if it
> fails, retry the operation.  re-writing any part that succeeded will
> be harmless.
>
> On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
> <stuart.langri...@canonical.com> wrote:
> > Hi, Cassandra people!
> >
> > We're looking at Cassandra as a possible replacement for some parts of
> > our database structures, and on an early look I'm a bit confused about
> > atomicity guarantees and rollbacks and such, so I wanted to ask what
> > standard practice is for dealing with the sorts of situation I outline
> > below.
> >
> > Imagine that we're storing information about files. Each file has a path
> > and a uuid, and sometimes we need to look up stuff about a file by its
> > path and sometimes by its uuid. The best way to do this, as I understand
> > it, is to store the data in Cassandra twice: once indexed by nodeid and
> > once by path. So, I have two ColumnFamilies, one indexed by uuid:
> >
> > {
> >  "some-uuid-1": {
> >    "path": "/a/b/c",
> >    "size": 100000
> >  },
> >  "some-uuid-2" {
> >    ...
> >  },
> >  ...
> > }
> >
> > and one indexed by path
> >
> > {
> >  "/a/b/c": {
> >    "uuid": "some-uuid-1",
> >    "size": 100000
> >  },
> >  "/d/e/f" {
> >    ...
> >  },
> >  ...
> > }
> >
> > So, first, do please correct me if I've misunderstood the terminology
> > here (and I've shown a "short form" of ColumnFamily here, as per
> > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).
> >
> > The thing I don't quite get is: what happens when I want to add a new
> > file? I need to add it to both these ColumnFamilies, but there's no "add
> > it to both" atomic operation. What's the way that people handle the
> > situation where I add to the first CF and then my program crashes, so I
> > never added to the second? (Assume that there is lots more data than
> > I've outlined above, so that "put it all in one SuperColumnFamily,
> > because that can be updated atomically" won't work because it would end
> > up with our entire database in one SCF). Should we add to one, and then
> > if we fail to add to the other for some reason continually retry until
> > it works? Have a "garbage collection" procedure which finds
> > discrepancies between indexes like this and fixes them up and run it
> > from cron? We'd love to hear some advice on how to do this, or if we're
> > modelling the data in the wrong way and there's a better way which
> > avoids these problems!
> >
> > sil
> >
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>

Reply via email to