Re: Understanding atomicity in Cassandra

2010-07-20 Thread Patricio Echagüe
Hi, regarding the retrying strategy, I understand that it might make
sense assuming that the client can actually perform a retry.

We are trying to build a fault tolerance solution based on Cassandra.
In some scenarios, the client machine can go down during a
transaction.

Would it be bad design to store all the data that need to be
consistent under one big key? In this case the batch_mutate operations
will not be big since just a small part is updated/add at a time. But
at least we know that the operation either succeeded or failed.

We basically have:

CF: usernames (similar to Twitter model)
SCF: User_tree (it has all the information related to the user)

Thanks

On Mon, Jul 19, 2010 at 9:40 PM, Alex Yiu bigcontentf...@gmail.com wrote:

 Hi, Stuart,
 If I may paraphrase what Jonathan said, typically your batch_mutate
 operation is idempotent.
 That is, you can replay / retry the same operation within a short timeframe
 without any undesirable side effect.
 The assumption behind the short timeframe here refers to: there is no
 other concurrent writer trying to write anything conflicting in an
 interleaving fashion.
 Imagine that if there was another writer trying to write:
  some-uuid-1: {
    path: /foo/bar,
    size: 10
  },
 ...
 {
  /foo/bar: {
    uuid: some-uuid-1
  },
 Then, there is a chance of 4 write operations (two writes for /a/b/c into
 2 CFs and two writes for /foo/bar into 2) would interleave each other and
 create an undesirable result.
 I guess that is not a likely situation in your case.
 Hopefully, my email helps.
 See also:
 http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

 Regards,
 Alex Yiu


 On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis jbel...@gmail.com wrote:

 typically you will update both as part of a batch_mutate, and if it
 fails, retry the operation.  re-writing any part that succeeded will
 be harmless.

 On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
 stuart.langri...@canonical.com wrote:
  Hi, Cassandra people!
 
  We're looking at Cassandra as a possible replacement for some parts of
  our database structures, and on an early look I'm a bit confused about
  atomicity guarantees and rollbacks and such, so I wanted to ask what
  standard practice is for dealing with the sorts of situation I outline
  below.
 
  Imagine that we're storing information about files. Each file has a path
  and a uuid, and sometimes we need to look up stuff about a file by its
  path and sometimes by its uuid. The best way to do this, as I understand
  it, is to store the data in Cassandra twice: once indexed by nodeid and
  once by path. So, I have two ColumnFamilies, one indexed by uuid:
 
  {
   some-uuid-1: {
     path: /a/b/c,
     size: 10
   },
   some-uuid-2 {
     ...
   },
   ...
  }
 
  and one indexed by path
 
  {
   /a/b/c: {
     uuid: some-uuid-1,
     size: 10
   },
   /d/e/f {
     ...
   },
   ...
  }
 
  So, first, do please correct me if I've misunderstood the terminology
  here (and I've shown a short form of ColumnFamily here, as per
  http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).
 
  The thing I don't quite get is: what happens when I want to add a new
  file? I need to add it to both these ColumnFamilies, but there's no add
  it to both atomic operation. What's the way that people handle the
  situation where I add to the first CF and then my program crashes, so I
  never added to the second? (Assume that there is lots more data than
  I've outlined above, so that put it all in one SuperColumnFamily,
  because that can be updated atomically won't work because it would end
  up with our entire database in one SCF). Should we add to one, and then
  if we fail to add to the other for some reason continually retry until
  it works? Have a garbage collection procedure which finds
  discrepancies between indexes like this and fixes them up and run it
  from cron? We'd love to hear some advice on how to do this, or if we're
  modelling the data in the wrong way and there's a better way which
  avoids these problems!
 
  sil
 
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com





-- 
Patricio.-


Re: Understanding atomicity in Cassandra

2010-07-20 Thread Jonathan Ellis
2010/7/20 Patricio Echagüe patric...@gmail.com:
 Would it be bad design to store all the data that need to be
 consistent under one big key?

That really depends how unnatural it is from a query perspective. :)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Understanding atomicity in Cassandra

2010-07-20 Thread Alex Yiu
Hi, Patricio,

It's hard to comment on your original questions without knowing details of
your own domain specific data model and data processing expectation.

W.R.T. lumping things into one big row, there is a limitation on data model
in Cassandra. You got CF and SCF. That is, you have only 2 level of nesting
at most for an atomic value update.  I.e. you cannot lump arbitrarily
complex data into a single big row.

Even as the update for one particular row is atomic, you would run into the
situation of having concurrent read-write operations that conflict with each
other.

For example, having a list of values as one of your column value.
Old value is: a, b, c
And, the operation is: you want to add d to that list.
The desired new value is: a, b, c, d
If there is another concurrent operation that tries to add e to the list,
you would still have problem given the present atomic semantic of row update
in cassandra.

On the other hand, there are a number of application scenario, where update
operations are safe to be considered as idempotent.
E.g. bulk loading data from flat files into Cassandra

If your main worry is about client process crashing, regardless what kind of
ACID properties that Cassandra can provide, you still want to have a way to
verify whether Cassandra has stored the desired state and/or log the
processed update operation in the context of bulk loading. Then, you can
decide whether a particular data update needs to be repeated or not. A full
fledge ACID database (all or nothing semantic) can decrease the complexity
of verification of the succeed of storage. But, it cannot remove that
concern completely. Consider the case that the client process crashes right
at the moment of dbConn.commit(). You still don't know for sure whether
that update operation has gone through.

Hope this email helps.

Thanks!


Regards,
Alex Yiu



On Tue, Jul 20, 2010 at 2:03 PM, Jonathan Ellis jbel...@gmail.com wrote:

 2010/7/20 Patricio Echagüe patric...@gmail.com:
  Would it be bad design to store all the data that need to be
  consistent under one big key?

 That really depends how unnatural it is from a query perspective. :)

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com



Re: Understanding atomicity in Cassandra

2010-07-19 Thread Alex Yiu
Hi, Stuart,

If I may paraphrase what Jonathan said, typically your batch_mutate
operation is idempotent.
That is, you can replay / retry the same operation within a short timeframe
without any undesirable side effect.

The assumption behind the short timeframe here refers to: there is no
other concurrent writer trying to write anything conflicting in an
interleaving fashion.

Imagine that if there was another writer trying to write:
  some-uuid-1: {
path: /foo/bar,
size: 10
  },
...
 {
  /foo/bar: {
uuid: some-uuid-1
  },

Then, there is a chance of 4 write operations (two writes for /a/b/c into
2 CFs and two writes for /foo/bar into 2) would interleave each other and
create an undesirable result.

I guess that is not a likely situation in your case.

Hopefully, my email helps.

See also:
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic


Regards,
Alex Yiu



On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis jbel...@gmail.com wrote:

 typically you will update both as part of a batch_mutate, and if it
 fails, retry the operation.  re-writing any part that succeeded will
 be harmless.

 On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
 stuart.langri...@canonical.com wrote:
  Hi, Cassandra people!
 
  We're looking at Cassandra as a possible replacement for some parts of
  our database structures, and on an early look I'm a bit confused about
  atomicity guarantees and rollbacks and such, so I wanted to ask what
  standard practice is for dealing with the sorts of situation I outline
  below.
 
  Imagine that we're storing information about files. Each file has a path
  and a uuid, and sometimes we need to look up stuff about a file by its
  path and sometimes by its uuid. The best way to do this, as I understand
  it, is to store the data in Cassandra twice: once indexed by nodeid and
  once by path. So, I have two ColumnFamilies, one indexed by uuid:
 
  {
   some-uuid-1: {
 path: /a/b/c,
 size: 10
   },
   some-uuid-2 {
 ...
   },
   ...
  }
 
  and one indexed by path
 
  {
   /a/b/c: {
 uuid: some-uuid-1,
 size: 10
   },
   /d/e/f {
 ...
   },
   ...
  }
 
  So, first, do please correct me if I've misunderstood the terminology
  here (and I've shown a short form of ColumnFamily here, as per
  http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).
 
  The thing I don't quite get is: what happens when I want to add a new
  file? I need to add it to both these ColumnFamilies, but there's no add
  it to both atomic operation. What's the way that people handle the
  situation where I add to the first CF and then my program crashes, so I
  never added to the second? (Assume that there is lots more data than
  I've outlined above, so that put it all in one SuperColumnFamily,
  because that can be updated atomically won't work because it would end
  up with our entire database in one SCF). Should we add to one, and then
  if we fail to add to the other for some reason continually retry until
  it works? Have a garbage collection procedure which finds
  discrepancies between indexes like this and fixes them up and run it
  from cron? We'd love to hear some advice on how to do this, or if we're
  modelling the data in the wrong way and there's a better way which
  avoids these problems!
 
  sil
 
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of Riptano, the source for professional Cassandra support
 http://riptano.com



Re: Understanding atomicity in Cassandra

2010-07-09 Thread Jonathan Ellis
typically you will update both as part of a batch_mutate, and if it
fails, retry the operation.  re-writing any part that succeeded will
be harmless.

On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
stuart.langri...@canonical.com wrote:
 Hi, Cassandra people!

 We're looking at Cassandra as a possible replacement for some parts of
 our database structures, and on an early look I'm a bit confused about
 atomicity guarantees and rollbacks and such, so I wanted to ask what
 standard practice is for dealing with the sorts of situation I outline
 below.

 Imagine that we're storing information about files. Each file has a path
 and a uuid, and sometimes we need to look up stuff about a file by its
 path and sometimes by its uuid. The best way to do this, as I understand
 it, is to store the data in Cassandra twice: once indexed by nodeid and
 once by path. So, I have two ColumnFamilies, one indexed by uuid:

 {
  some-uuid-1: {
    path: /a/b/c,
    size: 10
  },
  some-uuid-2 {
    ...
  },
  ...
 }

 and one indexed by path

 {
  /a/b/c: {
    uuid: some-uuid-1,
    size: 10
  },
  /d/e/f {
    ...
  },
  ...
 }

 So, first, do please correct me if I've misunderstood the terminology
 here (and I've shown a short form of ColumnFamily here, as per
 http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).

 The thing I don't quite get is: what happens when I want to add a new
 file? I need to add it to both these ColumnFamilies, but there's no add
 it to both atomic operation. What's the way that people handle the
 situation where I add to the first CF and then my program crashes, so I
 never added to the second? (Assume that there is lots more data than
 I've outlined above, so that put it all in one SuperColumnFamily,
 because that can be updated atomically won't work because it would end
 up with our entire database in one SCF). Should we add to one, and then
 if we fail to add to the other for some reason continually retry until
 it works? Have a garbage collection procedure which finds
 discrepancies between indexes like this and fixes them up and run it
 from cron? We'd love to hear some advice on how to do this, or if we're
 modelling the data in the wrong way and there's a better way which
 avoids these problems!

 sil






-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com