Re: Understanding atomicity in Cassandra

2010-07-20 Thread Alex Yiu
Hi, Patricio,

It's hard to comment on your original questions without knowing details of
your own domain specific data model and data processing expectation.

W.R.T. lumping things into one big row, there is a limitation on data model
in Cassandra. You got CF and SCF. That is, you have only 2 level of nesting
at most for an atomic value update.  I.e. you cannot lump arbitrarily
complex data into a single big row.

Even as the update for one particular row is atomic, you would run into the
situation of having concurrent read-write operations that conflict with each
other.

For example, having a list of values as one of your column value.
Old value is: "a, b, c"
And, the operation is: you want to add "d" to that list.
The desired new value is: "a, b, c, d"
If there is another concurrent operation that tries to add "e" to the list,
you would still have problem given the present atomic semantic of row update
in cassandra.

On the other hand, there are a number of application scenario, where update
operations are safe to be considered as idempotent.
E.g. bulk loading data from flat files into Cassandra

If your main worry is about client process crashing, regardless what kind of
ACID properties that Cassandra can provide, you still want to have a way to
verify whether Cassandra has stored the desired state and/or log the
processed update operation in the context of bulk loading. Then, you can
decide whether a particular data update needs to be repeated or not. A full
fledge ACID database ("all or nothing semantic") can decrease the complexity
of verification of the succeed of storage. But, it cannot remove that
concern completely. Consider the case that the client process crashes right
at the moment of "dbConn.commit()". You still don't know for sure whether
that update operation has gone through.

Hope this email helps.

Thanks!


Regards,
Alex Yiu



On Tue, Jul 20, 2010 at 2:03 PM, Jonathan Ellis  wrote:

> 2010/7/20 Patricio Echagüe :
> > Would it be bad design to store all the data that need to be
> > consistent under one big key?
>
> That really depends how unnatural it is from a query perspective. :)
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: Understanding atomicity in Cassandra

2010-07-20 Thread Jonathan Ellis
2010/7/20 Patricio Echagüe :
> Would it be bad design to store all the data that need to be
> consistent under one big key?

That really depends how unnatural it is from a query perspective. :)

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Understanding atomicity in Cassandra

2010-07-20 Thread Patricio Echagüe
Hi, regarding the retrying strategy, I understand that it might make
sense assuming that the client can actually perform a retry.

We are trying to build a fault tolerance solution based on Cassandra.
In some scenarios, the client machine can go down during a
transaction.

Would it be bad design to store all the data that need to be
consistent under one big key? In this case the batch_mutate operations
will not be big since just a small part is updated/add at a time. But
at least we know that the operation either succeeded or failed.

We basically have:

CF: usernames (similar to Twitter model)
SCF: User_tree (it has all the information related to the user)

Thanks

On Mon, Jul 19, 2010 at 9:40 PM, Alex Yiu  wrote:
>
> Hi, Stuart,
> If I may paraphrase what Jonathan said, typically your batch_mutate
> operation is idempotent.
> That is, you can replay / retry the same operation within a short timeframe
> without any undesirable side effect.
> The assumption behind the "short timeframe" here refers to: there is no
> other concurrent writer trying to write anything conflicting in an
> interleaving fashion.
> Imagine that if there was another writer trying to write:
>>  "some-uuid-1": {
>>    "path": "/foo/bar",
>>    "size": 10
>>  },
> ...
>> {
>>  "/foo/bar": {
>>    "uuid": "some-uuid-1"
>>  },
> Then, there is a chance of 4 write operations (two writes for "/a/b/c" into
> 2 CFs and two writes for "/foo/bar" into 2) would interleave each other and
> create an undesirable result.
> I guess that is not a likely situation in your case.
> Hopefully, my email helps.
> See also:
> http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic
>
> Regards,
> Alex Yiu
>
>
> On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis  wrote:
>>
>> typically you will update both as part of a batch_mutate, and if it
>> fails, retry the operation.  re-writing any part that succeeded will
>> be harmless.
>>
>> On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
>>  wrote:
>> > Hi, Cassandra people!
>> >
>> > We're looking at Cassandra as a possible replacement for some parts of
>> > our database structures, and on an early look I'm a bit confused about
>> > atomicity guarantees and rollbacks and such, so I wanted to ask what
>> > standard practice is for dealing with the sorts of situation I outline
>> > below.
>> >
>> > Imagine that we're storing information about files. Each file has a path
>> > and a uuid, and sometimes we need to look up stuff about a file by its
>> > path and sometimes by its uuid. The best way to do this, as I understand
>> > it, is to store the data in Cassandra twice: once indexed by nodeid and
>> > once by path. So, I have two ColumnFamilies, one indexed by uuid:
>> >
>> > {
>> >  "some-uuid-1": {
>> >    "path": "/a/b/c",
>> >    "size": 10
>> >  },
>> >  "some-uuid-2" {
>> >    ...
>> >  },
>> >  ...
>> > }
>> >
>> > and one indexed by path
>> >
>> > {
>> >  "/a/b/c": {
>> >    "uuid": "some-uuid-1",
>> >    "size": 10
>> >  },
>> >  "/d/e/f" {
>> >    ...
>> >  },
>> >  ...
>> > }
>> >
>> > So, first, do please correct me if I've misunderstood the terminology
>> > here (and I've shown a "short form" of ColumnFamily here, as per
>> > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).
>> >
>> > The thing I don't quite get is: what happens when I want to add a new
>> > file? I need to add it to both these ColumnFamilies, but there's no "add
>> > it to both" atomic operation. What's the way that people handle the
>> > situation where I add to the first CF and then my program crashes, so I
>> > never added to the second? (Assume that there is lots more data than
>> > I've outlined above, so that "put it all in one SuperColumnFamily,
>> > because that can be updated atomically" won't work because it would end
>> > up with our entire database in one SCF). Should we add to one, and then
>> > if we fail to add to the other for some reason continually retry until
>> > it works? Have a "garbage collection" procedure which finds
>> > discrepancies between indexes like this and fixes them up and run it
>> > from cron? We'd love to hear some advice on how to do this, or if we're
>> > modelling the data in the wrong way and there's a better way which
>> > avoids these problems!
>> >
>> > sil
>> >
>> >
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of Riptano, the source for professional Cassandra support
>> http://riptano.com
>
>



-- 
Patricio.-


Re: Understanding atomicity in Cassandra

2010-07-19 Thread Alex Yiu
Hi, Stuart,

If I may paraphrase what Jonathan said, typically your batch_mutate
operation is idempotent.
That is, you can replay / retry the same operation within a short timeframe
without any undesirable side effect.

The assumption behind the "short timeframe" here refers to: there is no
other concurrent writer trying to write anything conflicting in an
interleaving fashion.

Imagine that if there was another writer trying to write:
>  "some-uuid-1": {
>"path": "/foo/bar",
>"size": 10
>  },
...
> {
>  "/foo/bar": {
>"uuid": "some-uuid-1"
>  },

Then, there is a chance of 4 write operations (two writes for "/a/b/c" into
2 CFs and two writes for "/foo/bar" into 2) would interleave each other and
create an undesirable result.

I guess that is not a likely situation in your case.

Hopefully, my email helps.

See also:
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic


Regards,
Alex Yiu



On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis  wrote:

> typically you will update both as part of a batch_mutate, and if it
> fails, retry the operation.  re-writing any part that succeeded will
> be harmless.
>
> On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
>  wrote:
> > Hi, Cassandra people!
> >
> > We're looking at Cassandra as a possible replacement for some parts of
> > our database structures, and on an early look I'm a bit confused about
> > atomicity guarantees and rollbacks and such, so I wanted to ask what
> > standard practice is for dealing with the sorts of situation I outline
> > below.
> >
> > Imagine that we're storing information about files. Each file has a path
> > and a uuid, and sometimes we need to look up stuff about a file by its
> > path and sometimes by its uuid. The best way to do this, as I understand
> > it, is to store the data in Cassandra twice: once indexed by nodeid and
> > once by path. So, I have two ColumnFamilies, one indexed by uuid:
> >
> > {
> >  "some-uuid-1": {
> >"path": "/a/b/c",
> >"size": 10
> >  },
> >  "some-uuid-2" {
> >...
> >  },
> >  ...
> > }
> >
> > and one indexed by path
> >
> > {
> >  "/a/b/c": {
> >"uuid": "some-uuid-1",
> >"size": 10
> >  },
> >  "/d/e/f" {
> >...
> >  },
> >  ...
> > }
> >
> > So, first, do please correct me if I've misunderstood the terminology
> > here (and I've shown a "short form" of ColumnFamily here, as per
> > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).
> >
> > The thing I don't quite get is: what happens when I want to add a new
> > file? I need to add it to both these ColumnFamilies, but there's no "add
> > it to both" atomic operation. What's the way that people handle the
> > situation where I add to the first CF and then my program crashes, so I
> > never added to the second? (Assume that there is lots more data than
> > I've outlined above, so that "put it all in one SuperColumnFamily,
> > because that can be updated atomically" won't work because it would end
> > up with our entire database in one SCF). Should we add to one, and then
> > if we fail to add to the other for some reason continually retry until
> > it works? Have a "garbage collection" procedure which finds
> > discrepancies between indexes like this and fixes them up and run it
> > from cron? We'd love to hear some advice on how to do this, or if we're
> > modelling the data in the wrong way and there's a better way which
> > avoids these problems!
> >
> > sil
> >
> >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of Riptano, the source for professional Cassandra support
> http://riptano.com
>


Re: Understanding atomicity in Cassandra

2010-07-09 Thread Jonathan Ellis
typically you will update both as part of a batch_mutate, and if it
fails, retry the operation.  re-writing any part that succeeded will
be harmless.

On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge
 wrote:
> Hi, Cassandra people!
>
> We're looking at Cassandra as a possible replacement for some parts of
> our database structures, and on an early look I'm a bit confused about
> atomicity guarantees and rollbacks and such, so I wanted to ask what
> standard practice is for dealing with the sorts of situation I outline
> below.
>
> Imagine that we're storing information about files. Each file has a path
> and a uuid, and sometimes we need to look up stuff about a file by its
> path and sometimes by its uuid. The best way to do this, as I understand
> it, is to store the data in Cassandra twice: once indexed by nodeid and
> once by path. So, I have two ColumnFamilies, one indexed by uuid:
>
> {
>  "some-uuid-1": {
>    "path": "/a/b/c",
>    "size": 10
>  },
>  "some-uuid-2" {
>    ...
>  },
>  ...
> }
>
> and one indexed by path
>
> {
>  "/a/b/c": {
>    "uuid": "some-uuid-1",
>    "size": 10
>  },
>  "/d/e/f" {
>    ...
>  },
>  ...
> }
>
> So, first, do please correct me if I've misunderstood the terminology
> here (and I've shown a "short form" of ColumnFamily here, as per
> http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).
>
> The thing I don't quite get is: what happens when I want to add a new
> file? I need to add it to both these ColumnFamilies, but there's no "add
> it to both" atomic operation. What's the way that people handle the
> situation where I add to the first CF and then my program crashes, so I
> never added to the second? (Assume that there is lots more data than
> I've outlined above, so that "put it all in one SuperColumnFamily,
> because that can be updated atomically" won't work because it would end
> up with our entire database in one SCF). Should we add to one, and then
> if we fail to add to the other for some reason continually retry until
> it works? Have a "garbage collection" procedure which finds
> discrepancies between indexes like this and fixes them up and run it
> from cron? We'd love to hear some advice on how to do this, or if we're
> modelling the data in the wrong way and there's a better way which
> avoids these problems!
>
> sil
>
>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Understanding atomicity in Cassandra

2010-07-08 Thread Stuart Langridge
Hi, Cassandra people!

We're looking at Cassandra as a possible replacement for some parts of
our database structures, and on an early look I'm a bit confused about
atomicity guarantees and rollbacks and such, so I wanted to ask what
standard practice is for dealing with the sorts of situation I outline
below.

Imagine that we're storing information about files. Each file has a path
and a uuid, and sometimes we need to look up stuff about a file by its
path and sometimes by its uuid. The best way to do this, as I understand
it, is to store the data in Cassandra twice: once indexed by nodeid and
once by path. So, I have two ColumnFamilies, one indexed by uuid:

{
  "some-uuid-1": {
"path": "/a/b/c",
"size": 10
  },
  "some-uuid-2" {
...
  },
  ...
}

and one indexed by path

{
  "/a/b/c": {
"uuid": "some-uuid-1",
"size": 10
  },
  "/d/e/f" {
...
  },
  ...
}

So, first, do please correct me if I've misunderstood the terminology
here (and I've shown a "short form" of ColumnFamily here, as per
http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model).

The thing I don't quite get is: what happens when I want to add a new
file? I need to add it to both these ColumnFamilies, but there's no "add
it to both" atomic operation. What's the way that people handle the
situation where I add to the first CF and then my program crashes, so I
never added to the second? (Assume that there is lots more data than
I've outlined above, so that "put it all in one SuperColumnFamily,
because that can be updated atomically" won't work because it would end
up with our entire database in one SCF). Should we add to one, and then
if we fail to add to the other for some reason continually retry until
it works? Have a "garbage collection" procedure which finds
discrepancies between indexes like this and fixes them up and run it
from cron? We'd love to hear some advice on how to do this, or if we're
modelling the data in the wrong way and there's a better way which
avoids these problems!

sil