from:"AJ"


On 7/12/2011 10:48 AM, Yang wrote:

for example,
coord writes record 1,2 ,3 ,4,5 in sequence
if u have replica A, B, C
currently A can have 1 , 3
B can have 1,3,4,
C can have 2345

by "prefix", I mean I want them to have only 1---n  where n is some
number  between 1 and 5,
for example A having 1,2,3
B having 1,2,3,4
C having 1,2,3,4,5

the way we enforce this prefix pattern is that
1) the leader is ensured to have everything that's sent out, otherwise
it's removed from leader position
2) non-leader replicas is guaranteed to receive a prefix, because of
FIFO of the connection between replica and coordinator, if this
connection breaks, replica must catchup from the authoritative source
of leader

there is one point I hand-waved a bit: there are many coordinators,
the "prefix" from each of them is different, still need to think about
this, worst case is that we need to force the traffic come from the
leader, which is less interesting because it's almost hbase then...



Are you saying:  All replicas will receive the value whether or not they 
actually own the key range for the value.  If a node is not a replica 
for a value, it will not store it, but it will still write it in it's 
transaction log as a backup in case the leader dies.  Is that right?





On Tue, Jul 12, 2011 at 7:37 AM, AJ  wrote:

Yang, I'm not sure I understand what you mean by "prefix of the HLog".
  Also, can you explain what failure scenario you are talking about?  The
major failure that I see is when the leader node confirms to the client a
successful local write, but then fails before the write can be replicated to
any other replica node.  But, then again, you also say that the leader does
not forward replicas in your idea; so it's not real clear.

I'm still trying to figure out how to make this work with normal Cass
operation.

aj

On 7/11/2011 3:48 PM, Yang wrote:

I'm not proposing any changes to be done, but this looks like a very
interesting topic for thought/hack/learning, so the following are only
for thought exercises 


HBase enforces a single write/read entry point, so you can achieve
strong consistency by writing/reading only one node.  but just writing
to one node exposes you to loss of data if that node fails. so the
region server HLog is replicated to 3 HDFS data nodes.  the
interesting thing here is that each replica sees a complete *prefix*
of the HLog: it won't miss a record, if a record sync() to a data node
fails, all the existing bytes in the block are replicated to a new
data node.

if we employ a similar "leader" node among the N replicas of
cassandra (coordinator always waits for the reply from leader, but
leader does not do further replication like in HBase or counters), the
leader sees all writes onto the key range, but the other replicas
could miss some writes, as a result, each of the non-leader replicas'
write history has some "holes", so when the leader dies, and when we
elect a new one, no one is going to have a complete history. so you'd
have to do a repair amongst all the replicas to reconstruct the full
history, which is slow.

it seems possible that we could utilize the FIFO property of the
InComingTCPConnection to simplify history reconstruction, just like
Zookeeper. if the IncomingTcpConnection of a replica fails, that means
that it may have missed some edits, then when it reconnects, we force
it to talk to the active leader first, to catch up to date. when the
leader dies, the next leader is elected to be the replica with the
most recent history.  by maintaining the property that each node has a
complete prefix of history, we only need to catch up on the tail of
history, and avoid doing a complete repair on the entire
memtable+SStable.  but one issue is that the history at the leader has
to be kept really long - if a non-leader replica goes off for 2
days, the leader has to keep all the history for 2 days to feed them
to the replica when it comes back online. but possibly this could be
limited to some max length so that over that length, the woken replica
simply does a complete bootstrap.


thanks
yang
On Sun, Jul 3, 2011 at 8:25 PM, AJwrote:

We seem to be having a fundamental misunderstanding.  Thanks for your
comments. aj

On 7/3/2011 8:28 PM, William Oberman wrote:

I'm using cassandra as a tool, like a black box with a certain contract
to
the world.  Without modifying the "core", C* will send the updates to all
replicas, so your plan would cause the extra write (for the placeholder).
  I
wasn't assuming a modification to how C* fundamentally works.
Sounds like you are hacking (or at least looking) at the source, so all
the
power to you if/when you try these kind of changes.
will
On Sun, Jul 3, 2011 at 8:45 PM, AJwrote:

On 7/3/2011 6:32 PM, William Oberman wrote:

Was just going off of: " Send the value to the primary replica and send
placehold

Re: Anyone using Facebook's flashcache?


On 7/12/2011 9:02 PM, Peter Schuller wrote:

Thanks Peter, but... hmmm, are you saying that even after a cache miss which
results in a disk read and blocks being moved to the ssd, that by the next
cache miss for the same data and subsequent same file blocks, that the ssd
is unlikely to have those same blocks present anymore?

I am saying that regardless of whether the cache is memory, ssd, a
combination of both, or anything else, most workloads tend to be
subject to diminishing returns. Doubling cache from 5 gb to 10 gb
might get you from 10% to 50% cache hit ratio, but doubling again to
20 gb might get you to 60% and doubling to 40 gig to 65% (to use some
completely arbitrary random numbers for demonstration purposes).

The reason a cache can be more effective than the ratio of its size
vs. the total data set, is that there is a hotspot/working set that is
smaller than the total data set. If you have completely random access
this won't be the case, and an cache of size n% of total size will
give you a n% cache hit ratio.

But for most workloads, you have a hotter working set so you get more
bang for the buck when caching. For example, if 99% of all accesses
are accessing 10% of the data, then a cache that is the size of 10% of
the data gets you 99% cache hit ratio. But clearly no matter how much
more cache you ever add, you will never ever cache more than 100% of
reads so in this (artificial arbitrary) scenario, once you're caching
10% of your data the cost of cachine the final small percent of
accesses might be 10 times that of the original cache.

I did a quick Google but didn't find a good piece describing it more
properly, but hopefully the above is helpful. Some related reading
might be http://en.wikipedia.org/wiki/Long_Tail



Of course.  Thanks for the clarification.  On the positive side, this 
flashcache and other solutions like it could be beneficial for all disk 
i/o on the system.  Writes will always benefit.  Reads, only if they are 
read again before being pushed out by other reads.  I wonder if it would 
help to "prime" the ssd by reading in (and discarding) the top 25% 
(250/1000GB) of the usual hot data.


aj

Re: Anyone using Facebook's flashcache?


On 7/12/2011 10:19 AM, Peter Schuller wrote:

Do any Cass developers have any thoughts on this and whether or not it would
be helpful considering Cass' architecture and operation?

A well-functioning L2 cache should definitely be very useful with
Cassandra for read-intensive workloads where the request distribution
is such that additional caching will be beneficial. However, it will
depend in any particular case on how the L2 cache works, and what your
request distribution is like.

I have been wanting to try flashcache but haven't yet, so I cannot speak to it.

In particular though, keep in mind that if you've got say 1 tb of data
and your memory is enough to keep the hot set, and you're disk I/O is
coming form the long tail, increasing the amount of cache to 200 gig
may not necessarily give you a huge improvement in terms of
percentages.


Thanks Peter, but... hmmm, are you saying that even after a cache miss 
which results in a disk read and blocks being moved to the ssd, that by 
the next cache miss for the same data and subsequent same file blocks, 
that the ssd is unlikely to have those same blocks present anymore?

Anyone using Facebook's flashcache?

With big data requirements pressuring me to pack up to a terabyte on one 
node, I suspect that even 32 GB of RAM just will not be large enough for 
Cass' various memory caches to be effective.  32/1000 is a tiny working 
set to data store ratio... even assuming non-random reads.  So, I'm 
investigating whether or not a 256 GB SSD used as a cache between the 
data HDD and the Cass server process.  It won't decrease cache misses, 
but at least the access time would be orders of magnitude faster than 
from the hdd.  Also, write performance is improved because of lazy flushing.


Do any Cass developers have any thoughts on this and whether or not it 
would be helpful considering Cass' architecture and operation?


Links:
http://www.facebook.com/note.php?note_id=388112370932
https://github.com/facebook/flashcache/wiki

aj

Re: Strong Consistency with ONE read/writes

Yang, I'm not sure I understand what you mean by "prefix of the HLog".  
Also, can you explain what failure scenario you are talking about?  The 
major failure that I see is when the leader node confirms to the client 
a successful local write, but then fails before the write can be 
replicated to any other replica node.  But, then again, you also say 
that the leader does not forward replicas in your idea; so it's not real 
clear.


I'm still trying to figure out how to make this work with normal Cass 
operation.


aj

On 7/11/2011 3:48 PM, Yang wrote:

I'm not proposing any changes to be done, but this looks like a very
interesting topic for thought/hack/learning, so the following are only
for thought exercises 


HBase enforces a single write/read entry point, so you can achieve
strong consistency by writing/reading only one node.  but just writing
to one node exposes you to loss of data if that node fails. so the
region server HLog is replicated to 3 HDFS data nodes.  the
interesting thing here is that each replica sees a complete *prefix*
of the HLog: it won't miss a record, if a record sync() to a data node
fails, all the existing bytes in the block are replicated to a new
data node.

if we employ a similar "leader" node among the N replicas of
cassandra (coordinator always waits for the reply from leader, but
leader does not do further replication like in HBase or counters), the
leader sees all writes onto the key range, but the other replicas
could miss some writes, as a result, each of the non-leader replicas'
write history has some "holes", so when the leader dies, and when we
elect a new one, no one is going to have a complete history. so you'd
have to do a repair amongst all the replicas to reconstruct the full
history, which is slow.

it seems possible that we could utilize the FIFO property of the
InComingTCPConnection to simplify history reconstruction, just like
Zookeeper. if the IncomingTcpConnection of a replica fails, that means
that it may have missed some edits, then when it reconnects, we force
it to talk to the active leader first, to catch up to date. when the
leader dies, the next leader is elected to be the replica with the
most recent history.  by maintaining the property that each node has a
complete prefix of history, we only need to catch up on the tail of
history, and avoid doing a complete repair on the entire
memtable+SStable.  but one issue is that the history at the leader has
to be kept really long - if a non-leader replica goes off for 2
days, the leader has to keep all the history for 2 days to feed them
to the replica when it comes back online. but possibly this could be
limited to some max length so that over that length, the woken replica
simply does a complete bootstrap.


thanks
yang
On Sun, Jul 3, 2011 at 8:25 PM, AJ  wrote:

We seem to be having a fundamental misunderstanding.  Thanks for your
comments. aj

On 7/3/2011 8:28 PM, William Oberman wrote:

I'm using cassandra as a tool, like a black box with a certain contract to
the world.  Without modifying the "core", C* will send the updates to all
replicas, so your plan would cause the extra write (for the placeholder).  I
wasn't assuming a modification to how C* fundamentally works.
Sounds like you are hacking (or at least looking) at the source, so all the
power to you if/when you try these kind of changes.
will
On Sun, Jul 3, 2011 at 8:45 PM, AJ  wrote:

On 7/3/2011 6:32 PM, William Oberman wrote:

Was just going off of: " Send the value to the primary replica and send
placeholder values to the other replicas".  Sounded like you wanted to write
the value to one, and write the placeholder to N-1 to me.

Yes, that is what I was suggesting.  The point of the placeholders is to
handle the crash case that I talked about... "like" a WAL does.

But, C* will propagate the value to N-1 eventually anyways, 'cause that's
just what it does anyways :-)
will

On Sun, Jul 3, 2011 at 7:47 PM, AJ  wrote:

On 7/3/2011 3:49 PM, Will Oberman wrote:

Why not send the value itself instead of a placeholder?  Now it takes 2x
writes on a random node to do a single update (write placeholder, write
update) and N*x writes from the client (write value, write placeholder to
N-1). Where N is replication factor.  Seems like extra network and IO
instead of less...

To send the value to each node is 1.) unnecessary, 2.) will only cause a
large burst of network traffic.  Think about if it's a large data value,
such as a document.  Just let C* do it's thing.  The extra messages are tiny
and doesn't significantly increase latency since they are all sent
asynchronously.


Of course, I still think this sounds like reimplementing Cassandra
internals in a Cassandra client (just guessing, I'm not a cassandra dev)

I don't see how.  Maybe you should take a peek at the source.


On Jul 3, 2011, at 5:20 PM,

Feature Request: Multi-key Mapping

2011-07-10 Thread AJ

I think this would be another powerful feature by making it so much 
easier when faced with records/objects that can have multiple unique 
keys where both are not always used.  You wouldn't have to use secondary 
indexes which really aren't suitable for high cardinality (high 
uniqueness) indexes and better suited for range queries.


Of course, some indirection would be needed to avoid the naive solution 
of simply duplicating values.


Maybe Unix inodes is the best analogy here.

aj

Re: Command Request: rename a column

2011-07-08 Thread AJ


On 7/8/2011 2:18 AM, Sylvain Lebresne wrote:

On Fri, Jul 8, 2011 at 9:22 AM, AJ  wrote:

I think it would be really cool to be able to rename a column, or, more
generally, a move command to move data from one column to another in the
same CF without the client having to read and resend the column value.  This
would be extremely powerful, imo.  I suspect the execution would be quick
and could even be made atomic (per node) as I suspect it would mostly entail
only reference updates.

Cassandra don't work like that. We would have no other choice than to read the
column and write it back with a different name


I figured as much :)  Not that bad though.


  (and it would not be atomic). So
the only win we would get from doing this server side would lie in not
transferring
the value across the network.



That would be the main benefit I think, esp with large values.


--
Sylvain

Command Request: rename a column

2011-07-08 Thread AJ



I think it would be really cool to be able to rename a column, or, more 
generally, a move command to move data from one column to another in the 
same CF without the client having to read and resend the column value.  
This would be *extremely* powerful, imo.  I suspect the execution would 
be quick and could even be made atomic (per node) as I suspect it would 
mostly entail only reference updates.  Has anything like this been 
discussed before?  Seems like such a natural operation for a 
hash-table-like data store.


aj

Re: Strong Consistency with ONE read/writes

We seem to be having a fundamental misunderstanding.  Thanks for your 
comments. aj


On 7/3/2011 8:28 PM, William Oberman wrote:
I'm using cassandra as a tool, like a black box with a certain 
contract to the world.  Without modifying the "core", C* will send the 
updates to all replicas, so your plan would cause the extra write (for 
the placeholder).  I wasn't assuming a modification to how C* 
fundamentally works.


Sounds like you are hacking (or at least looking) at the source, so 
all the power to you if/when you try these kind of changes.


will

On Sun, Jul 3, 2011 at 8:45 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 7/3/2011 6:32 PM, William Oberman wrote:

Was just going off of: " Send the value to the primary replica
and send placeholder values to the other replicas".  Sounded like
you wanted to write the value to one, and write the placeholder
to N-1 to me.


Yes, that is what I was suggesting.  The point of the placeholders
is to handle the crash case that I talked about... "like" a WAL does.



But, C* will propagate the value to N-1 eventually anyways,
'cause that's just what it does anyways :-)

will

On Sun, Jul 3, 2011 at 7:47 PM, AJ mailto:a...@dude.podzone.net>> wrote:

On 7/3/2011 3:49 PM, Will Oberman wrote:

Why not send the value itself instead of a placeholder?  Now
it takes 2x writes on a random node to do a single update
(write placeholder, write update) and N*x writes from the
client (write value, write placeholder to N-1). Where N is
replication factor.  Seems like extra network and IO instead
of less...


To send the value to each node is 1.) unnecessary, 2.) will
only cause a large burst of network traffic.  Think about if
it's a large data value, such as a document.  Just let C* do
it's thing.  The extra messages are tiny and doesn't
significantly increase latency since they are all sent
asynchronously.



Of course, I still think this sounds like reimplementing
Cassandra internals in a Cassandra client (just guessing,
I'm not a cassandra dev)



I don't see how.  Maybe you should take a peek at the source.




On Jul 3, 2011, at 5:20 PM, AJ mailto:a...@dude.podzone.net>> wrote:


Yang,

How would you deal with the problem when the 1st node
responds success but then crashes before completely
forwarding any replicas?  Then, after switching to the next
primary, a read would return stale data.

Here's a quick-n-dirty way:  Send the value to the primary
replica and send placeholder values to the other replicas. 
The placeholder value is something like, "PENDING_UPDATE". 
The placeholder values are sent with timestamps 1 less than

the timestamp for the actual value that went to the
primary.  Later, when the changes propagate, the actual
values will overwrite the placeholders.  In event of a
crash before the placeholder gets overwritten, the next
read value will tell the client so.  The client will report
to the user that the key/column is unavailable.  The
downside is you've overwritten your data and maybe would
like to know what the old data was!  But, maybe there's
another way using other columns or with MVCC.  The client
would want a success from the primary and the secondary
replicas to be certain of future read consistency in case
the primary goes down immediately as I said above.  The
ability to set an "update_pending" flag on any column value
would probably make this work.  But, I'll think more on
this later.

aj

On 7/2/2011 10:55 AM, Yang wrote:

there is a JIRA completed in 0.7.x that "Prefers" a
certain node in snitch, so this does roughly what you want
MOST of the time


but the problem is that it does not GUARANTEE that the
same node will always be read.  I recently read into the
HBase vs Cassandra comparison thread that started after
Facebook dropped Cassandra for their messaging system, and
understood some of the differences. what you want is
essentially what HBase does. the fundamental difference
there is really due to the gossip protocol: it's a
probablistic, or eventually consistent failure detector
 while HBase/Google Bigtable use Zookeeper/Chubby to
provide a strong failure detector (a distributed lock).
 so in HBase, if a tablet server goes down, it really goes
down, it can not re-grab the tablet from the new tablet
server without going through a start up protocol
(notifying the m

Re: Strong Consistency with ONE read/writes


On 7/3/2011 6:32 PM, William Oberman wrote:
Was just going off of: " Send the value to the primary replica and 
send placeholder values to the other replicas".  Sounded like you 
wanted to write the value to one, and write the placeholder to N-1 to me.


Yes, that is what I was suggesting.  The point of the placeholders is to 
handle the crash case that I talked about... "like" a WAL does.


But, C* will propagate the value to N-1 eventually anyways, 'cause 
that's just what it does anyways :-)


will

On Sun, Jul 3, 2011 at 7:47 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 7/3/2011 3:49 PM, Will Oberman wrote:

Why not send the value itself instead of a placeholder?  Now it
takes 2x writes on a random node to do a single update (write
placeholder, write update) and N*x writes from the client (write
value, write placeholder to N-1). Where N is replication factor.
 Seems like extra network and IO instead of less...


To send the value to each node is 1.) unnecessary, 2.) will only
cause a large burst of network traffic.  Think about if it's a
large data value, such as a document.  Just let C* do it's thing. 
The extra messages are tiny and doesn't significantly increase

latency since they are all sent asynchronously.



Of course, I still think this sounds like reimplementing
Cassandra internals in a Cassandra client (just guessing, I'm not
a cassandra dev)



I don't see how.  Maybe you should take a peek at the source.




On Jul 3, 2011, at 5:20 PM, AJ mailto:a...@dude.podzone.net>> wrote:


Yang,

How would you deal with the problem when the 1st node responds
success but then crashes before completely forwarding any
replicas?  Then, after switching to the next primary, a read
would return stale data.

Here's a quick-n-dirty way:  Send the value to the primary
replica and send placeholder values to the other replicas.  The
placeholder value is something like, "PENDING_UPDATE".  The
placeholder values are sent with timestamps 1 less than the
timestamp for the actual value that went to the primary.  Later,
when the changes propagate, the actual values will overwrite the
placeholders.  In event of a crash before the placeholder gets
overwritten, the next read value will tell the client so.  The
client will report to the user that the key/column is
unavailable.  The downside is you've overwritten your data and
maybe would like to know what the old data was!  But, maybe
there's another way using other columns or with MVCC.  The
client would want a success from the primary and the secondary
replicas to be certain of future read consistency in case the
primary goes down immediately as I said above.  The ability to
set an "update_pending" flag on any column value would probably
make this work.  But, I'll think more on this later.

aj

On 7/2/2011 10:55 AM, Yang wrote:

there is a JIRA completed in 0.7.x that "Prefers" a certain
node in snitch, so this does roughly what you want MOST of the
time


but the problem is that it does not GUARANTEE that the same
node will always be read.  I recently read into the HBase vs
Cassandra comparison thread that started after Facebook dropped
Cassandra for their messaging system, and understood some of
the differences. what you want is essentially what HBase does.
the fundamental difference there is really due to the gossip
protocol: it's a probablistic, or eventually consistent failure
detector  while HBase/Google Bigtable use Zookeeper/Chubby to
provide a strong failure detector (a distributed lock).  so in
HBase, if a tablet server goes down, it really goes down, it
can not re-grab the tablet from the new tablet server without
going through a start up protocol (notifying the master, which
would notify the clients etc),  in other words it is guaranteed
that one tablet is served by only one tablet server at any
given time.  in comparison the above JIRA only TRYIES to serve
that key from one particular replica. HBase can have that
guarantee because the group membership is maintained by the
strong failure detector.

just for hacking curiosity, a strong failure detector +
Cassandra replicas is not impossible (actually seems not
difficult), although the performance is not clear. what would
such a strong failure detector bring to Cassandra besides this
ONE-ONE strong consistency ? that is an interesting question I
think.

considering that HBase has been deployed on big clusters, it is
probably OK with the performance of the strong  Zookeeper
failure detector. then a further question was: why did Dynamo
originally choose to use the probablistic failure detector? ye

Re: Strong Consistency with ONE read/writes


On 7/3/2011 4:07 PM, Yang wrote:


I'm no expert. So addressing the question to me probably give you real 
answers :)


The single entry mode makes sure that all writes coming through the 
leader are received by replicas before ack to client. Probably wont be 
stale data




That doesn't sound any different than a TWO write.  I'm trying to save a 
hop (+ 1 data xfer) by ack'ing immediately after the primary 
successfully writes, i.e., ONE write.


On Jul 3, 2011 11:20 AM, "AJ" <mailto:a...@dude.podzone.net>> wrote:

> Yang,
>
> How would you deal with the problem when the 1st node responds success
> but then crashes before completely forwarding any replicas? Then, after
> switching to the next primary, a read would return stale data.
>
> Here's a quick-n-dirty way: Send the value to the primary replica and
> send placeholder values to the other replicas. The placeholder value is
> something like, "PENDING_UPDATE". The placeholder values are sent with
> timestamps 1 less than the timestamp for the actual value that went to
> the primary. Later, when the changes propagate, the actual values will
> overwrite the placeholders. In event of a crash before the placeholder
> gets overwritten, the next read value will tell the client so. The
> client will report to the user that the key/column is unavailable. The
> downside is you've overwritten your data and maybe would like to know
> what the old data was! But, maybe there's another way using other
> columns or with MVCC. The client would want a success from the primary
> and the secondary replicas to be certain of future read consistency in
> case the primary goes down immediately as I said above. The ability to
> set an "update_pending" flag on any column value would probably make
> this work. But, I'll think more on this later.
>
> aj
>
> On 7/2/2011 10:55 AM, Yang wrote:
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>> snitch, so this does roughly what you want MOST of the time
>>
>>
>> but the problem is that it does not GUARANTEE that the same node will
>> always be read. I recently read into the HBase vs Cassandra
>> comparison thread that started after Facebook dropped Cassandra for
>> their messaging system, and understood some of the differences. what
>> you want is essentially what HBase does. the fundamental difference
>> there is really due to the gossip protocol: it's a probablistic, or
>> eventually consistent failure detector while HBase/Google Bigtable
>> use Zookeeper/Chubby to provide a strong failure detector (a
>> distributed lock). so in HBase, if a tablet server goes down, it
>> really goes down, it can not re-grab the tablet from the new tablet
>> server without going through a start up protocol (notifying the
>> master, which would notify the clients etc), in other words it is
>> guaranteed that one tablet is served by only one tablet server at any
>> given time. in comparison the above JIRA only TRYIES to serve that
>> key from one particular replica. HBase can have that guarantee because
>> the group membership is maintained by the strong failure detector.
>>
>> just for hacking curiosity, a strong failure detector + Cassandra
>> replicas is not impossible (actually seems not difficult), although
>> the performance is not clear. what would such a strong failure
>> detector bring to Cassandra besides this ONE-ONE strong consistency ?
>> that is an interesting question I think.
>>
>> considering that HBase has been deployed on big clusters, it is
>> probably OK with the performance of the strong Zookeeper failure
>> detector. then a further question was: why did Dynamo originally
>> choose to use the probablistic failure detector? yes Dynamo's main
>> theme is "eventually consistent", so the Phi-detector is **enough**,
>> but if a strong detector buys us more with little cost, wouldn't that
>> be great?
>>
>>
>>
>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <mailto:a...@dude.podzone.net>

>> <mailto:a...@dude.podzone.net <mailto:a...@dude.podzone.net>>> wrote:
>>
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same
>> node from a client. It seems the only thing needed is to allow
>> the clients to compute which node is the closes replica for the
>> given key using the same algorithm C* uses. When the first
>> replica receives the write request, it will write to itself which
>> should complete before any of the other replicas and then return.
>> The loads should still stay balanced if u

Re: Strong Consistency with ONE read/writes


On 7/3/2011 3:49 PM, Will Oberman wrote:
Why not send the value itself instead of a placeholder?  Now it takes 
2x writes on a random node to do a single update (write placeholder, 
write update) and N*x writes from the client (write value, write 
placeholder to N-1). Where N is replication factor.  Seems like extra 
network and IO instead of less...


To send the value to each node is 1.) unnecessary, 2.) will only cause a 
large burst of network traffic.  Think about if it's a large data value, 
such as a document.  Just let C* do it's thing.  The extra messages are 
tiny and doesn't significantly increase latency since they are all sent 
asynchronously.


Of course, I still think this sounds like reimplementing Cassandra 
internals in a Cassandra client (just guessing, I'm not a cassandra dev)




I don't see how.  Maybe you should take a peek at the source.



On Jul 3, 2011, at 5:20 PM, AJ <mailto:a...@dude.podzone.net>> wrote:



Yang,

How would you deal with the problem when the 1st node responds 
success but then crashes before completely forwarding any replicas?  
Then, after switching to the next primary, a read would return stale 
data.


Here's a quick-n-dirty way:  Send the value to the primary replica 
and send placeholder values to the other replicas.  The placeholder 
value is something like, "PENDING_UPDATE".  The placeholder values 
are sent with timestamps 1 less than the timestamp for the actual 
value that went to the primary.  Later, when the changes propagate, 
the actual values will overwrite the placeholders.  In event of a 
crash before the placeholder gets overwritten, the next read value 
will tell the client so.  The client will report to the user that the 
key/column is unavailable.  The downside is you've overwritten your 
data and maybe would like to know what the old data was!  But, maybe 
there's another way using other columns or with MVCC.  The client 
would want a success from the primary and the secondary replicas to 
be certain of future read consistency in case the primary goes down 
immediately as I said above.  The ability to set an "update_pending" 
flag on any column value would probably make this work.  But, I'll 
think more on this later.


aj

On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that "Prefers" a certain node in 
snitch, so this does roughly what you want MOST of the time



but the problem is that it does not GUARANTEE that the same node 
will always be read.  I recently read into the HBase vs Cassandra 
comparison thread that started after Facebook dropped Cassandra for 
their messaging system, and understood some of the differences. what 
you want is essentially what HBase does. the fundamental difference 
there is really due to the gossip protocol: it's a probablistic, or 
eventually consistent failure detector  while HBase/Google Bigtable 
use Zookeeper/Chubby to provide a strong failure detector (a 
distributed lock).  so in HBase, if a tablet server goes down, it 
really goes down, it can not re-grab the tablet from the new tablet 
server without going through a start up protocol (notifying the 
master, which would notify the clients etc),  in other words it is 
guaranteed that one tablet is served by only one tablet server at 
any given time.  in comparison the above JIRA only TRYIES to serve 
that key from one particular replica. HBase can have that guarantee 
because the group membership is maintained by the strong failure 
detector.


just for hacking curiosity, a strong failure detector + Cassandra 
replicas is not impossible (actually seems not difficult), although 
the performance is not clear. what would such a strong failure 
detector bring to Cassandra besides this ONE-ONE strong consistency 
? that is an interesting question I think.


considering that HBase has been deployed on big clusters, it is 
probably OK with the performance of the strong  Zookeeper failure 
detector. then a further question was: why did Dynamo originally 
choose to use the probablistic failure detector? yes Dynamo's main 
theme is "eventually consistent", so the Phi-detector is **enough**, 
but if a strong detector buys us more with little cost, wouldn't 
that  be great?




On Fri, Jul 1, 2011 at 6:53 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


Is this possible?

All reads and writes for a given key will always go to the same
node from a client.  It seems the only thing needed is to allow
the clients to compute which node is the closes replica for the
given key using the same algorithm C* uses.  When the first
replica receives the write request, it will write to itself
which should complete before any of the other replicas and then
return.  The loads should still stay balanced if using random
partitioner.  If the first replica becomes unavailable (however
that is defin

Re: Strong Consistency with ONE read/writes


Yang,

How would you deal with the problem when the 1st node responds success 
but then crashes before completely forwarding any replicas?  Then, after 
switching to the next primary, a read would return stale data.


Here's a quick-n-dirty way:  Send the value to the primary replica and 
send placeholder values to the other replicas.  The placeholder value is 
something like, "PENDING_UPDATE".  The placeholder values are sent with 
timestamps 1 less than the timestamp for the actual value that went to 
the primary.  Later, when the changes propagate, the actual values will 
overwrite the placeholders.  In event of a crash before the placeholder 
gets overwritten, the next read value will tell the client so.  The 
client will report to the user that the key/column is unavailable.  The 
downside is you've overwritten your data and maybe would like to know 
what the old data was!  But, maybe there's another way using other 
columns or with MVCC.  The client would want a success from the primary 
and the secondary replicas to be certain of future read consistency in 
case the primary goes down immediately as I said above.  The ability to 
set an "update_pending" flag on any column value would probably make 
this work.  But, I'll think more on this later.


aj

On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that "Prefers" a certain node in 
snitch, so this does roughly what you want MOST of the time



but the problem is that it does not GUARANTEE that the same node will 
always be read.  I recently read into the HBase vs Cassandra 
comparison thread that started after Facebook dropped Cassandra for 
their messaging system, and understood some of the differences. what 
you want is essentially what HBase does. the fundamental difference 
there is really due to the gossip protocol: it's a probablistic, or 
eventually consistent failure detector  while HBase/Google Bigtable 
use Zookeeper/Chubby to provide a strong failure detector (a 
distributed lock).  so in HBase, if a tablet server goes down, it 
really goes down, it can not re-grab the tablet from the new tablet 
server without going through a start up protocol (notifying the 
master, which would notify the clients etc),  in other words it is 
guaranteed that one tablet is served by only one tablet server at any 
given time.  in comparison the above JIRA only TRYIES to serve that 
key from one particular replica. HBase can have that guarantee because 
the group membership is maintained by the strong failure detector.


just for hacking curiosity, a strong failure detector + Cassandra 
replicas is not impossible (actually seems not difficult), although 
the performance is not clear. what would such a strong failure 
detector bring to Cassandra besides this ONE-ONE strong consistency ? 
that is an interesting question I think.


considering that HBase has been deployed on big clusters, it is 
probably OK with the performance of the strong  Zookeeper failure 
detector. then a further question was: why did Dynamo originally 
choose to use the probablistic failure detector? yes Dynamo's main 
theme is "eventually consistent", so the Phi-detector is **enough**, 
but if a strong detector buys us more with little cost, wouldn't that 
 be great?




On Fri, Jul 1, 2011 at 6:53 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


Is this possible?

All reads and writes for a given key will always go to the same
node from a client.  It seems the only thing needed is to allow
the clients to compute which node is the closes replica for the
given key using the same algorithm C* uses.  When the first
replica receives the write request, it will write to itself which
should complete before any of the other replicas and then return.
 The loads should still stay balanced if using random partitioner.
 If the first replica becomes unavailable (however that is
defined), then the clients can send to the next repilca in the
ring and switch from ONE write/reads to QUORUM write/reads
temporarily until the first replica becomes available again.
 QUORUM is required since there could be some replicas that were
not updated after the first replica went down.

Will this work?  The goal is to have strong consistency with a
read/write consistency level as low as possible while secondarily
a network performance boost.

Re: Strong Consistency with ONE read/writes

2011-07-02 Thread AJ


On 7/2/2011 6:03 AM, William Oberman wrote:
Ok, I see the "you happen to choose the 'right' node" idea, but it 
sounds like you want to solve "C* problems" in the client, and they 
already wrote that complicated code to make clients simple.   You're 
talking about reimplementing key<->node mappings, network topology 
(with failures), etc...  Plus, if they change something about 
replication and you get too tricky, your code breaks.  Or, if they 
optimize something, you might not benefit.




I'm only asking if this is possible working within the current design 
and architecture and if not, then why.  I'm not interested in a hack; 
just exploring possibilities.

Re: Strong Consistency with ONE read/writes

2011-07-02 Thread AJ

Yang, you seem to understand all of the details, at least the details 
that have occurred to me, such as having a failure protocol rather than 
a perfect failure detector and new leader coordination.

I finally did some more reading outside of Cassandra space and realized 
HBase has what I was asking about.  If Cass could be flexible enough to 
allow such a setup without violating it's goals, that would be great, imho.

This thread is just a brainstorming exploratory thread (by a non-expert) 
based on a simplistic observation that, if all clients went directly to 
the responsible replica every time, then performance and consistency can 
be increased by:

- providing guaranteed monotonic reads/writes consistency
- read-your-writes consistency
- higher performance (less latency)

all with only a read/write of ONE.

Basically, it's like a mater/slave setup except that the slaves can 
take-over as master, so you still have high availability.

I'm not saying it's easy and I'm only coming at this from a customer 
request point of view.  The question is, would this be useful if it 
could be added to Cass's bag of tricks?  Cass is already a hybrid.

aj

On 7/2/2011 1:57 PM, Yang wrote:

Jonathan:

could you please elaborate more on specifically why they are "not even 
close"?
 --- I kind of see what you mean (please correct me if I 
misunderstood): Cassandra failure detector
is consulted on every write; while HBase failure detector is only used 
when the tablet server joins or leaves.

 in order to have the single write entry point approach originally 
brought up in this thread,
I think you need a strong membership protocol to lock on the key range 
 leadership, once leadership is acquired,

failure detectors do not need to be consulted on every write.

yes by definition of the original requirement brought up in this thread,
Cassandra's write behavior is going to be changed, to be more like 
Hbase, and mongo in "replica set" mode. but
it seems that this leader mode can even co-exist with the multi-entry 
write mode that Cassandra uses now, just as
you can use different CL for each single write request.  in that case 
you would need to keep both the current lightweight Phi-detector

and add the ZK for leader election for single-entry mode write.

Thanks
Yang

(I should correct my terminology  it's not a "strong failure 
detector" that's needed, it's a "strong membership protocol". strongly 
complete and accurate failure detectors do not exist in
async distributed systems (Tushar Chandra  "Unreliable Failure 
Detectors for Reliable Distributed Systems, Journal of the ACM, 
43(2):225-267, 1996 <http://doi.acm.org/10.1145/226643.226647>"  and 
FLP "Impossibility of  Distributed Consensus with One Faulty Process 
<http://www.podc.org/influential/2001.html>" )  )

On Sat, Jul 2, 2011 at 10:11 AM, Jonathan Ellis <mailto:jbel...@gmail.com>> wrote:

The way HBase uses ZK (for master election) is not even close to how
Cassandra uses the failure detector.

Using ZK for each operation would (a) not scale and (b) not work
cross-DC for any reasonable latency requirements.

On Sat, Jul 2, 2011 at 11:55 AM, Yang mailto:tedd...@gmail.com>> wrote:
> there is a JIRA completed in 0.7.x that "Prefers" a certain node
in snitch,
> so this does roughly what you want MOST of the time
>
> but the problem is that it does not GUARANTEE that the same node
will always
> be read.  I recently read into the HBase vs Cassandra comparison
thread that
> started after Facebook dropped Cassandra for their messaging
system, and
> understood some of the differences. what you want is essentially
what HBase
> does. the fundamental difference there is really due to the
gossip protocol:
> it's a probablistic, or eventually consistent failure detector
 while
> HBase/Google Bigtable use Zookeeper/Chubby to provide a strong
failure
> detector (a distributed lock).  so in HBase, if a tablet server
goes down,
> it really goes down, it can not re-grab the tablet from the new
tablet
> server without going through a start up protocol (notifying the
master,
> which would notify the clients etc),  in other words it is
guaranteed that
> one tablet is served by only one tablet server at any given
time.  in
> comparison the above JIRA only TRYIES to serve that key from one
particular
> replica. HBase can have that guarantee because the group
membership is
> maintained by the strong failure detector.
> just for hacking curiosity, a strong failure detector +
Cassandra replicas
> is not impossible (actually seems not difficult), although the
performance
> is not clear. wh

Re: Strong Consistency with ONE read/writes

2011-07-01 Thread AJ

I'm saying I will make my clients forward the C* requests to the first replica 
instead of forwarding to a random node.
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Will Oberman  wrote:

Sent from my iPhone

On Jul 1, 2011, at 9:53 PM, AJ  wrote:

> Is this possible?
>
> All reads and writes for a given key will always go to the same node
> from a client.

I don't think that's true. Given a key K, the client will write to N
nodes (N=replication factor). And at consistency level ONE the client
will return after 1 "ack" (from the N writes).

Strong Consistency with ONE read/writes

2011-07-01 Thread AJ


Is this possible?

All reads and writes for a given key will always go to the same node 
from a client.  It seems the only thing needed is to allow the clients 
to compute which node is the closes replica for the given key using the 
same algorithm C* uses.  When the first replica receives the write 
request, it will write to itself which should complete before any of the 
other replicas and then return.  The loads should still stay balanced if 
using random partitioner.  If the first replica becomes unavailable 
(however that is defined), then the clients can send to the next repilca 
in the ring and switch from ONE write/reads to QUORUM write/reads 
temporarily until the first replica becomes available again.  QUORUM is 
required since there could be some replicas that were not updated after 
the first replica went down.


Will this work?  The goal is to have strong consistency with a 
read/write consistency level as low as possible while secondarily a 
network performance boost.

Re: Cassandra ACID

2011-07-01 Thread AJ


On 6/30/2011 1:57 PM, Jeremiah Jordan wrote:
For your Consistency case, it is actually an ALL read that is needed, 
not an ALL write.  ALL read, with what ever consistency level of write 
that you need (to support machines dyeing) is the only way to get 
consistent results in the face of a failed write which was at > 
ONE that went to one node, but not the others.




True, an ALL read is the best and final test for consistency for that 
read.  I think an ALL write is more of a preemptive measure.  If you 
know you'll be needing consistency later, better to get it in while you 
can.  But, this leads to a whole other set of complex topics.  I like 
the flexibility, however.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch mutate 
for one specific key will apply updates to all the columns for that one 
specific row atomically.  If part of the single-key batch update fails, 
then all of the updates will be reverted since they all pertained to one 
key/row.  Notice, I said 'reverted' not 'rolled back'.  Note: atomicity 
and isolation are related to the topic of transactions but one does not 
imply the other.  Even though row updates are atomic, they are not 
isolated from other users' updates or reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
Cassandra does not provide the same scope of Consistency as defined in 
the ACID standard.  Consistency in C* does not include referential 
integrity since C* is not a relational database.  Any referential 
integrity required would have to be handled by the client.  Also, even 
though the official docs say that QUORUM writes/reads is the minimal 
consistency_level setting to guarantee full consistency, this assumes 
that the write preceding the read does not fail (see comments below).  
What to do in this case is not fully understood by this author.

Refs: http://wiki.apache.org/cassandra/ArchitectureOverview

*Isolation*
NOTHING is isolated; because there is no transaction support in the 
first place.  This means that two or more clients can update the same 
row at the same time.  Their updates of the same or different columns 
may be interleaved and leave the row in a state that may not make sense 
depending on your application.  Note: this doesn't mean to say that two 
updates of the same column will be corrupted, obviously; columns are the 
smallest atomic unit ('atomic' in the more general thread-safe context).
Refs: None that directly address this explicitly and clearly and in one 
place.


*Durability*
Updates are made highly durable at the level comparable to a DBMS by the 
use of the commit log.  However, this requires "commitlog_sync: batch" 
in cassandra.yaml.  For "some" performance improvement with "some" cost 
in durability you can specify "commitlog_sync: periodic".  See 
discussion below for more details.

Refs: Plenty + this thread.




*From:* AJ [mailto:a...@dude.podzone.net]
*Sent:* Friday, June 24, 2011 11:28 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Cassandra ACID

Ok, here it is reworked; consider it a summary of the thread.  If I 
left out an important point that you think is 100% correct even if you 
already mentioned it, then make some noise about it and provide some 
evidence so it's captured sufficiently.  And, if you're in a debate, 
please try and get to a resolution; all will appreciate it.


It will be evident below that Consistency is not the only thing that 
is "tunable", at least indirectly.  Unfortunately, you still can't 
tunafish.  Ar ar ar.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch mutate 
for one specific key will apply updates to all the columns for that 
one specific row atomically.  If part of the single-key batch update 
fails, then all of the updates will be reverted since they all 
pertained to one key/row.  Notice, I said 'reverted' not 'rolled 
back'.  Note: atomicity and isolation are related to the topic of 
transactions but one does not imply the other.  Even though row 
updates are atomic, they are not isolated from other users' updates or 
reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
Cassandra does not provide the same scope of Consistency as defined in 
the ACID standard.  Consistency in C* does not include referential 
integrity since C* is not a relational database.  Any referential 
integrity required would have to be handled by the client.  Also, even 
though the official docs say that QUORUM writes/reads is the minimal 
consistency_level setting to guarantee full consistency, this assumes 
that the write preceding the read does not fail (see comments below).  
Therefore, an ALL write would be necessary prior to a QUORUM read of 
the same data.  For a mu

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

2011-06-30 Thread AJ


It would be helpful if this was automated some how.

Re: No Transactions: An Example

2011-06-29 Thread AJ



On 6/22/2011 9:18 AM, Trevor Smith wrote:
Right -- that's the part that I am more interested in fleshing out in 
this post.




Here is one way.  Use MVCC 
<http://en.wikipedia.org/wiki/Multiversion_concurrency_control>.  A 
single global clean-up process would be acceptable since it's not a 
single point of failure, only a single point of accumulating back-logged 
work and will not affect availability as long as you are notified if 
that process terminates and restart it in a reasonable amount of time 
but this will not affect the validity of subsequent reads.


So, you would have a "balance" column.  And each update will create a 
"balance_" with a positive or negative value indicating a 
credit or debit.  Subsequent clients will read the latest value by doing 
a slice from "balance" to "balance_~" (i.e. all "balance*" columns).  
(You would have to work-out your column naming conventions so that your 
slices return only the pertinent columns.)  Then, the clients would have 
to apply all the credits and debits to the balance to get the current 
balance.


This handles the lost update problem.

For the dirty read and incorrect summary problems by others reading data 
that is in the middle of a transaction that hasn't committed yet, I 
would add a final transaction column to a Transactions CF.  The key 
would be .., e.g., Accounts.1234.balance, 1234 being 
the account # and Accounts being the CF owning the balance column.  
Then, a new column would be added for each successful transaction (e.g., 
after debiting and crediting the two accounts) using the same timestamp 
used in balance_.  So, now, a client wanting the current 
balance would have to do a slice for all of the transactions for that 
column and only apply the balance updates up to the latest transaction.  
Note, you might have to do something else with the transaction naming 
schemes to make sure they are guaranteed to be unique, but you get the 
idea.  If the transaction fails, the client simply does not add a 
transaction column to Transactions and deletes any "balance_" 
columns it added to in the Accounts CF (or let's the clean-up process do 
it... carefully).


This should avoid the need for locks and as long as each account doesn't 
have a crazy amount of updates, the slices shouldn't be so large as to 
be a significant perf hit.


A note about the updates.  You have to make sure the clean-up process 
processes the updates in order and only 1 time.  If you can't guarantee 
these, then you'll have to make sure your updates are idempotent and 
commutative.


Oh yeah, and you must use QUORUM read/writes, of course.

Any critiques?

aj

Re: Sharing Cassandra with Solandra

2011-06-28 Thread AJ


On 6/27/2011 3:39 PM, David Strauss wrote:

On Mon, 2011-06-27 at 15:06 -0600, AJ wrote:

Would anyone care to talk about their experiences with using Solandra
along side another application that uses Cassandra (also on the same
node)?  I'm curious about any resource contention issues or
compatibility between C* versions and Sol.  Also, I read the developer
somewhere say that you have to run Solandra on every C* node in the
ring.  I'm not sure if I interpreted that correctly.  Also, what's the
index size to data size ratio to expect (ballpark)?  How does it
perform?  Any caveats?

We're currently keeping the clusters separate at Pantheon Systems
because our core API (which runs on standard Cassandra) is often ready
for the next Cassandra version at a different time than Solandra.
Solandra recently gained dual 0.7/0.8 support, but we're still opting to
use the version on Cassandra that Solandra is primarily being built and
tested on (which is currently 0.8).


Thanks.  But, I'm finally cluing in that Solandra is also developed by 
DataStax, so I feel safer about future compatibility.

Yikes! I just read your blog Dominic. Now I'm worried since my app was
going to be mostly cloud-based. But, you didn't mention anything about
sleeping for 'max clock variance' after making the ntp-related config
changes (maybe you haven't had the time to blog it).

I'm curious, do you think the sleep is required even in a
non-virtualized environment? Is it only needed when implementing some
kind of lock? Does the type of lock make a difference?

Thanks!
aj (the other one)

On 6/28/2011 11:31 AM, Dominic Williams wrote:

Hi, yes you are correct, and this is a potential problem.

IMPORTANT: If you need to serialize writes from your application
servers, for example using distributed locking, then before releasing
locks you must sleep for a period equal to the maximum variance
between the clocks on your application server nodes.

I had a problem with the clocks on my nodes which led to all kinds of
problems. There is a slightly out of date post, which may not
mentioned the above point, on my experiences here
http://ria101.wordpress.com/2011/02/08/cassandra-the-importance-of-system-clocks-avoiding-oom-and-how-to-escape-oom-meltdown/

Hope this helps
Dominic

On 27 June 2011 23:03, A J <mailto:s5a...@gmail.com>> wrote:

During writes, the timestamp field in the column is the system-time of
that node (correct me if that is not the case and the system-time of
the co-ordinator is what gets applied to all the replicas).
During reads, the latest write wins.

What if there is a clock skew ? It could lead to a stale write
over-riding the actual latest write, just because the clock of that
node is ahead of the other node. Right ?

Re: Auto compaction to be staggered ?

2011-06-27 Thread AJ


On 6/27/2011 4:01 PM, A J wrote:

Is there an enhancement on the roadmap to stagger the auto compactions
on different nodes, to avoid more than one node compacting at any
given time (or as few nodes as possible to compact at any given time).
If not, any workarounds ?

Thanks.



+1.  I proposed the same in my *Ideas for Big Data Support* thread,

"5.)  Postponed Major Compactions:

The option to postpone auto-triggered major compactions until a 
pre-defined time of day or week or until staff can do it manually. "


aj

Sharing Cassandra with Solandra

2011-06-27 Thread AJ


Hi everyone,

Would anyone care to talk about their experiences with using Solandra 
along side another application that uses Cassandra (also on the same 
node)?  I'm curious about any resource contention issues or 
compatibility between C* versions and Sol.  Also, I read the developer 
somewhere say that you have to run Solandra on every C* node in the 
ring.  I'm not sure if I interpreted that correctly.  Also, what's the 
index size to data size ratio to expect (ballpark)?  How does it 
perform?  Any caveats?


Thanks!
aj

Re: Concurrency: Does C* support a Happened-Before relation between processes' writes?

2011-06-25 Thread AJ


On 6/25/2011 8:24 AM, Edward Capriolo wrote:

I played around with the bakery algorithm and had ok success the
challenges are most implementations assume an n size array of fixed
clients and when you get it working it turns out to be a good number
of cassandra ops to acquire your bakery lock.



I was thinking rather than making certain that there is a column 
reserved for each node and having to keep it updated, you can just over 
allocate a large number that would always be enough, like 100.  A slice 
of 100 byte-sized values shouldn't be a significant perf hit vs 3 or 4.  
If you only have 3 nodes in your cluster and the last 97 go unused, that 
would be ok; it would be as if those non-existent "customers" never take 
a number.


For optimizing for C*, I think you can get away with minimal getSlices 
for the loops.  If you're lucky, you can fall through all of them using 
the results from only 1 getSlice.  Only if a process is "entering" or 
has a higher priority number will you need to wait and then do another 
getSlice and only a slice for the remaining columns.  I think my logic 
is correct; do you agree?


Did you have other problems other than performance?

Re: Concurrency: Does C* support a Happened-Before relation between processes' writes?


On 6/24/2011 2:27 PM, Jonathan Ellis wrote:

Might be able to do it with
http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm.  "It is
remarkable that this algorithm is not built on top of some lower level
"atomic" operation, e.g. compare-and-swap."


This looks like it may work.  Jonathan, have you guru's discussed this 
algorithm before and come to a consensus on it by chance?

Re: Cassandra ACID

Ok, here it is reworked; consider it a summary of the thread.  If I left 
out an important point that you think is 100% correct even if you 
already mentioned it, then make some noise about it and provide some 
evidence so it's captured sufficiently.  And, if you're in a debate, 
please try and get to a resolution; all will appreciate it.


It will be evident below that Consistency is not the only thing that is 
"tunable", at least indirectly.  Unfortunately, you still can't 
tunafish.  Ar ar ar.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch mutate 
for one specific key will apply updates to all the columns for that one 
specific row atomically.  If part of the single-key batch update fails, 
then all of the updates will be reverted since they all pertained to one 
key/row.  Notice, I said 'reverted' not 'rolled back'.  Note: atomicity 
and isolation are related to the topic of transactions but one does not 
imply the other.  Even though row updates are atomic, they are not 
isolated from other users' updates or reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
Cassandra does not provide the same scope of Consistency as defined in 
the ACID standard.  Consistency in C* does not include referential 
integrity since C* is not a relational database.  Any referential 
integrity required would have to be handled by the client.  Also, even 
though the official docs say that QUORUM writes/reads is the minimal 
consistency_level setting to guarantee full consistency, this assumes 
that the write preceding the read does not fail (see comments below).  
Therefore, an ALL write would be necessary prior to a QUORUM read of the 
same data.  For a multi-dc scenario use an ALL write followed by a 
EACH_QUORUM read.

Refs: http://wiki.apache.org/cassandra/ArchitectureOverview

*Isolation*
NOTHING is isolated; because there is no transaction support in the 
first place.  This means that two or more clients can update the same 
row at the same time.  Their updates of the same or different columns 
may be interleaved and leave the row in a state that may not make sense 
depending on your application.  Note: this doesn't mean to say that two 
updates of the same column will be corrupted, obviously; columns are the 
smallest atomic unit ('atomic' in the more general thread-safe context).
Refs: None that directly address this explicitly and clearly and in one 
place.


*Durability*
Updates are made highly durable at the level comparable to a DBMS by the 
use of the commit log.  However, this requires "commitlog_sync: batch" 
in cassandra.yaml.  For "some" performance improvement with "some" cost 
in durability you can specify "commitlog_sync: periodic".  See 
discussion below for more details.

Refs: Plenty + this thread.



On 6/24/2011 1:46 PM, Jim Newsham wrote:

On 6/23/2011 8:55 PM, AJ wrote:
Can any Cassandra contributors/guru's confirm my understanding of 
Cassandra's degree of support for the ACID properties?


I provide official references when known.  Please let me know if I 
missed some good official documentation.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch 
mutate for one specific key will apply updates to all the columns for 
that one specific row atomically.  If part of the single-key batch 
update fails, then all of the updates will be reverted since they all 
pertained to one key/row.  Notice, I said 'reverted' not 'rolled 
back'.  Note: atomicity and isolation are related to the topic of 
transactions but one does not imply the other.  Even though row 
updates are atomic, they are not isolated from other users' updates 
or reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
If you want 100% consistency, use consistency level QUORUM for both 
reads and writes and EACH_QUORUM in a multi-dc scenario.

Refs: http://wiki.apache.org/cassandra/ArchitectureOverview



This is a pretty narrow interpretation of consistency.  In a 
traditional database, consistency prevents you from getting into a 
logically inconsistent state, where records in one table do not agree 
with records in another table.  This includes referential integrity, 
cascading deletes, etc.  It seems to me Cassandra has no support for 
this concept whatsoever.



*Isolation*
NOTHING is isolated; because there is no transaction support in the 
first place.  This means that two or more clients can update the same 
row at the same time.  Their updates of the same or different columns 
may be interleaved and leave the row in a state that may not make 
sense depending on your application.  Note: this doesn't mean to say 
that two updates of the same column will be corrupted, obviously; 
columns are the smallest atomic unit ('atomic' in the more general 
thread-safe con

Re: Concurrency: Does C* support a Happened-Before relation between processes' writes?


On 6/24/2011 2:27 PM, Jonathan Ellis wrote:

Might be able to do it with
http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm.  "It is
remarkable that this algorithm is not built on top of some lower level
"atomic" operation, e.g. compare-and-swap."

I've been meaning to get back to reading that.  Thanks for the reminder 
Jonathan!

Re: Concurrency: Does C* support a Happened-Before relation between processes' writes?


On 6/24/2011 2:09 PM, Jim Newsham wrote:

On 6/24/2011 9:28 AM, Yang wrote:
without a clear description of your pseudo-code, it's difficult to 
say whether it will work.


but I think it can work fine as an election/agreement protocol, which 
you can use as a lock to some degree, but this requires
all the potential lock contenders to all participate, you can't grab 
a lock before everyone has voiced their vote yet


I agree with this statement.  I think the issue is that the timestamps 
are generated by the clients and their clocks may not be in sync, so 
write A from client A might arrive with timestamp T, and write B from 
client B may reach the node later in time, however it may have an 
earlier timestamp (T', where T' < T).  Client A may perform a read 
immediately after its write and notice that it was the only client to 
request a lock -- so it will assume it has acquired the lock.  After 
Client B's lock request, it will perform a read and observe that it 
has written the request with the earliest timestamp -- so it will also 
assume it has acquired the lock, which would result in a failure of 
the locking scheme.  If each client is required to wait for all other 
clients to "vote", then this issue goes away.




Yes, you both understand the problem.  Hopefully we can find a solution 
without relying on a hack and based on C* design that will be supported 
in the future.


I'll be thinking on this some more.  Thanks.

Concurrency: Does C* support a Happened-Before relation between processes' writes?

Sorry, I know this is long-winded but I just want to make sure before I 
go through the trouble to implement this since it's not something that 
can be reliably tested and requires in-depth knowledge about C* 
internals.  But, this ultimately deals with concurrency control so 
anyone interested in that subject may want to try and answer this.  Thanks!



I would like to know how to do a series of writes and reads such that I 
can tell definitively what process out of many was the first to create a 
unique flag column.


IOW, I would like to have multiple processes (clients) compete to see 
who is first to write a token column.  The tokens start with a known 
prefix, such as "Token_" followed by the name of the process that 
created it and a UUID so that all columns are guaranteed unique and 
don't get overwritten.  For example, Process A could create:


Token_ProcA_

and process B would create:

Token_ProcB_

These writes/reads are asynchronous between the two or more processes.  
After the two processes write their respective tokens, each will read 
back all columns named "Token_*" that exist (a slice).  They each do 
this in order to find out who "won".  The process that wrote the column 
with the lowest timestamp wins.  The purpose is to implement a lock.


I think all that is required is for the processes to use QUORUM 
read/writes to make sure the final read is consistent and will assure 
each process that it can rely on what's returned from the final read and 
that there isn't an earlier write floating around somewhere.  This is 
where the "happened-before" question comes in.  Is it possible that 
Process A which writes it's token with a lower timestamp (and should be 
the winner), that this write may not be seen by Process B when it does 
it's read (which is after it's token write and after Process A wrote 
it's token), and thus conclude incorrectly that itself (Process B) is 
the winner since it will not see Process A's token?  I'm 99% sure using 
QUORUM read/writes will allow this to work because that's the whole 
purpose, but I just wanted to double-check in case there's another 
detail I'm forgetting about C* that would defeat this.


Thanks!

P.S.  I realize this will cost me in performance, but this is only meant 
to be used on occasion.

Cassandra ACID

2011-06-23 Thread AJ

Can any Cassandra contributors/guru's confirm my understanding of 
Cassandra's degree of support for the ACID properties?


I provide official references when known.  Please let me know if I 
missed some good official documentation.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch mutate 
for one specific key will apply updates to all the columns for that one 
specific row atomically.  If part of the single-key batch update fails, 
then all of the updates will be reverted since they all pertained to one 
key/row.  Notice, I said 'reverted' not 'rolled back'.  Note: atomicity 
and isolation are related to the topic of transactions but one does not 
imply the other.  Even though row updates are atomic, they are not 
isolated from other users' updates or reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
If you want 100% consistency, use consistency level QUORUM for both 
reads and writes and EACH_QUORUM in a multi-dc scenario.

Refs: http://wiki.apache.org/cassandra/ArchitectureOverview

*Isolation*
NOTHING is isolated; because there is no transaction support in the 
first place.  This means that two or more clients can update the same 
row at the same time.  Their updates of the same or different columns 
may be interleaved and leave the row in a state that may not make sense 
depending on your application.  Note: this doesn't mean to say that two 
updates of the same column will be corrupted, obviously; columns are the 
smallest atomic unit ('atomic' in the more general thread-safe context).
Refs: None that directly address this explicitly and clearly and in one 
place.


*Durability*
Updates are made durable by the use of the commit log.  No worries here.
Refs: Plenty.

Re: Storing files in blob into Cassandra

2011-06-23 Thread AJ


On 6/22/2011 11:43 PM, Sasha Dolgy wrote:

maybe you want to spend a few minutes reading about Haystack over at
facebook to give you some ideas...

https://www.facebook.com/note.php?note_id=76191543919

Not saying what they've done is the right way... just sayin'


Thanks for the tip Sasha; will do.

Re: No Transactions: An Example

2011-06-23 Thread AJ


On 6/23/2011 7:37 AM, Trevor Smith wrote:

AJ,

Thanks for your input. I don't fully follow though how this would work 
with a bank scenario. Could you explain in more detail?


Thanks.

Trevor


I don't know yet.  I'll be researching that.  My working procedure is to 
figure out a way to handle each class of problem that ACID addresses and 
see if there is an acceptable way to compensate or manage it on the 
client or business side; following the ideas in the article.  I bet 
solutions exist somewhere.  In short, the developer needs to be fully 
versed in the potential problems that could arise and have ways to deal 
with it.  It's added responsibility for the developer, but if it keeps 
the infrastructure simple with reduced maintenance costs by not having 
to integrate another service such as ZK/Cages (as useful as they indeed 
are) then it may be worth it.  I'll let you know what I conclude.

Re: Storing files in blob into Cassandra


On 6/22/2011 1:07 AM, Damien Picard wrote:

Hi,

I have to store some files (Images, documents, etc.) for my users in a 
webapp. I use Cassandra for all of my data and I would like to know if 
this is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve 
this ?


Thank you

--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html



I was thinking of doing the same thing.  But, to compensate for the 
bandwidth usage during the read, I was hoping to find a way for the 
httpd or app server to cache the file either in RAM or on disk so 
subsequent reads could just reference the in-mem cache or local hdd.  I 
have big data requirements, so duplicating the storage of file blobs by 
adding them to the hdd would almost double my storage requirements.  So, 
the hdd cache would have to be limited with the LRU removed periodically.


I was thinking about making the key for each file be a relative file 
path as if it were on disk.  This same path could also be used as it's 
actual location on disk in the local disk cache.  Using a path as the 
key makes it flexible in many ways if I ever change my mind and want to 
store all files on disk, or when backing-up or archiving, etc..


But, I'm rusty on my apache http knowledge but I also thought there was 
an apache cache mod that would use both ram and disk depending on the 
frequency of use.  But, I don't know if you can tell it to "cache this 
blob like it's a file".


Just some thoughts.

Re: Is LOCAL_QUORUM as strong as QUORUM?

On 6/22/2011 8:20 PM, mcasandra wrote:

Well it depends on the requirements. If you use any combination of CL with
EACH_QUORUM it means you are accepting the fact that you are ok if one of
the DC is down. And in your scenario you care more about DCs being
consistent even if writes were to fail. Also you are ok with network
latency.

I think there is a broader design question here and you might be able to
solve it with LOCAL_QUORUM if you handled it at application or load
balancing layer. Is this active/active data center? What's your actual
requirements? Are these external clients that can go to any data center?

--
View this message in context:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-LOCAL-QUORUM-as-strong-as-QUORUM-tp6506592p6506937.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at
Nabble.com.

I require 3 (or more) geographically diverse dc's serving local users.
The next arbitrary closest dc will serve as a 1-replica fail-over for
the previous dc in case it becomes unavail altogether. So, each dc is
active for it's locale and a failover for one of the others; like a
daisy chain configuration. I was imagining a series of events where the
primary dc gets updated at local_quorum, followed by that dc losing all
network connectivity before the backup gets the change. Then, the same
user gets redirected to the backup dc and does a read at local_quorum
and gets stale data.

But, I realize now if I substituted each_quorum for local_quorum for
writes, then, in the case of fail-over, the writes would fail. That's
fine for consistency's sake, but is a high price to pay. I have to
think on this more and what I want. Thanks for the help.

Re: Is LOCAL_QUORUM as strong as QUORUM?


On 6/22/2011 6:50 PM, AJ wrote:

On 6/22/2011 5:56 PM, mcasandra wrote:

LOCAL_QUORUM gurantees consistency in the local data center only. Other
replica nodes in the same DC and other DC not part of the QUORUM will be
eventually consistent. If you want to ensure consistency accross DCs 
you can
use EACH_QUORUM but keep in mind the latency involved assuming DCs 
are not

located within short distance.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-LOCAL-QUORUM-as-strong-as-QUORUM-tp6506592p6506621.html
Sent from the cassandra-u...@incubator.apache.org mailing list 
archive at Nabble.com.




Thanks mcasandra.

I would like to know the minimal consistency_level to assure absolute 
consistency with a multiple data center setup for minimal latency.  
Just as quorum read/writes is the minimal needed to assure consistency 
with a single data center cluster, what is the equivalent read/write 
consistency_level value pair with a multi data center environment?


I'm thinking... writes at EACH_QUORUM and reads at LOCAL_QUORUM?  This 
will handle when a data center gets partitioned.   The write will fail 
if the dc's get partitioned.  If the partition happens after a 
successful write, then that's ok and a local quorum is all that's 
needed for a subsequent read that's consistent.


I meant to say "This will handle when *two or more data centers get* 
partitioned.  The write...".

Re: Is LOCAL_QUORUM as strong as QUORUM?


On 6/22/2011 5:56 PM, mcasandra wrote:

LOCAL_QUORUM gurantees consistency in the local data center only. Other
replica nodes in the same DC and other DC not part of the QUORUM will be
eventually consistent. If you want to ensure consistency accross DCs you can
use EACH_QUORUM but keep in mind the latency involved assuming DCs are not
located within short distance.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-LOCAL-QUORUM-as-strong-as-QUORUM-tp6506592p6506621.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.



Thanks mcasandra.

I would like to know the minimal consistency_level to assure absolute 
consistency with a multiple data center setup for minimal latency.  Just 
as quorum read/writes is the minimal needed to assure consistency with a 
single data center cluster, what is the equivalent read/write 
consistency_level value pair with a multi data center environment?


I'm thinking... writes at EACH_QUORUM and reads at LOCAL_QUORUM?  This 
will handle when a data center gets partitioned.   The write will fail 
if the dc's get partitioned.  If the partition happens after a 
successful write, then that's ok and a local quorum is all that's needed 
for a subsequent read that's consistent.

Is LOCAL_QUORUM as strong as QUORUM?

Quorum read/writes guarantees consistency.  But, when a keyspace spans 
multiple data centers, does local quorum read/writes also guarantee 
consistency?  I'm thinking maybe not if two data centers get partitioned.


Thanks!

Re: Atomicity Strategies


Thanks Aaron!

On 6/22/2011 5:25 PM, aaron morton wrote:

Atomic on a single machine yes.

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23 Jun 2011, at 09:42, AJ wrote:


On 4/9/2011 7:52 PM, aaron morton wrote:

My understanding of what they did with locking (based on the examples) was to achieve 
a level of transaction isolation 
http://en.wikipedia.org/wiki/Isolation_(database_systems)<http://en.wikipedia.org/wiki/Isolation_%28database_systems%29>

I think the issue here is more about atomicity 
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

We cannot guarantee that all or none of the mutations in your batch are 
completed. There is some work in this area though 
https://issues.apache.org/jira/browse/CASSANDRA-1684


Just to be clear, you are speaking in the general sense, right?  The batch 
mutate link you provide says that in the case that ALL the mutates of the batch 
are for the SAME key (row), then the whole batch is atomic:

"As a special case, mutations against a single key are atomic but not 
isolated."

So, is it true that if I want to update multiple columns for one key, then it 
will be an all or nothing update for the whole batch if using batch update?  
But, if your batch mutate containts mutates for more than one key, then all the 
updates for one key will be atomic, followed by all the updates for the next 
key will be atomic, and so on.  Correct?

Thanks!

NTS Replication Strategy - only replicating to a subset of data centers

I'm just double-checking, but when using NTS, is it required to specify 
ALL the data centers in the strategy_options attribute?


IOW, I do NOT want replication to ALL data centers; only a two of the 
three.  So, if my property file snitch describes all of the existing 
data centers and nodes as:


$CASSANDRA_HOME/conf/cassandra-topology.properties:

# Cassandra Node IP=Data Center:Rack
175.1.1.1=DC1:RAC1
175.2.1.1=DC2:RAC1
175.3.1.1=DC3:RAC1
# default for unknown nodes
default=DC1:rac1

Can I specify strategy_options as:

strategy_options=[{DC1:2, DC2:1}]

and just leave out DC3 entirely?

If not, will setting the last one to 0 work?:

strategy_options=[{DC1:2, DC2:1, DC3:0}]


Thanks!

Re: No Transactions: An Example

I think Sasha's idea is worth studying more.  Here is a supporting read 
referenced in the O'Reilly Cassandra book that talks about alternatives 
to 2-phase commit and synchronous transactions:


http://www.eaipatterns.com/ramblings/18_starbucks.html

If it can be done without locks and the business can handle a rare 
incomplete transaction, then this might be acceptable.



On 6/22/2011 9:14 AM, Sasha Dolgy wrote:

I would still maintain a record of the transaction ... so that I can
do analysis post to determine if/when problems occurred ...

On Wed, Jun 22, 2011 at 4:31 PM, Trevor Smith  wrote:

Sasha,
How would you deal with a transfer between accounts in which only one half
of the operation was successfully completed?
Thank you.
Trevor

Re: Atomicity Strategies


On 4/9/2011 7:52 PM, aaron morton wrote:
My understanding of what they did with locking (based on the examples) 
was to achieve a level of transaction isolation 
http://en.wikipedia.org/wiki/Isolation_(database_systems) 



I think the issue here is more about atomicity 
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic


We cannot guarantee that all or none of the mutations in your batch 
are completed. There is some work in this area though 
https://issues.apache.org/jira/browse/CASSANDRA-1684




Just to be clear, you are speaking in the general sense, right?  The 
batch mutate link you provide says that in the case that ALL the mutates 
of the batch are for the SAME key (row), then the whole batch is atomic:


"As a special case, mutations against a single key are atomic but 
not isolated."


So, is it true that if I want to update multiple columns for one key, 
then it will be an all or nothing update for the whole batch if using 
batch update?  But, if your batch mutate containts mutates for more than 
one key, then all the updates for one key will be atomic, followed by 
all the updates for the next key will be atomic, and so on.  Correct?


Thanks!

Re: Storing Accounting Data


On 6/21/2011 3:36 PM, Stephen Connolly wrote:


writes are not atomic.

the first side can succeed at quorum, and the second side can fail 
completely... you'll know it failed, but now what... you retry, still 
failed... erh I'll store it somewhere and retry it later... where do I 
store it?


the consistency level is about tuning whether reads and writes are 
replicated/checked across multiple of the replicates... but at any 
consistency level, each write will either succeed or fail _independently_


you could have one column family which is kind of like a transaction 
log, you write a json object of all the mutations you will make, then 
you go and make the mutations, when they succeed you write a completed 
column to the transaction log... them you can repeat that as often as need


you could have transactions posted as columns in a row, and to get the 
balance you iterate all the columns adding the +'s and -'s


by processing the transaction log, you could establish the highest 
complete timestamp, and add summary balance columns being the running 
total up to that point, so that you don't have to iterate everything


- Stephen



Yeah, it's all more than I want to do.  But, I just rediscovered 
Dominic's Cages <http://code.google.com/p/cages/>.   Has anyone tried it?


---
Sent from my Android phone, so random spelling mistakes, random 
nonsense words and other nonsense are a direct result of using swype 
to type on the screen


On 21 Jun 2011 22:04, "AJ" <mailto:a...@dude.podzone.net>> wrote:

Re: Storing Accounting Data


On 6/21/2011 3:14 PM, Anand Somani wrote:
Not sure if it is that simple, a quorum can fail with writes happening 
on some nodes (there is no rollback). Also there is no concept of 
atomic compare-and-swap.




Good points.  I suppose what I need is for the client to implement the 
part of ACID that C* does not.  So, right off the bat, can anyone tell 
me if that is even possible conceptually?  If so, please throw out some 
terms that I can wiki and some Java API's would be much appreciated as 
well.  Also, can I accomplish this or make things easier by imposing 
some restrictions, such as only allowing single-user access to the data 
for certain operations?


Thanks!

Re: Storing Accounting Data

And I was thinking of using JTA for transaction processing.  I have no 
experience with it but on the surface it looks like it should work.


On 6/21/2011 3:31 PM, AJ wrote:

What's the best accepted way to handle that 100% in the client?  Retries?

On 6/21/2011 3:14 PM, Anand Somani wrote:
Not sure if it is that simple, a quorum can fail with writes 
happening on some nodes (there is no rollback). Also there is no 
concept of atomic compare-and-swap.


On Tue, Jun 21, 2011 at 2:03 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 6/21/2011 2:50 PM, Stephen Connolly wrote:


how important are things like transactional consistency for you?

would you have issues if only one side of a transfer was recorded?



Right.  Both of those questions are about consistency.  Isn't the
simple solution is to use QUORUM read/writes?


cassandra, out of the box, on it's own, would not be ideal if
the above two things are important for you.

you can add components to a system to help address these things,
eg zookeeper, etc. a reason why you moght do this is if you
already use cassandra in your app and are trying to limit the
number of databases

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random
nonsense words and other nonsense are a direct result of using
swype to type on the screen

    On 21 Jun 2011 18:30, "AJ" mailto:a...@dude.podzone.net>> wrote:

Re: Storing Accounting Data


What's the best accepted way to handle that 100% in the client?  Retries?

On 6/21/2011 3:14 PM, Anand Somani wrote:
Not sure if it is that simple, a quorum can fail with writes happening 
on some nodes (there is no rollback). Also there is no concept of 
atomic compare-and-swap.


On Tue, Jun 21, 2011 at 2:03 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 6/21/2011 2:50 PM, Stephen Connolly wrote:


how important are things like transactional consistency for you?

would you have issues if only one side of a transfer was recorded?



Right.  Both of those questions are about consistency.  Isn't the
simple solution is to use QUORUM read/writes?


cassandra, out of the box, on it's own, would not be ideal if the
above two things are important for you.

you can add components to a system to help address these things,
eg zookeeper, etc. a reason why you moght do this is if you
already use cassandra in your app and are trying to limit the
number of databases

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random
nonsense words and other nonsense are a direct result of using
swype to type on the screen

    On 21 Jun 2011 18:30, "AJ" mailto:a...@dude.podzone.net>> wrote:

Re: Storing Accounting Data


On 6/21/2011 2:50 PM, Stephen Connolly wrote:


how important are things like transactional consistency for you?

would you have issues if only one side of a transfer was recorded?



Right.  Both of those questions are about consistency.  Isn't the simple 
solution is to use QUORUM read/writes?


cassandra, out of the box, on it's own, would not be ideal if the 
above two things are important for you.


you can add components to a system to help address these things, eg 
zookeeper, etc. a reason why you moght do this is if you already use 
cassandra in your app and are trying to limit the number of databases


- Stephen

---
Sent from my Android phone, so random spelling mistakes, random 
nonsense words and other nonsense are a direct result of using swype 
to type on the screen


On 21 Jun 2011 18:30, "AJ" <mailto:a...@dude.podzone.net>> wrote:

Storing Accounting Data