Changing snitch from PropertyFile to Gossip

2016-04-24 Thread AJ
Is it possible to do this without down time i.e. run in mixed mode while doing 
a rolling upgrade?

Thrift row cache in Cassandra 2.1

2016-03-30 Thread AJ
Hi,

I am having to tune a legacy app to use row caching (the why is unimportant). I 
know Thrift is EOL etc.. However, I have to do it.

I am unable to work out what the values to set on the column family are now 
with the changes in Caching (i.e. rows_per_partition). Previously you would set 
them to all, keys_only, rows_only, or none - is this still the case? The docs 
seem to indicate you can only set it to keys or rows_per_partition… When I set 
it to all n CF via cassandra-cli it says rows_per_partition: 0 when I look at 
the CQL for the same CF.

Just a bit confused - if anyone can clarify it, would be appreciated.

Thanks,

AJ




Re: Anyone using Facebook's flashcache?

2011-07-18 Thread AJ

On 7/18/2011 1:20 PM, Héctor Izquierdo Seliva wrote:

If using the version that has both rt and wt caches, is it just the wt
cache that's polluted for compactions/flushes?  If not, why does the rt
cache also get polluted?


As I said, all reads go through flashcache, so if you read three 10 GB
sstables for a compaction you will get those 30 GB into the cache.



Of course.  I wasn't thinking clearly.

So, back to a previous point you brought up, I will have heavy reads and 
even heavier writes.  How would you rate the benefits of flashcache in 
such a scenario?  Is it still an overall performance boost worth the 
expense?


Thanks,
aj



Re: Anyone using Facebook's flashcache?

2011-07-18 Thread AJ

On 7/18/2011 12:08 PM, Héctor Izquierdo Seliva wrote:

Interesting.  So, there is no segregation between read and write cache
space?  A compaction or flush can evict blocks in the read cache if it
needs the space for write buffering?

There are two versions, the -wt (write through) that will cache also
what is written, and the normal version that will only cache reads.
Either way you will pollute your cache with compactions.



If using the version that has both rt and wt caches, is it just the wt 
cache that's polluted for compactions/flushes?  If not, why does the rt 
cache also get polluted?





Re: Anyone using Facebook's flashcache?

2011-07-18 Thread AJ

On 7/18/2011 4:14 AM, Héctor Izquierdo Seliva wrote:

Hector, some before/after numbers would be great if you can find them.
Thanks!


I'll try and get some for you :)


What happens when your cache gets trashed?  Do compactions and flushes
go slower?


If you use flashcache-wt flushed and compacted sstables will go to the
cache.

All reads are cached, so if you compact three sstables into one, you are
stuffing your cache with a lot of useless crap and evicting valid blocks
(flashcache won't honor any of the hints set with fadvise, as it's a
block cache layer and doesn't know of them anyway). If your write rate
is low it might work for you.



Interesting.  So, there is no segregation between read and write cache 
space?  A compaction or flush can evict blocks in the read cache if it 
needs the space for write buffering?





aj








Re: Anyone using Facebook's flashcache?

2011-07-17 Thread AJ

On 7/17/2011 12:29 PM, Héctor Izquierdo Seliva wrote:

I've been using flashcache for a while in production. It improves read
performance and latency was halved by a good chunk, though I don't
remember the exact numbers.

Problems: compactions will trash your cache, and so will memtable
flushes. Right now there's no way to avoid that.

If you want, I could dig the numbers for a before/after comparison.



Hector, some before/after numbers would be great if you can find them.  
Thanks!


What happens when your cache gets trashed?  Do compactions and flushes 
go slower?


aj







Re: Strong Consistency with ONE read/writes

2011-07-12 Thread AJ

On 7/12/2011 10:48 AM, Yang wrote:

for example,
coord writes record 1,2 ,3 ,4,5 in sequence
if u have replica A, B, C
currently A can have 1 , 3
B can have 1,3,4,
C can have 2345

by "prefix", I mean I want them to have only 1---n  where n is some
number  between 1 and 5,
for example A having 1,2,3
B having 1,2,3,4
C having 1,2,3,4,5

the way we enforce this prefix pattern is that
1) the leader is ensured to have everything that's sent out, otherwise
it's removed from leader position
2) non-leader replicas is guaranteed to receive a prefix, because of
FIFO of the connection between replica and coordinator, if this
connection breaks, replica must catchup from the authoritative source
of leader

there is one point I hand-waved a bit: there are many coordinators,
the "prefix" from each of them is different, still need to think about
this, worst case is that we need to force the traffic come from the
leader, which is less interesting because it's almost hbase then...



Are you saying:  All replicas will receive the value whether or not they 
actually own the key range for the value.  If a node is not a replica 
for a value, it will not store it, but it will still write it in it's 
transaction log as a backup in case the leader dies.  Is that right?





On Tue, Jul 12, 2011 at 7:37 AM, AJ  wrote:

Yang, I'm not sure I understand what you mean by "prefix of the HLog".
  Also, can you explain what failure scenario you are talking about?  The
major failure that I see is when the leader node confirms to the client a
successful local write, but then fails before the write can be replicated to
any other replica node.  But, then again, you also say that the leader does
not forward replicas in your idea; so it's not real clear.

I'm still trying to figure out how to make this work with normal Cass
operation.

aj

On 7/11/2011 3:48 PM, Yang wrote:

I'm not proposing any changes to be done, but this looks like a very
interesting topic for thought/hack/learning, so the following are only
for thought exercises 


HBase enforces a single write/read entry point, so you can achieve
strong consistency by writing/reading only one node.  but just writing
to one node exposes you to loss of data if that node fails. so the
region server HLog is replicated to 3 HDFS data nodes.  the
interesting thing here is that each replica sees a complete *prefix*
of the HLog: it won't miss a record, if a record sync() to a data node
fails, all the existing bytes in the block are replicated to a new
data node.

if we employ a similar "leader" node among the N replicas of
cassandra (coordinator always waits for the reply from leader, but
leader does not do further replication like in HBase or counters), the
leader sees all writes onto the key range, but the other replicas
could miss some writes, as a result, each of the non-leader replicas'
write history has some "holes", so when the leader dies, and when we
elect a new one, no one is going to have a complete history. so you'd
have to do a repair amongst all the replicas to reconstruct the full
history, which is slow.

it seems possible that we could utilize the FIFO property of the
InComingTCPConnection to simplify history reconstruction, just like
Zookeeper. if the IncomingTcpConnection of a replica fails, that means
that it may have missed some edits, then when it reconnects, we force
it to talk to the active leader first, to catch up to date. when the
leader dies, the next leader is elected to be the replica with the
most recent history.  by maintaining the property that each node has a
complete prefix of history, we only need to catch up on the tail of
history, and avoid doing a complete repair on the entire
memtable+SStable.  but one issue is that the history at the leader has
to be kept really long - if a non-leader replica goes off for 2
days, the leader has to keep all the history for 2 days to feed them
to the replica when it comes back online. but possibly this could be
limited to some max length so that over that length, the woken replica
simply does a complete bootstrap.


thanks
yang
On Sun, Jul 3, 2011 at 8:25 PM, AJwrote:

We seem to be having a fundamental misunderstanding.  Thanks for your
comments. aj

On 7/3/2011 8:28 PM, William Oberman wrote:

I'm using cassandra as a tool, like a black box with a certain contract
to
the world.  Without modifying the "core", C* will send the updates to all
replicas, so your plan would cause the extra write (for the placeholder).
  I
wasn't assuming a modification to how C* fundamentally works.
Sounds like you are hacking (or at least looking) at the source, so all
the
power to you if/when you try these kind of changes.
will
On Sun, Jul 3, 2011 at 8:45 PM, AJwrote:

On 7/3/2011 6:32 PM, William Oberman wrote:

Was just going off of: " Send the value to the primary replica and send
placehold

Re: Anyone using Facebook's flashcache?

2011-07-12 Thread AJ

On 7/12/2011 9:02 PM, Peter Schuller wrote:

Thanks Peter, but... hmmm, are you saying that even after a cache miss which
results in a disk read and blocks being moved to the ssd, that by the next
cache miss for the same data and subsequent same file blocks, that the ssd
is unlikely to have those same blocks present anymore?

I am saying that regardless of whether the cache is memory, ssd, a
combination of both, or anything else, most workloads tend to be
subject to diminishing returns. Doubling cache from 5 gb to 10 gb
might get you from 10% to 50% cache hit ratio, but doubling again to
20 gb might get you to 60% and doubling to 40 gig to 65% (to use some
completely arbitrary random numbers for demonstration purposes).

The reason a cache can be more effective than the ratio of its size
vs. the total data set, is that there is a hotspot/working set that is
smaller than the total data set. If you have completely random access
this won't be the case, and an cache of size n% of total size will
give you a n% cache hit ratio.

But for most workloads, you have a hotter working set so you get more
bang for the buck when caching. For example, if 99% of all accesses
are accessing 10% of the data, then a cache that is the size of 10% of
the data gets you 99% cache hit ratio. But clearly no matter how much
more cache you ever add, you will never ever cache more than 100% of
reads so in this (artificial arbitrary) scenario, once you're caching
10% of your data the cost of cachine the final small percent of
accesses might be 10 times that of the original cache.

I did a quick Google but didn't find a good piece describing it more
properly, but hopefully the above is helpful. Some related reading
might be http://en.wikipedia.org/wiki/Long_Tail



Of course.  Thanks for the clarification.  On the positive side, this 
flashcache and other solutions like it could be beneficial for all disk 
i/o on the system.  Writes will always benefit.  Reads, only if they are 
read again before being pushed out by other reads.  I wonder if it would 
help to "prime" the ssd by reading in (and discarding) the top 25% 
(250/1000GB) of the usual hot data.


aj


Re: Anyone using Facebook's flashcache?

2011-07-12 Thread AJ

On 7/12/2011 10:19 AM, Peter Schuller wrote:

Do any Cass developers have any thoughts on this and whether or not it would
be helpful considering Cass' architecture and operation?

A well-functioning L2 cache should definitely be very useful with
Cassandra for read-intensive workloads where the request distribution
is such that additional caching will be beneficial. However, it will
depend in any particular case on how the L2 cache works, and what your
request distribution is like.

I have been wanting to try flashcache but haven't yet, so I cannot speak to it.

In particular though, keep in mind that if you've got say 1 tb of data
and your memory is enough to keep the hot set, and you're disk I/O is
coming form the long tail, increasing the amount of cache to 200 gig
may not necessarily give you a huge improvement in terms of
percentages.


Thanks Peter, but... hmmm, are you saying that even after a cache miss 
which results in a disk read and blocks being moved to the ssd, that by 
the next cache miss for the same data and subsequent same file blocks, 
that the ssd is unlikely to have those same blocks present anymore?




Anyone using Facebook's flashcache?

2011-07-12 Thread AJ
With big data requirements pressuring me to pack up to a terabyte on one 
node, I suspect that even 32 GB of RAM just will not be large enough for 
Cass' various memory caches to be effective.  32/1000 is a tiny working 
set to data store ratio... even assuming non-random reads.  So, I'm 
investigating whether or not a 256 GB SSD used as a cache between the 
data HDD and the Cass server process.  It won't decrease cache misses, 
but at least the access time would be orders of magnitude faster than 
from the hdd.  Also, write performance is improved because of lazy flushing.


Do any Cass developers have any thoughts on this and whether or not it 
would be helpful considering Cass' architecture and operation?


Links:
http://www.facebook.com/note.php?note_id=388112370932
https://github.com/facebook/flashcache/wiki

aj


Re: Strong Consistency with ONE read/writes

2011-07-12 Thread AJ
Yang, I'm not sure I understand what you mean by "prefix of the HLog".  
Also, can you explain what failure scenario you are talking about?  The 
major failure that I see is when the leader node confirms to the client 
a successful local write, but then fails before the write can be 
replicated to any other replica node.  But, then again, you also say 
that the leader does not forward replicas in your idea; so it's not real 
clear.


I'm still trying to figure out how to make this work with normal Cass 
operation.


aj

On 7/11/2011 3:48 PM, Yang wrote:

I'm not proposing any changes to be done, but this looks like a very
interesting topic for thought/hack/learning, so the following are only
for thought exercises 


HBase enforces a single write/read entry point, so you can achieve
strong consistency by writing/reading only one node.  but just writing
to one node exposes you to loss of data if that node fails. so the
region server HLog is replicated to 3 HDFS data nodes.  the
interesting thing here is that each replica sees a complete *prefix*
of the HLog: it won't miss a record, if a record sync() to a data node
fails, all the existing bytes in the block are replicated to a new
data node.

if we employ a similar "leader" node among the N replicas of
cassandra (coordinator always waits for the reply from leader, but
leader does not do further replication like in HBase or counters), the
leader sees all writes onto the key range, but the other replicas
could miss some writes, as a result, each of the non-leader replicas'
write history has some "holes", so when the leader dies, and when we
elect a new one, no one is going to have a complete history. so you'd
have to do a repair amongst all the replicas to reconstruct the full
history, which is slow.

it seems possible that we could utilize the FIFO property of the
InComingTCPConnection to simplify history reconstruction, just like
Zookeeper. if the IncomingTcpConnection of a replica fails, that means
that it may have missed some edits, then when it reconnects, we force
it to talk to the active leader first, to catch up to date. when the
leader dies, the next leader is elected to be the replica with the
most recent history.  by maintaining the property that each node has a
complete prefix of history, we only need to catch up on the tail of
history, and avoid doing a complete repair on the entire
memtable+SStable.  but one issue is that the history at the leader has
to be kept really long - if a non-leader replica goes off for 2
days, the leader has to keep all the history for 2 days to feed them
to the replica when it comes back online. but possibly this could be
limited to some max length so that over that length, the woken replica
simply does a complete bootstrap.


thanks
yang
On Sun, Jul 3, 2011 at 8:25 PM, AJ  wrote:

We seem to be having a fundamental misunderstanding.  Thanks for your
comments. aj

On 7/3/2011 8:28 PM, William Oberman wrote:

I'm using cassandra as a tool, like a black box with a certain contract to
the world.  Without modifying the "core", C* will send the updates to all
replicas, so your plan would cause the extra write (for the placeholder).  I
wasn't assuming a modification to how C* fundamentally works.
Sounds like you are hacking (or at least looking) at the source, so all the
power to you if/when you try these kind of changes.
will
On Sun, Jul 3, 2011 at 8:45 PM, AJ  wrote:

On 7/3/2011 6:32 PM, William Oberman wrote:

Was just going off of: " Send the value to the primary replica and send
placeholder values to the other replicas".  Sounded like you wanted to write
the value to one, and write the placeholder to N-1 to me.

Yes, that is what I was suggesting.  The point of the placeholders is to
handle the crash case that I talked about... "like" a WAL does.

But, C* will propagate the value to N-1 eventually anyways, 'cause that's
just what it does anyways :-)
will

On Sun, Jul 3, 2011 at 7:47 PM, AJ  wrote:

On 7/3/2011 3:49 PM, Will Oberman wrote:

Why not send the value itself instead of a placeholder?  Now it takes 2x
writes on a random node to do a single update (write placeholder, write
update) and N*x writes from the client (write value, write placeholder to
N-1). Where N is replication factor.  Seems like extra network and IO
instead of less...

To send the value to each node is 1.) unnecessary, 2.) will only cause a
large burst of network traffic.  Think about if it's a large data value,
such as a document.  Just let C* do it's thing.  The extra messages are tiny
and doesn't significantly increase latency since they are all sent
asynchronously.


Of course, I still think this sounds like reimplementing Cassandra
internals in a Cassandra client (just guessing, I'm not a cassandra dev)

I don't see how.  Maybe you should take a peek at the source.


On Jul 3, 2011, at 5:20 PM,

Feature Request: Multi-key Mapping

2011-07-10 Thread AJ
I think this would be another powerful feature by making it so much 
easier when faced with records/objects that can have multiple unique 
keys where both are not always used.  You wouldn't have to use secondary 
indexes which really aren't suitable for high cardinality (high 
uniqueness) indexes and better suited for range queries.


Of course, some indirection would be needed to avoid the naive solution 
of simply duplicating values.


Maybe Unix inodes is the best analogy here.

aj


Re: Command Request: rename a column

2011-07-08 Thread AJ

On 7/8/2011 2:18 AM, Sylvain Lebresne wrote:

On Fri, Jul 8, 2011 at 9:22 AM, AJ  wrote:

I think it would be really cool to be able to rename a column, or, more
generally, a move command to move data from one column to another in the
same CF without the client having to read and resend the column value.  This
would be extremely powerful, imo.  I suspect the execution would be quick
and could even be made atomic (per node) as I suspect it would mostly entail
only reference updates.

Cassandra don't work like that. We would have no other choice than to read the
column and write it back with a different name


I figured as much :)  Not that bad though.


  (and it would not be atomic). So
the only win we would get from doing this server side would lie in not
transferring
the value across the network.



That would be the main benefit I think, esp with large values.


--
Sylvain





Command Request: rename a column

2011-07-08 Thread AJ


I think it would be really cool to be able to rename a column, or, more 
generally, a move command to move data from one column to another in the 
same CF without the client having to read and resend the column value.  
This would be *extremely* powerful, imo.  I suspect the execution would 
be quick and could even be made atomic (per node) as I suspect it would 
mostly entail only reference updates.  Has anything like this been 
discussed before?  Seems like such a natural operation for a 
hash-table-like data store.


aj


Re: Strong Consistency with ONE read/writes

2011-07-03 Thread AJ
We seem to be having a fundamental misunderstanding.  Thanks for your 
comments. aj


On 7/3/2011 8:28 PM, William Oberman wrote:
I'm using cassandra as a tool, like a black box with a certain 
contract to the world.  Without modifying the "core", C* will send the 
updates to all replicas, so your plan would cause the extra write (for 
the placeholder).  I wasn't assuming a modification to how C* 
fundamentally works.


Sounds like you are hacking (or at least looking) at the source, so 
all the power to you if/when you try these kind of changes.


will

On Sun, Jul 3, 2011 at 8:45 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 7/3/2011 6:32 PM, William Oberman wrote:

Was just going off of: " Send the value to the primary replica
and send placeholder values to the other replicas".  Sounded like
you wanted to write the value to one, and write the placeholder
to N-1 to me.


Yes, that is what I was suggesting.  The point of the placeholders
is to handle the crash case that I talked about... "like" a WAL does.



But, C* will propagate the value to N-1 eventually anyways,
'cause that's just what it does anyways :-)

will

On Sun, Jul 3, 2011 at 7:47 PM, AJ mailto:a...@dude.podzone.net>> wrote:

On 7/3/2011 3:49 PM, Will Oberman wrote:

Why not send the value itself instead of a placeholder?  Now
it takes 2x writes on a random node to do a single update
(write placeholder, write update) and N*x writes from the
client (write value, write placeholder to N-1). Where N is
replication factor.  Seems like extra network and IO instead
of less...


To send the value to each node is 1.) unnecessary, 2.) will
only cause a large burst of network traffic.  Think about if
it's a large data value, such as a document.  Just let C* do
it's thing.  The extra messages are tiny and doesn't
significantly increase latency since they are all sent
asynchronously.



Of course, I still think this sounds like reimplementing
Cassandra internals in a Cassandra client (just guessing,
I'm not a cassandra dev)



I don't see how.  Maybe you should take a peek at the source.




On Jul 3, 2011, at 5:20 PM, AJ mailto:a...@dude.podzone.net>> wrote:


Yang,

How would you deal with the problem when the 1st node
responds success but then crashes before completely
forwarding any replicas?  Then, after switching to the next
primary, a read would return stale data.

Here's a quick-n-dirty way:  Send the value to the primary
replica and send placeholder values to the other replicas. 
The placeholder value is something like, "PENDING_UPDATE". 
The placeholder values are sent with timestamps 1 less than

the timestamp for the actual value that went to the
primary.  Later, when the changes propagate, the actual
values will overwrite the placeholders.  In event of a
crash before the placeholder gets overwritten, the next
read value will tell the client so.  The client will report
to the user that the key/column is unavailable.  The
downside is you've overwritten your data and maybe would
like to know what the old data was!  But, maybe there's
another way using other columns or with MVCC.  The client
would want a success from the primary and the secondary
replicas to be certain of future read consistency in case
the primary goes down immediately as I said above.  The
ability to set an "update_pending" flag on any column value
would probably make this work.  But, I'll think more on
this later.

aj

On 7/2/2011 10:55 AM, Yang wrote:

there is a JIRA completed in 0.7.x that "Prefers" a
certain node in snitch, so this does roughly what you want
MOST of the time


but the problem is that it does not GUARANTEE that the
same node will always be read.  I recently read into the
HBase vs Cassandra comparison thread that started after
Facebook dropped Cassandra for their messaging system, and
understood some of the differences. what you want is
essentially what HBase does. the fundamental difference
there is really due to the gossip protocol: it's a
probablistic, or eventually consistent failure detector
 while HBase/Google Bigtable use Zookeeper/Chubby to
provide a strong failure detector (a distributed lock).
 so in HBase, if a tablet server goes down, it really goes
down, it can not re-grab the tablet from the new tablet
server without going through a start up protocol
(notifying the m

Re: Strong Consistency with ONE read/writes

2011-07-03 Thread AJ

On 7/3/2011 6:32 PM, William Oberman wrote:
Was just going off of: " Send the value to the primary replica and 
send placeholder values to the other replicas".  Sounded like you 
wanted to write the value to one, and write the placeholder to N-1 to me.


Yes, that is what I was suggesting.  The point of the placeholders is to 
handle the crash case that I talked about... "like" a WAL does.


But, C* will propagate the value to N-1 eventually anyways, 'cause 
that's just what it does anyways :-)


will

On Sun, Jul 3, 2011 at 7:47 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 7/3/2011 3:49 PM, Will Oberman wrote:

Why not send the value itself instead of a placeholder?  Now it
takes 2x writes on a random node to do a single update (write
placeholder, write update) and N*x writes from the client (write
value, write placeholder to N-1). Where N is replication factor.
 Seems like extra network and IO instead of less...


To send the value to each node is 1.) unnecessary, 2.) will only
cause a large burst of network traffic.  Think about if it's a
large data value, such as a document.  Just let C* do it's thing. 
The extra messages are tiny and doesn't significantly increase

latency since they are all sent asynchronously.



Of course, I still think this sounds like reimplementing
Cassandra internals in a Cassandra client (just guessing, I'm not
a cassandra dev)



I don't see how.  Maybe you should take a peek at the source.




On Jul 3, 2011, at 5:20 PM, AJ mailto:a...@dude.podzone.net>> wrote:


Yang,

How would you deal with the problem when the 1st node responds
success but then crashes before completely forwarding any
replicas?  Then, after switching to the next primary, a read
would return stale data.

Here's a quick-n-dirty way:  Send the value to the primary
replica and send placeholder values to the other replicas.  The
placeholder value is something like, "PENDING_UPDATE".  The
placeholder values are sent with timestamps 1 less than the
timestamp for the actual value that went to the primary.  Later,
when the changes propagate, the actual values will overwrite the
placeholders.  In event of a crash before the placeholder gets
overwritten, the next read value will tell the client so.  The
client will report to the user that the key/column is
unavailable.  The downside is you've overwritten your data and
maybe would like to know what the old data was!  But, maybe
there's another way using other columns or with MVCC.  The
client would want a success from the primary and the secondary
replicas to be certain of future read consistency in case the
primary goes down immediately as I said above.  The ability to
set an "update_pending" flag on any column value would probably
make this work.  But, I'll think more on this later.

aj

On 7/2/2011 10:55 AM, Yang wrote:

there is a JIRA completed in 0.7.x that "Prefers" a certain
node in snitch, so this does roughly what you want MOST of the
time


but the problem is that it does not GUARANTEE that the same
node will always be read.  I recently read into the HBase vs
Cassandra comparison thread that started after Facebook dropped
Cassandra for their messaging system, and understood some of
the differences. what you want is essentially what HBase does.
the fundamental difference there is really due to the gossip
protocol: it's a probablistic, or eventually consistent failure
detector  while HBase/Google Bigtable use Zookeeper/Chubby to
provide a strong failure detector (a distributed lock).  so in
HBase, if a tablet server goes down, it really goes down, it
can not re-grab the tablet from the new tablet server without
going through a start up protocol (notifying the master, which
would notify the clients etc),  in other words it is guaranteed
that one tablet is served by only one tablet server at any
given time.  in comparison the above JIRA only TRYIES to serve
that key from one particular replica. HBase can have that
guarantee because the group membership is maintained by the
strong failure detector.

just for hacking curiosity, a strong failure detector +
Cassandra replicas is not impossible (actually seems not
difficult), although the performance is not clear. what would
such a strong failure detector bring to Cassandra besides this
ONE-ONE strong consistency ? that is an interesting question I
think.

considering that HBase has been deployed on big clusters, it is
probably OK with the performance of the strong  Zookeeper
failure detector. then a further question was: why did Dynamo
originally choose to use the probablistic failure detector? ye

Re: Strong Consistency with ONE read/writes

2011-07-03 Thread AJ

On 7/3/2011 4:07 PM, Yang wrote:


I'm no expert. So addressing the question to me probably give you real 
answers :)


The single entry mode makes sure that all writes coming through the 
leader are received by replicas before ack to client. Probably wont be 
stale data




That doesn't sound any different than a TWO write.  I'm trying to save a 
hop (+ 1 data xfer) by ack'ing immediately after the primary 
successfully writes, i.e., ONE write.


On Jul 3, 2011 11:20 AM, "AJ" <mailto:a...@dude.podzone.net>> wrote:

> Yang,
>
> How would you deal with the problem when the 1st node responds success
> but then crashes before completely forwarding any replicas? Then, after
> switching to the next primary, a read would return stale data.
>
> Here's a quick-n-dirty way: Send the value to the primary replica and
> send placeholder values to the other replicas. The placeholder value is
> something like, "PENDING_UPDATE". The placeholder values are sent with
> timestamps 1 less than the timestamp for the actual value that went to
> the primary. Later, when the changes propagate, the actual values will
> overwrite the placeholders. In event of a crash before the placeholder
> gets overwritten, the next read value will tell the client so. The
> client will report to the user that the key/column is unavailable. The
> downside is you've overwritten your data and maybe would like to know
> what the old data was! But, maybe there's another way using other
> columns or with MVCC. The client would want a success from the primary
> and the secondary replicas to be certain of future read consistency in
> case the primary goes down immediately as I said above. The ability to
> set an "update_pending" flag on any column value would probably make
> this work. But, I'll think more on this later.
>
> aj
>
> On 7/2/2011 10:55 AM, Yang wrote:
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>> snitch, so this does roughly what you want MOST of the time
>>
>>
>> but the problem is that it does not GUARANTEE that the same node will
>> always be read. I recently read into the HBase vs Cassandra
>> comparison thread that started after Facebook dropped Cassandra for
>> their messaging system, and understood some of the differences. what
>> you want is essentially what HBase does. the fundamental difference
>> there is really due to the gossip protocol: it's a probablistic, or
>> eventually consistent failure detector while HBase/Google Bigtable
>> use Zookeeper/Chubby to provide a strong failure detector (a
>> distributed lock). so in HBase, if a tablet server goes down, it
>> really goes down, it can not re-grab the tablet from the new tablet
>> server without going through a start up protocol (notifying the
>> master, which would notify the clients etc), in other words it is
>> guaranteed that one tablet is served by only one tablet server at any
>> given time. in comparison the above JIRA only TRYIES to serve that
>> key from one particular replica. HBase can have that guarantee because
>> the group membership is maintained by the strong failure detector.
>>
>> just for hacking curiosity, a strong failure detector + Cassandra
>> replicas is not impossible (actually seems not difficult), although
>> the performance is not clear. what would such a strong failure
>> detector bring to Cassandra besides this ONE-ONE strong consistency ?
>> that is an interesting question I think.
>>
>> considering that HBase has been deployed on big clusters, it is
>> probably OK with the performance of the strong Zookeeper failure
>> detector. then a further question was: why did Dynamo originally
>> choose to use the probablistic failure detector? yes Dynamo's main
>> theme is "eventually consistent", so the Phi-detector is **enough**,
>> but if a strong detector buys us more with little cost, wouldn't that
>> be great?
>>
>>
>>
>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <mailto:a...@dude.podzone.net>

>> <mailto:a...@dude.podzone.net <mailto:a...@dude.podzone.net>>> wrote:
>>
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same
>> node from a client. It seems the only thing needed is to allow
>> the clients to compute which node is the closes replica for the
>> given key using the same algorithm C* uses. When the first
>> replica receives the write request, it will write to itself which
>> should complete before any of the other replicas and then return.
>> The loads should still stay balanced if u

Re: Strong Consistency with ONE read/writes

2011-07-03 Thread AJ

On 7/3/2011 3:49 PM, Will Oberman wrote:
Why not send the value itself instead of a placeholder?  Now it takes 
2x writes on a random node to do a single update (write placeholder, 
write update) and N*x writes from the client (write value, write 
placeholder to N-1). Where N is replication factor.  Seems like extra 
network and IO instead of less...


To send the value to each node is 1.) unnecessary, 2.) will only cause a 
large burst of network traffic.  Think about if it's a large data value, 
such as a document.  Just let C* do it's thing.  The extra messages are 
tiny and doesn't significantly increase latency since they are all sent 
asynchronously.


Of course, I still think this sounds like reimplementing Cassandra 
internals in a Cassandra client (just guessing, I'm not a cassandra dev)




I don't see how.  Maybe you should take a peek at the source.



On Jul 3, 2011, at 5:20 PM, AJ <mailto:a...@dude.podzone.net>> wrote:



Yang,

How would you deal with the problem when the 1st node responds 
success but then crashes before completely forwarding any replicas?  
Then, after switching to the next primary, a read would return stale 
data.


Here's a quick-n-dirty way:  Send the value to the primary replica 
and send placeholder values to the other replicas.  The placeholder 
value is something like, "PENDING_UPDATE".  The placeholder values 
are sent with timestamps 1 less than the timestamp for the actual 
value that went to the primary.  Later, when the changes propagate, 
the actual values will overwrite the placeholders.  In event of a 
crash before the placeholder gets overwritten, the next read value 
will tell the client so.  The client will report to the user that the 
key/column is unavailable.  The downside is you've overwritten your 
data and maybe would like to know what the old data was!  But, maybe 
there's another way using other columns or with MVCC.  The client 
would want a success from the primary and the secondary replicas to 
be certain of future read consistency in case the primary goes down 
immediately as I said above.  The ability to set an "update_pending" 
flag on any column value would probably make this work.  But, I'll 
think more on this later.


aj

On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that "Prefers" a certain node in 
snitch, so this does roughly what you want MOST of the time



but the problem is that it does not GUARANTEE that the same node 
will always be read.  I recently read into the HBase vs Cassandra 
comparison thread that started after Facebook dropped Cassandra for 
their messaging system, and understood some of the differences. what 
you want is essentially what HBase does. the fundamental difference 
there is really due to the gossip protocol: it's a probablistic, or 
eventually consistent failure detector  while HBase/Google Bigtable 
use Zookeeper/Chubby to provide a strong failure detector (a 
distributed lock).  so in HBase, if a tablet server goes down, it 
really goes down, it can not re-grab the tablet from the new tablet 
server without going through a start up protocol (notifying the 
master, which would notify the clients etc),  in other words it is 
guaranteed that one tablet is served by only one tablet server at 
any given time.  in comparison the above JIRA only TRYIES to serve 
that key from one particular replica. HBase can have that guarantee 
because the group membership is maintained by the strong failure 
detector.


just for hacking curiosity, a strong failure detector + Cassandra 
replicas is not impossible (actually seems not difficult), although 
the performance is not clear. what would such a strong failure 
detector bring to Cassandra besides this ONE-ONE strong consistency 
? that is an interesting question I think.


considering that HBase has been deployed on big clusters, it is 
probably OK with the performance of the strong  Zookeeper failure 
detector. then a further question was: why did Dynamo originally 
choose to use the probablistic failure detector? yes Dynamo's main 
theme is "eventually consistent", so the Phi-detector is **enough**, 
but if a strong detector buys us more with little cost, wouldn't 
that  be great?




On Fri, Jul 1, 2011 at 6:53 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


Is this possible?

All reads and writes for a given key will always go to the same
node from a client.  It seems the only thing needed is to allow
the clients to compute which node is the closes replica for the
given key using the same algorithm C* uses.  When the first
replica receives the write request, it will write to itself
which should complete before any of the other replicas and then
return.  The loads should still stay balanced if using random
partitioner.  If the first replica becomes unavailable (however
that is defin

Re: Strong Consistency with ONE read/writes

2011-07-03 Thread AJ

Yang,

How would you deal with the problem when the 1st node responds success 
but then crashes before completely forwarding any replicas?  Then, after 
switching to the next primary, a read would return stale data.


Here's a quick-n-dirty way:  Send the value to the primary replica and 
send placeholder values to the other replicas.  The placeholder value is 
something like, "PENDING_UPDATE".  The placeholder values are sent with 
timestamps 1 less than the timestamp for the actual value that went to 
the primary.  Later, when the changes propagate, the actual values will 
overwrite the placeholders.  In event of a crash before the placeholder 
gets overwritten, the next read value will tell the client so.  The 
client will report to the user that the key/column is unavailable.  The 
downside is you've overwritten your data and maybe would like to know 
what the old data was!  But, maybe there's another way using other 
columns or with MVCC.  The client would want a success from the primary 
and the secondary replicas to be certain of future read consistency in 
case the primary goes down immediately as I said above.  The ability to 
set an "update_pending" flag on any column value would probably make 
this work.  But, I'll think more on this later.


aj

On 7/2/2011 10:55 AM, Yang wrote:
there is a JIRA completed in 0.7.x that "Prefers" a certain node in 
snitch, so this does roughly what you want MOST of the time



but the problem is that it does not GUARANTEE that the same node will 
always be read.  I recently read into the HBase vs Cassandra 
comparison thread that started after Facebook dropped Cassandra for 
their messaging system, and understood some of the differences. what 
you want is essentially what HBase does. the fundamental difference 
there is really due to the gossip protocol: it's a probablistic, or 
eventually consistent failure detector  while HBase/Google Bigtable 
use Zookeeper/Chubby to provide a strong failure detector (a 
distributed lock).  so in HBase, if a tablet server goes down, it 
really goes down, it can not re-grab the tablet from the new tablet 
server without going through a start up protocol (notifying the 
master, which would notify the clients etc),  in other words it is 
guaranteed that one tablet is served by only one tablet server at any 
given time.  in comparison the above JIRA only TRYIES to serve that 
key from one particular replica. HBase can have that guarantee because 
the group membership is maintained by the strong failure detector.


just for hacking curiosity, a strong failure detector + Cassandra 
replicas is not impossible (actually seems not difficult), although 
the performance is not clear. what would such a strong failure 
detector bring to Cassandra besides this ONE-ONE strong consistency ? 
that is an interesting question I think.


considering that HBase has been deployed on big clusters, it is 
probably OK with the performance of the strong  Zookeeper failure 
detector. then a further question was: why did Dynamo originally 
choose to use the probablistic failure detector? yes Dynamo's main 
theme is "eventually consistent", so the Phi-detector is **enough**, 
but if a strong detector buys us more with little cost, wouldn't that 
 be great?




On Fri, Jul 1, 2011 at 6:53 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


Is this possible?

All reads and writes for a given key will always go to the same
node from a client.  It seems the only thing needed is to allow
the clients to compute which node is the closes replica for the
given key using the same algorithm C* uses.  When the first
replica receives the write request, it will write to itself which
should complete before any of the other replicas and then return.
 The loads should still stay balanced if using random partitioner.
 If the first replica becomes unavailable (however that is
defined), then the clients can send to the next repilca in the
ring and switch from ONE write/reads to QUORUM write/reads
temporarily until the first replica becomes available again.
 QUORUM is required since there could be some replicas that were
not updated after the first replica went down.

Will this work?  The goal is to have strong consistency with a
read/write consistency level as low as possible while secondarily
a network performance boost.






Re: Strong Consistency with ONE read/writes

2011-07-02 Thread AJ

On 7/2/2011 6:03 AM, William Oberman wrote:
Ok, I see the "you happen to choose the 'right' node" idea, but it 
sounds like you want to solve "C* problems" in the client, and they 
already wrote that complicated code to make clients simple.   You're 
talking about reimplementing key<->node mappings, network topology 
(with failures), etc...  Plus, if they change something about 
replication and you get too tricky, your code breaks.  Or, if they 
optimize something, you might not benefit.




I'm only asking if this is possible working within the current design 
and architecture and if not, then why.  I'm not interested in a hack; 
just exploring possibilities.




Re: Strong Consistency with ONE read/writes

2011-07-02 Thread AJ
Yang, you seem to understand all of the details, at least the details 
that have occurred to me, such as having a failure protocol rather than 
a perfect failure detector and new leader coordination.


I finally did some more reading outside of Cassandra space and realized 
HBase has what I was asking about.  If Cass could be flexible enough to 
allow such a setup without violating it's goals, that would be great, imho.


This thread is just a brainstorming exploratory thread (by a non-expert) 
based on a simplistic observation that, if all clients went directly to 
the responsible replica every time, then performance and consistency can 
be increased by:


- providing guaranteed monotonic reads/writes consistency
- read-your-writes consistency
- higher performance (less latency)

all with only a read/write of ONE.

Basically, it's like a mater/slave setup except that the slaves can 
take-over as master, so you still have high availability.


I'm not saying it's easy and I'm only coming at this from a customer 
request point of view.  The question is, would this be useful if it 
could be added to Cass's bag of tricks?  Cass is already a hybrid.


aj

On 7/2/2011 1:57 PM, Yang wrote:


Jonathan:

could you please elaborate more on specifically why they are "not even 
close"?
 --- I kind of see what you mean (please correct me if I 
misunderstood): Cassandra failure detector
is consulted on every write; while HBase failure detector is only used 
when the tablet server joins or leaves.


 in order to have the single write entry point approach originally 
brought up in this thread,
I think you need a strong membership protocol to lock on the key range 
 leadership, once leadership is acquired,

failure detectors do not need to be consulted on every write.

yes by definition of the original requirement brought up in this thread,
Cassandra's write behavior is going to be changed, to be more like 
Hbase, and mongo in "replica set" mode. but
it seems that this leader mode can even co-exist with the multi-entry 
write mode that Cassandra uses now, just as
you can use different CL for each single write request.  in that case 
you would need to keep both the current lightweight Phi-detector

and add the ZK for leader election for single-entry mode write.

Thanks
Yang


(I should correct my terminology  it's not a "strong failure 
detector" that's needed, it's a "strong membership protocol". strongly 
complete and accurate failure detectors do not exist in
async distributed systems (Tushar Chandra  "Unreliable Failure 
Detectors for Reliable Distributed Systems, Journal of the ACM, 
43(2):225-267, 1996 <http://doi.acm.org/10.1145/226643.226647>"  and 
FLP "Impossibility of  Distributed Consensus with One Faulty Process 
<http://www.podc.org/influential/2001.html>" )  )



On Sat, Jul 2, 2011 at 10:11 AM, Jonathan Ellis <mailto:jbel...@gmail.com>> wrote:


The way HBase uses ZK (for master election) is not even close to how
Cassandra uses the failure detector.

Using ZK for each operation would (a) not scale and (b) not work
cross-DC for any reasonable latency requirements.

On Sat, Jul 2, 2011 at 11:55 AM, Yang mailto:tedd...@gmail.com>> wrote:
> there is a JIRA completed in 0.7.x that "Prefers" a certain node
in snitch,
> so this does roughly what you want MOST of the time
>
> but the problem is that it does not GUARANTEE that the same node
will always
> be read.  I recently read into the HBase vs Cassandra comparison
thread that
> started after Facebook dropped Cassandra for their messaging
system, and
> understood some of the differences. what you want is essentially
what HBase
> does. the fundamental difference there is really due to the
gossip protocol:
> it's a probablistic, or eventually consistent failure detector
 while
> HBase/Google Bigtable use Zookeeper/Chubby to provide a strong
failure
> detector (a distributed lock).  so in HBase, if a tablet server
goes down,
> it really goes down, it can not re-grab the tablet from the new
tablet
> server without going through a start up protocol (notifying the
master,
> which would notify the clients etc),  in other words it is
guaranteed that
> one tablet is served by only one tablet server at any given
time.  in
> comparison the above JIRA only TRYIES to serve that key from one
particular
> replica. HBase can have that guarantee because the group
membership is
> maintained by the strong failure detector.
> just for hacking curiosity, a strong failure detector +
Cassandra replicas
> is not impossible (actually seems not difficult), although the
performance
> is not clear. wh

Re: Strong Consistency with ONE read/writes

2011-07-01 Thread AJ
I'm saying I will make my clients forward the C* requests to the first replica 
instead of forwarding to a random node.
--
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Will Oberman  wrote:



Sent from my iPhone

On Jul 1, 2011, at 9:53 PM, AJ  wrote:

> Is this possible?
>
> All reads and writes for a given key will always go to the same node
> from a client.

I don't think that's true. Given a key K, the client will write to N
nodes (N=replication factor). And at consistency level ONE the client
will return after 1 "ack" (from the N writes).



Strong Consistency with ONE read/writes

2011-07-01 Thread AJ

Is this possible?

All reads and writes for a given key will always go to the same node 
from a client.  It seems the only thing needed is to allow the clients 
to compute which node is the closes replica for the given key using the 
same algorithm C* uses.  When the first replica receives the write 
request, it will write to itself which should complete before any of the 
other replicas and then return.  The loads should still stay balanced if 
using random partitioner.  If the first replica becomes unavailable 
(however that is defined), then the clients can send to the next repilca 
in the ring and switch from ONE write/reads to QUORUM write/reads 
temporarily until the first replica becomes available again.  QUORUM is 
required since there could be some replicas that were not updated after 
the first replica went down.


Will this work?  The goal is to have strong consistency with a 
read/write consistency level as low as possible while secondarily a 
network performance boost.


Re: Cassandra ACID

2011-07-01 Thread AJ

On 6/30/2011 1:57 PM, Jeremiah Jordan wrote:
For your Consistency case, it is actually an ALL read that is needed, 
not an ALL write.  ALL read, with what ever consistency level of write 
that you need (to support machines dyeing) is the only way to get 
consistent results in the face of a failed write which was at > 
ONE that went to one node, but not the others.




True, an ALL read is the best and final test for consistency for that 
read.  I think an ALL write is more of a preemptive measure.  If you 
know you'll be needing consistency later, better to get it in while you 
can.  But, this leads to a whole other set of complex topics.  I like 
the flexibility, however.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch mutate 
for one specific key will apply updates to all the columns for that one 
specific row atomically.  If part of the single-key batch update fails, 
then all of the updates will be reverted since they all pertained to one 
key/row.  Notice, I said 'reverted' not 'rolled back'.  Note: atomicity 
and isolation are related to the topic of transactions but one does not 
imply the other.  Even though row updates are atomic, they are not 
isolated from other users' updates or reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
Cassandra does not provide the same scope of Consistency as defined in 
the ACID standard.  Consistency in C* does not include referential 
integrity since C* is not a relational database.  Any referential 
integrity required would have to be handled by the client.  Also, even 
though the official docs say that QUORUM writes/reads is the minimal 
consistency_level setting to guarantee full consistency, this assumes 
that the write preceding the read does not fail (see comments below).  
What to do in this case is not fully understood by this author.

Refs: http://wiki.apache.org/cassandra/ArchitectureOverview

*Isolation*
NOTHING is isolated; because there is no transaction support in the 
first place.  This means that two or more clients can update the same 
row at the same time.  Their updates of the same or different columns 
may be interleaved and leave the row in a state that may not make sense 
depending on your application.  Note: this doesn't mean to say that two 
updates of the same column will be corrupted, obviously; columns are the 
smallest atomic unit ('atomic' in the more general thread-safe context).
Refs: None that directly address this explicitly and clearly and in one 
place.


*Durability*
Updates are made highly durable at the level comparable to a DBMS by the 
use of the commit log.  However, this requires "commitlog_sync: batch" 
in cassandra.yaml.  For "some" performance improvement with "some" cost 
in durability you can specify "commitlog_sync: periodic".  See 
discussion below for more details.

Refs: Plenty + this thread.




*From:* AJ [mailto:a...@dude.podzone.net]
*Sent:* Friday, June 24, 2011 11:28 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Cassandra ACID

Ok, here it is reworked; consider it a summary of the thread.  If I 
left out an important point that you think is 100% correct even if you 
already mentioned it, then make some noise about it and provide some 
evidence so it's captured sufficiently.  And, if you're in a debate, 
please try and get to a resolution; all will appreciate it.


It will be evident below that Consistency is not the only thing that 
is "tunable", at least indirectly.  Unfortunately, you still can't 
tunafish.  Ar ar ar.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch mutate 
for one specific key will apply updates to all the columns for that 
one specific row atomically.  If part of the single-key batch update 
fails, then all of the updates will be reverted since they all 
pertained to one key/row.  Notice, I said 'reverted' not 'rolled 
back'.  Note: atomicity and isolation are related to the topic of 
transactions but one does not imply the other.  Even though row 
updates are atomic, they are not isolated from other users' updates or 
reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
Cassandra does not provide the same scope of Consistency as defined in 
the ACID standard.  Consistency in C* does not include referential 
integrity since C* is not a relational database.  Any referential 
integrity required would have to be handled by the client.  Also, even 
though the official docs say that QUORUM writes/reads is the minimal 
consistency_level setting to guarantee full consistency, this assumes 
that the write preceding the read does not fail (see comments below).  
Therefore, an ALL write would be necessary prior to a QUORUM read of 
the same data.  For a mu

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

2011-06-30 Thread AJ

It would be helpful if this was automated some how.


Re: No Transactions: An Example

2011-06-29 Thread AJ


On 6/22/2011 9:18 AM, Trevor Smith wrote:
Right -- that's the part that I am more interested in fleshing out in 
this post.




Here is one way.  Use MVCC 
<http://en.wikipedia.org/wiki/Multiversion_concurrency_control>.  A 
single global clean-up process would be acceptable since it's not a 
single point of failure, only a single point of accumulating back-logged 
work and will not affect availability as long as you are notified if 
that process terminates and restart it in a reasonable amount of time 
but this will not affect the validity of subsequent reads.


So, you would have a "balance" column.  And each update will create a 
"balance_" with a positive or negative value indicating a 
credit or debit.  Subsequent clients will read the latest value by doing 
a slice from "balance" to "balance_~" (i.e. all "balance*" columns).  
(You would have to work-out your column naming conventions so that your 
slices return only the pertinent columns.)  Then, the clients would have 
to apply all the credits and debits to the balance to get the current 
balance.


This handles the lost update problem.

For the dirty read and incorrect summary problems by others reading data 
that is in the middle of a transaction that hasn't committed yet, I 
would add a final transaction column to a Transactions CF.  The key 
would be .., e.g., Accounts.1234.balance, 1234 being 
the account # and Accounts being the CF owning the balance column.  
Then, a new column would be added for each successful transaction (e.g., 
after debiting and crediting the two accounts) using the same timestamp 
used in balance_.  So, now, a client wanting the current 
balance would have to do a slice for all of the transactions for that 
column and only apply the balance updates up to the latest transaction.  
Note, you might have to do something else with the transaction naming 
schemes to make sure they are guaranteed to be unique, but you get the 
idea.  If the transaction fails, the client simply does not add a 
transaction column to Transactions and deletes any "balance_" 
columns it added to in the Accounts CF (or let's the clean-up process do 
it... carefully).


This should avoid the need for locks and as long as each account doesn't 
have a crazy amount of updates, the slices shouldn't be so large as to 
be a significant perf hit.


A note about the updates.  You have to make sure the clean-up process 
processes the updates in order and only 1 time.  If you can't guarantee 
these, then you'll have to make sure your updates are idempotent and 
commutative.


Oh yeah, and you must use QUORUM read/writes, of course.

Any critiques?

aj


Re: Sharing Cassandra with Solandra

2011-06-28 Thread AJ

On 6/27/2011 3:39 PM, David Strauss wrote:

On Mon, 2011-06-27 at 15:06 -0600, AJ wrote:

Would anyone care to talk about their experiences with using Solandra
along side another application that uses Cassandra (also on the same
node)?  I'm curious about any resource contention issues or
compatibility between C* versions and Sol.  Also, I read the developer
somewhere say that you have to run Solandra on every C* node in the
ring.  I'm not sure if I interpreted that correctly.  Also, what's the
index size to data size ratio to expect (ballpark)?  How does it
perform?  Any caveats?

We're currently keeping the clusters separate at Pantheon Systems
because our core API (which runs on standard Cassandra) is often ready
for the next Cassandra version at a different time than Solandra.
Solandra recently gained dual 0.7/0.8 support, but we're still opting to
use the version on Cassandra that Solandra is primarily being built and
tested on (which is currently 0.8).


Thanks.  But, I'm finally cluing in that Solandra is also developed by 
DataStax, so I feel safer about future compatibility.


Re: Clock skew

2011-06-28 Thread AJ
Yikes!  I just read your blog Dominic.  Now I'm worried since my app was 
going to be mostly cloud-based.  But, you didn't mention anything about 
sleeping for 'max clock variance' after making the ntp-related config 
changes (maybe you haven't had the time to blog it).


I'm curious, do you think the sleep is required even in a 
non-virtualized environment?  Is it only needed when implementing some 
kind of lock?  Does the type of lock make a difference?


Thanks!
aj (the other one)

On 6/28/2011 11:31 AM, Dominic Williams wrote:

Hi, yes you are correct, and this is a potential problem.

IMPORTANT: If you need to serialize writes from your application 
servers, for example using distributed locking, then before releasing 
locks you must sleep for a period equal to the maximum variance 
between the clocks on your application server nodes.


I had a problem with the clocks on my nodes which led to all kinds of 
problems. There is a slightly out of date post, which may not 
mentioned the above point, on my experiences here 
http://ria101.wordpress.com/2011/02/08/cassandra-the-importance-of-system-clocks-avoiding-oom-and-how-to-escape-oom-meltdown/


Hope this helps
Dominic

On 27 June 2011 23:03, A J <mailto:s5a...@gmail.com>> wrote:


During writes, the timestamp field in the column is the system-time of
that node (correct me if that is not the case and the system-time of
the co-ordinator is what gets applied to all the replicas).
During reads, the latest write wins.

What if there is a clock skew ? It could lead to a stale write
over-riding the actual latest write, just because the clock of that
node is ahead of the other node. Right ?






Re: Auto compaction to be staggered ?

2011-06-27 Thread AJ

On 6/27/2011 4:01 PM, A J wrote:

Is there an enhancement on the roadmap to stagger the auto compactions
on different nodes, to avoid more than one node compacting at any
given time (or as few nodes as possible to compact at any given time).
If not, any workarounds ?

Thanks.



+1.  I proposed the same in my *Ideas for Big Data Support* thread,

"5.)  Postponed Major Compactions:

The option to postpone auto-triggered major compactions until a 
pre-defined time of day or week or until staff can do it manually. "


aj


Sharing Cassandra with Solandra

2011-06-27 Thread AJ

Hi everyone,

Would anyone care to talk about their experiences with using Solandra 
along side another application that uses Cassandra (also on the same 
node)?  I'm curious about any resource contention issues or 
compatibility between C* versions and Sol.  Also, I read the developer 
somewhere say that you have to run Solandra on every C* node in the 
ring.  I'm not sure if I interpreted that correctly.  Also, what's the 
index size to data size ratio to expect (ballpark)?  How does it 
perform?  Any caveats?


Thanks!
aj


Re: Concurrency: Does C* support a Happened-Before relation between processes' writes?

2011-06-25 Thread AJ

On 6/25/2011 8:24 AM, Edward Capriolo wrote:

I played around with the bakery algorithm and had ok success the
challenges are most implementations assume an n size array of fixed
clients and when you get it working it turns out to be a good number
of cassandra ops to acquire your bakery lock.



I was thinking rather than making certain that there is a column 
reserved for each node and having to keep it updated, you can just over 
allocate a large number that would always be enough, like 100.  A slice 
of 100 byte-sized values shouldn't be a significant perf hit vs 3 or 4.  
If you only have 3 nodes in your cluster and the last 97 go unused, that 
would be ok; it would be as if those non-existent "customers" never take 
a number.


For optimizing for C*, I think you can get away with minimal getSlices 
for the loops.  If you're lucky, you can fall through all of them using 
the results from only 1 getSlice.  Only if a process is "entering" or 
has a higher priority number will you need to wait and then do another 
getSlice and only a slice for the remaining columns.  I think my logic 
is correct; do you agree?


Did you have other problems other than performance?


Re: Concurrency: Does C* support a Happened-Before relation between processes' writes?

2011-06-24 Thread AJ

On 6/24/2011 2:27 PM, Jonathan Ellis wrote:

Might be able to do it with
http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm.  "It is
remarkable that this algorithm is not built on top of some lower level
"atomic" operation, e.g. compare-and-swap."


This looks like it may work.  Jonathan, have you guru's discussed this 
algorithm before and come to a consensus on it by chance?


Re: Cassandra ACID

2011-06-24 Thread AJ
Ok, here it is reworked; consider it a summary of the thread.  If I left 
out an important point that you think is 100% correct even if you 
already mentioned it, then make some noise about it and provide some 
evidence so it's captured sufficiently.  And, if you're in a debate, 
please try and get to a resolution; all will appreciate it.


It will be evident below that Consistency is not the only thing that is 
"tunable", at least indirectly.  Unfortunately, you still can't 
tunafish.  Ar ar ar.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch mutate 
for one specific key will apply updates to all the columns for that one 
specific row atomically.  If part of the single-key batch update fails, 
then all of the updates will be reverted since they all pertained to one 
key/row.  Notice, I said 'reverted' not 'rolled back'.  Note: atomicity 
and isolation are related to the topic of transactions but one does not 
imply the other.  Even though row updates are atomic, they are not 
isolated from other users' updates or reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
Cassandra does not provide the same scope of Consistency as defined in 
the ACID standard.  Consistency in C* does not include referential 
integrity since C* is not a relational database.  Any referential 
integrity required would have to be handled by the client.  Also, even 
though the official docs say that QUORUM writes/reads is the minimal 
consistency_level setting to guarantee full consistency, this assumes 
that the write preceding the read does not fail (see comments below).  
Therefore, an ALL write would be necessary prior to a QUORUM read of the 
same data.  For a multi-dc scenario use an ALL write followed by a 
EACH_QUORUM read.

Refs: http://wiki.apache.org/cassandra/ArchitectureOverview

*Isolation*
NOTHING is isolated; because there is no transaction support in the 
first place.  This means that two or more clients can update the same 
row at the same time.  Their updates of the same or different columns 
may be interleaved and leave the row in a state that may not make sense 
depending on your application.  Note: this doesn't mean to say that two 
updates of the same column will be corrupted, obviously; columns are the 
smallest atomic unit ('atomic' in the more general thread-safe context).
Refs: None that directly address this explicitly and clearly and in one 
place.


*Durability*
Updates are made highly durable at the level comparable to a DBMS by the 
use of the commit log.  However, this requires "commitlog_sync: batch" 
in cassandra.yaml.  For "some" performance improvement with "some" cost 
in durability you can specify "commitlog_sync: periodic".  See 
discussion below for more details.

Refs: Plenty + this thread.



On 6/24/2011 1:46 PM, Jim Newsham wrote:

On 6/23/2011 8:55 PM, AJ wrote:
Can any Cassandra contributors/guru's confirm my understanding of 
Cassandra's degree of support for the ACID properties?


I provide official references when known.  Please let me know if I 
missed some good official documentation.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch 
mutate for one specific key will apply updates to all the columns for 
that one specific row atomically.  If part of the single-key batch 
update fails, then all of the updates will be reverted since they all 
pertained to one key/row.  Notice, I said 'reverted' not 'rolled 
back'.  Note: atomicity and isolation are related to the topic of 
transactions but one does not imply the other.  Even though row 
updates are atomic, they are not isolated from other users' updates 
or reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
If you want 100% consistency, use consistency level QUORUM for both 
reads and writes and EACH_QUORUM in a multi-dc scenario.

Refs: http://wiki.apache.org/cassandra/ArchitectureOverview



This is a pretty narrow interpretation of consistency.  In a 
traditional database, consistency prevents you from getting into a 
logically inconsistent state, where records in one table do not agree 
with records in another table.  This includes referential integrity, 
cascading deletes, etc.  It seems to me Cassandra has no support for 
this concept whatsoever.



*Isolation*
NOTHING is isolated; because there is no transaction support in the 
first place.  This means that two or more clients can update the same 
row at the same time.  Their updates of the same or different columns 
may be interleaved and leave the row in a state that may not make 
sense depending on your application.  Note: this doesn't mean to say 
that two updates of the same column will be corrupted, obviously; 
columns are the smallest atomic unit ('atomic' in the more general 
thread-safe con

Re: Concurrency: Does C* support a Happened-Before relation between processes' writes?

2011-06-24 Thread AJ

On 6/24/2011 2:27 PM, Jonathan Ellis wrote:

Might be able to do it with
http://en.wikipedia.org/wiki/Lamport%27s_bakery_algorithm.  "It is
remarkable that this algorithm is not built on top of some lower level
"atomic" operation, e.g. compare-and-swap."

I've been meaning to get back to reading that.  Thanks for the reminder 
Jonathan!


Re: Concurrency: Does C* support a Happened-Before relation between processes' writes?

2011-06-24 Thread AJ

On 6/24/2011 2:09 PM, Jim Newsham wrote:

On 6/24/2011 9:28 AM, Yang wrote:
without a clear description of your pseudo-code, it's difficult to 
say whether it will work.


but I think it can work fine as an election/agreement protocol, which 
you can use as a lock to some degree, but this requires
all the potential lock contenders to all participate, you can't grab 
a lock before everyone has voiced their vote yet


I agree with this statement.  I think the issue is that the timestamps 
are generated by the clients and their clocks may not be in sync, so 
write A from client A might arrive with timestamp T, and write B from 
client B may reach the node later in time, however it may have an 
earlier timestamp (T', where T' < T).  Client A may perform a read 
immediately after its write and notice that it was the only client to 
request a lock -- so it will assume it has acquired the lock.  After 
Client B's lock request, it will perform a read and observe that it 
has written the request with the earliest timestamp -- so it will also 
assume it has acquired the lock, which would result in a failure of 
the locking scheme.  If each client is required to wait for all other 
clients to "vote", then this issue goes away.




Yes, you both understand the problem.  Hopefully we can find a solution 
without relying on a hack and based on C* design that will be supported 
in the future.


I'll be thinking on this some more.  Thanks.


Concurrency: Does C* support a Happened-Before relation between processes' writes?

2011-06-24 Thread AJ
Sorry, I know this is long-winded but I just want to make sure before I 
go through the trouble to implement this since it's not something that 
can be reliably tested and requires in-depth knowledge about C* 
internals.  But, this ultimately deals with concurrency control so 
anyone interested in that subject may want to try and answer this.  Thanks!



I would like to know how to do a series of writes and reads such that I 
can tell definitively what process out of many was the first to create a 
unique flag column.


IOW, I would like to have multiple processes (clients) compete to see 
who is first to write a token column.  The tokens start with a known 
prefix, such as "Token_" followed by the name of the process that 
created it and a UUID so that all columns are guaranteed unique and 
don't get overwritten.  For example, Process A could create:


Token_ProcA_

and process B would create:

Token_ProcB_

These writes/reads are asynchronous between the two or more processes.  
After the two processes write their respective tokens, each will read 
back all columns named "Token_*" that exist (a slice).  They each do 
this in order to find out who "won".  The process that wrote the column 
with the lowest timestamp wins.  The purpose is to implement a lock.


I think all that is required is for the processes to use QUORUM 
read/writes to make sure the final read is consistent and will assure 
each process that it can rely on what's returned from the final read and 
that there isn't an earlier write floating around somewhere.  This is 
where the "happened-before" question comes in.  Is it possible that 
Process A which writes it's token with a lower timestamp (and should be 
the winner), that this write may not be seen by Process B when it does 
it's read (which is after it's token write and after Process A wrote 
it's token), and thus conclude incorrectly that itself (Process B) is 
the winner since it will not see Process A's token?  I'm 99% sure using 
QUORUM read/writes will allow this to work because that's the whole 
purpose, but I just wanted to double-check in case there's another 
detail I'm forgetting about C* that would defeat this.


Thanks!

P.S.  I realize this will cost me in performance, but this is only meant 
to be used on occasion.


Cassandra ACID

2011-06-23 Thread AJ
Can any Cassandra contributors/guru's confirm my understanding of 
Cassandra's degree of support for the ACID properties?


I provide official references when known.  Please let me know if I 
missed some good official documentation.


*Atomicity*
All individual writes are atomic at the row level.  So, a batch mutate 
for one specific key will apply updates to all the columns for that one 
specific row atomically.  If part of the single-key batch update fails, 
then all of the updates will be reverted since they all pertained to one 
key/row.  Notice, I said 'reverted' not 'rolled back'.  Note: atomicity 
and isolation are related to the topic of transactions but one does not 
imply the other.  Even though row updates are atomic, they are not 
isolated from other users' updates or reads.

Refs: http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

*Consistency*
If you want 100% consistency, use consistency level QUORUM for both 
reads and writes and EACH_QUORUM in a multi-dc scenario.

Refs: http://wiki.apache.org/cassandra/ArchitectureOverview

*Isolation*
NOTHING is isolated; because there is no transaction support in the 
first place.  This means that two or more clients can update the same 
row at the same time.  Their updates of the same or different columns 
may be interleaved and leave the row in a state that may not make sense 
depending on your application.  Note: this doesn't mean to say that two 
updates of the same column will be corrupted, obviously; columns are the 
smallest atomic unit ('atomic' in the more general thread-safe context).
Refs: None that directly address this explicitly and clearly and in one 
place.


*Durability*
Updates are made durable by the use of the commit log.  No worries here.
Refs: Plenty.


Re: Storing files in blob into Cassandra

2011-06-23 Thread AJ

On 6/22/2011 11:43 PM, Sasha Dolgy wrote:

maybe you want to spend a few minutes reading about Haystack over at
facebook to give you some ideas...

https://www.facebook.com/note.php?note_id=76191543919

Not saying what they've done is the right way... just sayin'


Thanks for the tip Sasha; will do.



Re: No Transactions: An Example

2011-06-23 Thread AJ

On 6/23/2011 7:37 AM, Trevor Smith wrote:

AJ,

Thanks for your input. I don't fully follow though how this would work 
with a bank scenario. Could you explain in more detail?


Thanks.

Trevor


I don't know yet.  I'll be researching that.  My working procedure is to 
figure out a way to handle each class of problem that ACID addresses and 
see if there is an acceptable way to compensate or manage it on the 
client or business side; following the ideas in the article.  I bet 
solutions exist somewhere.  In short, the developer needs to be fully 
versed in the potential problems that could arise and have ways to deal 
with it.  It's added responsibility for the developer, but if it keeps 
the infrastructure simple with reduced maintenance costs by not having 
to integrate another service such as ZK/Cages (as useful as they indeed 
are) then it may be worth it.  I'll let you know what I conclude.




Re: Storing files in blob into Cassandra

2011-06-22 Thread AJ

On 6/22/2011 1:07 AM, Damien Picard wrote:

Hi,

I have to store some files (Images, documents, etc.) for my users in a 
webapp. I use Cassandra for all of my data and I would like to know if 
this is a good idea to store these files into blob on a Cassandra CF ?
Is there some contraindications, or special things to know to achieve 
this ?


Thank you

--
Damien Picard
Axeiya Services : http://axeiya.com/
gwt-ckeditor : http://code.google.com/p/gwt-ckeditor/
Mon livre sur GWT : http://axeiya.com/index.php/ouvrage-gwt.html



I was thinking of doing the same thing.  But, to compensate for the 
bandwidth usage during the read, I was hoping to find a way for the 
httpd or app server to cache the file either in RAM or on disk so 
subsequent reads could just reference the in-mem cache or local hdd.  I 
have big data requirements, so duplicating the storage of file blobs by 
adding them to the hdd would almost double my storage requirements.  So, 
the hdd cache would have to be limited with the LRU removed periodically.


I was thinking about making the key for each file be a relative file 
path as if it were on disk.  This same path could also be used as it's 
actual location on disk in the local disk cache.  Using a path as the 
key makes it flexible in many ways if I ever change my mind and want to 
store all files on disk, or when backing-up or archiving, etc..


But, I'm rusty on my apache http knowledge but I also thought there was 
an apache cache mod that would use both ram and disk depending on the 
frequency of use.  But, I don't know if you can tell it to "cache this 
blob like it's a file".


Just some thoughts.


Re: Is LOCAL_QUORUM as strong as QUORUM?

2011-06-22 Thread AJ

On 6/22/2011 8:20 PM, mcasandra wrote:

Well it depends on the requirements. If you use any combination of CL with
EACH_QUORUM it means you are accepting the fact that you are ok if one of
the DC is down. And in your scenario you care more about DCs being
consistent even if writes were to fail. Also you are ok with network
latency.

I think there is a broader design question here and you might be able to
solve it with LOCAL_QUORUM if you handled it at application or load
balancing layer. Is this active/active data center? What's your actual
requirements? Are these external clients that can go to any data center?

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-LOCAL-QUORUM-as-strong-as-QUORUM-tp6506592p6506937.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.



I require 3 (or more) geographically diverse dc's serving local users.  
The next arbitrary closest dc will serve as a 1-replica fail-over for 
the previous dc in case it becomes unavail altogether.  So, each dc is 
active for it's locale and a failover for one of the others; like a 
daisy chain configuration.  I was imagining a series of events where the 
primary dc gets updated at local_quorum, followed by that dc losing all 
network connectivity before the backup gets the change.  Then, the same 
user gets redirected to the backup dc and does a read at local_quorum 
and gets stale data.


But, I realize now if I substituted each_quorum for local_quorum for 
writes, then, in the case of fail-over, the writes would fail.  That's 
fine for consistency's sake, but is a high price to pay.  I have to 
think on this more and what I want.  Thanks for the help.


Re: Is LOCAL_QUORUM as strong as QUORUM?

2011-06-22 Thread AJ

On 6/22/2011 6:50 PM, AJ wrote:

On 6/22/2011 5:56 PM, mcasandra wrote:

LOCAL_QUORUM gurantees consistency in the local data center only. Other
replica nodes in the same DC and other DC not part of the QUORUM will be
eventually consistent. If you want to ensure consistency accross DCs 
you can
use EACH_QUORUM but keep in mind the latency involved assuming DCs 
are not

located within short distance.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-LOCAL-QUORUM-as-strong-as-QUORUM-tp6506592p6506621.html
Sent from the cassandra-u...@incubator.apache.org mailing list 
archive at Nabble.com.




Thanks mcasandra.

I would like to know the minimal consistency_level to assure absolute 
consistency with a multiple data center setup for minimal latency.  
Just as quorum read/writes is the minimal needed to assure consistency 
with a single data center cluster, what is the equivalent read/write 
consistency_level value pair with a multi data center environment?


I'm thinking... writes at EACH_QUORUM and reads at LOCAL_QUORUM?  This 
will handle when a data center gets partitioned.   The write will fail 
if the dc's get partitioned.  If the partition happens after a 
successful write, then that's ok and a local quorum is all that's 
needed for a subsequent read that's consistent.


I meant to say "This will handle when *two or more data centers get* 
partitioned.  The write...".


Re: Is LOCAL_QUORUM as strong as QUORUM?

2011-06-22 Thread AJ

On 6/22/2011 5:56 PM, mcasandra wrote:

LOCAL_QUORUM gurantees consistency in the local data center only. Other
replica nodes in the same DC and other DC not part of the QUORUM will be
eventually consistent. If you want to ensure consistency accross DCs you can
use EACH_QUORUM but keep in mind the latency involved assuming DCs are not
located within short distance.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Is-LOCAL-QUORUM-as-strong-as-QUORUM-tp6506592p6506621.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.



Thanks mcasandra.

I would like to know the minimal consistency_level to assure absolute 
consistency with a multiple data center setup for minimal latency.  Just 
as quorum read/writes is the minimal needed to assure consistency with a 
single data center cluster, what is the equivalent read/write 
consistency_level value pair with a multi data center environment?


I'm thinking... writes at EACH_QUORUM and reads at LOCAL_QUORUM?  This 
will handle when a data center gets partitioned.   The write will fail 
if the dc's get partitioned.  If the partition happens after a 
successful write, then that's ok and a local quorum is all that's needed 
for a subsequent read that's consistent.


Is LOCAL_QUORUM as strong as QUORUM?

2011-06-22 Thread AJ
Quorum read/writes guarantees consistency.  But, when a keyspace spans 
multiple data centers, does local quorum read/writes also guarantee 
consistency?  I'm thinking maybe not if two data centers get partitioned.


Thanks!


Re: Atomicity Strategies

2011-06-22 Thread AJ

Thanks Aaron!

On 6/22/2011 5:25 PM, aaron morton wrote:

Atomic on a single machine yes.

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23 Jun 2011, at 09:42, AJ wrote:


On 4/9/2011 7:52 PM, aaron morton wrote:

My understanding of what they did with locking (based on the examples) was to achieve 
a level of transaction isolation 
http://en.wikipedia.org/wiki/Isolation_(database_systems)<http://en.wikipedia.org/wiki/Isolation_%28database_systems%29>

I think the issue here is more about atomicity 
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic

We cannot guarantee that all or none of the mutations in your batch are 
completed. There is some work in this area though 
https://issues.apache.org/jira/browse/CASSANDRA-1684


Just to be clear, you are speaking in the general sense, right?  The batch 
mutate link you provide says that in the case that ALL the mutates of the batch 
are for the SAME key (row), then the whole batch is atomic:

"As a special case, mutations against a single key are atomic but not 
isolated."

So, is it true that if I want to update multiple columns for one key, then it 
will be an all or nothing update for the whole batch if using batch update?  
But, if your batch mutate containts mutates for more than one key, then all the 
updates for one key will be atomic, followed by all the updates for the next 
key will be atomic, and so on.  Correct?

Thanks!







NTS Replication Strategy - only replicating to a subset of data centers

2011-06-22 Thread AJ
I'm just double-checking, but when using NTS, is it required to specify 
ALL the data centers in the strategy_options attribute?


IOW, I do NOT want replication to ALL data centers; only a two of the 
three.  So, if my property file snitch describes all of the existing 
data centers and nodes as:


$CASSANDRA_HOME/conf/cassandra-topology.properties:

# Cassandra Node IP=Data Center:Rack
175.1.1.1=DC1:RAC1
175.2.1.1=DC2:RAC1
175.3.1.1=DC3:RAC1
# default for unknown nodes
default=DC1:rac1

Can I specify strategy_options as:

strategy_options=[{DC1:2, DC2:1}]

and just leave out DC3 entirely?

If not, will setting the last one to 0 work?:

strategy_options=[{DC1:2, DC2:1, DC3:0}]


Thanks!


Re: No Transactions: An Example

2011-06-22 Thread AJ
I think Sasha's idea is worth studying more.  Here is a supporting read 
referenced in the O'Reilly Cassandra book that talks about alternatives 
to 2-phase commit and synchronous transactions:


http://www.eaipatterns.com/ramblings/18_starbucks.html

If it can be done without locks and the business can handle a rare 
incomplete transaction, then this might be acceptable.



On 6/22/2011 9:14 AM, Sasha Dolgy wrote:

I would still maintain a record of the transaction ... so that I can
do analysis post to determine if/when problems occurred ...

On Wed, Jun 22, 2011 at 4:31 PM, Trevor Smith  wrote:

Sasha,
How would you deal with a transfer between accounts in which only one half
of the operation was successfully completed?
Thank you.
Trevor




Re: Atomicity Strategies

2011-06-22 Thread AJ

On 4/9/2011 7:52 PM, aaron morton wrote:
My understanding of what they did with locking (based on the examples) 
was to achieve a level of transaction isolation 
http://en.wikipedia.org/wiki/Isolation_(database_systems) 



I think the issue here is more about atomicity 
http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic


We cannot guarantee that all or none of the mutations in your batch 
are completed. There is some work in this area though 
https://issues.apache.org/jira/browse/CASSANDRA-1684




Just to be clear, you are speaking in the general sense, right?  The 
batch mutate link you provide says that in the case that ALL the mutates 
of the batch are for the SAME key (row), then the whole batch is atomic:


"As a special case, mutations against a single key are atomic but 
not isolated."


So, is it true that if I want to update multiple columns for one key, 
then it will be an all or nothing update for the whole batch if using 
batch update?  But, if your batch mutate containts mutates for more than 
one key, then all the updates for one key will be atomic, followed by 
all the updates for the next key will be atomic, and so on.  Correct?


Thanks!



Re: Storing Accounting Data

2011-06-21 Thread AJ

On 6/21/2011 3:36 PM, Stephen Connolly wrote:


writes are not atomic.

the first side can succeed at quorum, and the second side can fail 
completely... you'll know it failed, but now what... you retry, still 
failed... erh I'll store it somewhere and retry it later... where do I 
store it?


the consistency level is about tuning whether reads and writes are 
replicated/checked across multiple of the replicates... but at any 
consistency level, each write will either succeed or fail _independently_


you could have one column family which is kind of like a transaction 
log, you write a json object of all the mutations you will make, then 
you go and make the mutations, when they succeed you write a completed 
column to the transaction log... them you can repeat that as often as need


you could have transactions posted as columns in a row, and to get the 
balance you iterate all the columns adding the +'s and -'s


by processing the transaction log, you could establish the highest 
complete timestamp, and add summary balance columns being the running 
total up to that point, so that you don't have to iterate everything


- Stephen



Yeah, it's all more than I want to do.  But, I just rediscovered 
Dominic's Cages <http://code.google.com/p/cages/>.   Has anyone tried it?


---
Sent from my Android phone, so random spelling mistakes, random 
nonsense words and other nonsense are a direct result of using swype 
to type on the screen


On 21 Jun 2011 22:04, "AJ" <mailto:a...@dude.podzone.net>> wrote:




Re: Storing Accounting Data

2011-06-21 Thread AJ

On 6/21/2011 3:14 PM, Anand Somani wrote:
Not sure if it is that simple, a quorum can fail with writes happening 
on some nodes (there is no rollback). Also there is no concept of 
atomic compare-and-swap.




Good points.  I suppose what I need is for the client to implement the 
part of ACID that C* does not.  So, right off the bat, can anyone tell 
me if that is even possible conceptually?  If so, please throw out some 
terms that I can wiki and some Java API's would be much appreciated as 
well.  Also, can I accomplish this or make things easier by imposing 
some restrictions, such as only allowing single-user access to the data 
for certain operations?


Thanks!


Re: Storing Accounting Data

2011-06-21 Thread AJ
And I was thinking of using JTA for transaction processing.  I have no 
experience with it but on the surface it looks like it should work.


On 6/21/2011 3:31 PM, AJ wrote:

What's the best accepted way to handle that 100% in the client?  Retries?

On 6/21/2011 3:14 PM, Anand Somani wrote:
Not sure if it is that simple, a quorum can fail with writes 
happening on some nodes (there is no rollback). Also there is no 
concept of atomic compare-and-swap.


On Tue, Jun 21, 2011 at 2:03 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 6/21/2011 2:50 PM, Stephen Connolly wrote:


how important are things like transactional consistency for you?

would you have issues if only one side of a transfer was recorded?



Right.  Both of those questions are about consistency.  Isn't the
simple solution is to use QUORUM read/writes?


cassandra, out of the box, on it's own, would not be ideal if
the above two things are important for you.

you can add components to a system to help address these things,
eg zookeeper, etc. a reason why you moght do this is if you
already use cassandra in your app and are trying to limit the
number of databases

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random
nonsense words and other nonsense are a direct result of using
swype to type on the screen

    On 21 Jun 2011 18:30, "AJ" mailto:a...@dude.podzone.net>> wrote:









Re: Storing Accounting Data

2011-06-21 Thread AJ

What's the best accepted way to handle that 100% in the client?  Retries?

On 6/21/2011 3:14 PM, Anand Somani wrote:
Not sure if it is that simple, a quorum can fail with writes happening 
on some nodes (there is no rollback). Also there is no concept of 
atomic compare-and-swap.


On Tue, Jun 21, 2011 at 2:03 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 6/21/2011 2:50 PM, Stephen Connolly wrote:


how important are things like transactional consistency for you?

would you have issues if only one side of a transfer was recorded?



Right.  Both of those questions are about consistency.  Isn't the
simple solution is to use QUORUM read/writes?


cassandra, out of the box, on it's own, would not be ideal if the
above two things are important for you.

you can add components to a system to help address these things,
eg zookeeper, etc. a reason why you moght do this is if you
already use cassandra in your app and are trying to limit the
number of databases

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random
nonsense words and other nonsense are a direct result of using
swype to type on the screen

    On 21 Jun 2011 18:30, "AJ" mailto:a...@dude.podzone.net>> wrote:







Re: Storing Accounting Data

2011-06-21 Thread AJ

On 6/21/2011 2:50 PM, Stephen Connolly wrote:


how important are things like transactional consistency for you?

would you have issues if only one side of a transfer was recorded?



Right.  Both of those questions are about consistency.  Isn't the simple 
solution is to use QUORUM read/writes?


cassandra, out of the box, on it's own, would not be ideal if the 
above two things are important for you.


you can add components to a system to help address these things, eg 
zookeeper, etc. a reason why you moght do this is if you already use 
cassandra in your app and are trying to limit the number of databases


- Stephen

---
Sent from my Android phone, so random spelling mistakes, random 
nonsense words and other nonsense are a direct result of using swype 
to type on the screen


On 21 Jun 2011 18:30, "AJ" <mailto:a...@dude.podzone.net>> wrote:




Storing Accounting Data

2011-06-21 Thread AJ
Is C* suitable for storing customer account (financial) data, as well as 
billing, payroll, etc?  This is a new company so migration is not an 
issue... starting from scratch.


Thanks!


Re: Docs: Token Selection

2011-06-17 Thread AJ

On 6/17/2011 1:27 PM, Sasha Dolgy wrote:

Replication factor is defined per keyspace if i'm not mistaken.  Can't
remember if NTS is per keyspace or per cluster ... if it's per
keyspace, that would be a way around it ... without having to maintain
multiple clusters  just have multiple keyspaces ...


Hey!  I think you're on to something Sasha!  Placement strat and RF are 
both defined per ks.  I'll let that stew for a while in my brain.


Thanks!


On Fri, Jun 17, 2011 at 9:23 PM, AJ  wrote:

On 6/17/2011 12:32 PM, Jeremiah Jordan wrote:

Run two clusters, one which has {DC1:2, DC2:1} and one which is
{DC1:1,DC2:2}.  You can't have both in the same cluster, otherwise it
isn't possible to tell where the data got written when you want to read
it.  For a given key "XYZ" you must be able to compute which nodes it is
stored on just using "XYZ", so a strategy where it is on nodes
DC1_1,DC1_2, and DC2_1 when a node in DC1 is the coordinator, and to
DC1_1, DC2_1 and DC2_2 when a node in DC2 is the coordinator won't work.
Given just "XYZ" I don't know where to look for the data.
But, from the way you describe what you want to happen, clients
from DC1 aren't using data inserted by clients from DC2, so you should
just make two different Cassandra clusters.  Once for the DC1 guys which
is {DC1:2, DC2:1} and one for the DC2 guys which is {DC1:1,DC2:2}.


Interesting.  Thx.









Re: Docs: Token Selection

2011-06-17 Thread AJ

On 6/17/2011 12:32 PM, Jeremiah Jordan wrote:

Run two clusters, one which has {DC1:2, DC2:1} and one which is
{DC1:1,DC2:2}.  You can't have both in the same cluster, otherwise it
isn't possible to tell where the data got written when you want to read
it.  For a given key "XYZ" you must be able to compute which nodes it is
stored on just using "XYZ", so a strategy where it is on nodes
DC1_1,DC1_2, and DC2_1 when a node in DC1 is the coordinator, and to
DC1_1, DC2_1 and DC2_2 when a node in DC2 is the coordinator won't work.
Given just "XYZ" I don't know where to look for the data.
But, from the way you describe what you want to happen, clients
from DC1 aren't using data inserted by clients from DC2, so you should
just make two different Cassandra clusters.  Once for the DC1 guys which
is {DC1:2, DC2:1} and one for the DC2 guys which is {DC1:1,DC2:2}.



Interesting.  Thx.



Re: Docs: Token Selection

2011-06-17 Thread AJ

On 6/17/2011 12:33 PM, Eric tamme wrote:


As i said previously, trying to build make cassandra treat things
differently based on some kind of persistent locality set it maintains
in memory .. or whatever .. sounds like you will be absolutely
undermining the core principles of how cassandra works.

-Eric



That sounds so funny because figuring out how Cassandra works is the 
hard part! ;o)  But, thanks for your explanation; it helps.


Re: Docs: Token Selection

2011-06-17 Thread AJ

Hi Jeremiah, can you give more details?

Thanks

On 6/17/2011 10:49 AM, Jeremiah Jordan wrote:

Run two Cassandra clusters...

-Original Message-
From: Eric tamme [mailto:eta...@gmail.com]
Sent: Friday, June 17, 2011 11:31 AM
To: user@cassandra.apache.org
Subject: Re: Docs: Token Selection


What I don't like about NTS is I would have to have more replicas than
I need.  {DC1=2, DC2=2}, RF=4 would be the minimum.  If I felt that 2
local replicas was insufficient, I'd have to move up to RF=6 which
seems like a waste... I'm predicting data in the TB range so I'm
trying to keep replicas to a minimum.

My goal is to have 2-3 replicas in a local data center and 1 replica
in another dc.  I think that would be enough barring a major
catastrophe.  But, I'm not sure this is possible.  I define "local" as
in the same data center as the client doing the insert/update.

Yes, not being able to configure the replication factor differently for each 
data center is a bit annoying.  Im assuming you basically want DC1 to have a 
replication factor of {DC1:2, DC2:1} and DC2 to have {DC1:1,DC2:2}.

I would very much like that feature as well, but I dont know the feasibility of 
it.

-Eric





Re: Docs: Token Selection

2011-06-17 Thread AJ

On 6/17/2011 10:31 AM, Eric tamme wrote:

What I don't like about NTS is I would have to have more replicas than I
need.  {DC1=2, DC2=2}, RF=4 would be the minimum.  If I felt that 2 local
replicas was insufficient, I'd have to move up to RF=6 which seems like a
waste... I'm predicting data in the TB range so I'm trying to keep replicas
to a minimum.

My goal is to have 2-3 replicas in a local data center and 1 replica in
another dc.  I think that would be enough barring a major catastrophe.  But,
I'm not sure this is possible.  I define "local" as in the same data center
as the client doing the insert/update.

Yes, not being able to configure the replication factor differently
for each data center is a bit annoying.  Im assuming you basically
want DC1 to have a replication factor of {DC1:2, DC2:1} and DC2 to
have {DC1:1,DC2:2}.


Yes.  But, the more I think about it, the more I see issues.  Here is 
what I envision (Issues marked with *):


Three or more dc's, each serving as fail-overs for the others with 1 
maximum unavailable dc supported at a time.

Each dc is a production dc serving users that I choose.
Each dc also stores 0-1 replicas from the other dc's.
Direct customers to their "home" dc of my choice.
Data coming from the client local to the dc is replicated X times in the 
local dc and 1 time in any other dc (randomly).
In the even a dc becomes unreachable by users, an arbitrary fail-over dc 
can serve their requests albeit with increased latency.
*There will only be 1 replica left amongst the remaining fail-over dc's, 
so this could be a problem depending on the CL used other than CL.ONE.
*During the fail-over state, the cluster needs to know that the real 
"home" of the replicas belongs to the currently unavailable dc.  But, as 
of now, I don't think that's possible and so new writes will start to be 
replicated in the current dc as if the currently-used fail-over dc is 
the home dc.


Maybe these goals can be achieve with a kind of ordered asymmetrical 
replication strategy like you illustrated above.  The hard part will be 
to figure out a simple and elegant way to do this w/o undermining C*.




I would very much like that feature as well, but I dont know the
feasibility of it.

-Eric





Re: Docs: Token Selection

2011-06-17 Thread AJ
+1  Yes, that is what I'm talking about Eric.  Maybe I could write my 
own strategy, I dunno.  I'll have to understand more first.


On 6/17/2011 10:37 AM, Sasha Dolgy wrote:

+1 for this if it is possible...

On Fri, Jun 17, 2011 at 6:31 PM, Eric tamme  wrote:

What I don't like about NTS is I would have to have more replicas than I
need.  {DC1=2, DC2=2}, RF=4 would be the minimum.  If I felt that 2 local
replicas was insufficient, I'd have to move up to RF=6 which seems like a
waste... I'm predicting data in the TB range so I'm trying to keep replicas
to a minimum.

My goal is to have 2-3 replicas in a local data center and 1 replica in
another dc.  I think that would be enough barring a major catastrophe.  But,
I'm not sure this is possible.  I define "local" as in the same data center
as the client doing the insert/update.

Yes, not being able to configure the replication factor differently
for each data center is a bit annoying.  Im assuming you basically
want DC1 to have a replication factor of {DC1:2, DC2:1} and DC2 to
have {DC1:1,DC2:2}.

I would very much like that feature as well, but I dont know the
feasibility of it.

-Eric




Re: Docs: Token Selection

2011-06-17 Thread AJ

On 6/17/2011 7:26 AM, William Oberman wrote:
I haven't done it yet, but when I researched how to make 
geo-diverse/failover DCs, I figured I'd have to do something like 
RF=6, strategy = {DC1=3, DC2=3}, and LOCAL_QUORUM for reads/writes. 
 This gives you an "ack" after 2 local nodes do the read/write, but 
the data eventually gets distributed to the other DC for a full 
failover.  No "ying-yang", but I believe accomplishes the same goal?


will


What I don't like about NTS is I would have to have more replicas than I 
need.  {DC1=2, DC2=2}, RF=4 would be the minimum.  If I felt that 2 
local replicas was insufficient, I'd have to move up to RF=6 which seems 
like a waste... I'm predicting data in the TB range so I'm trying to 
keep replicas to a minimum.


My goal is to have 2-3 replicas in a local data center and 1 replica in 
another dc.  I think that would be enough barring a major catastrophe.  
But, I'm not sure this is possible.  I define "local" as in the same 
data center as the client doing the insert/update.




Re: Docs: Token Selection

2011-06-17 Thread AJ
Thanks Jonathan.  I assumed since each data center owned the full key 
space that the first replica would be stored in the dc of the 
coordinating node, the 2nd in another dc, and the 3rd+ back in the 1st 
dc.  But, are you saying that the first endpoint is selected regardless 
of the location of the coordinating node?  Are you saying that the 
starting endpoint is the one closest to the row token regardless of the 
dc?  So, it is possible that a replica might not even get stored in the 
dc of the coordinator at all depending on how many dc's there are, rf, 
starting token assignments, etc.?



On 6/17/2011 12:20 AM, Jonathan Ellis wrote:

Replication location is determined by the row key, not the location of
the client that inserted it.  (Otherwise, without knowing what DC a
row was inserted in, you couldn't look it up to read it!)

On Fri, Jun 17, 2011 at 12:20 AM, AJ  wrote:

On 6/16/2011 9:45 PM, aaron morton wrote:

But, I'm thinking about using OldNetworkTopStrat.

NetworkTopologyStrategy is where it's at.

Oh yeah?  It didn't look like it would serve my requirements.  I want 2 full
production geo-diverse data centers with each serving as a failover for the
other.  Random Partitioner.  Each dc holds 2 replicas from the local clients
and 1 replica goes to the other dc.  It doesn't look like I can do a
ying-yang setup like that with NTS.  Am I wrong?


A
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com









Re: Docs: Token Selection

2011-06-16 Thread AJ

On 6/16/2011 9:45 PM, aaron morton wrote:

But, I'm thinking about using OldNetworkTopStrat.

NetworkTopologyStrategy is where it's at.


Oh yeah?  It didn't look like it would serve my requirements.  I want 2 
full production geo-diverse data centers with each serving as a failover 
for the other.  Random Partitioner.  Each dc holds 2 replicas from the 
local clients and 1 replica goes to the other dc.  It doesn't look like 
I can do a ying-yang setup like that with NTS.  Am I wrong?



A
-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com




Re: Propose new ConsistencyLevel.ALL_AVAIL for reads

2011-06-16 Thread AJ

On 6/16/2011 9:36 PM, Dan Hendry wrote:
"Help me out here.  I'm trying to visualize a situation where the 
clients can access all the C* nodes but the nodes can't access each 
other.  I don't see how that can happen on a regular ethernet subnet 
in one data center.  Well, I"m sure there is a case that you can point 
out.  Ok, I will concede that this is an issue for some network 
configurations."


First rule of designing/developing/operating distributed systems: 
assume anything and everything can and will happen, regardless of 
network configuration or hardware.


This specific situation actually HAS happened to me. Our Cassandra 
nodes accept client connections on one ethernet interface on one 
network (the production network) yet communicate with each other on a 
separate ethernet interface on a separate network which is Cassandra 
specific. This was done mainly due to the relatively large inter-node 
Cassandra bandwidth requirements in comparison to client bandwidth 
requirements. At one point, the switch for the cassandra network went 
down so clients could connect yet the cassandra nodes could not talk 
to eachother. (We write at ONE and read at ALL so everything behaved 
as expected).




Funny, but that's the exact same setup I'm running.  But, I'm not a 
network guy and kind of assumed it wasn't so typical.  Plus, lately I've 
had my mind on a cloud setup.


Re: Propose new ConsistencyLevel.ALL_AVAIL for reads

2011-06-16 Thread AJ

On 6/16/2011 7:56 PM, Dan Hendry wrote:
How would your solution deal with complete network partitions? A node 
being 'down' does not actually mean it is dead, just that it is 
unreachable from whatever is making the decision to mark it 'down'.


Following from Ryan's example, consider nodes A, B, and C but within a 
fully partitioned network: all of the nodes are up but each thinks all 
the others are down. Your ALL_AVAILABLE consistency level would boil 
down to consistency level ONE for clients connecting to any of the 
nodes. If I connect to A, it thinks it is the last one standing and 
translates 'ALL_AVALIABLE' into 'ONE'. Based on your logic, two 
clients connecting to two different nodes could each modify a value 
then read it, thinking that its 100% consistent yet it is 
actually *completely* inconsistent with the value on other node(s).


Help me out here.  I'm trying to visualize a situation where the clients 
can access all the C* nodes but the nodes can't access each other.  I 
don't see how that can happen on a regular ethernet subnet in one data 
center.  Well, I"m sure there is a case that you can point out.  Ok, I 
will concede that this is an issue for some network configurations.


I suggest you review the principles of the infamous CAP theorem. The 
consistency levels as the stand now, allow for an explicit trade off 
between 'available and partition tolerant' (ONE read/write) OR 
'consistent and available' (QUORUM read/write). Your solution achieves 
only availability and can guarantee neither consistency nor partition 
tolerance.


It looks like CAP may triumph again.  Thanks for the exercise Dan and Ryan.


Re: Propose new ConsistencyLevel.ALL_AVAIL for reads

2011-06-16 Thread AJ

UPDATE to my suggestion is below.



On 6/16/2011 5:50 PM, Ryan King wrote:

On Thu, Jun 16, 2011 at 2:12 PM, AJ  wrote:

On 6/16/2011 2:37 PM, Ryan King wrote:

On Thu, Jun 16, 2011 at 1:05 PM, AJwrote:


The Cassandra consistency model is pretty elegant and this type of
approach breaks that elegance in many ways. It would also only really be
useful when the value has a high probability of being updated between a
node
going down and the value being read.

I'm not sure what you mean.  A node can be down for days during which
time
the value can be updated.  The intention is to use the nodes available
even
if they fall below the RF.  If there is only 1 node available for
accepting
a replica, that should be enough given the conditions I stated and
updated
below.

If this is your constraint, then you should just use CL.ONE.


My constraint is a CL = "All Available".  So, CL.ONE will not work.

That's a solution, not a requirement. What's your requirement?


Ok.  And this updates my suggestion removing the need for ALL_AVAIL.  
This adds logic to cope with unavailable nodes and still achieve 
consistency for a specific situation.


The general requirement is to completely eliminate read failures for 
reads specifying CL = ALL for values that have been subject to a 
specific data update pattern.  The specific data update pattern consists 
of a value that has been updated (or added) in the face of one or more, 
but less than R, unavailable replica nodes (at least 1 replica node is 
available).  If a particular data value (column value) is updated after 
the latest down node, this implies this new value is independent of any 
replica values that are currently unavailable.  Therefore, in this 
situation, the number of available replicas is irrelevant.  After 
querying all *available* replica nodes, the value with the latest 
timestamp is consistent if that timestamp is > the timestamp of the last 
replica node that became unavailable.





Well, theoretically, of course; that's the nature of distributed systems.
  But, Cass does indeed make that determination when it counts the number
available replica nodes before it decides if enough replica nodes are
available.  But, this is obvious to you I'm sure so maybe I don't understand
your statement.

Consider this scenario: given nodes, A, B and C and A thinks C is down
but B thinks C is up. What do you do? Remember, A doesn't know that B
thinks C is up, it only knows its own state.



What kind of network configuration would have this kind of scenario?  
This method only applies withing a data center which should be OK since 
other replication across data centers seems to be mostly for fault 
tolerance... but I will have to think about this.



-ryan





Re: Propose new ConsistencyLevel.ALL_AVAIL for reads

2011-06-16 Thread AJ

On 6/16/2011 2:37 PM, Ryan King wrote:

On Thu, Jun 16, 2011 at 1:05 PM, AJ  wrote:





The Cassandra consistency model is pretty elegant and this type of
approach breaks that elegance in many ways. It would also only really be
useful when the value has a high probability of being updated between a node
going down and the value being read.

I'm not sure what you mean.  A node can be down for days during which time
the value can be updated.  The intention is to use the nodes available even
if they fall below the RF.  If there is only 1 node available for accepting
a replica, that should be enough given the conditions I stated and updated
below.

If this is your constraint, then you should just use CL.ONE.


My constraint is a CL = "All Available".  So, CL.ONE will not work.

Perhaps the simpler approach which is fairly trivial and does not require
any Cassandra change is to simply downgrade your read from ALL to QUORUM
when you get an unavailable exception for this particular read.

It's not so trivial, esp since you would have to build that into your client
at many levels.  I think it would be more appropriate (if this idea
survives) to put it into Cass.

I think the general answerer for 'maximum consistency' is QUORUM
reads/writes. Based on the fact you are using CL=ALL for reads I assume you
are using CL=ONE for writes: this itself strikes me as a bad idea if you
require 'maximum consistency for one critical operation'.


Very true.  Specifying quorum for BOTH reads/writes provides the 100%
consistency because of the overlapping of the availability numbers.  But,
only if the # of available nodes is not<  RF.

No, it will work as long as the available nodes is>= RF/2 + 1
Yes, that's what I meant.  Sorry for any confusion.  Restated: But, only 
if the # of available nodes is not < RF/2 + 1.

Upon further reflection, this idea can be used for any consistency level.
  The general thrust of my argument is:  If a particular value can be
overwritten by one process regardless of it's prior value, then that implies
that the value in the down node is no longer up-to-date and can be
disregarded.  Just work with the nodes that are available.

Actually, now that I think about it...

ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the value

latest unavailability time of all unavailable replica nodes for that

value's row key.  Unavailable is defined as a node's Cass process that is
not reachable from ANY node in the cluster in the same data center.  If the
node in question is available to at least one node, then the read should
fail as there is a possibility that the value could have been updated some
other way.

Node A can't reliably and consistently know  whether node B and node C
can communicate.
Well, theoretically, of course; that's the nature of distributed 
systems.  But, Cass does indeed make that determination when it counts 
the number available replica nodes before it decides if enough replica 
nodes are available.  But, this is obvious to you I'm sure so maybe I 
don't understand your statement.

After looking at the code, it doesn't look like it will be difficult.
  Instead of skipping the request for values from the nodes when CL nodes
aren't available, it would have to go ahead and request the values from the
available nodes as usual and then look at the timestamps which it does
anyways and compare it to the latest unavailability time of the relevant
replica nodes.  The code that keeps track of what nodes are down simply
records the time it went down.  But, I've only been looking at the code for
a few days so I'm not claiming to know everything by any stretch.

-ryan





Re: Propose new ConsistencyLevel.ALL_AVAIL for reads

2011-06-16 Thread AJ

On 6/16/2011 10:58 AM, Dan Hendry wrote:

I think this would add a lot of complexity behind the scenes and be 
conceptually confusing, particularly for new users.
I'm not so sure about this.  Cass is already somewhat sophisticated and 
I don't see how this could trip-up anyone who can already grasp the 
basics.  The only thing I am adding to the CL concept is the concept of 
available replication nodes, versus total replication nodes.  But, don't 
forget; a competitor to Cass is probably in the works this very minute 
so constant improvement is a good thing.

The Cassandra consistency model is pretty elegant and this type of approach 
breaks that elegance in many ways. It would also only really be useful when the 
value has a high probability of being updated between a node going down and the 
value being read.
I'm not sure what you mean.  A node can be down for days during which 
time the value can be updated.  The intention is to use the nodes 
available even if they fall below the RF.  If there is only 1 node 
available for accepting a replica, that should be enough given the 
conditions I stated and updated below.

Perhaps the simpler approach which is fairly trivial and does not require any 
Cassandra change is to simply downgrade your read from ALL to QUORUM when you 
get an unavailable exception for this particular read.
It's not so trivial, esp since you would have to build that into your 
client at many levels.  I think it would be more appropriate (if this 
idea survives) to put it into Cass.

I think the general answerer for 'maximum consistency' is QUORUM reads/writes. 
Based on the fact you are using CL=ALL for reads I assume you are using CL=ONE 
for writes: this itself strikes me as a bad idea if you require 'maximum 
consistency for one critical operation'.

Very true.  Specifying quorum for BOTH reads/writes provides the 100% 
consistency because of the overlapping of the availability numbers.  
But, only if the # of available nodes is not < RF.


Upon further reflection, this idea can be used for any consistency 
level.  The general thrust of my argument is:  If a particular value can 
be overwritten by one process regardless of it's prior value, then that 
implies that the value in the down node is no longer up-to-date and can 
be disregarded.  Just work with the nodes that are available.


Actually, now that I think about it...

ALL_AVAIL guarantees 100% consistency iff the latest timestamp of the 
value > latest unavailability time of all unavailable replica nodes for 
that value's row key.  Unavailable is defined as a node's Cass process 
that is not reachable from ANY node in the cluster in the same data 
center.  If the node in question is available to at least one node, then 
the read should fail as there is a possibility that the value could have 
been updated some other way.


After looking at the code, it doesn't look like it will be difficult.  
Instead of skipping the request for values from the nodes when CL nodes 
aren't available, it would have to go ahead and request the values from 
the available nodes as usual and then look at the timestamps which it 
does anyways and compare it to the latest unavailability time of the 
relevant replica nodes.  The code that keeps track of what nodes are 
down simply records the time it went down.  But, I've only been looking 
at the code for a few days so I'm not claiming to know everything by any 
stretch.



Dan



Thanks for your reply.  I still welcome critiques.


Re: Propose new ConsistencyLevel.ALL_AVAIL for reads

2011-06-16 Thread AJ

On 6/16/2011 10:05 AM, Ryan King wrote:


I don't think this buys you anything that you can't get with quorum
reads and writes.

-ryan



QUORUM <= ALL_AVAIL <= ALL == RF


Propose new ConsistencyLevel.ALL_AVAIL for reads

2011-06-16 Thread AJ

Good morning all.

Hypothetical Setup:
1 data center
RF = 3
Total nodes > 3

Problem:
Suppose I need maximum consistency for one critical operation; thus I 
specify CL = ALL for reads.  However, this will fail if only 1 replica 
endpoint is down.  I don't see why this fail is necessary all of the 
time since the data could have been updated since the node became 
unavailable and it's data is old anyways.  If only one node goes down 
and it has the key I need, then the app is not 100% available and it 
could take some time making the node available again.


Proposal:
If all of the *available* replica nodes answer the read operation and 
the latest value timestamp is clearly AFTER the time the down node 
became unavailable, then this situation can meet the requirements for 
*near* 100% consistency since the value in the down node would be 
outdated anyway.  Clearly, the value was updated some time *after* the 
node went down or unavailable.  This way, you can have max availability 
when using read with CL.ALL... or something CL close in meaning to ALL.


I say "near" 100% consistency to leave room for some situation where the 
unavailable node was only unavailable to the coordinating node for some 
reason such as a network issue and thus still received an update by some 
other route after it "appeared" unavailable to the current coordinating 
node.  In a situation like this, there is a chance the read will still 
not return the latest value.  So, this will not be truly 100% consistent 
which CL.ALL guarantees.  However, I think this logic could justify a 
new consistency level slightly lower than ALL, such as ALL_AVAIL.


What do you think?  Is my logic correct?  Is there a conflict with the 
architecture or base principles?  This fits with the tunable consistency 
principle for sure.


Thanks for listening




Re: Docs: Token Selection

2011-06-16 Thread AJ
Thanks Eric!  I've finally got it!  I feel like I've just been initiated 
or something by discovering this "secret".  I kid!


But, I'm thinking about using OldNetworkTopStrat.  Do you, or anyone 
else, know if the same rules for token assignment applies to ONTS?



On 6/16/2011 7:21 AM, Eric tamme wrote:

AJ,

sorry I seemed to miss the original email on this thread.  As Aaron
said, when computing tokens for multiple data centers, you should
compute them independently for each data center - as if it were its
own Cassandra cluster.

You can have "overlapping" token ranges between multiple data centers,
but no two nodes can have the same token, so for subsequent data
centers I just increment the tokens.

For two data centers with two nodes each using RandomPartitioner
calculate the tokens for the first DC normally, but int he second data
center, increment the tokens by one.

In DC 1
node 1 = 0
node 2 = 85070591730234615865843651857942052864

In DC 2
node 1 = 1
node 2 =  85070591730234615865843651857942052865

For RowMutations this will give each data center a local set of nodes
that it can write to for complete coverage of the entire token space.
If you are using NetworkTopologyStrategy for replication, it will give
an offset mirror replication between the two data centers so that your
replicas will not get pinned to a node in the remote DC.  There are
other ways to select the tokens, but the increment method is the
simplest to manage and continue to grow with.

Hope that helps.

-Eric





Re: Docs: Token Selection

2011-06-16 Thread AJ
LOL, I feel Eric's pain.  This double-ring thing can throw you for a 
loop since, like I said, there is only one place it is documented and it 
is only *implied*, so one is not sure he is interpreting it correctly.  
Even the source for NTS doesn't mention this.


Thanks for everyone's help on this.

On 6/16/2011 5:43 AM, aaron morton wrote:
See this thread for background 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html 



In a multi DC environment, if you calculate the initial tokens for the 
entire cluster data will not be evenly distributed.


Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com





Re: Docs: Token Selection

2011-06-15 Thread AJ
Ok.  I understand the reasoning you laid out.  But, I think it should be 
documented more thoroughly.  I was trying to get an idea as to how 
flexible Cass lets you be with the various combinations of strategies, 
snitches, token ranges, etc..


It would be instructional to see what a graphical representation of a 
cluster ring with multiple data centers looks like.  Google turned-up 
nothing.  I imagine it's a multilayer ring; one layer per data center 
with the nodes of one layer slightly offset from the ones in the other 
(based on the example in the wiki).  I would also like to know which 
node is next in the ring such so as to understand replica placement in, 
for example, the OldNetworkTopologyStrategy when it's doc states,


"...It places one replica in a different data center from the first (if 
there is any such data center), the third replica in a different rack in 
the first datacenter, and any remaining replicas on the first unused 
nodes on the ring."


I can only assume for now that "the ring" referred to is the "local" 
ring of the first data center.



On 6/15/2011 5:51 PM, Vijay wrote:

No it wont it will assume you are doing the right thing...

Regards,




On Wed, Jun 15, 2011 at 2:34 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


Vijay, thank you for your thoughtful reply.  Will Cass complain if
I don't setup my tokens like in the examples?


On 6/15/2011 2:41 PM, Vijay wrote:

All you heard is right...
You are not overriding Cassandra's token assignment by saying
here is your token...

Logic is:
Calculate a token for the given key...
find the node in each region independently (If you use NTS and if
you set the strategy options which says you want to replicate to
the other region)...
Search for the ranges in each region independntly
Replicate the data to that node.

For multi DC cassandra needs nodes to be equally partitioned
within each dc (If you care that the load equally
distributed) as well as there shouldn't be any collusion of
tokens within a cluster

The documentation tried to explain the same and the example in
the documentation.
Hope this clarifies...

More examples if it helps

DC1 Node 1 : token 0
DC1 Node 2 : token 8..

DC2 Node 1 : token 4..
DC2 Node 1 : token 12..

or

DC1 Node 1 : token 0
DC1 Node 2 : token 1..

DC2 Node 1 : token 8..
DC2 Node 1 : token  7..

Regards,




On Wed, Jun 15, 2011 at 12:28 PM, AJ mailto:a...@dude.podzone.net>> wrote:

On 6/15/2011 12:14 PM, Vijay wrote:

Correction

"The problem in the above approach is you have 2 nodes
between 12 to 4 in DC1 but from 4 to 12  you just have 1"

should be

"The problem in the above approach is you have 1 node
between 0-4 (25%) and and one node covering the rest which
is 4-16, 0-0 (75%)"

Regards,




Ok, I think you are saying that the computed token range
intervals are incorrect and that they would be:

DC1
*node 1 = 0  Range: (4, 16], (0, 0]

node 2 = 4  Range: (0, 4]

DC2
*node 3 = 8  Range: (12, 16], (0, 8]

node 4 = 12   Range: (8, 12]

If so, then yes, this is what I am seeking to confirm since I
haven't found any documentation stating this directly and
that reference that I gave only implies this; that is, that
the token ranges are calculated per data center rather than
per cluster.  I just need someone to confirm that 100%
because it doesn't sound right to me based on everything else
I've read.

SO, the question is:  Does Cass calculate the consecutive
node token ranges A.) per cluster, or B.) for the whole data
center?

From all I understand, the answer is B.  But, that
documentation (reprinted below) implies A... or something
that doesn't make sense to me because of the token placement
in the example:

"With NetworkTopologyStrategy, you should calculate the
tokens the nodes in each DC independantly...

DC1 node 1 = 0 node 2 =
85070591730234615865843651857942052864 DC2 node 3 = 1 node 4
= 850705917302346158658436518579
42052865"


However, I do see why this would be helpful, but first I'm just asking 
if this token assignment is absolutely mandatory
or if it's just a technique to achieve some end.











Re: Docs: Token Selection

2011-06-15 Thread AJ
Vijay, thank you for your thoughtful reply.  Will Cass complain if I 
don't setup my tokens like in the examples?


On 6/15/2011 2:41 PM, Vijay wrote:

All you heard is right...
You are not overriding Cassandra's token assignment by saying here is 
your token...


Logic is:
Calculate a token for the given key...
find the node in each region independently (If you use NTS and if you 
set the strategy options which says you want to replicate to the other 
region)...

Search for the ranges in each region independntly
Replicate the data to that node.

For multi DC cassandra needs nodes to be equally partitioned 
within each dc (If you care that the load equally distributed) as 
well as there shouldn't be any collusion of tokens within a cluster


The documentation tried to explain the same and the example in the 
documentation.

Hope this clarifies...

More examples if it helps

DC1 Node 1 : token 0
DC1 Node 2 : token 8..

DC2 Node 1 : token 4..
DC2 Node 1 : token 12..

or

DC1 Node 1 : token 0
DC1 Node 2 : token 1..

DC2 Node 1 : token 8..
DC2 Node 1 : token  7..

Regards,




On Wed, Jun 15, 2011 at 12:28 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


On 6/15/2011 12:14 PM, Vijay wrote:

Correction

"The problem in the above approach is you have 2 nodes between 12
to 4 in DC1 but from 4 to 12  you just have 1"

should be

"The problem in the above approach is you have 1 node between 0-4
(25%) and and one node covering the rest which is 4-16, 0-0 (75%)"

Regards,




Ok, I think you are saying that the computed token range intervals
are incorrect and that they would be:

DC1
*node 1 = 0  Range: (4, 16], (0, 0]

node 2 = 4  Range: (0, 4]

DC2
*node 3 = 8  Range: (12, 16], (0, 8]

node 4 = 12   Range: (8, 12]

If so, then yes, this is what I am seeking to confirm since I
haven't found any documentation stating this directly and that
reference that I gave only implies this; that is, that the token
ranges are calculated per data center rather than per cluster.  I
just need someone to confirm that 100% because it doesn't sound
right to me based on everything else I've read.

SO, the question is:  Does Cass calculate the consecutive node
token ranges A.) per cluster, or B.) for the whole data center?

From all I understand, the answer is B.  But, that documentation
(reprinted below) implies A... or something that doesn't make
sense to me because of the token placement in the example:

"With NetworkTopologyStrategy, you should calculate the tokens the
nodes in each DC independantly...

DC1 node 1 = 0 node 2 = 85070591730234615865843651857942052864 DC2
node 3 = 1 node 4 = 850705917302346158658436518579
42052865"


However, I do see why this would be helpful, but first I'm just asking if 
this token assignment is absolutely mandatory
or if it's just a technique to achieve some end.








Re: Docs: Token Selection

2011-06-15 Thread AJ

On 6/15/2011 12:14 PM, Vijay wrote:

Correction

"The problem in the above approach is you have 2 nodes between 12 to 4 
in DC1 but from 4 to 12  you just have 1"


should be

"The problem in the above approach is you have 1 node between 0-4 
(25%) and and one node covering the rest which is 4-16, 0-0 (75%)"


Regards,




Ok, I think you are saying that the computed token range intervals are 
incorrect and that they would be:


DC1
*node 1 = 0  Range: (4, 16], (0, 0]
node 2 = 4  Range: (0, 4]

DC2
*node 3 = 8  Range: (12, 16], (0, 8]
node 4 = 12   Range: (8, 12]

If so, then yes, this is what I am seeking to confirm since I haven't 
found any documentation stating this directly and that reference that I 
gave only implies this; that is, that the token ranges are calculated 
per data center rather than per cluster.  I just need someone to confirm 
that 100% because it doesn't sound right to me based on everything else 
I've read.


SO, the question is:  Does Cass calculate the consecutive node token 
ranges A.) per cluster, or B.) for the whole data center?


From all I understand, the answer is B.  But, that documentation 
(reprinted below) implies A... or something that doesn't make sense to 
me because of the token placement in the example:


"With NetworkTopologyStrategy, you should calculate the tokens the nodes 
in each DC independantly...


DC1
node 1 = 0
node 2 = 85070591730234615865843651857942052864

DC2
node 3 = 1
node 4 = 85070591730234615865843651857942052865"


However, I do see why this would be helpful, but first I'm just asking if this 
token assignment is absolutely mandatory
or if it's just a technique to achieve some end.





Re: Forcing Cassandra to free up some space

2011-06-15 Thread AJ
In regards to cleaning-up old sstable files, I posed this question 
before as I noticed after taking a snapshot, the older files 
(pre-compaction) shared no links with the snapshots.  Therefore, (if the 
Cass snapshot functionality is working correctly) those older files can 
be manually deleted.  The reasoning is simply because if you were to do 
a backup based on the snapshots that Cass created, then those older 
(pre-compation) files would be left-out of the backup.  Therefore, they 
are no longer needed.


But, I never got a definitive answer to this.  If the Cass snapshot 
functionality can be relied upon with 100% confidence, then all you have 
to do is take a snapshot, then delete all the files with hard links <= 1 
and with mod times prior to the snapshotted files.  But, again, this is 
only considered safe if the Cass snapshot function is 100% reliable.  I 
have no reason to believe it's not... just saying.


On 6/15/2011 9:48 AM, Terje Marthinussen wrote:
Even if the gc call cleaned all files, it is not really acceptable on 
a decent sized cluster due to the impact full gc has on performance. 
Especially non-needed ones.


The delay in file deletion can also at times make it hard to see how 
much spare disk you actually have.


We easily see 100% increase in disk use which extends for long periods 
of time before anything gets cleaned up. This can be quite misleading 
and I believe on a couple of occasions we seen short term full disk 
scenarios during testing as a result of cleanup not happening entirely 
when it should...


Terje

On Wed, Jun 15, 2011 at 11:50 PM, Shotaro Kamio > wrote:


We've encountered the situation that compacted sstable files aren't
deleted after node repair. Even when gc is triggered via jmx, it
sometimes leaves compacted files. In a case, a lot of files are left.
Some files stay more than 10 hours already. There is no guarantee that
gc will cleanup all compacted sstable files.

We have a great interest on the following ticket.
https://issues.apache.org/jira/browse/CASSANDRA-2521


Regards,
Shotaro


On Fri, May 27, 2011 at 11:27 AM, Jeffrey Kesselman
mailto:jef...@gmail.com>> wrote:
> Im also not sure that will guarantee all space is cleaned up.  It
> really depends on what you are doing inside Cassandra.  If you have
> your on garbage collect that is just in some way tied to the gc run,
> then it will run when  it runs.
>
> If otoh you are associating records in your storage with specific
> objects in memory and using one of the post-mortem hooks
(finalize or
> PhantomReference) to tell you to clean up that particular record
then
> its quite possible they wont all get cleaned up.  In general hotspot
> does not find and clean every candidate object on every GC run.  It
> starts with the easiest/fastest to find and then sees what more it
> thinks it needs to do to create enough memory for anticipated near
> future needs.
>
> On Thu, May 26, 2011 at 10:16 PM, Jonathan Ellis
mailto:jbel...@gmail.com>> wrote:
>> In summary, system.gc works fine unless you've deliberately done
>> something like setting the -XX:-DisableExplicitGC flag.
>>
>> On Thu, May 26, 2011 at 5:58 PM, Konstantin  Naryshkin
>> mailto:konstant...@a-bb.net>> wrote:
>>> So, in summary, there is no way to predictably and efficiently
tell Cassandra to get rid of all of the extra space it is using on
disk?
>>>
>>> - Original Message -
>>> From: "Jeffrey Kesselman" mailto:jef...@gmail.com>>
>>> To: user@cassandra.apache.org 
>>> Sent: Thursday, May 26, 2011 8:57:49 PM
>>> Subject: Re: Forcing Cassandra to free up some space
>>>
>>> Which JVM?  Which collector?  There have been and continue to
be many.
>>>
>>> Hotspot itself supports a number of different collectors with
>>> different behaviors.   Many of them do not collect every
candidate on
>>> every gc, but merely the easiest ones to find.  This is why
depending
>>> on finalizers is a *bad* idea in java code.  They may well
never get
>>> run.  (Finalizer is one of a few features the Sun Java team always
>>> regretted putting in Java to start with.  It has caused quite
a few
>>> application problems over the years)
>>>
>>> The really important thing is that NONE of these behaviors of the
>>> colelctors are guaranteed by specification not to change from
version
>>> to version.  Basing your code on non-specified behaviors is a
good way
>>> to hit mysterious failures on updates.
>>>
>>> For instance, in the mid 90s, IBM had a mode of their Vm called
>>> "infinite heap."  it *never* garbage collected, even if you called
>>> System.gc.  Instead it just threw away address space and
counted on
>>> the total memory need

Re: cascading failures due to memory

2011-06-15 Thread AJ

Sasha,

Did you ever nail down the cause of this problem?

On 5/31/2011 4:01 AM, Sasha Dolgy wrote:

hi everyone,

the current nodes i have deployed (4) have all been working fine, with
not a lot of data ... more reads than writes at the moment.  as i had
monitoring disabled, when one node's OS killed the cassandra process
due to out of memory problems ... that was fine.  24 hours later,
another node, 24 hours later, another node ...until finally, all 4
nodes no longer had cassandra running.

When all nodes are started fresh, CPU utilization is at about 21% on
each box.  after 24 hours, this goes up to 32% and then 51% 24 hours
later.

originally I had thought that this may be a result of 'nodetool
repair' not being run consistently ... after adding a cronjob to run
every 24 hours (staggered between nodes) the problem of the increasing
memory utilization does not resolve.

i've read the operations page and also the
http://wiki.apache.org/cassandra/MemtableThresholds page.  i am
running defaults and 0.7.6-02 ...

what are the best places to start in terms of finding why this is
happening?  CF design / usage?  'nodetool cfstats' gives me some good
info ... and i've already implemented some changes to one CF based on
how it had ballooned (too many rows versus not enough columns)

suggestions appreciated





Re: Where is my data?

2011-06-15 Thread AJ

Thanks

On 6/15/2011 3:20 AM, Sylvain Lebresne wrote:

You can use the thrift call describe_ring(). It will returns a map
that associate to each range of the
ring who is a replica. Once any range has all it's endpoint
unavailable, that range of the data is unavailable.

--
Sylvain





Re: New web client & future API

2011-06-15 Thread AJ

Nice interface... and someone has good taste in music.

BTW, I'm new to web programming, what did you use for the web 
components?  JSF, JavaScript, something else?


On 6/14/2011 7:42 AM, Markus Wiesenbacher | Codefreun.de wrote:


Hi,

what is the future API for Cassandra? Thrift, Avro, CQL?

I just released an early version of my web client 
(http://www.codefreun.de/apollo) which is Thrift-based, and therefore 
I would like to know what the future is ...


Many thanks
MW




Re: Docs: Token Selection

2011-06-14 Thread AJ

Yes, which means that the ranges overlap each other.

Is this just a convention, or is it technically required when using 
NetworkTopologyStrategy?  Would it be acceptable to split the ranges 
into quarters by ignoring the data centers, such as:


DC1
node 1 = 0  Range: (12, 16], (0, 0]
node 2 = 4  Range: (0, 4]

DC2
node 3 = 8  Range: (4, 8]
node 4 = 12   Range: (8, 12]

If this is OK, are there any drawbacks to this?



On 6/14/2011 6:10 PM, Vijay wrote:

Yes... Thats right...  If you are trying to say the below...

DC1
Node1 Owns 50%

(Ranges 8..4 -> 8..5 & 8..5 -> 0)

Node2 Owns 50%

(Ranges 0 -> 1 & 1 -> 8..4)


DC2
Node1 Owns 50%

(Ranges 8..5 -> 0 & 0 -> 1)

Node2 Owns 50%

(Ranges 1 -> 8..4 & 8..4 -> 8..5)


Regards,




On Tue, Jun 14, 2011 at 3:47 PM, AJ <mailto:a...@dude.podzone.net>> wrote:


This http://wiki.apache.org/cassandra/Operations#Token_selection
 says:

"With NetworkTopologyStrategy, you should calculate the tokens the
nodes in each DC independantly."

and gives the example:

DC1
node 1 = 0
node 2 = 85070591730234615865843651857942052864

DC2
node 3 = 1
node 4 = 85070591730234615865843651857942052865


So, according to the above, the token ranges would be (abbreviated
nums):

DC1
node 1 = 0  Range: (8..4, 16], (0, 0]
node 2 = 8..4   Range: (0, 8..4]

DC2
node 3 = 1  Range: (8..5, 16], (0, 1]
node 4 = 8..5   Range: (1, 8..5]


If the above is correct, then I would be surprised as this
paragraph is the only place were one would discover this and may
be easy to miss... unless there's a doc buried somewhere in plain
view that I missed.

So, have I interpreted this paragraph correctly?  Was this design
to help keep data somewhat localized if that was important, such
as a geographically dispersed DC?

Thanks!






Re: Docs: "Why do deleted keys show up during range scans?"

2011-06-14 Thread AJ

Thanks, but right now I'm thinking, RTFC ;o)

On 6/14/2011 4:37 PM, aaron morton wrote:

While you can delete a row, if I understand correctly, what happens is a
tombstone is created which matches every column, so in effect it is
deleting the columns, not the whole row.

A tombstone is created at the level of the delete, rather than for every 
column. Otherwise imagine deleting a row with 1 million columns.

Tombstones are created at the Column, Super Column and Row level. Deleting at 
the row level writes a row level tombstone. All these different tombstones are 
resolved during the read process.

My understanding of "So to special case leaving out result entries for deletions, we 
would have to check the entire rest of the row to make sure there is no undeleted data 
anywhere else either (in which case leaving the key out would be an error)." is...

Resolving the predicate to determine if a row contains the specified columns is 
a (somewhat) bound operation. Determining if a row as ANY non deleted columns 
is a potentially unbound operation that could involve lots-o-io .  Imagine a 
row with 1 million columns, and the first 100,000 have been deleted.

For each row in the result set you can say either :

1) It has 1 or more of the columns I requested.
2) It has none of the columns I requested.
3) it has no columns, but cassandra decided it was too much work to 
conclusively prove that. Because after all I asked if it had some specific 
columns not if it had any columns.

Hope that helps.

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15 Jun 2011, at 04:25, Jeremiah Jordan wrote:


Also, tombstone's are not "attached" anywhere.  A tombstone is just a
column with special value which says "I was deleted".  And I am pretty
sure they go into SSTables etc the exact same way regular columns do.

-Original Message-
From: Jeremiah Jordan [mailto:jeremiah.jor...@morningstar.com]
Sent: Tuesday, June 14, 2011 11:22 AM
To: user@cassandra.apache.org
Subject: RE: Docs: "Why do deleted keys show up during range scans?"

I am pretty sure how Cassandra works will make sense to you if you think
of it that way, that rows do not get deleted, columns get deleted.
While you can delete a row, if I understand correctly, what happens is a
tombstone is created which matches every column, so in effect it is
deleting the columns, not the whole row.  A row key will not be
forgotten/deleted until there are no columns or tombstones which
reference it.  Until there are no references to that row key in any
SSTables you can still get that key back from the API.

-Jeremiah

-Original Message-
From: AJ [mailto:a...@dude.podzone.net]
Sent: Monday, June 13, 2011 12:11 PM
To: user@cassandra.apache.org
Subject: Re: Docs: "Why do deleted keys show up during range scans?"

On 6/13/2011 10:14 AM, Stephen Connolly wrote:

store the query inverted.

that way empty ->   deleted


I don't know what that means... get the other columns?  Can you
elaborate?  Is there docs for this or is this a hack/workaround?


the tombstones are stored for each column that had data IIRC... but at
this point my grok of C* is lacking

I suspected this, but wasn't sure.  It sounds like when a row is
deleted, a tombstone is not "attached" to the row, but to each column???
So, if all columns are deleted then the row is considered deleted?
Hmmm, that doesn't sound right, but that doesn't mean it isn't ! ;o)






Docs: Token Selection

2011-06-14 Thread AJ

This http://wiki.apache.org/cassandra/Operations#Token_selection  says:

"With NetworkTopologyStrategy, you should calculate the tokens the nodes 
in each DC independantly."


and gives the example:

DC1
node 1 = 0
node 2 = 85070591730234615865843651857942052864

DC2
node 3 = 1
node 4 = 85070591730234615865843651857942052865


So, according to the above, the token ranges would be (abbreviated nums):

DC1
node 1 = 0  Range: (8..4, 16], (0, 0]
node 2 = 8..4   Range: (0, 8..4]

DC2
node 3 = 1  Range: (8..5, 16], (0, 1]
node 4 = 8..5   Range: (1, 8..5]


If the above is correct, then I would be surprised as this paragraph is 
the only place were one would discover this and may be easy to miss... 
unless there's a doc buried somewhere in plain view that I missed.


So, have I interpreted this paragraph correctly?  Was this design to 
help keep data somewhat localized if that was important, such as a 
geographically dispersed DC?


Thanks!


Where is my data?

2011-06-14 Thread AJ
Is there an official deterministic formula to compute the various 
subsets of a given cluster that comprises a complete set of data 
(redundant rows ok)?  IOW, if multiple nodes become unavailable one at a 
time, at what point can I say <100% of my data is available?


Obviously, the method would have to take into consideration the ring 
layout along with the partition type, the # of nodes, 
replication_factor, replication strat, etc..


Thanks!


Re: Is this the proper use of OPP?

2011-06-14 Thread AJ
Thanks.  I found that article later.  I was definitely off-base with 
respect to OPP.  Random partitioning is pretty much the way to go and 
datastax has a good article on geographic distribution: 
http://www.datastax.com/docs/0.8/operations/datacenter


Sorry for the long pointless post previously.  But, FWIW, I don't see 
much use for OPP other than the corner case of a cluster consisting on 1 
ks and 1 cf, such as an index.  I will have to read Dominic's post on 
having multiple Cass clusters running on the same nodes.


On 6/14/2011 4:46 AM, Eric tamme wrote:

I would point you to this article, it does a good job describing OPP
and pretty much answers the specific questions you asked.

http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

-Eric


On Mon, Jun 13, 2011 at 5:06 PM, AJ  wrote:

I'm just becoming aware of the restrictions of using an OPP as compared to
Random.  Please let me know if I understand this correctly.

First off, if using the OPP only for an increased performance of range
queries, then it will probably be very hard to predict if you will end up
with hotspots or not and thus where and even how the data may be clustered
together in a particular node.  This is because all the various keys of the
various CFs may or may not have any correlation with one another.  So, in
effect, you just have a big mess of keys of various ranges and formats, but
they all are partitioned according to one global set of tokens that apply to
ALL CFs of ALL keyspaces.

[main reason for post below...]
OTOH, if you want to use OPP to purposely cluster certain data together on
specific nodes, such as for geographic partitioning, then you have to choose
a prefix for all of the keys of ALL CFs and ALL keyspaces!  This is because
they will all be partitioned based on the tokens assigned to the nodes.
  IOW, if I had two datacenters, one in the US and another in Europe, then
for all rows in all KSs and in all CFs, I would need to prepend a prefix to
the keys, such as "US:" and "EU:".  The problem is I may not want ALL of my
CFs to be partitioned this way; only specific ones.  Also, it may be very
difficult if not impossible for all keys of all keyspaces and CFs to use
keys of this form.  I'm not sure if Cass is designed for this.

However, if using the random partitioner, then there is no problem.  You can
use any key of any type you want (UTF8, Long, etc.) since they are all
hashed before deciding which node gets the key/row.

Do I understand things correctly or am I missing something?  Is Cass
designed to use OPP this way or am I hacking it?  If so, is there an
acceptable way to do geographic partitioning?

Also, what is OPP really good for?

Thanks!





Is this the proper use of OPP?

2011-06-13 Thread AJ
I'm just becoming aware of the restrictions of using an OPP as compared 
to Random.  Please let me know if I understand this correctly.


First off, if using the OPP only for an increased performance of range 
queries, then it will probably be very hard to predict if you will end 
up with hotspots or not and thus where and even how the data may be 
clustered together in a particular node.  This is because all the 
various keys of the various CFs may or may not have any correlation with 
one another.  So, in effect, you just have a big mess of keys of various 
ranges and formats, but they all are partitioned according to one global 
set of tokens that apply to ALL CFs of ALL keyspaces.


[main reason for post below...]
OTOH, if you want to use OPP to purposely cluster certain data together 
on specific nodes, such as for geographic partitioning, then you have to 
choose a prefix for all of the keys of ALL CFs and ALL keyspaces!  This 
is because they will all be partitioned based on the tokens assigned to 
the nodes.  IOW, if I had two datacenters, one in the US and another in 
Europe, then for all rows in all KSs and in all CFs, I would need to 
prepend a prefix to the keys, such as "US:" and "EU:".  The problem is I 
may not want ALL of my CFs to be partitioned this way; only specific 
ones.  Also, it may be very difficult if not impossible for all keys of 
all keyspaces and CFs to use keys of this form.  I'm not sure if Cass is 
designed for this.


However, if using the random partitioner, then there is no problem.  You 
can use any key of any type you want (UTF8, Long, etc.) since they are 
all hashed before deciding which node gets the key/row.


Do I understand things correctly or am I missing something?  Is Cass 
designed to use OPP this way or am I hacking it?  If so, is there an 
acceptable way to do geographic partitioning?


Also, what is OPP really good for?

Thanks!


Re: Docs: "Why do deleted keys show up during range scans?"

2011-06-13 Thread AJ

On 6/13/2011 10:14 AM, Stephen Connolly wrote:


store the query inverted.

that way empty ->  deleted

I don't know what that means... get the other columns?  Can you 
elaborate?  Is there docs for this or is this a hack/workaround?



the tombstones are stored for each column that had data IIRC... but at
this point my grok of C* is lacking
I suspected this, but wasn't sure.  It sounds like when a row is 
deleted, a tombstone is not "attached" to the row, but to each 
column???  So, if all columns are deleted then the row is considered 
deleted?  Hmmm, that doesn't sound right, but that doesn't mean it isn't 
! ;o)


Re: Docs: "Why do deleted keys show up during range scans?"

2011-06-13 Thread AJ

On 6/13/2011 9:25 AM, Stephen Connolly wrote:

On 13 June 2011 16:14, AJ  wrote:

On 6/13/2011 7:03 AM, Stephen Connolly wrote:

It returns the set of columns for the set of rows... how do you
determine the difference between a completely empty row and a row that
just does not have any of the matching columns?

I would expect it to not return anything (no row at all) for both of those
cases.  Are you saying that an empty row is returned for rows that do not
match the predicate?  So, if I perform a range slice where the range is
every row of the CF and the slice equates to no matches and I have 1 million
rows in the CF, then I will get a result set of 1 million empty rows?


No I am saying that for each row that matches, you will get a result,
even if the columns that you request happen to be empty for that
specific row.



Ok, this I understand I guess.  If I query a range of rows and want only 
a certain column and a row does not have that column, I would like to 
know that.



Likewise, any deleted rows in the same row range will show as empty
because C* would have a tone of work to figure out the difference
between being deleted and being empty.



But, if a row does indeed have the column, but that row was deleted, why 
would I get an empty row?  You say because of a ton of work.  So, the 
tombstone for the row is not stored "close-by" for quick access... or 
something like that?  At any rate, how do I figure out if the empty row 
is empty because it was deleted?  Sorry if I'm being dense.





Re: Docs: "Why do deleted keys show up during range scans?"

2011-06-13 Thread AJ

On 6/13/2011 7:03 AM, Stephen Connolly wrote:

It returns the set of columns for the set of rows... how do you
determine the difference between a completely empty row and a row that
just does not have any of the matching columns?


I would expect it to not return anything (no row at all) for both of 
those cases.  Are you saying that an empty row is returned for rows that 
do not match the predicate?  So, if I perform a range slice where the 
range is every row of the CF and the slice equates to no matches and I 
have 1 million rows in the CF, then I will get a result set of 1 million 
empty rows?


Docs: "Why do deleted keys show up during range scans?"

2011-06-13 Thread AJ

http://wiki.apache.org/cassandra/FAQ#range_ghosts

"So to special case leaving out result entries for deletions, we would 
have to check the entire rest of the row to make sure there is no 
undeleted data anywhere else either (in which case leaving the key out 
would be an error)."


The above doesn't read well and I don't get it.  Can anyone rephrase it 
or elaborate?


Thanks!


Re: SSL & Streaming

2011-06-13 Thread AJ
Performance-wise, I think it would be better to just let the client 
encrypt sensitive data before storing it, versus encrypting all traffic 
all the time.  If individual values are encrypted, then they don't have 
to be encrypted/decrypted during transit between nodes during the 
initial updates as well as during the commissioning of a new node or 
other times.


A drawback, however, is now you have to manage one or more keys for the 
lifetime of the data.  It will also complicate your data view 
interfaces.  However, if Cassandra had data encryption built-in somehow, 
that would solve this problem... just thinking out loud.


Can anyone think of other pro/cons of both strategies?

On 3/22/2011 2:21 AM, Sasha Dolgy wrote:

Hi,

Is there documentation available anywhere that describes how one can
use org.apache.cassandra.security.streaming.* ?   After the EC2 posts
yesterday, one question I was asked was about the security of data
being shifted between nodes.  Is it done in clear text, or
encrypted..?  I haven't seen anything to suggest that it's encrypted,
but see in the source that security.streaming does leverage SSL ...

Thanks in advance for some pointers to documentation.

Also, for anyone who is using SSL .. how much of a performance impact
have you noticed?  Is it minimal or significant?





Where is the Overview Documentation on Counters?

2011-06-10 Thread AJ
I can't find any that gives an overview of their purpose/benefits/etc, 
only how to code them.  I can only guess that they are more efficient 
for some reason but don't know exactly why or exactly what conditions I 
would choose to use them over a regular column.


Thanks!


Consistency Levels and Replication with Down Nodes

2011-06-10 Thread AJ

The O'Reilly book on Cass says this about READ consistency level ALL:

"Query all nodes. Wait for all nodes to respond, and return to the 
client the record with the most recent timestamp. Then, if necessary, 
perfrom a read repair in the background.  If any nodes fail or respond, 
fail the read operation."


It says "all nodes".  Shouldn't it say "replication_factor nodes"?  The 
above is implying that if a node, that doesn't even have a copy of the 
row, is down, then the read will fail.  Is that true?




Here is a related question regarding the WRITE consistency level ALL.  
The book says:


"Ensure that the number of nodes specified by replication_factor 
received the write before returning to the client.  If even one replica 
is unresponsive to the write operation, fail the operation."


I can understand this if the given row already exists from a previous 
write and one of the nodes that contains a replica is down.  But, what 
if this is the FIRST time creating this row and one of the nodes that it 
determines should store one of the replicas is down?  Will it choose 
another node to store the replica, or will it use hints to update the 
chosen down node when it comes back up?



Generally speaking for any RF value and for the FIRST write of a 
particular row, does Cass select specific nodes to contain the replicas 
regardless of their availability, and use hints if some of them are 
unavailable?  Or, will it select another available node?


Thanks


Re: Ideas for Big Data Support

2011-06-09 Thread AJ

On 6/9/2011 8:40 AM, Edward Capriolo wrote:




Some of these things are challenges, and a few are being worked on in 
one way or another.


1) Dynamic snitch was implemented to determine slow acting nodes and 
re-balance load.


2) You can budget bootstrap with rsync, as long as you know what data 
to copy where. 0.7.X made the data moving process more efficient.


Still, moving only 1 TB of data over a T-1 would take 61 days.  Or you 
could ship it in a couple.




3) There are many cases where different partition strategies can 
theoretically be better. The question is for the normal use case what 
is the best?


4) Compressed SSTables is on the way. This will be nice because it can 
help maximize disk caches.


5) Compaction's *are* a good thing. You can already do this by setting 
compaction thresholds to 0. That is not great because smaller 
compactions can run really fast and you want those to happen 
regularly. Another way I take care of this is forcing major 
compactions on my schedule. This makes it very unlikely that a larger 
compaction will happen at random during peak time. 0.8.X has 
multi-threaded compaction and a throttling limit so that looks promising.


More nodes vs less nodes..+1 more nodes. This does not mean you need 
to go very small, but the larger disk configurations are just more 
painful. Unless you can get very/very/very fast disks.


Even with a massive RAID-0?  At some point, the disk I/O throughput 
should be pretty fast where it can compete with cache speeds perhaps?





Ideas for Big Data Support

2011-06-09 Thread AJ
[Please feel free to correct me on anything or suggest other workarounds 
that could be employed now to help.]


Hello,

This is purely theoretical, as I don't have a big working cluster yet 
and am still in the planning stages, but from what I understand, while 
Cass scales well horizontally, EACH node will not be able to handle well 
a data store in the terabyte range... for reasons that are 
understandable such as simple hardware and bandwidth limitations.  But, 
looking forward and pushing the envelope, I think there might be ways to 
at least manage these issues until broadband speeds, disk and memory 
technology catches up.


The biggest issues with big data clusters that I am currently aware are:

> disk I/O probs during major compaction and repairs.
> Bandwidth limitations during new node commissioning.

Here are a few ideas I've thought of:

1.)  Load-balancing:

During a major compaction or repair or other similar severe performance 
impacting processes, allow the node to broadcast that it is temporarily 
unavailable so requests for data can be sent to other nodes in the 
cluster.  The node could still "wake-up" and pause or cancel it's 
compaction in the case of a failed node whereby there are no other nodes 
that can provide the data requested.  The node could be considered as 
"degraded" by other nodes, rather than down.  (As a matter of fact, a 
general load-balancing scheme could be devised if each node broadcasts 
it's current load level and maybe even hop-count between data centers.)


2.)  Data Transplants:

Since commissioning a new node that is due to receive data in the TB 
range (data xfer could take days or weeks), it would be much more 
efficient to just courier the data.  Perhaps the SSTables (maybe from a 
snapshot) could be transplanted from one production node into a new node 
to help jump-start the bootstrap process.  The new node could sort 
things out during the bootstrapping phase so that it is balanced 
correctly as if it had started out with no data as usual.  If this could 
cut down on half the bandwidth, that would be a great benefit.  However, 
this would work well mostly if the transplanted data came from a 
keyspace that used a random partitioner; coming from an ordered 
partioner may not be so helpful if the rows in the transplanted data 
would never be used in the new node.


3.)  Strategic Partitioning:

Of course, there are surely other issues to contend with, such as RAM 
requirements for caching purposes.  That may be managed by a partition 
strategy that allows certain nodes to specialize in a certain subset of 
the data, such as geographically or whatever the designer chooses.  
Replication would still be done as usual but this may help the cache to 
be better utilized by allowing it to focus on the subset of data that 
comprises the majority of the node's data versus a random sampling of 
the entire cluster.  IOW, while a node may specialize in a certain 
subset and also contain replicated rows from outside that subset, it 
will still only (mostly) be queried for data from within it's subset and 
thus the cache will contain mostly data from this special subset which 
could increase the hit rate of the cache.


This may not be a huge help for TB sized data nodes since the even 32 GB 
of RAM would still be relatively tiny in comparison to the data size, 
but I include it just in case it spurs other ideas.  Also, I do not know 
how Cass decides on which node to query for data in the first place... 
maybe not the best idea.


4.)  Compressed Columns:

Some sort of data compression of certain columns could be very helpful 
especially since text can be compressed to less than 50% if the 
conditions are right.  Overall native disk compression will not help the 
bandwidth issue since the data would be decompressed before transit.  If 
the data was stored compressed, then Cass could even send the data to 
the client compressed so as to offload the decompression to the client.  
Likewise, during node commission, the data would never have to be 
decompressed saving on CPU and BW.  Alternately, a client could tell 
Cass to decompress the data before transmit if needed.  This, combined 
with idea #1 (transplants) could help speed-up new node bootstraping, 
but only when a large portion of the data consists of very large column 
values and thus compression is practical and efficient.  Of course, the 
client could handle all the compression today without Cass even knowing 
about it, so building this into Cass would be just a convenience, but 
still nice to have, nonetheless.


5.)  Postponed Major Compactions:

The option to postpone auto-triggered major compactions until a 
pre-defined time of day or week or until staff can do it manually.


6.?)  Finally, some have suggested just using more nodes with less data 
storage which may solve most if not all of these problems.  But, I'm 
still fuzzy on that.  The trade-offs would be more infrastructure and 
maintenance costs, hi

Re: CLI set command returns null, ver 0.8.0

2011-06-08 Thread AJ

Thanks Aaron,

I created a script and everything went OK.  I think that the problem is 
when you try to update a CF.  Below, I try to change the column 
comparator and it complains that the 'comparators do not match'.  Can 
you enlighten me on what that means?  There is no data in the CF at this 
point.


[default@Keyspace1] create column family User3;
503dba20-924b-11e0--f1169bb35ddf
Waiting for schema agreement...
... schemas agree across the cluster
[default@Keyspace1] set User3['1']['name'] = 'mike';
org.apache.cassandra.db.marshal.MarshalException: cannot parse 'name' as 
hex bytes
java.lang.RuntimeException: 
org.apache.cassandra.db.marshal.MarshalException: cannot parse 'name' as 
hex bytes
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292)
at 
org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)

[default@Keyspace1] describe keyspace;
Keyspace: Keyspace1:
  Replication Strategy: 
org.apache.cassandra.locator.NetworkTopologyStrategy

Options: [datacenter1:1]
  Column Families:
ColumnFamily: User3
  Key Validation Class: org.apache.cassandra.db.marshal.BytesType
  Default column value validator: 
org.apache.cassandra.db.marshal.BytesType

  Columns sorted by: org.apache.cassandra.db.marshal.BytesType
  Row cache size / save period in seconds: 0.0/0
  Key cache size / save period in seconds: 20.0/14400
  Memtable thresholds: 0.2859375/61/1440 (millions of ops/MB/minutes)
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Replicate on write: false
  Built indexes: []
[default@Keyspace1]

/** Here, I figure the error above is because it cannot find the column 
called 'name' because it's using the BytesType column name 
sorter/comparator, so I try to change it below. */


[default@Keyspace1] update column family User3 with comparator = UTF8Type;
comparators do not match.
java.lang.RuntimeException: comparators do not match.
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292)
at 
org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1]

What does "comparators do not match" mean?

Thanks,
Mike



On 6/8/2011 4:37 PM, aaron morton wrote:

Can you provide the cli script to create the schema and info on how many nodes 
you have.

Thanks

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 8 Jun 2011, at 16:12, AJ wrote:


Can anyone help?  The CLI seems to be having issues.  The count command isn't 
working either:

[default@Keyspace1] count User[long(1)];
Expected 8 or 0 byte long (13)
java.lang.RuntimeException: Expected 8 or 0 byte long (13)
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284)
at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)
at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1]
[default@Keyspace1] count User[1];;
Expected 8 or 0 byte long (1)
java.lang.RuntimeException: Expected 8 or 0 byte long (1)
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284)
at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)
at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1] count User['1'];
Expected 8 or 0 byte long (1)
java.lang.RuntimeException: Expected 8 or 0 byte long (1)
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284)
at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)
at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1] count User['12345678'];
null
java.lang.RuntimeException
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292)
at org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)
at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1]


Granted, there are no rows in the CF yet (see probs below), but this exception 
seems to be during the parsing stage.

I've check everything else, AFAIK, so I'm at a loss.

Much obliged.

On 6/7/2011 12:44 PM, AJ wrote:

The log only shows INFO level messages about flushes, etc..

The debug mode of the CLI shows an exception after the set:

[mike@mars ~]$ cassandra-cli -h 192.168.1.101 --debug
Connected to: "Test Cluster" on 192.168.1.101/9160
Welcome to the Cassandra CLI.

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use Keyspace1;
Authenticated to keyspace: Keys

Re: Misc Performance Questions

2011-06-08 Thread AJ

Thank you Richard!

On 6/8/2011 2:57 AM, Richard Low wrote:


There is however a difference in running multiple column families
versus putting everything in the same column family and separating
them with e.g. a key prefix.  E.g. if you have a large data set and a
small one, it will be quicker to query the small one if it is in its
own column family.



I assumed that a read would be O(1) for any size CF since Cass is 
implemented with hashmaps.  Do you know why size matters?  (forgive the pun)


Misc Performance Questions

2011-06-08 Thread AJ


Is there a performance hit when dropping a CF?  What if it contains .5 
TB of data?  If not, is there a quick and painless way to drop a large 
amount of data w/minimal perf hit?


Is there a performance hit running multiple keyspaces on a cluster 
versus only one keyspace given a constant total data size?  Is there 
some quantity limit?


Using a Random Partitioner, but with a RF = 1, will the rows still be 
spread-out evenly on the cluster or will there be an affinity to a 
single node (like the one receiving the data from the client)?


I see a lot of mention of using RAID-0, but not RAID-5/6.  Why?  Even 
though Cass can tolerate a down node due to data loss, it would still be 
more efficient to just rebuild a bad hdd live, right?


Maybe perf related:  Will there be a problem having multiple keyspaces 
on a cluster all with different replication factors, from 1-3?


Thanks!


Re: Multiple large disks in server - setup considerations

2011-06-07 Thread AJ

On 6/7/2011 9:32 PM, Edward Capriolo wrote:



I do not like large disk set-ups. I think they end up not being 
economical. Most low latency use cases want high RAM to DISK ratio.  
Two machines with 32GB RAM is usually less expensive then one machine 
with 64GB ram.


For a machine with 1TB drives (or multiple 1TB drives) it is going to 
be difficult to get enough RAM to help with random read patterns.


Also cluster operations like joining, decommissioning, or repair can 
take a *VERY* long time maybe a day. More smaller servers like blade 
style or more agile.




Is there some rule-of-thumb as to how much RAM is needed per GB of 
data?  I know it probably "depends", but if you could try to explain the 
best you can that would be great!  I too am projecting "big data" 
requirements.


Re: CLI set command returns null, ver 0.8.0

2011-06-07 Thread AJ
Can anyone help?  The CLI seems to be having issues.  The count command 
isn't working either:


[default@Keyspace1] count User[long(1)];
Expected 8 or 0 byte long (13)
java.lang.RuntimeException: Expected 8 or 0 byte long (13)
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284)
at 
org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1]
[default@Keyspace1] count User[1];;
Expected 8 or 0 byte long (1)
java.lang.RuntimeException: Expected 8 or 0 byte long (1)
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284)
at 
org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1] count User['1'];
Expected 8 or 0 byte long (1)
java.lang.RuntimeException: Expected 8 or 0 byte long (1)
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:284)
at 
org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1] count User['12345678'];
null
java.lang.RuntimeException
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292)
at 
org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1]


Granted, there are no rows in the CF yet (see probs below), but this 
exception seems to be during the parsing stage.


I've check everything else, AFAIK, so I'm at a loss.

Much obliged.

On 6/7/2011 12:44 PM, AJ wrote:

The log only shows INFO level messages about flushes, etc..

The debug mode of the CLI shows an exception after the set:

[al@mars ~]$ cassandra-cli -h 192.168.1.101 --debug
Connected to: "Test Cluster" on 192.168.1.101/9160
Welcome to the Cassandra CLI.

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use Keyspace1;
Authenticated to keyspace: Keyspace1
[default@Keyspace1] set User[1]['name']='aaa';
null
java.lang.RuntimeException
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292)
at 
org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1]






Re: CLI set command returns null

2011-06-07 Thread AJ

The log only shows INFO level messages about flushes, etc..

The debug mode of the CLI shows an exception after the set:

[al@mars ~]$ cassandra-cli -h 192.168.1.101 --debug
Connected to: "Test Cluster" on 192.168.1.101/9160
Welcome to the Cassandra CLI.

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] use Keyspace1;
Authenticated to keyspace: Keyspace1
[default@Keyspace1] set User[1]['name']='aaa';
null
java.lang.RuntimeException
at 
org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:292)
at 
org.apache.cassandra.cli.CliMain.processStatement(CliMain.java:217)

at org.apache.cassandra.cli.CliMain.main(CliMain.java:345)
[default@Keyspace1]



On 6/7/2011 12:27 PM, Jonathan Ellis wrote:

try running cli with --debug

On Tue, Jun 7, 2011 at 1:22 PM, AJ  wrote:

Ver 0.8.0.

Please help.  I don't know what I'm doing wrong.  One simple keyspace with
one simple CF with one simple column.  I've tried two simple tutorials.  Is
there a common newbie mistake I could be making???

Thanks in advance!


[default@Keyspace1] describe keyspace;
Keyspace: Keyspace1:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Options: [replication_factor:1]
  Column Families:
ColumnFamily: User
  Key Validation Class: org.apache.cassandra.db.marshal.LongType
  Default column value validator:
org.apache.cassandra.db.marshal.UTF8Type
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period in seconds: 0.0/0
  Key cache size / save period in seconds: 20.0/14400
  Memtable thresholds: 0.2859375/61/1440 (millions of ops/MB/minutes)
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 1.0
  Replicate on write: false
  Built indexes: []
  Column Metadata:
Column Name: name
  Validation Class: org.apache.cassandra.db.marshal.UTF8Type
[default@Keyspace1]
[default@Keyspace1] set User[long(1)][utf8('name')]=utf8('aaa');
null
[default@Keyspace1] set User[1]['name']='aaa';
null
[default@Keyspace1]
[default@Keyspace1] list User;
Using default limit of 100
null
[default@Keyspace1]











  1   2   >