RE: Cassandra Counters

Roshni Rajagopal Mon, 24 Sep 2012 23:37:25 -0700

Thanks for the reply and sorry for being bull - headed.
Once  you're past the stage where you've decided its distributed, and NoSQL and 
cassandra out of all the NoSQL options,Now to count something, you can do it in 
different ways in cassandra. In all the ways you want to use cassandra's best 
features of availability, tunable consistency , partition tolerance etc.
Given this, what are the performance tradeoffs of using counters vs a standard 
column family for counting. Because as I see if the counter number in a counter 
column family becomes wrong, it will not be 'eventually consistent' - you will 
need intervention to correct it. So the key aspect is how much faster would be 
a counter column family, and at what numbers do we start seing a difference.

Date: Tue, 25 Sep 2012 07:57:08 +0200
Subject: Re: Cassandra Counters
From: oleksandr.pet...@gmail.com
To: user@cassandra.apache.org

Maybe I'm missing the point, but counting in a standard column family would be 
a little overkill. 
I assume that "distributed counting" here was more of a map/reduce approach, 
where Hadoop (+ Cascading, Pig, Hive, Cascalog) would help you a lot. We're 
doing some more complex counting (e.q. based on sets of rules) like that. Of 
course, that would perform _way_ slower than counting beforehand. On the other 
side, you will always have a consistent result for a consistent dataset.

On the other hand, if you use things like AMQP or Storm (sorry to put up my 
sentence together like that, as tools are mostly either orthogonal or 
complementary, but I hope you get my point), you could build a topology that 
makes fault-tolerant writes independently of your original write. Of course, it 
would still have a consistency tradeoff, mostly because of race conditions and 
different network latencies etc.  

So I would say that building a data model in a distributed system often depends 
more on your problem than on the common patterns, because everything has a 
tradeoff. 
Want to have an immediate result? Modify your counter while writing the row.
Can sacrifice speed, but have more counting opportunities? Go with offline 
distributed counting.Want to have kind of both, dispatch a message and react 
upon it, having the processing logic and writes decoupled from main 
application, allowing you to care less about speed.

However, I may have missed the point somewhere (early morning, you know), so I 
may be wrong in any given statement.Cheers

On Tue, Sep 25, 2012 at 6:53 AM, Roshni Rajagopal 
<roshni_rajago...@hotmail.com> wrote:

Thanks Milind,
Has anyone implemented counting in a standard col family in cassandra, when you 
can have increments and decrements to the count. Any comparisons in performance 
to using counter column families? 

Regards,Roshni

Date: Mon, 24 Sep 2012 11:02:51 -0700
Subject: RE: Cassandra Counters
From: milindpar...@gmail.com

To: user@cassandra.apache.org

IMO

You would use Cassandra Counters (or other variation of distributed counting) 
in case of having determined that a centralized version of counting is not 
going to work.

You'd determine the non_feasibility of centralized counting by figuring the 
speed at which you need to sustain writes and reads and reconcile that with 
your hard disk seek times (essentially).

Once you have "proved" that you can't do centralized counting, the second layer 
of arsenal comes into play; which is distributed counting.

In distributed counting , the CAP theorem comes into life. & in Cassandra, 
Availability and Network Partitioning trumps over Consistency. 

So yes, you sacrifice strong consistency for availability and partion 
tolerance; for eventual consistency.

On Sep 24, 2012 10:28 AM, "Roshni Rajagopal" <roshni_rajago...@hotmail.com> 
wrote:

Hi folks,
   I looked at my mail below, and Im rambling a bit, so Ill try to re-state my 
queries pointwise. 
a) what are the performance tradeoffs on reads & writes between creating a 
standard column family and manually doing the counts by a lookup on a key, 
versus using counters. 

b) whats the current state of counters limitations in the latest version of 
apache cassandra?
c) with there being a possibilty of counter values getting out of sync, would 
counters not be recommended where strong consistency is desired. The normal 
benefits of cassandra's tunable consistency would not be applicable, as 
re-tries may cause overstating. So the normal use case is high performance, and 
where consistency is not paramount.

Regards,roshni

From: roshni_rajago...@hotmail.com
To: user@cassandra.apache.org

Subject: Cassandra Counters
Date: Mon, 24 Sep 2012 16:21:55 +0530

Hi ,
I'm trying to understand if counters are a good fit for my use case.Ive watched 
http://blip.tv/datastax/counters-in-cassandra-5497678 many times over now...

and still need help!
Suppose I have a list of items- to which I can add or delete a set of items at 
a time,  and I want a count of the items, without considering changing the 
database  or additional components like zookeeper,

I have 2 options_ the first is a counter col family, and the second is a 
standard one

  1. List_Counter_CF

  TotalItems

  ListId
  50

  2.List_Std_CF

  TimeUUID1
  TimeUUID2
  TimeUUID3
  TimeUUID4
  TimeUUID5

  ListId
  3
  70
  -20
  3
  -6

And in the second I can add a new col with every set of items added or deleted. 
Over time this row may grow wide.To display the final count, Id need to read 
the row, slice through all columns and add them.

In both cases the writes should be fast, in fact standard col family should be 
faster as there's no read, before write. And for CL ONE write the latency 
should be same. For reads, the first option is very good, just read one column 
for a key

For the second, the read involves reading the row, and adding each column value 
via application code. I dont think there's a way to do math via CQL yet.There 
should be not hot spotting, if the key is sharded well. I could even maintain 
the count derived from the List_Std_CF in a separate column family which is a 
standard col family with the final number, but I could do that as a separate 
process  immediately after the write to List_Std_CF completes, so that its not 
blocking.  I understand cassandra is faster for writes than reads, but how slow 
would Reading by row key be...? Is there any number around after how many 
columns the performance starts deteriorating, or how much worse in performance 
it would be? 

The advantage I see is that I can use the same consistency rules as for the 
rest of column families. If quorum for reads & writes, then you get strongly 
consistent values. In case of counters I see that in case of timeout exceptions 
because the first replica is down or not responding, there's a chance of the 
values getting messed up, and re-trying can mess it up further. Its not 
idempotent like a standard col family design can be.

If it gets messed up, it would need administrator's help (is there a a document 
on how we could resolve counter values going wrong?)
I believe the rest of the limitations still hold good- has anything changed in 
recent versions? In my opinion, they are not as major as the consistency 
question.

-removing a counter & then modifying value - behaviour is undetermined-special 
process for counter col family sstable loss( need to remove all files)-no TTL 
support-no secondary indexes

In short, I can recommend counters can be used for analytics or while dealing 
with data where the exact numbers are not important, orwhen its ok to take some 
time to fix the mismatch, and the performance requirements are most important.

However where the numbers should match , its better to use a std column family 
and a manual implementation.
Please share your thoughts on this.

Regards,roshni                                                                  

-- 
alex p

RE: Cassandra Counters

Reply via email to