Re: clearing tombstones?

2014-04-11 Thread Mina Naguib

Levelled Compaction is a wholly different beast when it comes to tombstones.

The tombstones are inserted, like any other write really, at the lower levels 
in the leveldb hierarchy.

They are only removed after they have had the chance to "naturally" migrate 
upwards in the leveldb hierarchy to the highest level in your data store.  How 
long that takes depends on:
1. The amount of data in your store and the number of levels your LCS 
strategy has
2. The amount of new writes entering the bottom funnel of your leveldb, 
forcing upwards compaction and combining

To give you an idea, I had a similar scenario and ran a (slow, throttled) 
delete job on my cluster around December-January.  Here's a graph of the disk 
space usage on one node.  Notice the still-diclining usage long after the 
cleanup job has finished (sometime in January).  I tend to think of tombstones 
in LCS as little bombs that get to explode much later in time:

http://mina.naguib.ca/images/tombstones-cassandra-LCS.jpg



On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gomes 
 wrote:

> I have a similar problem here, I deleted about 30% of a very large CF using 
> LCS (about 80GB per node), but still my data hasn't shrinked, even if I used 
> 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool scrub 
> forces a minor compaction?
> 
> Cheers,
> 
> Paulo
> 
> 
> On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy  wrote:
> Yes, running nodetool compact (major compaction) creates one large SSTable. 
> This will mess up the heuristics of the SizeTiered strategy (is this the 
> compaction strategy you are using?) leading to multiple 'small' SSTables 
> alongside the single large SSTable, which results in increased read latency. 
> You will incur the operational overhead of having to manage compactions if 
> you wish to compact these smaller SSTables. For all these reasons it is 
> generally advised to stay away from running compactions manually.
> 
> Assuming that this is a production environment and you want to keep 
> everything running as smoothly as possible I would reduce the gc_grace on the 
> CF, allow automatic minor compactions to kick in and then increase the 
> gc_grace once again after the tombstones have been removed.
> 
> 
> On Fri, Apr 11, 2014 at 3:44 PM, William Oberman  
> wrote:
> So, if I was impatient and just "wanted to make this happen now", I could:
> 
> 1.) Change GCGraceSeconds of the CF to 0
> 2.) run nodetool compact (*)
> 3.) Change GCGraceSeconds of the CF back to 10 days
> 
> Since I have ~900M tombstones, even if I miss a few due to impatience, I 
> don't care *that* much as I could re-run my clean up tool against the now 
> much smaller CF.
> 
> (*) A long long time ago I seem to recall reading advice about "don't ever 
> run nodetool compact", but I can't remember why.  Is there any bad long term 
> consequence?  Short term there are several:
> -a heavy operation
> -temporary 2x disk space
> -one big SSTable afterwards
> But moving forward, everything is ok right?  CommitLog/MemTable->SStables, 
> minor compactions that merge SSTables, etc...  The only flaw I can think of 
> is it will take forever until the SSTable minor compactions build up enough 
> to consider including the big SSTable in a compaction, making it likely I'll 
> have to self manage compactions.
> 
> 
> 
> On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy  wrote:
> Correct, a tombstone will only be removed after gc_grace period has elapsed. 
> The default value is set to 10 days which allows a great deal of time for 
> consistency to be achieved prior to deletion. If you are operationally 
> confident that you can achieve consistency via anti-entropy repairs within a 
> shorter period you can always reduce that 10 day interval.
> 
> 
> Mark
> 
> 
> On Fri, Apr 11, 2014 at 3:16 PM, William Oberman  
> wrote:
> I'm seeing a lot of articles about a dependency between removing tombstones 
> and GCGraceSeconds, which might be my problem (I just checked, and this CF 
> has GCGraceSeconds of 10 days).
> 
> 
> On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli  
> wrote:
> compaction should take care of it; for me it never worked so I run nodetool 
> compaction on every node; that does it.
> 
> 
> 2014-04-11 16:05 GMT+02:00 William Oberman :
> 
> I'm wondering what will clear tombstoned rows?  nodetool cleanup, nodetool 
> repair, or time (as in just wait)?
> 
> I had a CF that was more or less storing session information.  After some 
> time, we decided that one piece of this information was pointless to track 
> (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a 
> row).   I wrote a process to remove all of those columns (which again in a 
> vast majority of cases had the effect of removing the whole row).
> 
> This CF had ~1 billion rows, so I expect to be left with ~100m rows.  After I 
> did this mass delete, everything was the same size on disk (which I expected, 
> knowing how tombstoning works).  

Re: cassandra error on restart

2013-09-10 Thread Mina Naguib

There was mention of a similar crash on the mailing list.  Does this apply to 
your case ?

http://mail-archives.apache.org/mod_mbox/cassandra-user/201306.mbox/%3ccdecfcfa.11e95%25agundabatt...@threatmetrix.com%3E


--
Mina Naguib
AdGear Technologies Inc.
http://adgear.com/

On 2013-09-10, at 10:09 AM, "Langston, Jim"  wrote:

> Hi all,
> 
> I restarted my cassandra ring this morning, but it is refusing to
> start. Everything was fine, but now I get this error in the log:
> 
> ….
>  INFO 14:05:14,420 Compacting 
> [SSTableReader(path='/raid0/cassandra/data/system/local/system-local-ic-20-Data.db'),
>  
> SSTableReader(path='/raid0/cassandra/data/system/local/system-local-ic-21-Data.db'),
>  
> SSTableReader(path='/raid0/cassandra/data/system/local/system-local-ic-23-Data.db'),
>  
> SSTableReader(path='/raid0/cassandra/data/system/local/system-local-ic-22-Data.db')]
>  INFO 14:05:14,493 Compacted 4 sstables to 
> [/raid0/cassandra/data/system/local/system-local-ic-24,].  1,086 bytes to 486 
> (~44% of original) in 66ms = 0.007023MB/s.  4 total rows, 1 unique.  Row 
> merge counts were {1:0, 2:0, 3:0, 4:1, }
>  INFO 14:05:14,543 Starting Messaging Service on port 7000
> java.lang.NullPointerException
> at 
> org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:745)
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:554)
> at 
> org.apache.cassandra.service.StorageService.initServer(StorageService.java:451)
> at 
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:348)
> at org.apache.cassandra.service.CassandraDaemon.init(CassandraDaemon.java:381)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at org.apache.commons.daemon.support.DaemonLoader.load(DaemonLoader.java:212)
> Cannot load daemon
> 
> 
> and cassandra will not start. I get the same error on all the nodes in the 
> ring.
> 
> Thoughts?
> 
> Thanks,
> 
> Jim



Re: Why don't you start off with a "single & small" Cassandra server as you usually do it with MySQL ?

2013-08-27 Thread Mina Naguib


On 2013-08-27, at 6:04 AM, Aklin_81  wrote:

> For any website just starting out, the load initially is minimal & grows with 
> a  slow pace initially. People usually start with their MySQL based sites 
> with a single server(***that too a VPS not a dedicated server) running as 
> both app server as well as DB server & usually get too far with this setup & 
> only as they feel the need they separate the DB from the app server giving it 
> a separate VPS server. This is what a start up expects the things to be while 
> planning about resources procurement.
> 
> But so far what I have seen, it's something very different with Cassandra. 
> People usually recommend starting out with atleast a 3 node cluster, (on 
> dedicated servers) with lots & lots of RAM. 4GB or 8GB RAM is what they 
> suggest to start with. So is it that Cassandra requires more hardware 
> resources in comparison to MySQL,  for a website to deliver similar 
> performance, serve similar load/ traffic & same amount of data. I understand 
> about higher storage requirements of Cassandra due to replication but what 
> about other hardware resources ? 
> 
> Can't we start off with Cassandra based apps just like MySQL. Starting with 1 
> or 2 VPS & adding more whenever there's a need. Renting out dedicated servers 
> with lots of RAM just from the beginning may be viable for very well funded 
> startups but not for all.

Yes you can, just make sure you do your homework, evaluate and measure things.

MySQL is a row-oriented RDBMS.  Cassandra is a distributed columnar key-value 
store. While both are "databases", they serve different use cases.

I think it's an illusion that a startup can "get by" on just a single virtual 
instance somewhere.  It's certainly doable, but very risky  Doing that means 
that if the server catches on fire, your startup's data and other IP is lost.

Any reasonable architecture in this day and age must account for such 
disasters.  Cassandra is built around failure-is-a-norm, and this is handled by 
encouraging multiple servers and increased replication factor as a default.  
You can certainly scale that back down to a single-machine if you want, 
provided you understand what risks you're taking.

Performance-wise, cassandra's quite fast even in a single-node scenario.  
Again, take that at face value and do your own benchmarks using your use cases 
and workloads.




Re: C language - cassandra

2013-05-17 Thread Mina Naguib

Hi Apostolis

I'm the author of libcassie, a C library for cassandra that wraps the C++ 
libcassandra library.  

It's in use in production where I work, however it has not received much 
traction elsewhere as far as I know.  You can get it here:
https://github.com/minaguib/libcassandra/tree/kickstart-libcassie-0.7

It has not been updated for a while (for example no CQL support, no pooling 
support).  I've been waiting for either the thrift C-glibc interface to mature, 
or the thriftless-CQL-binary protocol to mature, before putting effort into 
updating/rewriting it.  It might however satisfy your needs with its current 
functionality.



On 2013-05-17, at 10:42 AM, Apostolis Xekoukoulotakis  
wrote:

> Hello, new here, What are my options in using cassandra from a program 
> written in c?
> 
> A)
> Thrift has no documentation, so it will take me time to understand.
> Thrift also doesnt have a balancing pool, asking different nodes every time, 
> which is a big problem.
> 
> B)
> Should I use the hector (java) client and then send the data to my program 
> with my own protocol? 
> Seems a lot of unnecessary work.
> 
> Any other suggestions?
> 
> 
> -- 
> 
> Sincerely yours, 
>  Apostolis Xekoukoulotakis



Re: Is this how to read the output of nodetool cfhistograms?

2013-01-22 Thread Mina Naguib


On 2013-01-22, at 8:59 AM, Brian Tarbox  wrote:

> The output of this command seems to make no sense unless I think of it as 5 
> completely separate histograms that just happen to be displayed together.
> 
> Using this example output should I read it as: my reads all took either 1 or 
> 2 sstable.  And separately, I had write latencies of 3,7,19.  And separately 
> I had read latencies of 2, 8,69, etc?
> 
> In other words...each row isn't really a row...i.e. on those 16033 reads from 
> a single SSTable I didn't have 0 write latency, 0 read latency, 0 row size 
> and 0 column count.  Is that right?

Correct.  A number in any of the metric columns is a count value bucketed in 
the offset on that row.  There are no relationships between other columns on 
the same row.

So your first row says "16033 reads were satisfied by 1 sstable".  The other 
metrics (for example, latency of these reads) is reflected in the histogram 
under "Read Latency", under various other bucketed offsets.

> 
> Offset  SSTables Write Latency  Read Latency  Row Size
>   Column Count
> 1  16033 00   
>  0 0
> 2303   00 
>0 1
> 3  0 00   
>  0 0
> 4  0 00   
>  0 0
> 5  0 00   
>  0 0
> 6  0 00   
>  0 0
> 7  0 00   
>  0 0
> 8  0 02   
>  0 0
> 10 0 00   
>  0  6261
> 12 0 02   
>  0   117
> 14 0 08   
>  0 0
> 17 0 3   69   
>  0   255
> 20 0 7  163   
>  0 0
> 24 019 1369   
>  0 0
> 



Re: continue seeing "Finished hinted handoff of 0 rows to endpoint"

2012-11-25 Thread Mina Naguib


On 2012-11-24, at 10:37 AM, Chuan-Heng Hsiao  wrote:

> However, I continue seeing the following in /var/log/cassandra/system.log
> 
> INFO [HintedHandoff:1] 2012-11-24 22:58:28,088
> HintedHandOffManager.java (line 296) Started hinted handoff for token:
> 27949589543905115548813332729343195104 with IP: /192.168.0.10
> INFO [HintedHandoff:1] 2012-11-24 22:58:28,089
> HintedHandOffManager.java (line 392) Finished hinted handoff of 0 rows
> to endpoint /192.168.0.10
> 
> every ten mins.


See if https://issues.apache.org/jira/browse/CASSANDRA-4740 is relevant in your 
case.

Re: leveled compaction and tombstoned data

2012-11-09 Thread Mina Naguib


On 2012-11-08, at 1:12 PM, B. Todd Burruss  wrote:

> we are having the problem where we have huge SSTABLEs with tombstoned data in 
> them that is not being compacted soon enough (because size tiered compaction 
> requires, by default, 4 like sized SSTABLEs).  this is using more disk space 
> than we anticipated.
> 
> we are very write heavy compared to reads, and we delete the data after N 
> number of days (depends on the column family, but N is around 7 days)
> 
> my question is would leveled compaction help to get rid of the tombstoned 
> data faster than size tiered, and therefore reduce the disk space usage

From my experience, levelled compaction makes space reclamation after deletes 
even less predictable than sized-tier.

The reason is that deletes, like all mutations, are just recorded into 
sstables.  They enter level0, and get slowly, over time, promoted upwards to 
levelN.

Depending on your *total* mutation volume VS your data set size, this may be 
quite a slow process.  This is made even worse if the size of the data you're 
deleting (say, an entire row worth several hundred kilobytes) is to-be-deleted 
by a small row-level tombstone.  If the row is sitting in level 4, the 
tombstone won't impact it until enough data has pushed over all existing data 
in level3, level2, level1, level0

Finally, to guard against the tombstone missing any data, the tombstone itself 
is not candidate for removal (I believe even after gc_grace has passed) unless 
it's reached the highest populated level in levelled compaction.  This means if 
you have 4 levels and issue a ton of deletes (even deletes that will never 
impact existing data), these tombstones are deadweight that cannot be purged 
until they hit level4.

For a write-heavy workload, I recommend you stick with sized-tier.  You have 
several options at your disposal (compaction min/max thresholds, gc_grace) to 
move things along.  If that doesn't help, I've heard of some fairly reputable 
people doing some fairly blasphemous things (major compactions every night).




Re: virtual memory of all cassandra-nodes is growing extremly since Cassandra 1.1.0

2012-08-01 Thread Mina Naguib

All our servers (cassandra and otherwise) get monitored with nagios + get many 
basic metrics graphed by pnp4nagios.  This covers a large chunk of a box's 
health, as well as cassandra basics (specifically the pending tasks, JVM heap 
state).  IMO it's not possible to clearly debug a cassandra issue if you don't 
have a good holistic view of the boxes' health (CPU, RAM, swap, disk 
throughput, etc.)

Separate from that we have an operational dashboard.  It's a bunch of 
manually-defined RRD files and custom scripts that grab metrics, store, and 
graph the health of various layers in the infrastructure in an an 
easy-to-digest way (for example, each data center gets a color scheme - stacked 
machines within multiple DCs can just be eyeballed).  There we can see for 
example our total read volume, total write volume, struggling boxes, dynamic 
endpoint snitch reaction, etc...

Finally, almost all the software we write integrates with statsd + graphite.  
In graphite we have more metrics than we know what to do with, but it's better 
than the other way around.  From there for example we can see cassandra's 
response time including things cassandra itself can't measure (network, thrift, 
etc), across various different client softwares that talk to it.  Within 
graphite we have several dashboards defined (users make their own, some 
infrastructure components have shared dashboards.)


--
Mina Naguib :: Director, Infrastructure Engineering
Bloom Digital Platforms :: T 514.394.7951 #208
http://bloom-hq.com/



On 2012-08-01, at 3:43 PM, Greg Fausak wrote:

> Mina,
> 
> Thanks for that post.  Very interesting :-)
> 
> What sort of things are you graphing?  Standard *nux stuff
> (mem/cpu/etc)?  Or do you
> have some hooks in to the C* process (I saw somoething about port 1414
> in the .yaml file).
> 
> Best,
> 
> -g
> 
> 
> On Thu, Jul 26, 2012 at 9:27 AM, Mina Naguib
>  wrote:
>> 
>> Hi Thomas
>> 
>> On a modern 64bit server, I recommend you pay little attention to the 
>> virtual size.  It's made up of almost everything within the process's 
>> address space, including on-disk files mmap()ed in for zero-copy access.  
>> It's not unreasonable for a machine with N amount RAM to have a process 
>> whose virtual size is several times the value of N.  That in and of itself 
>> is not problematic
>> 
>> In a default cassandra 1.1.x setup, the bulk of that will be your sstables' 
>> data and index files.  On linux you can invoke the "pmap" tool on the 
>> cassandra process's PID to see what's in there.  Much of it will be 
>> anonymous memory allocations (the JVM heap itself, off-heap data structures, 
>> etc), but lots of it will be references to files on disk (binaries, 
>> libraries, mmap()ed files, etc).
>> 
>> What's more important to keep an eye on is the JVM heap - typically 
>> statically allocated to a fixed size at cassandra startup.  You can get info 
>> about its used/capacity values via "nodetool -h localhost info".  You can 
>> also hook up jconsole and trend it over time.
>> 
>> The other critical piece is the process's RESident memory size, which 
>> includes the JVM heap but also other off-heap data structures and 
>> miscellanea.  Cassandra has recently been making more use of off-heap 
>> structures (for example, row caching via SerializingCacheProvider).  This is 
>> done as a matter of efficiency - a serialized off-heap row is much smaller 
>> than a classical object sitting in the JVM heap - so you can do more with 
>> less.
>> 
>> Unfortunately, in my experience, it's not perfect.  They still have a cost, 
>> in terms of on-heap usage, as well as off-heap growth over time.
>> 
>> Specifically, my experience with cassandra 1.1.0 showed that off-heap row 
>> caches incurred a very high on-heap cost (ironic) - see my post at 
>> http://mail-archives.apache.org/mod_mbox/cassandra-user/201206.mbox/%3c6feb097f-287b-471d-bea2-48862b30f...@bloomdigital.com%3E
>>  - as documented in that email, I managed that with regularly scheduled full 
>> GC runs via System.gc()
>> 
>> I have, since then, moved away from scheduled System.gc() to scheduled row 
>> cache invalidations.  While this had the same effect as System.gc() I 
>> described in my email, it eliminated the 20-30 second pause associated with 
>> it.  It did however introduce (or may be I never noticed earlier), slow 
>> creep in memory usage outside of the heap.
>> 
>> It's typical in my case for example for a process configured with 6G of JVM 
>> heap to start up, stabilize at 6.5 - 7GB RESident usage, th

Re: virtual memory of all cassandra-nodes is growing extremly since Cassandra 1.1.0

2012-07-26 Thread Mina Naguib

Hi Thomas

On a modern 64bit server, I recommend you pay little attention to the virtual 
size.  It's made up of almost everything within the process's address space, 
including on-disk files mmap()ed in for zero-copy access.  It's not 
unreasonable for a machine with N amount RAM to have a process whose virtual 
size is several times the value of N.  That in and of itself is not problematic

In a default cassandra 1.1.x setup, the bulk of that will be your sstables' 
data and index files.  On linux you can invoke the "pmap" tool on the cassandra 
process's PID to see what's in there.  Much of it will be anonymous memory 
allocations (the JVM heap itself, off-heap data structures, etc), but lots of 
it will be references to files on disk (binaries, libraries, mmap()ed files, 
etc).

What's more important to keep an eye on is the JVM heap - typically statically 
allocated to a fixed size at cassandra startup.  You can get info about its 
used/capacity values via "nodetool -h localhost info".  You can also hook up 
jconsole and trend it over time.

The other critical piece is the process's RESident memory size, which includes 
the JVM heap but also other off-heap data structures and miscellanea.  
Cassandra has recently been making more use of off-heap structures (for 
example, row caching via SerializingCacheProvider).  This is done as a matter 
of efficiency - a serialized off-heap row is much smaller than a classical 
object sitting in the JVM heap - so you can do more with less.

Unfortunately, in my experience, it's not perfect.  They still have a cost, in 
terms of on-heap usage, as well as off-heap growth over time.

Specifically, my experience with cassandra 1.1.0 showed that off-heap row 
caches incurred a very high on-heap cost (ironic) - see my post at 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201206.mbox/%3c6feb097f-287b-471d-bea2-48862b30f...@bloomdigital.com%3E
 - as documented in that email, I managed that with regularly scheduled full GC 
runs via System.gc()

I have, since then, moved away from scheduled System.gc() to scheduled row 
cache invalidations.  While this had the same effect as System.gc() I described 
in my email, it eliminated the 20-30 second pause associated with it.  It did 
however introduce (or may be I never noticed earlier), slow creep in memory 
usage outside of the heap.

It's typical in my case for example for a process configured with 6G of JVM 
heap to start up, stabilize at 6.5 - 7GB RESident usage, then creep up slowly 
throughout a week to 10-11GB range.  Depending on what else the box is doing, 
I've experienced the linux OOM killer killing cassandra as you've described, or 
heavy swap usage bringing everything down (we're latency-sensitive), etc..

And now for the good news.  Since I've upgraded to 1.1.2:
1. There's no more need for regularly scheduled System.gc()
2. There's no more need for regularly scheduled row cache invalidation
3. The HEAP usage within the JVM is stable over time
4. The RESident size of the process appears also stable over time

Point #4 above is still pending as I only have 3 day graphs since the upgrade, 
but they show promising results compared to the slope of the same graph before 
the upgrade to 1.1.2

So my advice is give 1.1.2 a shot - just be mindful of 
https://issues.apache.org/jira/browse/CASSANDRA-4411


On 2012-07-26, at 2:18 AM, Thomas Spengler wrote:

> I saw this.
> 
> All works fine upto version 1.1.0
> the 0.8.x takes 5GB of memory of an 8GB machine
> the 1.0.x takes between 6 and 7 GB on a 8GB machine
> and
> the 1.1.0 takes all
> 
> and it is a problem
> for me it is no solution to wait of the OOM-Killer from the linux kernel
> and restart the cassandraprocess
> 
> when my machine has less then 100MB ram available then I have a problem.
> 
> 
> 
> On 07/25/2012 07:06 PM, Tyler Hobbs wrote:
>> Are you actually seeing any problems from this? High virtual memory usage
>> on its own really doesn't mean anything. See
>> http://wiki.apache.org/cassandra/FAQ#mmap
>> 
>> On Wed, Jul 25, 2012 at 1:21 AM, Thomas Spengler <
>> thomas.speng...@toptarif.de> wrote:
>> 
>>> No one has any idea?
>>> 
>>> we tryed
>>> 
>>> update to 1.1.2
>>> DiskAccessMode standard, indexAccessMode standard
>>> row_cache_size_in_mb: 0
>>> key_cache_size_in_mb: 0
>>> 
>>> 
>>> Our next try will to change
>>> 
>>> SerializingCacheProvider to ConcurrentLinkedHashCacheProvider
>>> 
>>> any other proposals are welcom
>>> 
>>> On 07/04/2012 02:13 PM, Thomas Spengler wrote:
 Hi @all,
 
 since our upgrade form cassandra 1.0.3 to 1.1.0 the virtual memory usage
 of the cassandra-nodes explodes
 
 our setup is:
 * 5 - centos 5.8 nodes
 * each 4 CPU's and 8 GB RAM
 * each node holds about 100 GB on data
 * each jvm's uses 2GB Ram
 * DiskAccessMode is standard, indexAccessMode is standard
 
 The memory usage grows upto the whole memory is used.
 
 Just for in

High CPU usage as of 8pm eastern time

2012-06-30 Thread Mina Naguib

Hi folks

Our cassandra (and other java-based apps) started experiencing extremely high 
CPU usage as of 8pm eastern time (midnight UTC).

The issue appears to be related to specific versions of java + linux + ntpd

There are many solutions floating around on IRC, twitter, stackexchange, LKML.

The simplest one that worked for us is simply to run this command on each 
affected machine:

date; date `date +"%m%d%H%M%C%y.%S"`; date;

CPU drop was instantaneous - there was no need to restart the server, ntpd, or 
any of the affected JVMs.





Re: Random slow connects.

2012-06-14 Thread Mina Naguib

On 2012-06-14, at 10:38 AM, Henrik Schröder wrote:

> Hi everyone,
> 
> We have problem with our Cassandra cluster, and that is that sometimes it 
> takes several seconds to open a new Thrift connection to the server. We've 
> had this issue when we ran on windows, and we have this issue now that we run 
> on Ubuntu. We've had it with our old networking setup, and we have it with 
> our new networking setup where we're running it over a dedicated gigabit 
> network. Normally estabishing a new connection is instant, but once in a 
> while it seems like it's not accepting any new connections until three 
> seconds have passed.
> 
> We're of course running a connection-pooling client which mitigates this, 
> since once a connection is established, it's rock solid.
> 
> We tried switching the rpc_server_type to hsha, but that seems to have made 
> the problem worse, we're seeing more connection timeouts because of this.
> 
> For what it's woth, we're running Cassandra version 1.0.10 on Ubuntu, and our 
> connection pool is configured to abort a connection attempt after two 
> seconds, and each connection lives for six hours and then it's recycled. 
> Under current load we do about 500 writes/s and 100 reads/s, we have 20 
> clients, but each has a very small connection pool of maybe up to 5 
> simultaneous connections against each Cassandra server. We see these 
> connection issues maybe once a day, but always at random intervals.
> 
> We've tried to get more information through Datastax Opscenter, the JMX 
> console, and our own application monitoring and logging, but we can't see 
> anything out of the ordinary. Sometimes, seemingly by random, it's just 
> really slow to connect. We're all out of ideas. Does anyone here have 
> suggestions on where to look and what to do next?

Have you ironed out non-cassandra potential causes ?

3 seconds constantly sounds it could be a timeout/retry somewhere.  Do you 
contact cassandra via a hostname or IP address ?  If via hostname, iron out DNS.

Either way, I'd fire up tcpdump, both on both the client and the server, and 
observe the TCP handshake.  Specifically see if the SYN packet is sent and 
received, whether the SYN-ACK is sent back right away and received, and final 
ACK.

If that looks good, then TCP-wise you're in good shape and the problem is in a 
higher layer (thrift).  If not, see where the delay/drop/retry happens.  If 
it's in the first packet, it may be a networking/routing issue.  If in the 
second, it may me capacity at the server (investigate with lsof/netstat/JMX), 
etc..




Re: memory issue on 1.1.0

2012-06-05 Thread Mina Naguib

Hi Wade

I don't know if your scenario matches mine, but I've been struggling with 
memory pressure in 1.x as well.  I made the jump from 0.7.9 to 1.1.0, along 
with enabling compression and levelled compactions, so I don't know which 
specifically is the main culprit.

Specifically, all my nodes seem to "lose" heap memory.  As parnew and CMS do 
their job, over any reasonable period of time, the "floor" of memory after a GC 
keeps rising.  This is quite visible if you leave jconsole connected for a day 
or so, and manifests itself as a funny-looking cone like so: 
http://mina.naguib.ca/images/cassandra_jconsole.png

Once memory pressure reaches a point where the heap can't be maintained 
reliably below 75%, cassandra goes into survival mode - via a bunch of tunables 
in cassandra.conf it'll do things like flush memtables, drop caches, etc - all 
of which, in my experience, especially with the recent off-heap data 
structures, exasperate the problem.

I've been meaning, of course, to collect enough technical data to file a bug 
report, but haven't had the time.  I have not yet tested 1.1.1 to see if it 
improves the situation.

What I have found however, is a band-aid which you see at the rightmost section 
of the graph in the screenshot I posted.  That is simply to hit "Perform GC" 
button in jconsole.  It seems that a full System.gc() *DOES* reclaim heap 
memory that parnew and CMS fail to reclaim.

On my production cluster I have a full-GC via JMX scheduled in a rolling 
fashion every 4 hours.  It's extremely expensive (20-40 seconds of 
unresponsiveness) but is a necessary evil in my situation.  Without it, my 
nodes enter a nasty spiral of constant flushing, constant compactions, high 
heap usage, instability and high latency.


On 2012-06-05, at 2:56 PM, Poziombka, Wade L wrote:

> Alas, upgrading to 1.1.1 did not solve my issue.
> 
> -Original Message-
> From: Brandon Williams [mailto:dri...@gmail.com] 
> Sent: Monday, June 04, 2012 11:24 PM
> To: user@cassandra.apache.org
> Subject: Re: memory issue on 1.1.0
> 
> Perhaps the deletes: https://issues.apache.org/jira/browse/CASSANDRA-3741
> 
> -Brandon
> 
> On Sun, Jun 3, 2012 at 6:12 PM, Poziombka, Wade L 
>  wrote:
>> Running a very write intensive (new column, delete old column etc.) process 
>> and failing on memory.  Log file attached.
>> 
>> Curiously when I add new data I have never seen this have in past sent 
>> hundreds of millions "new" transactions.  It seems to be when I 
>> modify.  my process is as follows
>> 
>> key slice to get columns to modify in batches of 100, in separate threads 
>> modify those columns.  I advance the slice with the start key each with last 
>> key in previous batch.  Mutations done are update a column value in one 
>> column family(token), delete column and add new column in another (pan).
>> 
>> Runs well until after about 5 million rows then it seems to run out of 
>> memory.  Note that these column families are quite small.
>> 
>> WARN [ScheduledTasks:1] 2012-06-03 17:49:01,558 GCInspector.java (line 
>> 145) Heap is 0.7967470834946492 full.  You may need to reduce memtable 
>> and/or cache sizes.  Cassandra will now flush up to the two largest 
>> memtables to free up memory.  Adjust flush_largest_memtables_at 
>> threshold in cassandra.yaml if you don't want Cassandra to do this 
>> automatically
>>  INFO [ScheduledTasks:1] 2012-06-03 17:49:01,559 StorageService.java 
>> (line 2772) Unable to reduce heap usage since there are no dirty 
>> column families
>>  INFO [GossipStage:1] 2012-06-03 17:49:01,999 Gossiper.java (line 797) 
>> InetAddress /10.230.34.170 is now UP
>>  INFO [ScheduledTasks:1] 2012-06-03 17:49:10,048 GCInspector.java 
>> (line 122) GC for ParNew: 206 ms for 1 collections, 7345969520 used; 
>> max is 8506048512
>>  INFO [ScheduledTasks:1] 2012-06-03 17:49:53,187 GCInspector.java 
>> (line 122) GC for ConcurrentMarkSweep: 12770 ms for 1 collections, 
>> 5714800208 used; max is 8506048512
>> 
>> 
>> Keyspace: keyspace
>>Read Count: 50042632
>>Read Latency: 0.23157864418482224 ms.
>>Write Count: 44948323
>>Write Latency: 0.019460829472992797 ms.
>>Pending Tasks: 0
>>Column Family: pan
>>SSTable count: 5
>>Space used (live): 1977467326
>>Space used (total): 1977467326
>>Number of Keys (estimate): 16334848
>>Memtable Columns Count: 0
>>Memtable Data Size: 0
>>Memtable Switch Count: 74
>>Read Count: 14985122
>>Read Latency: 0.408 ms.
>>Write Count: 19972441
>>Write Latency: 0.022 ms.
>>Pending Tasks: 0
>>Bloom Filter False Postives: 829
>>Bloom Filter False Ratio: 0.00073
>>Bloom Filter Space Used: 37048400
>>Compacted row minimum size: 125
>>Compac

Re: Cassandra C client implementation

2011-12-14 Thread Mina Naguib

Hi Vlad

I'm the author of libcassie.

For what it's worth, it's in production where I work, consuming a heavily-used 
cassandra 0.7.9 cluster.

We do have plans to upgrade the cluster to 1.x, to benefit from all the 
improvements, CQL, etc... but that includes revising all our clients (across 
several programming languages).

So, it's definitely on my todo list to address our C clients by either 
upgrading libcassie, or possibly completely rewriting it.

Currently it's a wrapper around the C++ parent project libcassandra.  I haven't 
been fond of having that many layered abstractions, and the thrift Glib2 
interface has definitely piqued my interest, so I'm leaning towards a complete 
rewrite.

While we're at it, it would also be nice to have features like asynchronous 
modes for popular event loops, connection pooling, etc.

Unfortunately, I have no milestones set for any of this, nor the time 
(currently) to experiment and proof-of-concept it.

I'd be curious to hear from other C hackers whether they've experimented with 
the thrift Glib2 interface and gotten a "hello world" to work against cassandra 
1.x.  Perhaps there's room for some code sharing/collaboration on a new library 
to supersede the existing libcassie+libcassandra.


On 2011-12-14, at 5:16 PM, Vlad Paiu wrote:

> Hello Eric,
> 
> We have that, thanks alot for the contribution.
> The idea is to not play around with including C++ code in a C app, if there's 
> an alternative ( the thrift g_libc ).
> 
> Unfortunately, since thrift does not generate a skeleton for the glibc code, 
> I don't know how to find out what the API functions are called, and guessing 
> them is not going that good :)
> 
> I'll wait a little longer & see if anybody can help with the C thrift, or at 
> least tell me it's not working. :)
> 
> Regards,
> Vlad
> 
> Eric Tamme  wrote:
> 
>> On 12/14/2011 04:18 PM, Vlad Paiu wrote:
>>> Hi,
>>> 
>>> Just tried libcassie and seems it's not compatible with latest cassandra, 
>>> as even simple inserts and fetches fail with InvalidRequestException...
>>> 
>>> So can anybody please provide a very simple example in C for connecting&  
>>> fetching columns with thrift ?
>>> 
>>> Regards,
>>> Vlad
>>> 
>>> Vlad Paiu  wrote:
>>> 
>> 
>> Vlad,
>> 
>> We have written a specific cassandra db module for usrloc with opensips 
>> and have open sourced it on github.  We use the thrift generated c++ 
>> bindings and extern stuff to c.  I spoke to bogdan about this a while 
>> ago, and gave him the github link, but here it is for your reference   
>> https://github.com/junction/db_jnctn_usrloc
>> 
>> Hopefully that helps.  I idle in #opensips too,  just ask about 
>> cassandra in there and I'll probably see it.
>> 
>> - Eric Tamme
>> 



Re: Peculiar imbalance affecting 2 machines in a 6 node cluster

2011-08-10 Thread Mina Naguib

Hi Aaron

Thank you very much for the reply and the pointers to the previous list 
discussions.  The second was was particularly telling.

I'm happy to say that the problem is fixed, and it's so trivial it's quite 
embarrassing - but I'll state it here for the sake of the archives.

There was an extra semicolon in the topology file in the line defining IPLA3.  
It's just as visible in my prod config as it is in my example below ;-)

I'm guessing the parser splits  tuples on (":"), so it probably 
parsed the IPLA3 entry as "DCLA" , ":RAC1" (which is different than the others 
on "RAC1"), and so the NTS did its thing distributing evenly between racks, and 
IPLA3 got more of the data and IPLA2 got less.

I''ve fixed it, and the reads/s and writes/s immediately equalized.  I'm now 
doing a round of repairs/compactions/cleanups to equalize the data load as well.

Unfortunately It's not easy in cassandra 0.7.8 to actually see the parsed 
topology state (unlike 0.8's nice ring output which shows the DC and rack), so 
I'm ashamed to say it took much longer than it should've to troubleshoot.

Thanks for your help.


On 2011-08-10, at 5:12 AM, aaron morton wrote:

> WRT the load imbalance checking the basics: you've run cleanup after any 
> tokens moves? Repair is running ?  Also sometimes nodes get a bit bloated 
> from repair and will settle down with compaction. 
> 
> Your slightly odd tokens in the MTL DC are making it a little tricky to 
> understand whats going on. But I'm trying to check if you've followed the 
> multi DC token selection here  
> http://wiki.apache.org/cassandra/Operations#Token_selection . Background 
> about what can happen in a multi dc deployment if the tokens are not right 
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Replica-data-distributing-between-racks-td6324819.html
> 
> This is what you currently have….
> 
> DC:LA
> IPLA1   Up Normal  34.57 GB11.11%  0  
>  
> IPLA2   Up Normal  17.55 GB11.11%  
> 56713727820156410577229101238628035242  
> IPLA3   Up Normal  51.37 GB11.11%  
> 113427455640312821154458202477256070485 
> 
> DC: MTL
> IPMTL1  Up Normal  34.43 GB22.22%  
> 37809151880104273718152734159085356828  
> IPMTL2  Up Normal  34.56 GB22.22%  
> 94522879700260684295381835397713392071  
> IPMTL3  Up Normal  34.71 GB22.22%  
> 151236607520417094872610936636341427313   
> 
> Using the bump approach you would have 
> 
> IPLA1 0 
> IPLA2 56713727820156410577229101238628035242
> IPLA3 113427455640312821154458202477256070484 
> 
> IPMTL11   
> IPMTL256713727820156410577229101238628035243  
> IPMTL3113427455640312821154458202477256070485  
> 
> Using the interleaving you would have 
> 
> IPLA1 0
> IPMTL128356863910078205288614550619314017621
> IPLA2 56713727820156410577229101238628035242
> IPMTL285070591730234615865843651857942052863
> IPLA3 113427455640312821154458202477256070484
> IPMTL3141784319550391026443072753096570088105
> 
> The current setup in LA give each node in LA 33% of the LA local ring. Which 
> should be right, just checking.  
> 
> If cleanup / repair / compaction is all good and you are confident the tokens 
> are right try poking around with nodetool getendpoints to see which nodes 
> keys are sent to.  Like you I cannot see anything obvious in NTS that would 
> cause load to be imbalanced if they are all in the same rack. 
> 
> Cheers
> 
> 
> -
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 10 Aug 2011, at 11:24, Mina Naguib wrote:
> 
>> Hi everyone
>> 
>> I'm observing a very peculiar type of imbalance and I'd appreciate any help 
>> or ideas to try.  This is on cassandra 0.7.8.
>> 
>> The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% 
>> each and each holding roughly 34G.
>> 
>> Then, I added to it 3 machines in the LA data center.  The ring is currently 
>> as follows (IP addresses redacted for clarity):
>> 
>> Address Status State   LoadOwnsToken 
>>   
>>   
>> 151236607520417094872610936636341427313 
>> IPLA1   Up Normal  34.57 GB11.11%  0 
>>   
>

Peculiar imbalance affecting 2 machines in a 6 node cluster

2011-08-09 Thread Mina Naguib
Hi everyone

I'm observing a very peculiar type of imbalance and I'd appreciate any help or 
ideas to try.  This is on cassandra 0.7.8.

The original cluster was 3 machines in the DCMTL, equally balanced at 33.33% 
each and each holding roughly 34G.

Then, I added to it 3 machines in the LA data center.  The ring is currently as 
follows (IP addresses redacted for clarity):

Address Status State   LoadOwnsToken
   
   
151236607520417094872610936636341427313 
IPLA1   Up Normal  34.57 GB11.11%  0
   
IPMTL1  Up Normal  34.43 GB22.22%  
37809151880104273718152734159085356828  
IPLA2   Up Normal  17.55 GB11.11%  
56713727820156410577229101238628035242  
IPMTL2  Up Normal  34.56 GB22.22%  
94522879700260684295381835397713392071  
IPLA3   Up Normal  51.37 GB11.11%  
113427455640312821154458202477256070485 
IPMTL3  Up Normal  34.71 GB22.22%  
151236607520417094872610936636341427313 

The bump in the 3 MTL nodes (22.22%) is in anticipation of 3 more machines in 
yet another data center, but they're not ready yet to join the cluster.  Once 
that third DC joins all nodes will be at 11.11%. However, I don't think this is 
related.

The problem I'm currently observing is visible in the LA machines, specifically 
IPLA2 and IPLA3.  IPLA2 has 50% the expected volume, and IPLA3 has 150% the 
expected volume.

Putting their load side by side shows the peculiar ratio of 2:1:3 between the 3 
LA nodes:
34.57 17.55 51.37
(the same 2:1:3 ratio is reflected in our internal tools trending reads/second 
and writes/second)

I've tried several iterations of compactions/cleanups to no avail.  In terms of 
config this is the main keyspace:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Options: [DCMTL:2, DCLA:2]
And this is the cassandra-topology.properties file (IPs again redacted for 
clarity):
  IPMTL1:DCMTL:RAC1
  IPMTL2:DCMTL:RAC1
  IPMTL3:DCMTL:RAC1
  IPLA1:DCLA:RAC1
  IPLA2:DCLA:RAC1
  IPLA3:DCLA::RAC1
  IPLON1:DCLON:RAC1
  IPLON2:DCLON:RAC1
  IPLON3:DCLON:RAC1
  # default for unknown nodes
  default=DCBAD:RACBAD


One thing that did occur to me while reading the source code for the 
NetworkTopologyStrategy's calculateNaturalEndpoints is that it prefers placing 
data on different racks.  Since all my machines are defined as in the same 
rack, I believe that the 2-pass approach would still yield balanced placement.

However, just to test, I modified live the topology file to specify that IPLA1, 
IPLA2 and IPLA3 are in 3 different racks, and sure enough I saw immediately 
that the reads/second and writes/second equalized to expected fair volume (I 
quickly reverted that change).

So, it seems somehow related to rack awareness, but I've been raking my head 
and I can't figure out how/why, or why the three MTL machines are not affected 
the same way.

If the solution is to specify them in different racks and run repair on 
everything, I'm okay with that - but I hate doing that without first 
understanding *why* the current behavior is the way it is.

Any ideas would be hugely appreciated.

Thank you.



Re: Read latency is over 1 minute on a column family with 400,000 rows

2011-07-31 Thread Mina Naguib

Did you run that verbatim ? Or you appropriately substituted "keyspace" and 
"columnfamily1" ?

Also, anything in cassandra's log file (system.log) ?  Compacting 150Gb over 
2057 SSTables should take a reasonable bit of time...


On 2011-07-31, at 11:47 PM, myreasoner wrote:

> Thanks.
> 
> I did *./nodetool -h localhost compact keyspace columnfamily1 *.  But it
> came back really quick and the cfstats doesn't seem change much.
> 
> After compaction:
>Column Family: Fingerprint
>SSTable count: 2057
>Space used (live): 164351343468
>Space used (total): 164742957014
>Memtable Columns Count: 33224
>Memtable Data Size: 22410133
>Memtable Switch Count: 378
>Read Count: 7
>Read Latency: NaN ms.
>Write Count: 30972
>Write Latency: 1.579 ms.
>Pending Tasks: 0
>Key cache capacity: 20
>Key cache size: 8157
>Key cache hit rate: 0.0
>Row cache: disabled
>Compacted row minimum size: 104
>Compacted row maximum size: 315852
>Compacted row mean size: 33846



Re: Cassandra 0.7.8 and 0.8.1 fail when major compaction on 37GB database

2011-07-24 Thread Mina Naguib

From experience with similar-sized data sets, 1.5GB may be too little.  
Recently I bumped our java HEAP limit from 3GB to 4GB to get past an OOM doing 
a major compaction.

Check "nodetool -h localhost info" while the compaction is running for a simple 
view into the memory state.

If you can, also hook in jconsole and you'll get a better view, over time, of 
how cassandra's memory usage trends, the effect of GC, and the pressure of 
various operations such as compactions.


On 2011-07-24, at 8:08 AM, lebron james wrote:

>   Hi, Please help me with my problem. For better performance i turn off 
> compaction and run massive inserts, after database reach 37GB i stop massive 
> inserts and start compaction with "NodeTool compaction Keyspace CFamily". 
> after half hour of work cassandra fall with error "Out of memory" i give 
> 1500M to JVM, all parameters in yaml file are default. testing OS ubuntu 
> 11.04 and windows server 2008 dc edition. Thanks!  



Re: Equalizing nodes storage load

2011-07-22 Thread Mina Naguib

Hi Peter

That was precisely it.  Thank you :)

Doing a major compaction on the heaviest node (74.65GB) reduced it to 33.55GB.

I'll compact the other 2 nodes as well.  I anticipate they will also settle 
around that size.


On 2011-07-22, at 5:00 PM, Peter Tillotson wrote:

> I'm not sure if this is the answer, but major compaction on each node
> for each column family. I suspect the data shuffle has left quite a few
> deleted keys which may get cleaned out on major compaction. As I
> remember major compaction doesn't automatically in 7.x, I'm not sure if
> it is triggered by repair.
> 
> p
> 
> On 22/07/11 16:08, Mina Naguib wrote:
>> 
>> I'm trying to balance Load ( 41.98GB vs 59.4GB vs 74.65GB )
>> 
>> Owns looks ok. They're all 33.33% which is what I want.  It was calculated 
>> simply by 2^127 / num_nodes.  The only reason the first one doesn't start at 
>> 0 is that I''ve actually carved the ring planning for 9 machines (2 new data 
>> centers of 3 machines each).  However only 1 data center (DCMTL) is 
>> currently up.
>> 
>> 
>> On 2011-07-22, at 10:56 AM, Sasha Dolgy wrote:
>> 
>>> are you trying to balance "load" or "owns" ?  "owns" looks fine ...
>>> 33.33% each ... which to me says balanced.
>>> 
>>> how did you calculate your tokens?
>>> 
>>> 
>>> On Fri, Jul 22, 2011 at 4:37 PM, Mina Naguib
>>>  wrote:
>>>> 
>>>> Address Status State   LoadOwnsToken
>>>> xx.xx.x.105 Up Normal  41.98 GB33.33%  
>>>> 37809151880104273718152734159085356828
>>>> xx.xx.x.107 Up Normal  59.4 GB 33.33%  
>>>> 94522879700260684295381835397713392071
>>>> xx.xx.x.18  Up Normal  74.65 GB33.33%  
>>>> 151236607520417094872610936636341427313
>> 
>> 
> 



Re: Equalizing nodes storage load

2011-07-22 Thread Mina Naguib

I'm trying to balance Load ( 41.98GB vs 59.4GB vs 74.65GB )

Owns looks ok. They're all 33.33% which is what I want.  It was calculated 
simply by 2^127 / num_nodes.  The only reason the first one doesn't start at 0 
is that I''ve actually carved the ring planning for 9 machines (2 new data 
centers of 3 machines each).  However only 1 data center (DCMTL) is currently 
up.


On 2011-07-22, at 10:56 AM, Sasha Dolgy wrote:

> are you trying to balance "load" or "owns" ?  "owns" looks fine ...
> 33.33% each ... which to me says balanced.
> 
> how did you calculate your tokens?
> 
> 
> On Fri, Jul 22, 2011 at 4:37 PM, Mina Naguib
>  wrote:
>> 
>> Address Status State   LoadOwnsToken
>> xx.xx.x.105 Up Normal  41.98 GB33.33%  
>> 37809151880104273718152734159085356828
>> xx.xx.x.107 Up Normal  59.4 GB 33.33%  
>> 94522879700260684295381835397713392071
>> xx.xx.x.18  Up Normal  74.65 GB33.33%  
>> 151236607520417094872610936636341427313



Equalizing nodes storage load

2011-07-22 Thread Mina Naguib

Hi everyone

I've been struggling trying to get the data volume ("load") to equalize across 
a balanced cluster, and I'm not sure what else I can try.

Background: This was originally a 5-node cluster.  We re-balanced the 3 faster 
machines across the ring, and decommissioned the 2 older ones.  We also 
upgraded cassandra a few times from 0.7.4 through 0.7.5, 0.7.6-2 to 0.7.7.  The 
ring currently looks like so:

Address Status State   LoadOwnsToken
   
   
151236607520417094872610936636341427313 
xx.xx.x.105 Up Normal  41.98 GB33.33%  
37809151880104273718152734159085356828  
xx.xx.x.107 Up Normal  59.4 GB 33.33%  
94522879700260684295381835397713392071  
xx.xx.x.18  Up Normal  74.65 GB33.33%  
151236607520417094872610936636341427313 

What I've tried to far:
1. Running repair on each node (sequentially of course).
2. Running cleanup on the largest node (.18) hoping it would shed 
unneeded data

The repairs helped a bit by, slightly, bumping up the load of the first 2 
machines, but the cleanup on the 3rd failed to reduce its data volume.

So, at this point, I'm out of ideas.  In terms of tpstats metrics, each of the 
3 nodes is serving roughly the same volume of ReadStage and MutationStage, so 
they're balanced in that respect.  However I'm concerned about the imbalance of 
the data load ( 24% / 34% / 42% ) and being unable to equalize it.

For the record, there's only 1 keyspace of meaningful data in the cluster, with 
the following schema settings:
Keyspace: ZZ:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Options: [DCMTL:2]
  Column Families:
ColumnFamily: AA
  default_validation_class: org.apache.cassandra.db.marshal.UTF8Type
  Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period in seconds: 256000.0/0
  Key cache size / save period in seconds: 20.0/14400
  Memtable thresholds: 0.88125/1440/188 (millions of ops/minutes/MB)
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 0.1
  Built indexes: []
ColumnFamily: B (Super)
  default_validation_class: org.apache.cassandra.db.marshal.UTF8Type
  Columns sorted by: 
org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.UTF8Type
  Row cache size / save period in seconds: 75000.0/0
  Key cache size / save period in seconds: 20.0/14400
  Memtable thresholds: 0.88125/1440/188 (millions of ops/minutes/MB)
  GC grace seconds: 864000
  Compaction min/max thresholds: 4/32
  Read repair chance: 0.25
  Built indexes: []

Any tips or ideas to help get the nodes' load equalized would be highly 
appreciated.  If this is normal behaviour and I shouldn't be trying too hard to 
get it equalized, I'd appreciate any notes/links explaining why.

Thank you.