understanding memory footprint

2013-08-12 Thread Paul Ingalls
I'm trying to get a handle on how newer cassandra handles memory.  Most of what 
I am seeing via google, on the wiki etc. appears old.  For example, this wiki 
article appears out of date relative to post 1.0:

http://wiki.apache.org/cassandra/MemtableThresholds

specifically this is the section I'm looking at:

 For a rough rule of thumb, Cassandra's internal datastructures will require 
about memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches.

is this still true?  I thought memtable_throughput_in_mb is deprecated.  Is 
there any equivalent calculation/rule of thumb for cassandra post 1.0?

At the core, my question really is:

"Does the number of column families still significantly impact the memory 
footprint? If so, what is the incremental cost of a column family/table?"

and the available docs aren't helping me answer it…

Thanks!

Paul

Re: understanding memory footprint

2013-08-12 Thread Robert Coli
On Mon, Aug 12, 2013 at 10:22 AM, Paul Ingalls wrote:

> At the core, my question really is:
>
> "Does the number of column families still significantly impact the memory
> footprint? If so, what is the incremental cost of a column family/table?"
>

This question has been asked about a kabillion times.

Someone with more time than me to mess around with Java heap dumps should
design a test case and publish their findings.

Even columnfamilies that take no write consume memory via their MBeans,
etc...

=Rob


Re: understanding memory footprint

2013-08-12 Thread Paul Ingalls
I don't really need exact numbers, just a rough cost would be sufficient.  I'm 
running into memory problems on my cluster, and I'm trying to decide if 
reducing the number of column families would be worth the effort.  Looking at 
the rule of thumb from the wiki entry made it seem like reducing the number of 
tables would make a big impact, but I'm running 1.2.8 so not sure if it is 
still true.

Is there a new rule of thumb?

On Aug 12, 2013, at 10:42 AM, Robert Coli  wrote:

> On Mon, Aug 12, 2013 at 10:22 AM, Paul Ingalls  wrote:
> At the core, my question really is:
> 
> "Does the number of column families still significantly impact the memory 
> footprint? If so, what is the incremental cost of a column family/table?"
> 
> This question has been asked about a kabillion times.
> 
> Someone with more time than me to mess around with Java heap dumps should 
> design a test case and publish their findings.
> 
> Even columnfamilies that take no write consume memory via their MBeans, etc...
> 
> =Rob
>  



Re: understanding memory footprint

2013-08-12 Thread Robert Coli
On Mon, Aug 12, 2013 at 11:14 AM, Paul Ingalls wrote:

> I don't really need exact numbers, just a rough cost would be sufficient.
>  I'm running into memory problems on my cluster, and I'm trying to decide
> if reducing the number of column families would be worth the effort.
>  Looking at the rule of thumb from the wiki entry made it seem like
> reducing the number of tables would make a big impact, but I'm running
> 1.2.8 so not sure if it is still true.
>
> Is there a new rule of thumb?
>

If you want a cheap/quick measure of how much space partially full
memtables are taking, just nodetool flush and check heap usage before and
after?

If you want a cheap/quick measure of how much space empty sstables take in
heap, I think you're out of luck.

=Rob


Re: understanding memory footprint

2013-08-13 Thread Alain RODRIGUEZ
if using 1.2.*, Bloom filters are in native memory so not pressuring your
heap, how many data do you have per node ? If this value is big, you have
samples index in the heap consuming a lot of memory, for sure, and growing
as your data per node grow.

Solutions : increase the heap if < 8GB and / or reduce
sampling index_interval: 128 to a bigger value (256 - 512) and /or wait for
2.0.* which, of the top of my head, should move the sampling in native
memory allowing heap size to be independent from the data size per node.

This should alleviate things. Yet these are only guesses since I know
almost nothing about your cluster...

Hope this help somehow.


2013/8/12 Robert Coli 

> On Mon, Aug 12, 2013 at 11:14 AM, Paul Ingalls wrote:
>
>> I don't really need exact numbers, just a rough cost would be sufficient.
>>  I'm running into memory problems on my cluster, and I'm trying to decide
>> if reducing the number of column families would be worth the effort.
>>  Looking at the rule of thumb from the wiki entry made it seem like
>> reducing the number of tables would make a big impact, but I'm running
>> 1.2.8 so not sure if it is still true.
>>
>> Is there a new rule of thumb?
>>
>
> If you want a cheap/quick measure of how much space partially full
> memtables are taking, just nodetool flush and check heap usage before and
> after?
>
> If you want a cheap/quick measure of how much space empty sstables take in
> heap, I think you're out of luck.
>
> =Rob
>
>


Re: understanding memory footprint

2013-08-14 Thread Aaron Morton
> "Does the number of column families still significantly impact the memory 
> footprint? If so, what is the incremental cost of a column family/table?"
IMHO there would be little difference in memory use for a node with zero data 
that had 10 CF's and one that had 100 CF's. When you start putting data in the 
story changes. 

As Alain said, the number of rows can impact the memory use. In 1.2+ that's 
less of an issue, but the index samples are still on heap. In my experience in 
normal (4Gb to 8GB heap) this is not an issue until you get into 500+ million 
rows. 

The number of CF's is still used when calculating when to flush to disk. If you 
have 100 cf's the server will flush to disk more frequently than if you have 
10. Because it needs to leave more room for the memtables to grow. 

The best way to get help on this is provide details on the memory settings, the 
numbers of CF's, the total number of rows, and the cache settings. 

Hope that helps. 
 
-
Aaron Morton
Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 13/08/2013, at 9:10 PM, Alain RODRIGUEZ  wrote:

> if using 1.2.*, Bloom filters are in native memory so not pressuring your 
> heap, how many data do you have per node ? If this value is big, you have 
> samples index in the heap consuming a lot of memory, for sure, and growing as 
> your data per node grow.
> 
> Solutions : increase the heap if < 8GB and / or reduce sampling 
> index_interval: 128 to a bigger value (256 - 512) and /or wait for 2.0.* 
> which, of the top of my head, should move the sampling in native memory 
> allowing heap size to be independent from the data size per node.
> 
> This should alleviate things. Yet these are only guesses since I know almost 
> nothing about your cluster...
> 
> Hope this help somehow.
> 
> 
> 2013/8/12 Robert Coli 
> On Mon, Aug 12, 2013 at 11:14 AM, Paul Ingalls  wrote:
> I don't really need exact numbers, just a rough cost would be sufficient.  
> I'm running into memory problems on my cluster, and I'm trying to decide if 
> reducing the number of column families would be worth the effort.  Looking at 
> the rule of thumb from the wiki entry made it seem like reducing the number 
> of tables would make a big impact, but I'm running 1.2.8 so not sure if it is 
> still true.
> 
> Is there a new rule of thumb?
>  
> If you want a cheap/quick measure of how much space partially full memtables 
> are taking, just nodetool flush and check heap usage before and after?
> 
> If you want a cheap/quick measure of how much space empty sstables take in 
> heap, I think you're out of luck.
> 
> =Rob
> 
> 



Re: understanding memory footprint

2013-08-15 Thread Janne Jalkanen

Also, if you are using leveled compaction, remember that each SSTable will take 
a couple of MB of heap space.  You can tune this by choosing a good 
sstable_size_in_mb value for those CFs which are on LCS and contain lots of 
data.  Default is 5 MB, which is for many cases inadequate, so most people seem 
to be happy running with sizes that range from 64 MB and up.  The right size 
for you will most probably vary.

/Janne

On Aug 15, 2013, at 06:05 , Aaron Morton  wrote:

>> "Does the number of column families still significantly impact the memory 
>> footprint? If so, what is the incremental cost of a column family/table?"
> IMHO there would be little difference in memory use for a node with zero data 
> that had 10 CF's and one that had 100 CF's. When you start putting data in 
> the story changes. 
> 
> As Alain said, the number of rows can impact the memory use. In 1.2+ that's 
> less of an issue, but the index samples are still on heap. In my experience 
> in normal (4Gb to 8GB heap) this is not an issue until you get into 500+ 
> million rows. 
> 
> The number of CF's is still used when calculating when to flush to disk. If 
> you have 100 cf's the server will flush to disk more frequently than if you 
> have 10. Because it needs to leave more room for the memtables to grow. 
> 
> The best way to get help on this is provide details on the memory settings, 
> the numbers of CF's, the total number of rows, and the cache settings. 
> 
> Hope that helps. 
>  
> -
> Aaron Morton
> Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 13/08/2013, at 9:10 PM, Alain RODRIGUEZ  wrote:
> 
>> if using 1.2.*, Bloom filters are in native memory so not pressuring your 
>> heap, how many data do you have per node ? If this value is big, you have 
>> samples index in the heap consuming a lot of memory, for sure, and growing 
>> as your data per node grow.
>> 
>> Solutions : increase the heap if < 8GB and / or reduce sampling 
>> index_interval: 128 to a bigger value (256 - 512) and /or wait for 2.0.* 
>> which, of the top of my head, should move the sampling in native memory 
>> allowing heap size to be independent from the data size per node.
>> 
>> This should alleviate things. Yet these are only guesses since I know almost 
>> nothing about your cluster...
>> 
>> Hope this help somehow.
>> 
>> 
>> 2013/8/12 Robert Coli 
>> On Mon, Aug 12, 2013 at 11:14 AM, Paul Ingalls  wrote:
>> I don't really need exact numbers, just a rough cost would be sufficient.  
>> I'm running into memory problems on my cluster, and I'm trying to decide if 
>> reducing the number of column families would be worth the effort.  Looking 
>> at the rule of thumb from the wiki entry made it seem like reducing the 
>> number of tables would make a big impact, but I'm running 1.2.8 so not sure 
>> if it is still true.
>> 
>> Is there a new rule of thumb?
>>  
>> If you want a cheap/quick measure of how much space partially full memtables 
>> are taking, just nodetool flush and check heap usage before and after?
>> 
>> If you want a cheap/quick measure of how much space empty sstables take in 
>> heap, I think you're out of luck.
>> 
>> =Rob
>> 
>> 
> 



Re: understanding memory footprint

2013-08-15 Thread Robert Coli
On Thu, Aug 15, 2013 at 6:58 AM, Janne Jalkanen wrote:

>
> Also, if you are using leveled compaction, remember that each SSTable will
> take a couple of MB of heap space.  You can tune this by choosing a good
> sstable_size_in_mb value for those CFs which are on LCS and contain lots of
> data.  Default is 5 MB, which is for many cases inadequate, so most people
> seem to be happy running with sizes that range from 64 MB and up.  The
> right size for you will most probably vary.
>

The 2.0 era default is 160mb.

https://issues.apache.org/jira/browse/CASSANDRA-5727

=Rob


Re: understanding memory footprint

2013-08-15 Thread Paul Ingalls
Hey Aaron,

I went ahead and changed the model around to reduce the number of CF's from 
around 60 or so to 7, but I'm still running into OOM messages and eventual node 
crashes after I've pushed in about 30GB of data per node.  And it seems that, 
under load, once one node goes down, the other seems to follow within a few 
minutes.  Its like the cluster just hits a wall.

Some details from my cluster.  As you can see, most are at defaults. Let me 
know if you need more data

5 nodes running 1.2.8 on 4CPU VM's with 7GB RAM
750GB raid 0 disk

num_tokens = 256
MAX_HEAP_SIZE = 3G
HEAP_NEW_SIZE = 300M
key_cache_size_in_mb is empty, so using the default
row cache is disabled
commitlog_sync_period_in_ms:1
commitlog_segment_size_in_mb: 32
flush_largest_memtables_at: 0.75
reduce_cache_sizes_at: 0.85
reduce_cache_capacity_to: 0.6
concurrent_reads: 32
concurrent_writes: 32
commitlog_total_space_in_mb: 768 - I reduced a bit
memtable_flush_queue_size: 5
rpc_server_type: hsha
column_index_size_in_kb: 64
in_memory_compaction_limit_in_mb: 64
compaction_throughput_mb_per_sec: 16

one of my tables creates a lot of large rows, would it make sense to change the 
partition key to break the large rows into more, smaller rows?

Thanks for the help!

Paul


On Aug 14, 2013, at 8:05 PM, Aaron Morton  wrote:

>> "Does the number of column families still significantly impact the memory 
>> footprint? If so, what is the incremental cost of a column family/table?"
> IMHO there would be little difference in memory use for a node with zero data 
> that had 10 CF's and one that had 100 CF's. When you start putting data in 
> the story changes. 
> 
> As Alain said, the number of rows can impact the memory use. In 1.2+ that's 
> less of an issue, but the index samples are still on heap. In my experience 
> in normal (4Gb to 8GB heap) this is not an issue until you get into 500+ 
> million rows. 
> 
> The number of CF's is still used when calculating when to flush to disk. If 
> you have 100 cf's the server will flush to disk more frequently than if you 
> have 10. Because it needs to leave more room for the memtables to grow. 
> 
> The best way to get help on this is provide details on the memory settings, 
> the numbers of CF's, the total number of rows, and the cache settings. 
> 
> Hope that helps. 
>  
> -
> Aaron Morton
> Cassandra Consultant
> New Zealand
> 
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 13/08/2013, at 9:10 PM, Alain RODRIGUEZ  wrote:
> 
>> if using 1.2.*, Bloom filters are in native memory so not pressuring your 
>> heap, how many data do you have per node ? If this value is big, you have 
>> samples index in the heap consuming a lot of memory, for sure, and growing 
>> as your data per node grow.
>> 
>> Solutions : increase the heap if < 8GB and / or reduce sampling 
>> index_interval: 128 to a bigger value (256 - 512) and /or wait for 2.0.* 
>> which, of the top of my head, should move the sampling in native memory 
>> allowing heap size to be independent from the data size per node.
>> 
>> This should alleviate things. Yet these are only guesses since I know almost 
>> nothing about your cluster...
>> 
>> Hope this help somehow.
>> 
>> 
>> 2013/8/12 Robert Coli 
>> On Mon, Aug 12, 2013 at 11:14 AM, Paul Ingalls  wrote:
>> I don't really need exact numbers, just a rough cost would be sufficient.  
>> I'm running into memory problems on my cluster, and I'm trying to decide if 
>> reducing the number of column families would be worth the effort.  Looking 
>> at the rule of thumb from the wiki entry made it seem like reducing the 
>> number of tables would make a big impact, but I'm running 1.2.8 so not sure 
>> if it is still true.
>> 
>> Is there a new rule of thumb?
>>  
>> If you want a cheap/quick measure of how much space partially full memtables 
>> are taking, just nodetool flush and check heap usage before and after?
>> 
>> If you want a cheap/quick measure of how much space empty sstables take in 
>> heap, I think you're out of luck.
>> 
>> =Rob
>> 
>> 
>