> on node with 300m rows (small node), it will be 585937 index sample entries > with 512 sampling. lets say 100 bytes per entry this will be 585 MB, bloom > filters are 884 MB. With default sampling 128, sampled entries will use > majority of node memory. Index sampling should be reworked like bloom > filters to avoid allocating one large array per sstable. hadoop mapfile is > using sampling 128 by default too and it reads entire mapfile index into > memory.
The index summary does have an ArrayList which will be backed by an array which could become large; however larger than that array (which is going to be 1 object reference per sample, or 1-2 taking into account internal growth of the array list) will be the overhead of the objects in the array (regular Java objects). This is also why it is non-trivial to report on the data size. > it should be clearly documented in > http://wiki.apache.org/cassandra/LargeDataSetConsiderations - that bloom > filters + index sampling will be responsible for most memory used by node. > Caching itself has minimal use on large data set used for OLAP. I added some information at the end. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)