On Wed, Jun 8, 2011 at 12:19 AM, AJ <a...@dude.podzone.net> wrote: > On 6/7/2011 9:32 PM, Edward Capriolo wrote: > <snip> > >> >> I do not like large disk set-ups. I think they end up not being >> economical. Most low latency use cases want high RAM to DISK ratio. Two >> machines with 32GB RAM is usually less expensive then one machine with 64GB >> ram. >> >> For a machine with 1TB drives (or multiple 1TB drives) it is going to be >> difficult to get enough RAM to help with random read patterns. >> >> Also cluster operations like joining, decommissioning, or repair can take >> a *VERY* long time maybe a day. More smaller servers like blade style or >> more agile. >> >> > Is there some rule-of-thumb as to how much RAM is needed per GB of data? I > know it probably "depends", but if you could try to explain the best you can > that would be great! I too am projecting "big data" requirements. >
The way this is normally explained is active-set. IE you have 100,000,000 users but at any given time only 1,000,000 are active thus you need enough RAM to keep these users cached. No there is no rule of thumb it depends on access patterns. In the most extreme case you are using cassandra for an ETL workload. In this case your data will far exceed your RAM and since most operations will be like a "full table scan" caching is almost hopeless and useless. On the other side there are those that want every lookup to be predicatable low latency and totally random read and those might want to maintain a 1-1 ratio. I would track these things over time: reads/writes to c* disk utilization size of CF on disk cache hit rate latency And eventually you find what your ratio is. IE. last month: i had 30 reads/sec my disk was 40% utilized my column family was 40 GB my cache hit was 70% my latency was 1ms this month: i had 45 reads/sec my disk was 95% utilized my column family was 40 GB my cache hit was 30% my latency was 5ms Conclusion: my disk maxed and my cache hit/rate is dropping. I probably need more nodes|or more RAM.