This is true, but for larger installations I end up needing more servers to hold the disks, more racks to hold the servers the point where the overall cost per GB climbs (granted the cost per IOP is probably still good). AIUI, a chunk of that 50% is replicated data such that the truly available space in the cluster is lower than 50% when capacity planning? If so, for some workloads where it's just data pouring in with very few updates, would have me thinking I'd want a tiered model, archiving cold data onto a filer/hdfs.
Bill On Thu, 2010-12-09 at 13:26 -0600, Tyler Hobbs wrote: > That depends on your scenario. In the worst case of one big CF, > there's not much that can be easily done for the disk usage of > compaction and cleanup (which is essentially compaction). > > If, instead, you have several column families and no single CF makes > up the majority of your data, you can push your disk usage a bit > higher. > > A fundamental idea behind Cassandra's architecture is that disk space > is cheap (which, indeed, it is). If you are particularly sensitive to > this, Cassandra might not be the best solution to your problem. Also > keep in mind that Cassandra performs well with average disks, so you > don't need to spend a lot there. Additionally, most people find that > the replication protects their data enough to allow them to use RAID 0 > instead of 1, 10, 5, or 6. > > - Tyler > > > On Thu, Dec 9, 2010 at 12:20 PM, Rustam Aliyev <rus...@code.az> wrote: > > Is there any plans to improve this in future? > > For big data clusters this could be very expensive. Based on > your comment, I will need 200TB of storage for 100TB of data > to keep Cassandra running. > > -- > Rustam. > > > > On 09/12/2010 17:56, Tyler Hobbs wrote: > > > If you are on 0.6, repair is particularly dangerous with > > respect to disk space usage. If your replica is > > sufficiently out of sync, you can triple your disk usage > > pretty easily. This has been improved in 0.7, so repairs > > should use about half as much disk space, on average. > > > > In general, yes, keep your nodes under 50% disk usage at all > > times. Any of: compaction, cleanup, snapshotting, repair, > > or bootstrapping (the latter two are improved in 0.7) can > > double your disk usage temporarily. > > > > You should plan to add more disk space or add nodes when you > > get close to this limit. Once you go over 50%, it's more > > difficult to add nodes, at least in 0.6. > > > > - Tyler > > > > > > On Thu, Dec 9, 2010 at 11:19 AM, Mark > > <static.void....@gmail.com> wrote: > > > > I recently ran into a problem during a repair > > operation where my nodes completely ran out of space > > and my whole cluster was... well, clusterfucked. > > > > I want to make sure how to prevent this problem in > > the future. > > > > Should I make sure that at all times every node is > > under 50% of its disk space? Are there any normal > > day-to-day operations that would cause the any one > > node to double in size that I should be aware of? If > > on or more nodes to surpass the 50% mark, what > > should I plan to do? > > > > Thanks for any advice > > > > > >