If you have a workload with overwrites you will end up with some data needing compaction. Running a nightly manual compaction would remove this, but it will also soak up some IO so it may not be the best solution.
I do not know if Leveled compaction would result in a smaller disk load for the same workload. I agree with other people, turn on compaction. Cheers ----------------- Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 3/04/2012, at 9:19 AM, Yiming Sun wrote: > Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it runs > out of disk space :-S. I didn't try the compression, but when it ran out > of disk space, or near running out, compaction would fail because it needs > space to create some tmp data files. > > I shall get a tatoo that says keep it around 50% -- this is valuable tip. > > -- Y. > > On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan > <jeremiah.jor...@morningstar.com> wrote: > Is that 80% with compression? If not, the first thing to do is turn on > compression. Cassandra doesn't behave well when it runs out of disk space. > You really want to try and stay around 50%, 60-70% works, but only if it is > spread across multiple column families, and even then you can run into issues > when doing repairs. > > -Jeremiah > > > > On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote: > >> Thanks Aaron. Well I guess it is possible the data files from sueprcolumns >> could've been reduced in size after compaction. >> >> This bring yet another question. Say I am on a shoestring budget and can >> only put together a cluster with very limited storage space. The first >> iteration of pushing data into cassandra would drive the disk usage up into >> the 80% range. As time goes by, there will be updates to the data, and many >> columns will be overwritten. If I just push the updates in, the disks will >> run out of space on all of the cluster nodes. What would be the best way to >> handle such a situation if I cannot to buy larger disks? Do I need to delete >> the rows/columns that are going to be updated, do a compaction, and then >> insert the updates? Or is there a better way? Thanks >> >> -- Y. >> >> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com> >> wrote: >>> does cassandra 1.0 perform some default compression? >> No. >> >> The on disk size depends to some degree on the work load. >> >> If there are a lot of overwrites or deleted you may have rows/columns that >> need to be compacted. You may have some big old SSTables that have not been >> compacted for a while. >> >> There is some overhead involved in the super columns: the super col name, >> length of the name and the number of columns. >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 29/03/2012, at 9:47 AM, Yiming Sun wrote: >> >>> Actually, after I read an article on cassandra 1.0 compression just now ( >>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I >>> am more puzzled. In our schema, we didn't specify any compression options >>> -- does cassandra 1.0 perform some default compression? or is the data >>> reduction purely because of the schema change? Thanks. >>> >>> -- Y. >>> >>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming....@gmail.com> wrote: >>> Hi, >>> >>> We are trying to estimate the amount of storage we need for a production >>> cassandra cluster. While I was doing the calculation, I noticed a very >>> dramatic difference in terms of storage space used by cassandra data files. >>> >>> Our previous setup consists of a single-node cassandra 0.8.x with no >>> replication, and the data is stored using supercolumns, and the data files >>> total about 534GB on disk. >>> >>> A few weeks ago, I put together a cluster consisting of 3 nodes running >>> cassandra 1.0 with replication factor of 2, and the data is flattened out >>> and stored using regular columns. And the aggregated data file size is >>> only 488GB (would be 244GB if no replication). >>> >>> This is a very dramatic reduction in terms of storage needs, and is >>> certainly good news in terms of how much storage we need to provision. >>> However, because of the dramatic reduction, I also would like to make sure >>> it is absolutely correct before submitting it - and also get a sense of why >>> there was such a difference. -- I know cassandra 1.0 does data compression, >>> but does the schema change from supercolumn to regular column also help >>> reduce storage usage? Thanks. >>> >>> -- Y. >>> >> >> > >