Thanks Aaron.  Well I guess it is possible the data files from sueprcolumns
could've been reduced in size after compaction.

This bring yet another question.  Say I am on a shoestring budget and can
only put together a cluster with very limited storage space.  The first
iteration of pushing data into cassandra would drive the disk usage up into
the 80% range.  As time goes by, there will be updates to the data, and
many columns will be overwritten.  If I just push the updates in, the disks
will run out of space on all of the cluster nodes.  What would be the best
way to handle such a situation if I cannot to buy larger disks? Do I need
to delete the rows/columns that are going to be updated, do a compaction,
and then insert the updates?  Or is there a better way?  Thanks

-- Y.

On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com>wrote:

> does cassandra 1.0 perform some default compression?
>
> No.
>
> The on disk size depends to some degree on the work load.
>
> If there are a lot of overwrites or deleted you may have rows/columns that
> need to be compacted. You may have some big old SSTables that have not been
> compacted for a while.
>
> There is some overhead involved in the super columns: the super col name,
> length of the name and the number of columns.
>
> Cheers
>
>   -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>
> Actually, after I read an article on cassandra 1.0 compression just now (
> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression),
> I am more puzzled.  In our schema, we didn't specify any compression
> options -- does cassandra 1.0 perform some default compression? or is the
> data reduction purely because of the schema change?  Thanks.
>
> -- Y.
>
> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming....@gmail.com> wrote:
>
>> Hi,
>>
>> We are trying to estimate the amount of storage we need for a production
>> cassandra cluster.  While I was doing the calculation, I noticed a very
>> dramatic difference in terms of storage space used by cassandra data files.
>>
>> Our previous setup consists of a single-node cassandra 0.8.x with no
>> replication, and the data is stored using supercolumns, and the data files
>> total about 534GB on disk.
>>
>> A few weeks ago, I put together a cluster consisting of 3 nodes running
>> cassandra 1.0 with replication factor of 2, and the data is flattened out
>> and stored using regular columns.  And the aggregated data file size is
>> only 488GB (would be 244GB if no replication).
>>
>> This is a very dramatic reduction in terms of storage needs, and is
>> certainly good news in terms of how much storage we need to provision.
>>  However, because of the dramatic reduction, I also would like to make sure
>> it is absolutely correct before submitting it - and also get a sense of why
>> there was such a difference. -- I know cassandra 1.0 does data compression,
>> but does the schema change from supercolumn to regular column also help
>> reduce storage usage?  Thanks.
>>
>> -- Y.
>>
>
>
>

Reply via email to