Re: data size difference between supercolumn and regular column

aaron morton Mon, 02 Apr 2012 18:38:22 -0700

If you have a workload with overwrites you will end up with some data needing 
compaction. Running a nightly manual compaction would remove this, but it will 
also soak up some IO so it may not be the best solution.


I do not know if Leveled compaction would result in a smaller disk load for the 
same workload. 

I agree with other people, turn on compaction. 

Cheers

-----------------
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 3/04/2012, at 9:19 AM, Yiming Sun wrote:

> Yup Jeremiah, I learned a hard lesson on how cassandra behaves when it runs 
> out of disk space :-S.    I didn't try the compression, but when it ran out 
> of disk space, or near running out, compaction would fail because it needs 
> space to create some tmp data files.
> 
> I shall get a tatoo that says keep it around 50% -- this is valuable tip.
> 
> -- Y.
> 
> On Sun, Apr 1, 2012 at 11:25 PM, Jeremiah Jordan 
> <jeremiah.jor...@morningstar.com> wrote:
> Is that 80% with compression?  If not, the first thing to do is turn on 
> compression.  Cassandra doesn't behave well when it runs out of disk space.  
> You really want to try and stay around 50%,  60-70% works, but only if it is 
> spread across multiple column families, and even then you can run into issues 
> when doing repairs.
> 
> -Jeremiah
> 
> 
> 
> On Apr 1, 2012, at 9:44 PM, Yiming Sun wrote:
> 
>> Thanks Aaron.  Well I guess it is possible the data files from sueprcolumns 
>> could've been reduced in size after compaction.
>> 
>> This bring yet another question.  Say I am on a shoestring budget and can 
>> only put together a cluster with very limited storage space.  The first 
>> iteration of pushing data into cassandra would drive the disk usage up into 
>> the 80% range.  As time goes by, there will be updates to the data, and many 
>> columns will be overwritten.  If I just push the updates in, the disks will 
>> run out of space on all of the cluster nodes.  What would be the best way to 
>> handle such a situation if I cannot to buy larger disks? Do I need to delete 
>> the rows/columns that are going to be updated, do a compaction, and then 
>> insert the updates?  Or is there a better way?  Thanks
>> 
>> -- Y.
>> 
>> On Sat, Mar 31, 2012 at 3:28 AM, aaron morton <aa...@thelastpickle.com> 
>> wrote:
>>> does cassandra 1.0 perform some default compression? 
>> No. 
>> 
>> The on disk size depends to some degree on the work load. 
>> 
>> If there are a lot of overwrites or deleted you may have rows/columns that 
>> need to be compacted. You may have some big old SSTables that have not been 
>> compacted for a while. 
>> 
>> There is some overhead involved in the super columns: the super col name, 
>> length of the name and the number of columns.  
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 29/03/2012, at 9:47 AM, Yiming Sun wrote:
>> 
>>> Actually, after I read an article on cassandra 1.0 compression just now ( 
>>> http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression), I 
>>> am more puzzled.  In our schema, we didn't specify any compression options 
>>> -- does cassandra 1.0 perform some default compression? or is the data 
>>> reduction purely because of the schema change?  Thanks.
>>> 
>>> -- Y.
>>> 
>>> On Wed, Mar 28, 2012 at 4:40 PM, Yiming Sun <yiming....@gmail.com> wrote:
>>> Hi,
>>> 
>>> We are trying to estimate the amount of storage we need for a production 
>>> cassandra cluster.  While I was doing the calculation, I noticed a very 
>>> dramatic difference in terms of storage space used by cassandra data files.
>>> 
>>> Our previous setup consists of a single-node cassandra 0.8.x with no 
>>> replication, and the data is stored using supercolumns, and the data files 
>>> total about 534GB on disk.
>>> 
>>> A few weeks ago, I put together a cluster consisting of 3 nodes running 
>>> cassandra 1.0 with replication factor of 2, and the data is flattened out 
>>> and stored using regular columns.  And the aggregated data file size is 
>>> only 488GB (would be 244GB if no replication).
>>> 
>>> This is a very dramatic reduction in terms of storage needs, and is 
>>> certainly good news in terms of how much storage we need to provision.  
>>> However, because of the dramatic reduction, I also would like to make sure 
>>> it is absolutely correct before submitting it - and also get a sense of why 
>>> there was such a difference. -- I know cassandra 1.0 does data compression, 
>>> but does the schema change from supercolumn to regular column also help 
>>> reduce storage usage?  Thanks.
>>> 
>>> -- Y.
>>> 
>> 
>> 
> 
>

Re: data size difference between supercolumn and regular column

Reply via email to