Re: Data overhead discussion in Cassandra

2011-07-17 Thread aaron morton
What RF are you using ? 

On disk each column has 15 bytes of overhead, plus the column name and the 
column value. So for an 8 byte long and a 8 byte double there will be 16 bytes 
of data and 15 bytes of data. 

The index file also contains the the row key, the MD5 token (for RP) and the 
row offset for the data file. 

Cheers

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15 Jul 2011, at 07:09, Sameer Farooqui wrote:

 We just set up a demo cluster with Cassandra 0.8.1 with 12 nodes and loaded 
 1.5 TB of data into it. However, the actual space on disk being used by data 
 files in Cassandra is 3 TB. We're using a standard column family with a 
 million rows (key=string) and 35,040 columns per key. The column name is a 
 long and the column value is a double.
 
 I was just hoping to understand more about why the data overhead is so large. 
 We're not using expiring columns. Even considering indexing and bloom 
 filters, it shouldn't have bloated up the data size to 2x the original 
 amount. Or should it have?
 
 How can we better anticipate the actual data usage on disk in the future?
 
 - Sameer



Data overhead discussion in Cassandra

2011-07-14 Thread Sameer Farooqui
We just set up a demo cluster with Cassandra 0.8.1 with 12 nodes and loaded
1.5 TB of data into it. However, the actual space on disk being used by data
files in Cassandra is 3 TB. We're using a standard column family with a
million rows (key=string) and 35,040 columns per key. The column name is a
long and the column value is a double.

I was just hoping to understand more about why the data overhead is so
large. We're not using expiring columns. Even considering indexing and bloom
filters, it shouldn't have bloated up the data size to 2x the original
amount. Or should it have?

How can we better anticipate the actual data usage on disk in the future?

- Sameer