Right, row stats in 0.6 are just "what I've seen during the
compactions that happened to run since this node restarted last."

0.7 has persistent (and more fine-grained) statistics.

On Thu, Aug 12, 2010 at 1:28 PM, Ryan King <r...@twitter.com> wrote:
> On Thu, Aug 12, 2010 at 9:08 AM, Julie <julie.su...@nextcentury.com> wrote:
>> I am chasing down a row size discrepancy and am confused.
>>
>> I populated a single node Cassandra cluster with 10,000 rows of data, using
>> numeric keys 1-10,000, where each row is a little over 100kB in length and 
>> has
>> a single column in it.
>>
>> When I perform a cfstats on the node immediately after writing the data, it
>> reports that the Compacted row minimum size = Compacted row maximum size 
>> which
>> is a little over 100,000 bytes.  This is what I expect.
>>
>> I then run an application that randomly reads rows and adds a timestamp 
>> column
>> to each row read.  This timestamp column name and column value is just adding
>> a few bytes to the row.
>>
>> But after running my reading app for a few hours, cfstats reports a very odd
>> minimum row size (and compacted mean row size):
>>
>> [r...@ec2-server1 ~]# /mnt/server/apache-cassandra-0.6.2/bin/nodetool -h
>> ec2-server1 -p 8080 cfstats
>> Keyspace: Keyspace1
>>        Read Count: 670434
>>        Read Latency: 36.22349047035205 ms.
>>        Write Count: 1519933
>>        Write Latency: 0.02940705741634664 ms.
>>        Pending Tasks: 0
>>                Column Family: Standard1
>>                SSTable count: 6
>>                Space used (live): 11130225642
>>                Space used (total): 11130225642
>>                Memtable Columns Count: 1435
>>                Memtable Data Size: 40344907
>>                Memtable Switch Count: 1329
>>                Read Count: 670434
>>                Read Latency: 41.768 ms.
>>                Write Count: 1519933
>>                Write Latency: 0.025 ms.
>>                Pending Tasks: 0
>>                Key cache capacity: 200000
>>                Key cache size: 200000
>>                Key cache hit rate: 0.48049934471509675
>>                Row cache: disabled
>>                Compacted row minimum size: 238
>>                Compacted row maximum size: 100323
>>                Compacted row mean size: 67548
>>
>> I thought I had a bug in my code so I wrote another app to read every row
>> in the database, keys 1-10,000.  I get the size of each row after reading it
>> (by adding up all column names and column values in the row and the size of
>> the key string) and this matches what I expect -- every single key in this
>> table has a size of just over 100,000 bytes.  (I know there are some
>> overhead columns in each row but I assume these will only make the row
>> larger, not smaller.)
>>
>> So I am confused about where cfstats is getting the row sizes it is working
>> with?
>>
>> When I add the timestamp column to each row, I am not deleting the other
>> column (large) in the row but I am not rewriting the large column either.
>
> I'm guessing (haven't read this part of the source) that the min size
> is being generated in minor compaction, which doesn't see the whole
> row.
>
> -ryan
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com

Reply via email to