You said compaction can't keep up.  Are you manually running compaction all the 
time or just letting cassandra kick off compactions when needed?  Is compaction 
always 100% running or are you saying your disk is growing faster than you like 
and would like compactions to be always 100% running?  (compactions in LCS can 
be kicked off manually and iyou may want to try that and then check your 
sstables again).

Dean

From: cem <cayiro...@gmail.com<mailto:cayiro...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Tuesday, May 28, 2013 1:17 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: data clean up problem

Thanks for the answer but it is already set to 0 since I don't do any delete.

Cem


On Tue, May 28, 2013 at 9:03 PM, Edward Capriolo 
<edlinuxg...@gmail.com<mailto:edlinuxg...@gmail.com>> wrote:
You need to change the gc_grace time of the column family. It defaults to 10 
days. By default the tombstones will not go away for 10 days.


On Tue, May 28, 2013 at 2:46 PM, cem 
<cayiro...@gmail.com<mailto:cayiro...@gmail.com>> wrote:
Hi Experts,


We have general problem about cleaning up data from the disk. I need to free 
the disk space after retention period and the customer wants to dimension the 
disk space base on that.

After running multiple performance tests with TTL of 1 day we saw that the 
compaction couldn't keep up with the request rate. Disks were getting full 
after 3 days. There were also a lot of sstables that are older than 1 day after 
3 days.

Things that we tried:

-Change the compaction strategy to leveled. (helped a bit but not much)

-Use big sstable size (10G) with leveled compaction to have more aggressive 
compaction.(helped a bit but not much)

-Upgrade Cassandra from 1.0 to 1.2 to use TTL histograms (didn't help at all 
since it has key overlapping estimation algorithm that generates %100 match. 
Although we don't have...)

Our column family structure is like this:

Event_data_cf: (we store event data. Event_id  is randomly generated and each 
event has attributes like location=london)

row                  data

event id          data blob

timeseries_cf: (key is the attribute that we want to index. It can be 
location=london, we didnt use secondary indexes because the indexes are 
dynamic.)

row                  data

index key       time series of event id (event1_id, event2_id….)

timeseries_inv_cf: (this is used for removing event by event row key. )

row                  data

event id          set of index keys

Candidate Solution: Implementing time range partitions.

Each partition will have column family set and will be managed by client.

Suppose that you want to have 7 days retention period. Then you can configure 
the partition size as 1 day and have 7 active partitions at any time. Then you 
can drop inactive partitions (older that 7 days). Dropping will immediate 
remove the data from the disk. (With proper Cassandra.yaml configuration)

Storing an event:

Find the current partition p1

store to event_data to Event_data_cf_p1

store to indexes to timeseries_cff_p1

store to inverted indexes to timeseries_inv_cf_p1


A time range query with an index:

Find the all partitions belongs to that time range

Do read starting from the first partition until you reach to limit

.....

Could you please provide your comments and concerns ?

Is there any other option that we can try?

What do you think about the candidate solution?

Does anyone have the same issue? How would you solve it in another way?


Thanks in advance!

Cem


Reply via email to