Re: Partition maintenance

aaron morton Wed, 19 Dec 2012 13:04:49 -0800

Couple of approaches to exporting…

1) If you know the list of keys you want to export, you could use / modify the 
sstable2json tool and pass in the list of keys. If expiring columns are used 
remove the expiration later or modify sstable2json to not include it.


2) If the list of keys to too big but you can parse a key to determine if it 
should be exported, it would be possible to modify sstable2json use a regex or 
similar to select the keys which match. Or to include some business logic. 

Both of these would require reading all sstables. Even though the maximum 
timestamp in an sstable is stored on disk it would be of no use. In the server 
we work out the min and max keys in each sstable, so it's possible to be a 
little smarter about this by reading the -Index.db file.

If you are monkeying around with sstable export it could be changed to your 
preferred output format. 

3) Use Hadoop + Hive. 


To purge the data from Cassandra…

Use TTL, a low gc_grace_seconds and compaction. You can specify the files you 
want to compact via JMX  so this could be added to nodetool. It may be 
necessary to use some smarts to work out which sstables to compact, the 
maxTimestamp in the sstable header will help here. 

Note that columns are not purged unless all fragments from the row are included 
in the compaction.  This could be a problem. It probably depends on your 
workload though. 

Hope that helps. 
 

-----------------
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 19/12/2012, at 6:37 AM, Michael Kjellman <mkjell...@barracuda.com> wrote:

> Yeah. No JOINs as of now in Cassandra.
> 
> What if you dumped the CF in question once a month to json and rewrote out 
> each record in the json data if it met the time stamp you were interested in 
> archiving.
> 
> You could then bulk load each "month" back in if you had to restore. 
> 
> Doesn't help with deletes though and I would advise against large mass delete 
> operations each month -- tends to lead to a very unhappy cluster 
> 
> On Dec 18, 2012, at 9:23 AM, "stephen.m.thomp...@wellsfargo.com" 
> <stephen.m.thomp...@wellsfargo.com> wrote:
> 
>> Michael - That is one approach I have considered, but that also makes 
>> querying the system particularly onerous since every column family would 
>> require its own query – I don’t think there is any good way to “join” those, 
>> right?
>>  
>> Chris – that is an interesting concept, but as Viktor and Keith note, it 
>> seems to have problems. 
>>  
>> Could we do this simply by mass deletes?  For example, if I created a column 
>> which was just YYYY/MM, then during our maintenance we could spool off 
>> records that match the month we are archiving, then do a bulk delete by that 
>> key.  We would need to have a secondary index for that, I would assume.
>>  
>>  
>> From: Michael Kjellman [mailto:mkjell...@barracuda.com] 
>> Sent: Tuesday, December 18, 2012 11:15 AM
>> To: user@cassandra.apache.org
>> Subject: Re: Partition maintenance
>>  
>> You could make a column family for each period of time and then drop the 
>> column family when you want to destroy it. Before you drop it you could use 
>> the sstabletojson converter and write the json files out to tape.
>>  
>> Might make your life difficult however if you need an input split for map 
>> reduce between each time period because you would be limited to working on 
>> one column family at a time. 
>> 
>> On Dec 18, 2012, at 8:09 AM, "stephen.m.thomp...@wellsfargo.com" 
>> <stephen.m.thomp...@wellsfargo.com> wrote:
>> 
>> Hi folks.  Still working through the details of building out a Cassandra 
>> solution and I have an interesting requirement that I’m not sure how to 
>> implement in Cassandra:
>>  
>> In our current Oracle world, we have the data for this system partitioned by 
>> month, and each month the data that are now 18-months old are archived to 
>> tape/cold storage and then the partition for that month is dropped.  Is 
>> there a way to do something similar with Cassandra without destroying our 
>> overall performance?
>>  
>> Thanks in advance,
>> Steve
>>  
>> ---------------------------------- 
>> Join Barracuda Networks in the fight against hunger.
>> To learn how you can help in your community, please visit: 
>> http://on.fb.me/UAdL4f
>>     
> 
> ---------------------------------- 
> Join Barracuda Networks in the fight against hunger.
> To learn how you can help in your community, please visit: 
> http://on.fb.me/UAdL4f
>

Re: Partition maintenance

Reply via email to