I guess you can also vote for this ticket : https://issues.apache.org/jira/browse/CASSANDRA-699 :)
</advertising> -- Sylvain On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com> wrote: > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote: >> >> Let take Twitter as an example. All the tweets are timestamped. I want >> to keep only a month's worth of tweets for each user. The number of tweets >> that fit within this one month window varies from user to user. What is the >> best way to accomplish this? > > This is the "expiry" problem that has been discussed on this list before. As > far as I can see there are no easy ways to do it with 0.5 > > If you use the ordered partitioner and make the first part of the keys a > timestamp (or part of it) then you can get the keys and delete them. > > However, these deletes will be quite inefficient, currently each row must be > deleted individually (there was a patch to range delete kicking around, I > don't know if it's accepted yet) > > But even if range delete is implemented, it's still quite inefficient and > not really what you want, and doesn't work with the RandomPartitioner > > If you have some metadata to say who tweeted within a given period (say 10 > days or 30 days) and you store the tweets all in the same key per user per > period (say with one column per tweet, or use supercolumns), then you can > just delete one key per user per period. > > One of the problems with using a time-based key with ordered partitioner is > that you're always going to have a data imbalance, so you may want to try > hashing *part* of the key (The first part) so you can still range scan the > next part. This may fix load balancing while still enabling you to use range > scans to do data expiry. > > e.g. your key is > > Hash of day number + user id + timestamp > > Then you can range scan the entire day's tweets to expire them, and range > scan a given user's tweets for a given day efficiently (and doing this for > 30 days is just 30 range scans) > > Putting a hash in there fixes load balancing with OPP. > > Mark >