Honestly, this has been by far my single biggest obstacle with Cassandra for time-based data--cleaning up the old data when the deletion criteria (i.e., date) isn't the primary key. I've asked about a few different approaches, but I haven't really seen any feasible options that can be implemented easily. I've seen the following:
1. Use date-based tables, then drop old tables, ala "audit_table_20140610", "audit_table_20140609", etc.. But then I run into the issue of having to query every table--I would have to execute queries against every day to get the data, and then merge the data myself. Unless, there's something in the binary driver I'm missing, it doesn't sound like this would be practical. 2. Use a TTL But then I have to basically decide on a value that works for everything and, if it ever turns out I overestimated, I'm basically SOL, because my cluster will be out of space. 3. Maintain a separate index of days to keys, and use this index as the reference for which keys to delete. But then this requires maintaining another index and a relatively manual delete. I can't help but feel that I am just way over-engineering this, or that I'm missing something basic in my data model. Except for the last approach, I can't help but feel that I'm overlooking something obvious. Andrew Of course, Jonathan, I'll do my best! It's an auditing table that, right now, uses a primary key consisting of a combination of a combined partition id of the region and the object id, the date, and the process ID. Each event in our system will create anywhere from 1-20 rows, for example, and multiple parts of the system might be working on the same "object ID". So the CF is constantly being appended to, but reads are rare. CREATE TABLE audit ( > id bigint, > region ascii, > date timestamp, > pid int, > PRIMARY KEY ((id, region), date, pid) > ); Data is queried on a specific object ID and region. Optionally, users can restrict their query to a specific date range, which the above data model provides. However, we generate quite a bit of data, and we want a convenient way to get rid of the oldest data. Since our system scales with the time of year, we might get 50GB a day during peak, and 5GB of data off peak. We could pick the safest number--let's say, 30 days--and set the TTL using that. The problem there is that, most of the year, we'll be using a very small percentage of our available space 90% of the year. What I'd like to be able to do is drop old tables as needed--i.e., let's say when we hit 80% load across the cluster (or some such metric that takes the cluster-wide load into account), I want to drop the oldest day's records until we're under 80%. That way, we're always using the maximum amount of space we can, without having to worry about getting to the point where we run out of space cluster-wide. My thoughts are--we could always make the date part of the primary key, but then we'd either a) have to query the entire range of dates, or b) we'd have to force a small date range when querying. What are the penalties? Do you have any other suggestions? On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <jlacefi...@datastax.com> wrote: > Hello, > > Will you please describe the use case and what you are trying to model. > What are some questions/queries that you would like to serve via > Cassandra. This will help the community help you a little better. > > Jonathan Lacefield > Solutions Architect, DataStax > (404) 822 3487 > <http://www.linkedin.com/in/jlacefield> > > <http://www.datastax.com/cassandrasummit14> > > > > On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <redmu...@gmail.com> wrote: > >> I've been trying to work around using "date-based tables" because I'd >> like to avoid the overhead. It seems, however, that this is just not going >> to work. >> >> So here's a question--for these date-based tables (i.e., a table per >> day/week/month/whatever), how are they queried? If I keep 60 days worth of >> auditing data, for example, I'd need to query all 60 tables--can I do that >> smoothly? Or do I have to have 60 different select statements? Is there a >> way for me to run the same query against all the tables? >> >> >> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <redmu...@gmail.com> wrote: >> >>> Ah, so the secondary indices are really secondary against the primary >>> key. That makes sense. >>> >>> I'm beginning to see why the whole "date-based table" approach is the >>> only one I've been able to find... thanks for the quick responses, guys! >>> >>> >>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski < >>> michal.michal...@boxever.com> wrote: >>> >>>> Secondary indexes internally are just CFs that map the indexed value to >>>> a row key which that value belongs to, so you can only query these indexes >>>> using "=", not ">", ">=" etc. >>>> >>>> However, your query does not require index *IF* you provide a row key - >>>> you can use "<" or ">" like you did for the date column, as long as you >>>> refer to a single row. However, if you don't provide it, it's not going to >>>> work. >>>> >>>> M. >>>> >>>> Kind regards, >>>> MichaĆ Michalski, >>>> michal.michal...@boxever.com >>>> >>>> >>>> On 9 June 2014 21:18, Redmumba <redmu...@gmail.com> wrote: >>>> >>>>> I have a table with a timestamp column on it; however, when I try to >>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which >>>>> to me, means its not using the secondary index. Table definition is >>>>> (snipping out irrelevant parts)... >>>>> >>>>> CREATE TABLE audit ( >>>>>> id bigint, >>>>>> date timestamp, >>>>>> ... >>>>>> PRIMARY KEY (id, date) >>>>>> ); >>>>>> CREATE INDEX date_idx ON audit (date); >>>>>> >>>>> >>>>> There are other fields, but they are not relevant to this example. >>>>> The date is part of the primary key, and I have a secondary index on it. >>>>> When I run a SELECT against it, I get an error: >>>>> >>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'; >>>>>> Bad Request: Cannot execute this query as it might involve data >>>>>> filtering and thus may have unpredictable performance. If you want to >>>>>> execute this query despite the performance unpredictability, use ALLOW >>>>>> FILTERING >>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01' >>>>>> ALLOW FILTERING; >>>>>> Request did not complete within rpc_timeout. >>>>>> >>>>> >>>>> How can I force it to use the index? I've seen rebuild_index tasks >>>>> running, but can I verify the "health" of the index? >>>>> >>>> >>>> >>> >> >