Some other ways to track old records is:

1) Use external queues - One queue per week or month  for instance and pile
up data on the queue cluster
2) Create one more table in C* to track the keys per week or month that you
can scan to read the keys of the audit table. Make sure you delete the
entire row after deleting the data instead of columns because that is more
efficient.

On Fri, Jun 13, 2014 at 2:01 PM, Jonathan Lacefield <jlacefi...@datastax.com
> wrote:

> Hello,
>
>   What you are attempting to do, reminds me of the old "sliding window"
> partitioning trick in rdbms systems.   You're right, there is no system
> provided tool that allows you to preform a similar operation.  You could
> always leverage option 3, and then create a service that helps manage the
> effort of the manual delete.  However, you would still have to insert into
> this separate table per the index item.
>
>   The cost of the "every once in a while" delete may be infrequent enough
> for you to do what you were actually trying to do in the first place, use a
> secondary index and query the table leveraging the ALLOW FILTERING clause.
>
>   My recommendation would be to:
>
>   1)  leverage TTLs
>   2)  see what type of load your original plan would put on your system,
> and if it's acceptable and or you have a down time to execute this costly
> operation, go that route when TTLs aren't keeping up with your load.
>
>   You have such a special use case for this functionality that the little,
> in frequent, performance hit outweighs the complexity of implementing
> options 1 and 3.
>
>   Thanks,
>
> Jonathan
>
>
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
> <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Tue, Jun 10, 2014 at 1:39 PM, Redmumba <redmu...@gmail.com> wrote:
>
>> Honestly, this has been by far my single biggest obstacle with Cassandra
>> for time-based data--cleaning up the old data when the deletion criteria
>> (i.e., date) isn't the primary key.  I've asked about a few different
>> approaches, but I haven't really seen any feasible options that can be
>> implemented easily.  I've seen the following:
>>
>>    1. Use date-based tables, then drop old tables, ala
>>    "audit_table_20140610", "audit_table_20140609", etc..
>>    But then I run into the issue of having to query every table--I would
>>    have to execute queries against every day to get the data, and then merge
>>    the data myself.  Unless, there's something in the binary driver I'm
>>    missing, it doesn't sound like this would be practical.
>>    2. Use a TTL
>>    But then I have to basically decide on a value that works for
>>    everything and, if it ever turns out I overestimated, I'm basically SOL,
>>    because my cluster will be out of space.
>>    3. Maintain a separate index of days to keys, and use this index as
>>    the reference for which keys to delete.
>>    But then this requires maintaining another index and a relatively
>>    manual delete.
>>
>> I can't help but feel that I am just way over-engineering this, or that
>> I'm missing something basic in my data model.  Except for the last
>> approach, I can't help but feel that I'm overlooking something obvious.
>>
>> Andrew
>>
>>
>> Of course, Jonathan, I'll do my best!
>>
>> It's an auditing table that, right now, uses a primary key consisting of
>> a combination of a combined partition id of the region and the object id,
>> the date, and the process ID.  Each event in our system will create
>> anywhere from 1-20 rows, for example, and multiple parts of the system
>> might be working on the same "object ID".  So the CF is constantly being
>> appended to, but reads are rare.
>>
>> CREATE TABLE audit (
>>>     id bigint,
>>>     region ascii,
>>>     date timestamp,
>>>     pid int,
>>>     PRIMARY KEY ((id, region), date, pid)
>>> );
>>
>>
>> Data is queried on a specific object ID and region.  Optionally, users
>> can restrict their query to a specific date range, which the above data
>> model provides.
>>
>> However, we generate quite a bit of data, and we want a convenient way to
>> get rid of the oldest data.  Since our system scales with the time of year,
>> we might get 50GB a day during peak, and 5GB of data off peak.  We could
>> pick the safest number--let's say, 30 days--and set the TTL using that.
>> The problem there is that, most of the year, we'll be using a very small
>> percentage of our available space 90% of the year.
>>
>> What I'd like to be able to do is drop old tables as needed--i.e., let's
>> say when we hit 80% load across the cluster (or some such metric that takes
>> the cluster-wide load into account), I want to drop the oldest day's
>> records until we're under 80%.  That way, we're always using the maximum
>> amount of space we can, without having to worry about getting to the point
>> where we run out of space cluster-wide.
>>
>> My thoughts are--we could always make the date part of the primary key,
>> but then we'd either a) have to query the entire range of dates, or b) we'd
>> have to force a small date range when querying.  What are the penalties?
>> Do you have any other suggestions?
>>
>>
>> On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <
>> jlacefi...@datastax.com> wrote:
>>
>>> Hello,
>>>
>>>   Will you please describe the use case and what you are trying to
>>> model.  What are some questions/queries that you would like to serve via
>>> Cassandra.  This will help the community help you a little better.
>>>
>>> Jonathan Lacefield
>>> Solutions Architect, DataStax
>>> (404) 822 3487
>>>  <http://www.linkedin.com/in/jlacefield>
>>>
>>> <http://www.datastax.com/cassandrasummit14>
>>>
>>>
>>>
>>> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <redmu...@gmail.com> wrote:
>>>
>>>> I've been trying to work around using "date-based tables" because I'd
>>>> like to avoid the overhead.  It seems, however, that this is just not going
>>>> to work.
>>>>
>>>> So here's a question--for these date-based tables (i.e., a table per
>>>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>>>> auditing data, for example, I'd need to query all 60 tables--can I do that
>>>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>>>> way for me to run the same query against all the tables?
>>>>
>>>>
>>>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <redmu...@gmail.com> wrote:
>>>>
>>>>> Ah, so the secondary indices are really secondary against the primary
>>>>> key.  That makes sense.
>>>>>
>>>>> I'm beginning to see why the whole "date-based table" approach is the
>>>>> only one I've been able to find... thanks for the quick responses, guys!
>>>>>
>>>>>
>>>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>>>> michal.michal...@boxever.com> wrote:
>>>>>
>>>>>> Secondary indexes internally are just CFs that map the indexed value
>>>>>> to a row key which that value belongs to, so you can only query these
>>>>>> indexes using "=", not ">", ">=" etc.
>>>>>>
>>>>>> However, your query does not require index *IF* you provide a row key
>>>>>> - you can use "<" or ">" like you did for the date column, as long as you
>>>>>> refer to a single row. However, if you don't provide it, it's not going 
>>>>>> to
>>>>>> work.
>>>>>>
>>>>>> M.
>>>>>>
>>>>>> Kind regards,
>>>>>> MichaƂ Michalski,
>>>>>> michal.michal...@boxever.com
>>>>>>
>>>>>>
>>>>>> On 9 June 2014 21:18, Redmumba <redmu...@gmail.com> wrote:
>>>>>>
>>>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>>>> query based on it, it fails saying that I must use ALLOW 
>>>>>>> FILTERING--which
>>>>>>> to me, means its not using the secondary index.  Table definition is
>>>>>>> (snipping out irrelevant parts)...
>>>>>>>
>>>>>>> CREATE TABLE audit (
>>>>>>>>     id bigint,
>>>>>>>>     date timestamp,
>>>>>>>> ...
>>>>>>>>     PRIMARY KEY (id, date)
>>>>>>>> );
>>>>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>>>>
>>>>>>>
>>>>>>> There are other fields, but they are not relevant to this example.
>>>>>>> The date is part of the primary key, and I have a secondary index on it.
>>>>>>> When I run a SELECT against it, I get an error:
>>>>>>>
>>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date <
>>>>>>>> '2014-05-01';
>>>>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>>>>> FILTERING
>>>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date <
>>>>>>>> '2014-05-01' ALLOW FILTERING;
>>>>>>>> Request did not complete within rpc_timeout.
>>>>>>>>
>>>>>>>
>>>>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>>>>> running, but can I verify the "health" of the index?
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to