Of course, Jonathan, I'll do my best!

It's an auditing table that, right now, uses a primary key consisting of a
combination of a combined partition id of the region and the object id, the
date, and the process ID.  Each event in our system will create anywhere
from 1-20 rows, for example, and multiple parts of the system might be
working on the same "object ID".  So the CF is constantly being appended
to, but reads are rare.

CREATE TABLE audit (
>     id bigint,
>     region ascii,
>     date timestamp,
>     pid int,
>     PRIMARY KEY ((id, region), date, pid)
> );


Data is queried on a specific object ID and region.  Optionally, users can
restrict their query to a specific date range, which the above data model
provides.

However, we generate quite a bit of data, and we want a convenient way to
get rid of the oldest data.  Since our system scales with the time of year,
we might get 50GB a day during peak, and 5GB of data off peak.  We could
pick the safest number--let's say, 30 days--and set the TTL using that.
The problem there is that, most of the year, we'll be using a very small
percentage of our available space 90% of the year.

What I'd like to be able to do is drop old tables as needed--i.e., let's
say when we hit 80% load across the cluster (or some such metric that takes
the cluster-wide load into account), I want to drop the oldest day's
records until we're under 80%.  That way, we're always using the maximum
amount of space we can, without having to worry about getting to the point
where we run out of space cluster-wide.

My thoughts are--we could always make the date part of the primary key, but
then we'd either a) have to query the entire range of dates, or b) we'd
have to force a small date range when querying.  What are the penalties?
Do you have any other suggestions?


On Mon, Jun 9, 2014 at 5:15 PM, Jonathan Lacefield <jlacefi...@datastax.com>
wrote:

> Hello,
>
>   Will you please describe the use case and what you are trying to model.
>  What are some questions/queries that you would like to serve via
> Cassandra.  This will help the community help you a little better.
>
> Jonathan Lacefield
> Solutions Architect, DataStax
> (404) 822 3487
>  <http://www.linkedin.com/in/jlacefield>
>
> <http://www.datastax.com/cassandrasummit14>
>
>
>
> On Mon, Jun 9, 2014 at 7:51 PM, Redmumba <redmu...@gmail.com> wrote:
>
>> I've been trying to work around using "date-based tables" because I'd
>> like to avoid the overhead.  It seems, however, that this is just not going
>> to work.
>>
>> So here's a question--for these date-based tables (i.e., a table per
>> day/week/month/whatever), how are they queried?  If I keep 60 days worth of
>> auditing data, for example, I'd need to query all 60 tables--can I do that
>> smoothly?  Or do I have to have 60 different select statements?  Is there a
>> way for me to run the same query against all the tables?
>>
>>
>> On Mon, Jun 9, 2014 at 3:42 PM, Redmumba <redmu...@gmail.com> wrote:
>>
>>> Ah, so the secondary indices are really secondary against the primary
>>> key.  That makes sense.
>>>
>>> I'm beginning to see why the whole "date-based table" approach is the
>>> only one I've been able to find... thanks for the quick responses, guys!
>>>
>>>
>>> On Mon, Jun 9, 2014 at 2:45 PM, Michal Michalski <
>>> michal.michal...@boxever.com> wrote:
>>>
>>>> Secondary indexes internally are just CFs that map the indexed value to
>>>> a row key which that value belongs to, so you can only query these indexes
>>>> using "=", not ">", ">=" etc.
>>>>
>>>> However, your query does not require index *IF* you provide a row key -
>>>> you can use "<" or ">" like you did for the date column, as long as you
>>>> refer to a single row. However, if you don't provide it, it's not going to
>>>> work.
>>>>
>>>> M.
>>>>
>>>> Kind regards,
>>>> MichaƂ Michalski,
>>>> michal.michal...@boxever.com
>>>>
>>>>
>>>> On 9 June 2014 21:18, Redmumba <redmu...@gmail.com> wrote:
>>>>
>>>>> I have a table with a timestamp column on it; however, when I try to
>>>>> query based on it, it fails saying that I must use ALLOW FILTERING--which
>>>>> to me, means its not using the secondary index.  Table definition is
>>>>> (snipping out irrelevant parts)...
>>>>>
>>>>> CREATE TABLE audit (
>>>>>>     id bigint,
>>>>>>     date timestamp,
>>>>>> ...
>>>>>>     PRIMARY KEY (id, date)
>>>>>> );
>>>>>> CREATE INDEX date_idx ON audit (date);
>>>>>>
>>>>>
>>>>> There are other fields, but they are not relevant to this example.
>>>>> The date is part of the primary key, and I have a secondary index on it.
>>>>> When I run a SELECT against it, I get an error:
>>>>>
>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01';
>>>>>> Bad Request: Cannot execute this query as it might involve data
>>>>>> filtering and thus may have unpredictable performance. If you want to
>>>>>> execute this query despite the performance unpredictability, use ALLOW
>>>>>> FILTERING
>>>>>> cqlsh> SELECT * FROM asinauditing.asinaudit WHERE date < '2014-05-01'
>>>>>> ALLOW FILTERING;
>>>>>> Request did not complete within rpc_timeout.
>>>>>>
>>>>>
>>>>> How can I force it to use the index?  I've seen rebuild_index tasks
>>>>> running, but can I verify the "health" of the index?
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to