Hi all,
I have daily snapshots of a database. After sometimes some snapshots are
deleted for retention purpose.

So I have a situation like this:
1 month ago: I have every day snapshots for every day of the past month
>= 2 month ago: I have a snapshots every 15 day
>= 6 month ago: I have a snapshot per month (say on every 1st of the month).

Now I would like to be able to query all this data. Since every snapshot is
about 250GB I naturally went to Big Data direction.
I think that Apache Drill is a great tool, and can be used to resolve my
problem, because i also need to be able to upload data from other sources
and compare it with the snapshots (for analysis purpose).

Also my snapshots can change Schema in time (another reason to search for a
schema-less solution).

Now, if I would need to do be able to execute statistics query, for
example:

Snapshot day 1:
    My_table
    id | title         | category   | date
    1  | My First Post | helloworld | 2015-01-01
Snapshot day 2 (A new post is added today):
    My_table
    id | title          | category   | date
    1  | My first post  | helloworld |  2015-01-01
    2  | My second post | other      | 2015-01-02
Snapshot day 3 (Post 2 is removed today):
    My_table
    id | title         | category   | date
    1  | My First Post | helloworld | 2015-01-01

So between days, a row of the table may or not may be constant. Now, I need
to be able to use a query like this:

SELECT category, COUNT(*) from day1.My_table group by category

This is for one table of one day. If we *want to count the daily average of
posts by category in a month* we should do something like:

SELECT category, SUM(cnt) / 30 from (
    SELECT category, COUNT(*) as cnt from day1.My_table group by category
    UNION ALL     SELECT category, COUNT(*) as cnt from day2.My_table
group by category
    UNION ALL ...
    UNION ALL     SELECT category, COUNT(*) as cnt from day30.My_table
group by category) group by category

Another example, the *number of post published in a month*:

SELECT COUNT(distinct id) from ( SELECT id FROM day1.My_table UNION
ALL ... UNION ALL SELECT id from day30.My_table);

The problem is that because of retention, there can be situation where *there
aren't snapshot* for every day of the month. So for example, after 1 year
we could have 1 snapshot per month. So we couldn't be able to perform a
query like that.

Basically we would need to consider a weight.

Reply via email to