Hi all, I have daily snapshots of a database. After sometimes some snapshots are deleted for retention purpose.
So I have a situation like this: 1 month ago: I have every day snapshots for every day of the past month >= 2 month ago: I have a snapshots every 15 day >= 6 month ago: I have a snapshot per month (say on every 1st of the month). Now I would like to be able to query all this data. Since every snapshot is about 250GB I naturally went to Big Data direction. I think that Apache Drill is a great tool, and can be used to resolve my problem, because i also need to be able to upload data from other sources and compare it with the snapshots (for analysis purpose). Also my snapshots can change Schema in time (another reason to search for a schema-less solution). Now, if I would need to do be able to execute statistics query, for example: Snapshot day 1: My_table id | title | category | date 1 | My First Post | helloworld | 2015-01-01 Snapshot day 2 (A new post is added today): My_table id | title | category | date 1 | My first post | helloworld | 2015-01-01 2 | My second post | other | 2015-01-02 Snapshot day 3 (Post 2 is removed today): My_table id | title | category | date 1 | My First Post | helloworld | 2015-01-01 So between days, a row of the table may or not may be constant. Now, I need to be able to use a query like this: SELECT category, COUNT(*) from day1.My_table group by category This is for one table of one day. If we *want to count the daily average of posts by category in a month* we should do something like: SELECT category, SUM(cnt) / 30 from ( SELECT category, COUNT(*) as cnt from day1.My_table group by category UNION ALL SELECT category, COUNT(*) as cnt from day2.My_table group by category UNION ALL ... UNION ALL SELECT category, COUNT(*) as cnt from day30.My_table group by category) group by category Another example, the *number of post published in a month*: SELECT COUNT(distinct id) from ( SELECT id FROM day1.My_table UNION ALL ... UNION ALL SELECT id from day30.My_table); The problem is that because of retention, there can be situation where *there aren't snapshot* for every day of the month. So for example, after 1 year we could have 1 snapshot per month. So we couldn't be able to perform a query like that. Basically we would need to consider a weight.