On Nov 16, 2012, at 2:59 PM, Rob Crowell wrote:
Thanks for the help so far Michael! I can explain a little more about
what I'm trying to do (I'm using a fictional application here but I
think it pretty accurately translates into my actual application).
BACKGROUND
--
Let's say I'm writing an issue tracking application. Each issue that
we're tracking has a type (an issue must have exactly one type), and
each issue may have an unlimited number of user-provided labels.
Each day, people browse our issue tracker and each time they do they
generate a page view on the issue. Here's an example of one day's
worth of data:
IssueA: Views: 1, Type: Bug, Labels: [login-page (id=1),
show-stopper (id=2)]
IssueB: Views: 20, Type: One-Time Task, Labels: [disk-full
(id=3), show-stopper (id=2)]
IssueC: Views: 300, Type: Bug, Labels: [login-page (id=1)]
The BigCo. I'm working for is very interested in knowing which issues
are read by the most people, and they need the ability to generate
reports sliced by arbitrary date ranges. However, we can tolerate a
day delay, so we are writing summary tables each night. Two of these
summary tables are aggregated by either issue type or label, and we
also write a third table that can be used to drill-down and see page
visits bucketed by both type and label:
CREATE TABLE `issue_type` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`type` enum('Bug','Feature','Improvement','One-Time Task') NOT NULL,
`num_watchers` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
)
CREATE TABLE `issue_label` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`label_id` int(10) unsigned NOT NULL,
`num_watchers` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
)
CREATE TABLE `issue_type_label` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`created` datetime NOT NULL,
`type` enum('Bug','Feature','Improvement','One-Time Task') NOT NULL,
`label_id` int(10) unsigned NOT NULL,
`num_visits` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`)
)
So we'd run these insert statements at midnight:
INSERT INTO issue_type (created, type, num_visits) VALUES
(2012-11-15, Bug, 301),
(2012-11-15, One-Time Task, 20);
INSERT INTO issue_labels (created, label_id, num_visits) VALUES
(2012-11-15, 1, 301),
(2012-11-15, 2, 21),
(2012-11-15, 3, 20);
INSERT INTO issue_type_label (created, type, label_id, num_visits)
VALUES
(2012-11-15, Bug, 1, 301),
(2012-11-15, Bug, 2, 1),
(2012-11-15, One-Time Task, 3, 20),
(2012-11-15, One-Time Task, 2, 20);
Now when we want to generate the summary reports, we query one of the
first two tables (if we're generating a report aggregated by issue
type we hit issue_type, if we're generating a report aggregated by
label we hit issue_label), and when we want to run a drill-down query
both both type and label, we hit issue_type_label:
# broken down by type
SELECT type, SUM(num_visits) FROM issue_type WHERE created =
2012-11-01 AND created = 2012-11-16 GROUP BY type;
# broken down by label
SELECT label_id, SUM(num_visits) FROM issue_label WHERE created =
2012-11-01 AND created = 2012-11-16 GROUP BY label_id;
# broken down by both type and label
SELECT type, label_id, SUM(num_visits) FROM issue_type_label WHERE
created = 2012-11-01 AND created = 2012-11-16 GROUP BY type,
label_id;
This works fine and is very fast for our application. However, it's a
bit redundant since each issue has exactly one issue type, so we can
safely ignore it when generating the label summary report. So really,
we can get away with not having the issue_label table at all, and just
run this instead (we must continue to have the issue_type table
however, since a single issue can have many labels or none at all):
# broken down by label
SELECT label_id, SUM(num_visits) FROM issue_type_label WHERE
created = 2012-11-01 AND created = 2012-11-16 GROUP BY label_id;
THE ISSUE
-
In reality, we have many types of fields that we generate reports on
(not just type and label), so we have more than a dozen of these
aggregate tables rolled up by different sets of fields. While porting
our data access layer to SQLAlchemy (we have a separate process to
write these tables which we aren't going to move to SQLAlchemy), I've
created a bunch of functions that use reflection on the various Model
objects I've built to figure out what columns we need to select.
Each column that we don't want to GROUP BY or filter out with a WHERE
clause gets wrapped by func.sum() and passed to a call to
session.query(). My columns are very consistently named so I have
automated almost all of this using object reflection, which lets my
code populate which columns to select, group by, and sum on
automatically. I really don't want to write a custom call to query()
for each table type that