Probably the most expensive query we have right now is the one that gets
the user's latest notices and notices from subscribed users ("friends
timeline" or "all"). The SQL for it looks something like this:

        SELECT notice.*
        FROM notice JOIN subscription ON notice.profile_id = 
subscription.subscribed
        WHERE subscription.subscriber = ?
        ORDER BY notice.created DESC, notice.id DESC
        LIMIT 0, 20

This gets the last 20 notices that were posted by someone that the
current user is subscribed to. (Note that every user has a "ghost"
subscription to their own account.) It's not 100% correct -- it would
probably be good to add a "AND notice.created >= subscription.created"
in the WHERE clause, to only show notices that were posted after the
user subscribed to the other person. Nobody's complained about this yet,
though.

There are two big problems with this query:

     1. It's pretty darn slow. I've added some memcached caching which
        speeds it up considerably, but for users with a lot of
        subscriptions to people who update pretty often (say, me, for
        example), the cache gets busted too often to be sufficient.
     2. We can't handle functionality where users follow a group, a
        search term, or a hashtag very easily. We'd either need to do
        some crazy UNION query, or get a bunch of query results and
        interleave them by date posted, or even do some tricky temporary
        tables hoohaw. None of this is appealing or extensible for new
        features.

The way I'd like to fix this query is to denormalize the database a
little bit. Instead of calculating which notices a user should see at
the time when someone's hitting their "friends timeline", I'd rather do
that calculation ahead of time, just once.

So, for every user, we'd have a notice inbox (kudos to whoever comes up
with a better name) that stores a user ID and a list of notice IDs --
one for every notice in their inbox. When a new notice is created, we
query for every user who's subscribed to the author's stream, and add
the notice's ID to those subscribers' notice inboxes. Then, in our
friends timeline query, we can just say, "Get the latest 20 notices from
this user's notice inbox." It should be significantly faster (I'd like
to benchmark it ahead of time, though.)

This would also let us do some of the more interesting functionality
around "subscribing" to tags, search terms, and groups. We'd do some
more queries at notice creation time, and slip the notice into the right
inboxes at that point.

There are still some things I'd like to figure out before rolling this
out. First, there are two possible ways I can think of doing it. First
would be to add a new many-to-many join table, notice_inbox, with two
columns, user_id (references user.id) and notice_id (references
notice.id), with one row for every notice in every user's notice inbox.
The other would be a denormalized table with two columns, user_id and
notices. notices would be a BLOB or TEXT column with a delimited list of
notices in the user's inbox -- "114,82,35,12,2" -- so we'd only need one
row per user. We could get the notices column for a given user, and then
do a query like this:

        "SELECT * FROM notices where ID in (" . $notices . ") ORDER BY "
        etc.

Second is the question of archives. This kind of denormalization is
going to really blow out our data storage requirements. For performance
and sanity reasons, we'll want to keep only a sliding window of notices
in the notice inbox at any one time -- I'd probably guess that ~10 pages
worth, or ~1 month's worth, would probably do the trick. We get very,
very very few requests for anything older than that. The question is,
though: what to do if someone asks for stuff older than that? Refuse it?
Recalculate it? Store the data in a secondary archive database?

Anyways, this is probably something I'm going to attack this weekend at
the Codefest, so I'd love to get some feedback ahead of time.

-Evan

-- 
Evan Prodromou <[EMAIL PROTECTED]>
Control Yourself, Inc.
_______________________________________________
Laconica-dev mailing list
[email protected]
http://mail.laconi.ca/mailman/listinfo/laconica-dev

Reply via email to