Leonardo,

I think a new compound key on email_id and activity in the activities table
may help.

I'm not sure if this will help or not, Its hard to test w/o having a large
data set to test against.


On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <leonardoborges...@gmail.com
> wrote:

> Sure can:
>
> show create table activities;
>
> CREATE TABLE `activities` (
>   `id` int(11) NOT NULL AUTO_INCREMENT,
>   `user_id` int(11) DEFAULT NULL,
>   `email` varchar(100) DEFAULT NULL,
>   `country_iso` varchar(2) DEFAULT NULL,
>   `tags` varchar(255) DEFAULT NULL,
>   `postcode` int(11) DEFAULT NULL,
>   `activity` varchar(100) DEFAULT NULL,
>   `page_id` int(11) DEFAULT NULL,
>   `donation_frequency` varchar(100) DEFAULT NULL,
>   `email_id` int(11) DEFAULT NULL,
>   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
> CURRENT_TIMESTAMP,
>   PRIMARY KEY (`id`),
>   KEY `user_idx` (`user_id`),
>   KEY `email_idx` (`email_id`),
>   KEY `activity_idx` (`activity`)
> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>
>
> And the explain:
>
>
> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
> | id | select_type | table | type  | possible_keys                   | key
>      | key_len | ref        | rows   | Extra       |
>
> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
> |  1 | SIMPLE      | u     | index | NULL                            |
> id_idx   | 5       | NULL       | 972064 | Using index |
> |  1 | SIMPLE      | a     | ref   | user_idx,email_idx,activity_idx |
> user_idx | 5       | getup.u.id |     20 | Using where |
>
> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>
>
> Cheers,
> Leonardo Borges
> www.leonardoborges.com
>
>
> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers <joh...@pixelated.net>wrote:
>
>> Can you post show create table for activity and explain output of the
>> problem query?
>>
>> On Jul 7, 2011 8:51 PM, "Leonardo Borges" <leonardoborges...@gmail.com>
>> wrote:
>>
>> Hello everyone,
>>
>> I have an increasingly popular web application running on top of mysql and
>> due to its popularity, I'm running into performance issues. After
>> carefully
>> examining database indexes and tuning queries I was able to pin down the
>> slowest part of the system.
>>
>> The app's got a user segmentation tool that allows you to filter users
>> based
>> on a range of criteria from which the slowest is: "Select all users that
>> did
>> not receive the email of id 100"
>>
>> To answer this question we turn to the activities table, which is
>> basically
>> a denormalized log of actions taken by the user in this format:
>> user_id  | activity        | email_id | ...
>> 10         | email_sent   | 100        | ...
>> 10         | subscribed   | NULL     | ...
>> 10         | email_open  | 100        | ...
>>
>>
>> Given this table and the question above, the usual way of finding out all
>> users who did not receive this email is through the use of a left outer
>> join, such as:
>>
>> select u.id
>> from users u
>> left outer join activities a
>> on u.id = a.user_id
>> and a.activity = 'email_sent'
>> and a.email_id = 100
>> where a.user_id is null
>>
>> That's all fine  for medium-ish tables. However our current activities
>> table
>> has over 13 million rows, slowing the hell out of this left outer join,
>> taking about 52 seconds in my machine.
>>
>> What this query is trying to do is to get the relative complement of set
>> A(users) to B(activities). As far as I know mysql doesn't support set
>> subtraction, thus the reason for these queries being slow.
>>
>> Based on that I've setup a test database on Postgresql, which supports
>> this
>> very set operation and rewrote the query to look like this:
>>
>> select u.id
>> from users u
>> except
>> select a.user_id
>> from activities a
>> where a.activity = 'email_sent'
>> and a.email_id = 100;
>>
>> The fact that postgresql knows how to subtract sets brought this query
>> down
>> to only 4 seconds.
>>
>> My question then is: since this is a somewhat common query in our system,
>> are there any workarounds I could use in mysql to improve things?
>>
>> I did find one myself, but it's a bit convoluted and might not perform
>> well
>> under load, but the following sql script gives me similar performance in
>> mysql:
>>
>> create temporary table email_sent_100
>> select a.user_id
>> from user_activity_events a
>> where a.activity = 'email_sent'
>>
>>
>> and a.email_id = 100;
>>
>> create index user_id_idx on email_sent_100(user_id);   //this could
>> potentially bring the runtime down in the case of a larg temp table.
>>
>> select count(u.id)
>> from users u
>> left outer join email_sent_100 s
>>   on u.id = s.user_id
>>   and s.user_id is null;
>>
>> A lot more lines and a lot more complex, but does the job in this example.
>>
>> I'd appreciate your thoughts.
>>
>> Cheers,
>> Leonardo Borges
>> www.leonardoborges.com
>>
>>
>


-- 
-----------------------------
Johnny Withers
601.209.4985
joh...@pixelated.net

Reply via email to