Same as before, but with the new index listed in the possible keys:

+----+-------------+-------+-------+----------------------------------------------+----------+---------+------------+--------+-------------+
| id | select_type | table | type  | possible_keys
     | key      | key_len | ref        | rows   | Extra       |
+----+-------------+-------+-------+----------------------------------------------+----------+---------+------------+--------+-------------+
|  1 | SIMPLE      | u     | index | NULL
      | id_idx   | 5       | NULL       | 972064 | Using index |
|  1 | SIMPLE      | a     | ref   |
user_idx,email_idx,activity_idx,compound_idx | user_idx | 5       |
getup.u.id |     20 | Using where |
+----+-------------+-------+-------+----------------------------------------------+----------+---------+------------+--------+-------------+



On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers <joh...@pixelated.net>wrote:

> What did the explain output look like after the new index?
>
>
> On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges <
> leonardoborges...@gmail.com> wrote:
>
>> Hi Johnny,
>>
>> I just gave that a try but it didn't help as I suspected.
>>
>> I still believe the problem is in mysql not being able to handle set
>> subtractions. Therefore, it has to perform the work harder to return the
>> rows that represent a "no match" with NULL values in place so they can then
>> be filtered by the WHERE clause.
>>
>>
>> This type of query seems to be a corner case in mysql one should be aware
>> about when working with large datasets.
>>
>> Cheers,
>> Leonardo Borges
>> www.leonardoborges.com
>>
>>
>> On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers <joh...@pixelated.net>wrote:
>>
>>> Leonardo,
>>>
>>> I think a new compound key on email_id and activity in the activities
>>> table may help.
>>>
>>> I'm not sure if this will help or not, Its hard to test w/o having a
>>> large data set to test against.
>>>
>>>
>>> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
>>> leonardoborges...@gmail.com> wrote:
>>>
>>>> Sure can:
>>>>
>>>> show create table activities;
>>>>
>>>> CREATE TABLE `activities` (
>>>>   `id` int(11) NOT NULL AUTO_INCREMENT,
>>>>   `user_id` int(11) DEFAULT NULL,
>>>>   `email` varchar(100) DEFAULT NULL,
>>>>   `country_iso` varchar(2) DEFAULT NULL,
>>>>   `tags` varchar(255) DEFAULT NULL,
>>>>   `postcode` int(11) DEFAULT NULL,
>>>>   `activity` varchar(100) DEFAULT NULL,
>>>>   `page_id` int(11) DEFAULT NULL,
>>>>   `donation_frequency` varchar(100) DEFAULT NULL,
>>>>   `email_id` int(11) DEFAULT NULL,
>>>>   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
>>>> CURRENT_TIMESTAMP,
>>>>   PRIMARY KEY (`id`),
>>>>   KEY `user_idx` (`user_id`),
>>>>   KEY `email_idx` (`email_id`),
>>>>   KEY `activity_idx` (`activity`)
>>>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>>>>
>>>>
>>>> And the explain:
>>>>
>>>>
>>>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>>>> | id | select_type | table | type  | possible_keys                   |
>>>> key      | key_len | ref        | rows   | Extra       |
>>>>
>>>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>>>> |  1 | SIMPLE      | u     | index | NULL                            |
>>>> id_idx   | 5       | NULL       | 972064 | Using index |
>>>> |  1 | SIMPLE      | a     | ref   | user_idx,email_idx,activity_idx |
>>>> user_idx | 5       | getup.u.id |     20 | Using where |
>>>>
>>>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>>>>
>>>>
>>>> Cheers,
>>>> Leonardo Borges
>>>> www.leonardoborges.com
>>>>
>>>>
>>>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers 
>>>> <joh...@pixelated.net>wrote:
>>>>
>>>>> Can you post show create table for activity and explain output of the
>>>>> problem query?
>>>>>
>>>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges" <leonardoborges...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I have an increasingly popular web application running on top of mysql
>>>>> and
>>>>> due to its popularity, I'm running into performance issues. After
>>>>> carefully
>>>>> examining database indexes and tuning queries I was able to pin down
>>>>> the
>>>>> slowest part of the system.
>>>>>
>>>>> The app's got a user segmentation tool that allows you to filter users
>>>>> based
>>>>> on a range of criteria from which the slowest is: "Select all users
>>>>> that did
>>>>> not receive the email of id 100"
>>>>>
>>>>> To answer this question we turn to the activities table, which is
>>>>> basically
>>>>> a denormalized log of actions taken by the user in this format:
>>>>> user_id  | activity        | email_id | ...
>>>>> 10         | email_sent   | 100        | ...
>>>>> 10         | subscribed   | NULL     | ...
>>>>> 10         | email_open  | 100        | ...
>>>>>
>>>>>
>>>>> Given this table and the question above, the usual way of finding out
>>>>> all
>>>>> users who did not receive this email is through the use of a left outer
>>>>> join, such as:
>>>>>
>>>>> select u.id
>>>>> from users u
>>>>> left outer join activities a
>>>>> on u.id = a.user_id
>>>>> and a.activity = 'email_sent'
>>>>> and a.email_id = 100
>>>>> where a.user_id is null
>>>>>
>>>>> That's all fine  for medium-ish tables. However our current activities
>>>>> table
>>>>> has over 13 million rows, slowing the hell out of this left outer join,
>>>>> taking about 52 seconds in my machine.
>>>>>
>>>>> What this query is trying to do is to get the relative complement of
>>>>> set
>>>>> A(users) to B(activities). As far as I know mysql doesn't support set
>>>>> subtraction, thus the reason for these queries being slow.
>>>>>
>>>>> Based on that I've setup a test database on Postgresql, which supports
>>>>> this
>>>>> very set operation and rewrote the query to look like this:
>>>>>
>>>>> select u.id
>>>>> from users u
>>>>> except
>>>>> select a.user_id
>>>>> from activities a
>>>>> where a.activity = 'email_sent'
>>>>> and a.email_id = 100;
>>>>>
>>>>> The fact that postgresql knows how to subtract sets brought this query
>>>>> down
>>>>> to only 4 seconds.
>>>>>
>>>>> My question then is: since this is a somewhat common query in our
>>>>> system,
>>>>> are there any workarounds I could use in mysql to improve things?
>>>>>
>>>>> I did find one myself, but it's a bit convoluted and might not perform
>>>>> well
>>>>> under load, but the following sql script gives me similar performance
>>>>> in
>>>>> mysql:
>>>>>
>>>>> create temporary table email_sent_100
>>>>> select a.user_id
>>>>> from user_activity_events a
>>>>> where a.activity = 'email_sent'
>>>>>
>>>>>
>>>>> and a.email_id = 100;
>>>>>
>>>>> create index user_id_idx on email_sent_100(user_id);   //this could
>>>>> potentially bring the runtime down in the case of a larg temp table.
>>>>>
>>>>> select count(u.id)
>>>>> from users u
>>>>> left outer join email_sent_100 s
>>>>>   on u.id = s.user_id
>>>>>   and s.user_id is null;
>>>>>
>>>>> A lot more lines and a lot more complex, but does the job in this
>>>>> example.
>>>>>
>>>>> I'd appreciate your thoughts.
>>>>>
>>>>> Cheers,
>>>>> Leonardo Borges
>>>>> www.leonardoborges.com
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> -----------------------------
>>> Johnny Withers
>>> 601.209.4985
>>> joh...@pixelated.net
>>>
>>
>>
>
>
> --
> -----------------------------
> Johnny Withers
> 601.209.4985
> joh...@pixelated.net
>

Reply via email to