What did the explain output look like after the new index?

On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges <leonardoborges...@gmail.com
> wrote:

> Hi Johnny,
>
> I just gave that a try but it didn't help as I suspected.
>
> I still believe the problem is in mysql not being able to handle set
> subtractions. Therefore, it has to perform the work harder to return the
> rows that represent a "no match" with NULL values in place so they can then
> be filtered by the WHERE clause.
>
>
> This type of query seems to be a corner case in mysql one should be aware
> about when working with large datasets.
>
> Cheers,
> Leonardo Borges
> www.leonardoborges.com
>
>
> On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers <joh...@pixelated.net>wrote:
>
>> Leonardo,
>>
>> I think a new compound key on email_id and activity in the activities
>> table may help.
>>
>> I'm not sure if this will help or not, Its hard to test w/o having a large
>> data set to test against.
>>
>>
>> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
>> leonardoborges...@gmail.com> wrote:
>>
>>> Sure can:
>>>
>>> show create table activities;
>>>
>>> CREATE TABLE `activities` (
>>>   `id` int(11) NOT NULL AUTO_INCREMENT,
>>>   `user_id` int(11) DEFAULT NULL,
>>>   `email` varchar(100) DEFAULT NULL,
>>>   `country_iso` varchar(2) DEFAULT NULL,
>>>   `tags` varchar(255) DEFAULT NULL,
>>>   `postcode` int(11) DEFAULT NULL,
>>>   `activity` varchar(100) DEFAULT NULL,
>>>   `page_id` int(11) DEFAULT NULL,
>>>   `donation_frequency` varchar(100) DEFAULT NULL,
>>>   `email_id` int(11) DEFAULT NULL,
>>>   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
>>> CURRENT_TIMESTAMP,
>>>   PRIMARY KEY (`id`),
>>>   KEY `user_idx` (`user_id`),
>>>   KEY `email_idx` (`email_id`),
>>>   KEY `activity_idx` (`activity`)
>>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>>>
>>>
>>> And the explain:
>>>
>>>
>>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>>> | id | select_type | table | type  | possible_keys                   |
>>> key      | key_len | ref        | rows   | Extra       |
>>>
>>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>>> |  1 | SIMPLE      | u     | index | NULL                            |
>>> id_idx   | 5       | NULL       | 972064 | Using index |
>>> |  1 | SIMPLE      | a     | ref   | user_idx,email_idx,activity_idx |
>>> user_idx | 5       | getup.u.id |     20 | Using where |
>>>
>>> +----+-------------+-------+-------+---------------------------------+----------+---------+------------+--------+-------------+
>>>
>>>
>>> Cheers,
>>> Leonardo Borges
>>> www.leonardoborges.com
>>>
>>>
>>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers <joh...@pixelated.net>wrote:
>>>
>>>> Can you post show create table for activity and explain output of the
>>>> problem query?
>>>>
>>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges" <leonardoborges...@gmail.com>
>>>> wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I have an increasingly popular web application running on top of mysql
>>>> and
>>>> due to its popularity, I'm running into performance issues. After
>>>> carefully
>>>> examining database indexes and tuning queries I was able to pin down the
>>>> slowest part of the system.
>>>>
>>>> The app's got a user segmentation tool that allows you to filter users
>>>> based
>>>> on a range of criteria from which the slowest is: "Select all users that
>>>> did
>>>> not receive the email of id 100"
>>>>
>>>> To answer this question we turn to the activities table, which is
>>>> basically
>>>> a denormalized log of actions taken by the user in this format:
>>>> user_id  | activity        | email_id | ...
>>>> 10         | email_sent   | 100        | ...
>>>> 10         | subscribed   | NULL     | ...
>>>> 10         | email_open  | 100        | ...
>>>>
>>>>
>>>> Given this table and the question above, the usual way of finding out
>>>> all
>>>> users who did not receive this email is through the use of a left outer
>>>> join, such as:
>>>>
>>>> select u.id
>>>> from users u
>>>> left outer join activities a
>>>> on u.id = a.user_id
>>>> and a.activity = 'email_sent'
>>>> and a.email_id = 100
>>>> where a.user_id is null
>>>>
>>>> That's all fine  for medium-ish tables. However our current activities
>>>> table
>>>> has over 13 million rows, slowing the hell out of this left outer join,
>>>> taking about 52 seconds in my machine.
>>>>
>>>> What this query is trying to do is to get the relative complement of set
>>>> A(users) to B(activities). As far as I know mysql doesn't support set
>>>> subtraction, thus the reason for these queries being slow.
>>>>
>>>> Based on that I've setup a test database on Postgresql, which supports
>>>> this
>>>> very set operation and rewrote the query to look like this:
>>>>
>>>> select u.id
>>>> from users u
>>>> except
>>>> select a.user_id
>>>> from activities a
>>>> where a.activity = 'email_sent'
>>>> and a.email_id = 100;
>>>>
>>>> The fact that postgresql knows how to subtract sets brought this query
>>>> down
>>>> to only 4 seconds.
>>>>
>>>> My question then is: since this is a somewhat common query in our
>>>> system,
>>>> are there any workarounds I could use in mysql to improve things?
>>>>
>>>> I did find one myself, but it's a bit convoluted and might not perform
>>>> well
>>>> under load, but the following sql script gives me similar performance in
>>>> mysql:
>>>>
>>>> create temporary table email_sent_100
>>>> select a.user_id
>>>> from user_activity_events a
>>>> where a.activity = 'email_sent'
>>>>
>>>>
>>>> and a.email_id = 100;
>>>>
>>>> create index user_id_idx on email_sent_100(user_id);   //this could
>>>> potentially bring the runtime down in the case of a larg temp table.
>>>>
>>>> select count(u.id)
>>>> from users u
>>>> left outer join email_sent_100 s
>>>>   on u.id = s.user_id
>>>>   and s.user_id is null;
>>>>
>>>> A lot more lines and a lot more complex, but does the job in this
>>>> example.
>>>>
>>>> I'd appreciate your thoughts.
>>>>
>>>> Cheers,
>>>> Leonardo Borges
>>>> www.leonardoborges.com
>>>>
>>>>
>>>
>>
>>
>> --
>> -----------------------------
>> Johnny Withers
>> 601.209.4985
>> joh...@pixelated.net
>>
>
>


-- 
-----------------------------
Johnny Withers
601.209.4985
joh...@pixelated.net

Reply via email to