Re: MySQL and set complements
Just now realized I answered to Mike only oops. So posting it again... forcing the use of the use_id index didn't really improve things, unfortunately. Cheers, Leonardo Borges www.leonardoborges.com On Sat, Jul 9, 2011 at 7:24 AM, mos mo...@fastmail.fm wrote: Leonardo, What happens when you use force index(user_id) ? See http://dev.mysql.com/doc/**refman/5.1/en/index-hints.htmlhttp://dev.mysql.com/doc/refman/5.1/en/index-hints.html Mike At 09:19 AM 7/8/2011, you wrote: Same as before, but with the new index listed in the possible keys: ++-+---+--**-+** --+---**---+-++---** -+-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+--**-+** --+---**---+-++---** -+-+ | 1 | SIMPLE | u | index | NULL | id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_**idx,compound_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+--**-+** --+---**---+-++---** -+-+ On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers joh...@pixelated.net wrote: What did the explain output look like after the new index? On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges leonardoborges...@gmail.com wrote: Hi Johnny, I just gave that a try but it didn't help as I suspected. I still believe the problem is in mysql not being able to handle set subtractions. Therefore, it has to perform the work harder to return the rows that represent a no match with NULL values in place so they can then be filtered by the WHERE clause. This type of query seems to be a corner case in mysql one should be aware about when working with large datasets. Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.net wrote: Leonardo, I think a new compound key on email_id and activity in the activities table may help. I'm not sure if this will help or not, Its hard to test w/o having a large data set to test against. On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Sure can: show create table activities; CREATE TABLE `activities` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `email` varchar(100) DEFAULT NULL, `country_iso` varchar(2) DEFAULT NULL, `tags` varchar(255) DEFAULT NULL, `postcode` int(11) DEFAULT NULL, `activity` varchar(100) DEFAULT NULL, `page_id` int(11) DEFAULT NULL, `donation_frequency` varchar(100) DEFAULT NULL, `email_id` int(11) DEFAULT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `user_idx` (`user_id`), KEY `email_idx` (`email_id`), KEY `activity_idx` (`activity`) ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1 And the explain: ++-+---+--**-+** -+--+-**+++---**--+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+--**-+** -+--+-**+++---**--+ | 1 | SIMPLE | u | index | NULL | id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_**idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+--**-+** -+--+-**+++---**--+ Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote: Can you post show create table for activity and explain output of the problem query? On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically
Re: MySQL and set complements
Leonardo, I think a new compound key on email_id and activity in the activities table may help. I'm not sure if this will help or not, Its hard to test w/o having a large data set to test against. On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Sure can: show create table activities; CREATE TABLE `activities` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `email` varchar(100) DEFAULT NULL, `country_iso` varchar(2) DEFAULT NULL, `tags` varchar(255) DEFAULT NULL, `postcode` int(11) DEFAULT NULL, `activity` varchar(100) DEFAULT NULL, `page_id` int(11) DEFAULT NULL, `donation_frequency` varchar(100) DEFAULT NULL, `email_id` int(11) DEFAULT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `user_idx` (`user_id`), KEY `email_idx` (`email_id`), KEY `activity_idx` (`activity`) ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1 And the explain: ++-+---+---+-+--+-+++-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+---+-+--+-+++-+ | 1 | SIMPLE | u | index | NULL| id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+---+-+--+-+++-+ Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote: Can you post show create table for activity and explain output of the problem query? On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically a denormalized log of actions taken by the user in this format: user_id | activity| email_id | ... 10 | email_sent | 100| ... 10 | subscribed | NULL | ... 10 | email_open | 100| ... Given this table and the question above, the usual way of finding out all users who did not receive this email is through the use of a left outer join, such as: select u.id from users u left outer join activities a on u.id = a.user_id and a.activity = 'email_sent' and a.email_id = 100 where a.user_id is null That's all fine for medium-ish tables. However our current activities table has over 13 million rows, slowing the hell out of this left outer join, taking about 52 seconds in my machine. What this query is trying to do is to get the relative complement of set A(users) to B(activities). As far as I know mysql doesn't support set subtraction, thus the reason for these queries being slow. Based on that I've setup a test database on Postgresql, which supports this very set operation and rewrote the query to look like this: select u.id from users u except select a.user_id from activities a where a.activity = 'email_sent' and a.email_id = 100; The fact that postgresql knows how to subtract sets brought this query down to only 4 seconds. My question then is: since this is a somewhat common query in our system, are there any workarounds I could use in mysql to improve things? I did find one myself, but it's a bit convoluted and might not perform well under load, but the following sql script gives me similar performance in mysql: create temporary table email_sent_100 select a.user_id from user_activity_events a where a.activity = 'email_sent' and a.email_id = 100; create index user_id_idx on email_sent_100(user_id); //this could potentially bring the runtime down in the case of a larg temp table. select count(u.id) from users u left outer join email_sent_100 s on u.id = s.user_id and s.user_id is null; A lot more lines and a lot more complex, but does the job in this example. I'd appreciate your thoughts. Cheers, Leonardo Borges www.leonardoborges.com -- - Johnny Withers 601.209.4985 joh...@pixelated.net
Re: MySQL and set complements
Hi Johnny, I just gave that a try but it didn't help as I suspected. I still believe the problem is in mysql not being able to handle set subtractions. Therefore, it has to perform the work harder to return the rows that represent a no match with NULL values in place so they can then be filtered by the WHERE clause. This type of query seems to be a corner case in mysql one should be aware about when working with large datasets. Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote: Leonardo, I think a new compound key on email_id and activity in the activities table may help. I'm not sure if this will help or not, Its hard to test w/o having a large data set to test against. On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Sure can: show create table activities; CREATE TABLE `activities` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `email` varchar(100) DEFAULT NULL, `country_iso` varchar(2) DEFAULT NULL, `tags` varchar(255) DEFAULT NULL, `postcode` int(11) DEFAULT NULL, `activity` varchar(100) DEFAULT NULL, `page_id` int(11) DEFAULT NULL, `donation_frequency` varchar(100) DEFAULT NULL, `email_id` int(11) DEFAULT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `user_idx` (`user_id`), KEY `email_idx` (`email_id`), KEY `activity_idx` (`activity`) ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1 And the explain: ++-+---+---+-+--+-+++-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+---+-+--+-+++-+ | 1 | SIMPLE | u | index | NULL| id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+---+-+--+-+++-+ Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote: Can you post show create table for activity and explain output of the problem query? On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically a denormalized log of actions taken by the user in this format: user_id | activity| email_id | ... 10 | email_sent | 100| ... 10 | subscribed | NULL | ... 10 | email_open | 100| ... Given this table and the question above, the usual way of finding out all users who did not receive this email is through the use of a left outer join, such as: select u.id from users u left outer join activities a on u.id = a.user_id and a.activity = 'email_sent' and a.email_id = 100 where a.user_id is null That's all fine for medium-ish tables. However our current activities table has over 13 million rows, slowing the hell out of this left outer join, taking about 52 seconds in my machine. What this query is trying to do is to get the relative complement of set A(users) to B(activities). As far as I know mysql doesn't support set subtraction, thus the reason for these queries being slow. Based on that I've setup a test database on Postgresql, which supports this very set operation and rewrote the query to look like this: select u.id from users u except select a.user_id from activities a where a.activity = 'email_sent' and a.email_id = 100; The fact that postgresql knows how to subtract sets brought this query down to only 4 seconds. My question then is: since this is a somewhat common query in our system, are there any workarounds I could use in mysql to improve things? I did find one myself, but it's a bit convoluted and might not perform well under load, but the following sql script gives me similar performance in mysql: create temporary table email_sent_100 select a.user_id from user_activity_events a where a.activity =
Re: MySQL and set complements
What did the explain output look like after the new index? On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges leonardoborges...@gmail.com wrote: Hi Johnny, I just gave that a try but it didn't help as I suspected. I still believe the problem is in mysql not being able to handle set subtractions. Therefore, it has to perform the work harder to return the rows that represent a no match with NULL values in place so they can then be filtered by the WHERE clause. This type of query seems to be a corner case in mysql one should be aware about when working with large datasets. Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote: Leonardo, I think a new compound key on email_id and activity in the activities table may help. I'm not sure if this will help or not, Its hard to test w/o having a large data set to test against. On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Sure can: show create table activities; CREATE TABLE `activities` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `email` varchar(100) DEFAULT NULL, `country_iso` varchar(2) DEFAULT NULL, `tags` varchar(255) DEFAULT NULL, `postcode` int(11) DEFAULT NULL, `activity` varchar(100) DEFAULT NULL, `page_id` int(11) DEFAULT NULL, `donation_frequency` varchar(100) DEFAULT NULL, `email_id` int(11) DEFAULT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `user_idx` (`user_id`), KEY `email_idx` (`email_id`), KEY `activity_idx` (`activity`) ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1 And the explain: ++-+---+---+-+--+-+++-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+---+-+--+-+++-+ | 1 | SIMPLE | u | index | NULL| id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+---+-+--+-+++-+ Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote: Can you post show create table for activity and explain output of the problem query? On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically a denormalized log of actions taken by the user in this format: user_id | activity| email_id | ... 10 | email_sent | 100| ... 10 | subscribed | NULL | ... 10 | email_open | 100| ... Given this table and the question above, the usual way of finding out all users who did not receive this email is through the use of a left outer join, such as: select u.id from users u left outer join activities a on u.id = a.user_id and a.activity = 'email_sent' and a.email_id = 100 where a.user_id is null That's all fine for medium-ish tables. However our current activities table has over 13 million rows, slowing the hell out of this left outer join, taking about 52 seconds in my machine. What this query is trying to do is to get the relative complement of set A(users) to B(activities). As far as I know mysql doesn't support set subtraction, thus the reason for these queries being slow. Based on that I've setup a test database on Postgresql, which supports this very set operation and rewrote the query to look like this: select u.id from users u except select a.user_id from activities a where a.activity = 'email_sent' and a.email_id = 100; The fact that postgresql knows how to subtract sets brought this query down to only 4 seconds. My question then is: since this is a somewhat common query in our system, are there any workarounds I could use in mysql to improve things? I did find one myself, but it's a bit convoluted and might not perform well under load, but the following sql
Re: MySQL and set complements
Same as before, but with the new index listed in the possible keys: ++-+---+---+--+--+-+++-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+---+--+--+-+++-+ | 1 | SIMPLE | u | index | NULL | id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx,compound_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+---+--+--+-+++-+ On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers joh...@pixelated.netwrote: What did the explain output look like after the new index? On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges leonardoborges...@gmail.com wrote: Hi Johnny, I just gave that a try but it didn't help as I suspected. I still believe the problem is in mysql not being able to handle set subtractions. Therefore, it has to perform the work harder to return the rows that represent a no match with NULL values in place so they can then be filtered by the WHERE clause. This type of query seems to be a corner case in mysql one should be aware about when working with large datasets. Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote: Leonardo, I think a new compound key on email_id and activity in the activities table may help. I'm not sure if this will help or not, Its hard to test w/o having a large data set to test against. On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Sure can: show create table activities; CREATE TABLE `activities` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `email` varchar(100) DEFAULT NULL, `country_iso` varchar(2) DEFAULT NULL, `tags` varchar(255) DEFAULT NULL, `postcode` int(11) DEFAULT NULL, `activity` varchar(100) DEFAULT NULL, `page_id` int(11) DEFAULT NULL, `donation_frequency` varchar(100) DEFAULT NULL, `email_id` int(11) DEFAULT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `user_idx` (`user_id`), KEY `email_idx` (`email_id`), KEY `activity_idx` (`activity`) ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1 And the explain: ++-+---+---+-+--+-+++-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+---+-+--+-+++-+ | 1 | SIMPLE | u | index | NULL| id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+---+-+--+-+++-+ Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote: Can you post show create table for activity and explain output of the problem query? On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically a denormalized log of actions taken by the user in this format: user_id | activity| email_id | ... 10 | email_sent | 100| ... 10 | subscribed | NULL | ... 10 | email_open | 100| ... Given this table and the question above, the usual way of finding out all users who did not receive this email is through the use of a left outer join, such as: select u.id from users u left outer join activities a on u.id = a.user_id and a.activity = 'email_sent' and a.email_id = 100 where a.user_id is null That's all fine for medium-ish tables. However our current activities table has over
Re: MySQL and set complements
Leonardo, What happens when you use force index(user_id) ? See http://dev.mysql.com/doc/refman/5.1/en/index-hints.html Mike At 09:19 AM 7/8/2011, you wrote: Same as before, but with the new index listed in the possible keys: ++-+---+---+--+--+-+++-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+---+--+--+-+++-+ | 1 | SIMPLE | u | index | NULL | id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx,compound_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+---+--+--+-+++-+ On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers joh...@pixelated.netwrote: What did the explain output look like after the new index? On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges leonardoborges...@gmail.com wrote: Hi Johnny, I just gave that a try but it didn't help as I suspected. I still believe the problem is in mysql not being able to handle set subtractions. Therefore, it has to perform the work harder to return the rows that represent a no match with NULL values in place so they can then be filtered by the WHERE clause. This type of query seems to be a corner case in mysql one should be aware about when working with large datasets. Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote: Leonardo, I think a new compound key on email_id and activity in the activities table may help. I'm not sure if this will help or not, Its hard to test w/o having a large data set to test against. On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Sure can: show create table activities; CREATE TABLE `activities` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `email` varchar(100) DEFAULT NULL, `country_iso` varchar(2) DEFAULT NULL, `tags` varchar(255) DEFAULT NULL, `postcode` int(11) DEFAULT NULL, `activity` varchar(100) DEFAULT NULL, `page_id` int(11) DEFAULT NULL, `donation_frequency` varchar(100) DEFAULT NULL, `email_id` int(11) DEFAULT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `user_idx` (`user_id`), KEY `email_idx` (`email_id`), KEY `activity_idx` (`activity`) ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1 And the explain: ++-+---+---+-+--+-+++-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+---+-+--+-+++-+ | 1 | SIMPLE | u | index | NULL| id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+---+-+--+-+++-+ Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote: Can you post show create table for activity and explain output of the problem query? On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically a denormalized log of actions taken by the user in this format: user_id | activity| email_id | ... 10 | email_sent | 100| ... 10 | subscribed | NULL | ... 10 | email_open | 100| ... Given this table and the question above, the usual way of finding out all users who did not receive this email is through the use of a left outer join, such as: select u.id from users u left outer join activities a on u.id =
MySQL and set complements
Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically a denormalized log of actions taken by the user in this format: user_id | activity| email_id | ... 10 | email_sent | 100| ... 10 | subscribed | NULL | ... 10 | email_open | 100| ... Given this table and the question above, the usual way of finding out all users who did not receive this email is through the use of a left outer join, such as: select u.id from users u left outer join activities a on u.id = a.user_id and a.activity = 'email_sent' and a.email_id = 100 where a.user_id is null That's all fine for medium-ish tables. However our current activities table has over 13 million rows, slowing the hell out of this left outer join, taking about 52 seconds in my machine. What this query is trying to do is to get the relative complement of set A(users) to B(activities). As far as I know mysql doesn't support set subtraction, thus the reason for these queries being slow. Based on that I've setup a test database on Postgresql, which supports this very set operation and rewrote the query to look like this: select u.id from users u except select a.user_id from activities a where a.activity = 'email_sent' and a.email_id = 100; The fact that postgresql knows how to subtract sets brought this query down to only 4 seconds. My question then is: since this is a somewhat common query in our system, are there any workarounds I could use in mysql to improve things? I did find one myself, but it's a bit convoluted and might not perform well under load, but the following sql script gives me similar performance in mysql: create temporary table email_sent_100 select a.user_id from user_activity_events a where a.activity = 'email_sent' and a.email_id = 100; create index user_id_idx on email_sent_100(user_id); //this could potentially bring the runtime down in the case of a larg temp table. select count(u.id) from users u left outer join email_sent_100 s on u.id = s.user_id and s.user_id is null; A lot more lines and a lot more complex, but does the job in this example. I'd appreciate your thoughts. Cheers, Leonardo Borges www.leonardoborges.com
Re: MySQL and set complements
Can you post show create table for activity and explain output of the problem query? On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically a denormalized log of actions taken by the user in this format: user_id | activity| email_id | ... 10 | email_sent | 100| ... 10 | subscribed | NULL | ... 10 | email_open | 100| ... Given this table and the question above, the usual way of finding out all users who did not receive this email is through the use of a left outer join, such as: select u.id from users u left outer join activities a on u.id = a.user_id and a.activity = 'email_sent' and a.email_id = 100 where a.user_id is null That's all fine for medium-ish tables. However our current activities table has over 13 million rows, slowing the hell out of this left outer join, taking about 52 seconds in my machine. What this query is trying to do is to get the relative complement of set A(users) to B(activities). As far as I know mysql doesn't support set subtraction, thus the reason for these queries being slow. Based on that I've setup a test database on Postgresql, which supports this very set operation and rewrote the query to look like this: select u.id from users u except select a.user_id from activities a where a.activity = 'email_sent' and a.email_id = 100; The fact that postgresql knows how to subtract sets brought this query down to only 4 seconds. My question then is: since this is a somewhat common query in our system, are there any workarounds I could use in mysql to improve things? I did find one myself, but it's a bit convoluted and might not perform well under load, but the following sql script gives me similar performance in mysql: create temporary table email_sent_100 select a.user_id from user_activity_events a where a.activity = 'email_sent' and a.email_id = 100; create index user_id_idx on email_sent_100(user_id); //this could potentially bring the runtime down in the case of a larg temp table. select count(u.id) from users u left outer join email_sent_100 s on u.id = s.user_id and s.user_id is null; A lot more lines and a lot more complex, but does the job in this example. I'd appreciate your thoughts. Cheers, Leonardo Borges www.leonardoborges.com
Re: MySQL and set complements
Sure can: show create table activities; CREATE TABLE `activities` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `email` varchar(100) DEFAULT NULL, `country_iso` varchar(2) DEFAULT NULL, `tags` varchar(255) DEFAULT NULL, `postcode` int(11) DEFAULT NULL, `activity` varchar(100) DEFAULT NULL, `page_id` int(11) DEFAULT NULL, `donation_frequency` varchar(100) DEFAULT NULL, `email_id` int(11) DEFAULT NULL, `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `user_idx` (`user_id`), KEY `email_idx` (`email_id`), KEY `activity_idx` (`activity`) ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1 And the explain: ++-+---+---+-+--+-+++-+ | id | select_type | table | type | possible_keys | key | key_len | ref| rows | Extra | ++-+---+---+-+--+-+++-+ | 1 | SIMPLE | u | index | NULL| id_idx | 5 | NULL | 972064 | Using index | | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx | user_idx | 5 | getup.u.id | 20 | Using where | ++-+---+---+-+--+-+++-+ Cheers, Leonardo Borges www.leonardoborges.com On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote: Can you post show create table for activity and explain output of the problem query? On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com wrote: Hello everyone, I have an increasingly popular web application running on top of mysql and due to its popularity, I'm running into performance issues. After carefully examining database indexes and tuning queries I was able to pin down the slowest part of the system. The app's got a user segmentation tool that allows you to filter users based on a range of criteria from which the slowest is: Select all users that did not receive the email of id 100 To answer this question we turn to the activities table, which is basically a denormalized log of actions taken by the user in this format: user_id | activity| email_id | ... 10 | email_sent | 100| ... 10 | subscribed | NULL | ... 10 | email_open | 100| ... Given this table and the question above, the usual way of finding out all users who did not receive this email is through the use of a left outer join, such as: select u.id from users u left outer join activities a on u.id = a.user_id and a.activity = 'email_sent' and a.email_id = 100 where a.user_id is null That's all fine for medium-ish tables. However our current activities table has over 13 million rows, slowing the hell out of this left outer join, taking about 52 seconds in my machine. What this query is trying to do is to get the relative complement of set A(users) to B(activities). As far as I know mysql doesn't support set subtraction, thus the reason for these queries being slow. Based on that I've setup a test database on Postgresql, which supports this very set operation and rewrote the query to look like this: select u.id from users u except select a.user_id from activities a where a.activity = 'email_sent' and a.email_id = 100; The fact that postgresql knows how to subtract sets brought this query down to only 4 seconds. My question then is: since this is a somewhat common query in our system, are there any workarounds I could use in mysql to improve things? I did find one myself, but it's a bit convoluted and might not perform well under load, but the following sql script gives me similar performance in mysql: create temporary table email_sent_100 select a.user_id from user_activity_events a where a.activity = 'email_sent' and a.email_id = 100; create index user_id_idx on email_sent_100(user_id); //this could potentially bring the runtime down in the case of a larg temp table. select count(u.id) from users u left outer join email_sent_100 s on u.id = s.user_id and s.user_id is null; A lot more lines and a lot more complex, but does the job in this example. I'd appreciate your thoughts. Cheers, Leonardo Borges www.leonardoborges.com