Re: MySQL and set complements

2011-07-08 Thread Johnny Withers
Leonardo,

I think a new compound key on email_id and activity in the activities table
may help.

I'm not sure if this will help or not, Its hard to test w/o having a large
data set to test against.


On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 ++-+---+---+-+--+-+++-+
 | id | select_type | table | type  | possible_keys   | key
  | key_len | ref| rows   | Extra   |

 ++-+---+---+-+--+-+++-+
 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 ++-+---+---+-+--+-+++-+


 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users that
 did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 13 million rows, slowing the hell out of this left outer join,
 taking about 52 seconds in my machine.

 What this query is trying to do is to get the relative complement of set
 A(users) to B(activities). As far as I know mysql doesn't support set
 subtraction, thus the reason for these queries being slow.

 Based on that I've setup a test database on Postgresql, which supports
 this
 very set operation and rewrote the query to look like this:

 select u.id
 from users u
 except
 select a.user_id
 from activities a
 where a.activity = 'email_sent'
 and a.email_id = 100;

 The fact that postgresql knows how to subtract sets brought this query
 down
 to only 4 seconds.

 My question then is: since this is a somewhat common query in our system,
 are there any workarounds I could use in mysql to improve things?

 I did find one myself, but it's a bit convoluted and might not perform
 well
 under load, but the following sql script gives me similar performance in
 mysql:

 create temporary table email_sent_100
 select a.user_id
 from user_activity_events a
 where a.activity = 'email_sent'


 and a.email_id = 100;

 create index user_id_idx on email_sent_100(user_id);   //this could
 potentially bring the runtime down in the case of a larg temp table.

 select count(u.id)
 from users u
 left outer join email_sent_100 s
   on u.id = s.user_id
   and s.user_id is null;

 A lot more lines and a lot more complex, but does the job in this example.

 I'd appreciate your thoughts.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com





-- 
-
Johnny Withers
601.209.4985
joh...@pixelated.net


Re: MySQL and set complements

2011-07-08 Thread Leonardo Borges
Hi Johnny,

I just gave that a try but it didn't help as I suspected.

I still believe the problem is in mysql not being able to handle set
subtractions. Therefore, it has to perform the work harder to return the
rows that represent a no match with NULL values in place so they can then
be filtered by the WHERE clause.


This type of query seems to be a corner case in mysql one should be aware
about when working with large datasets.

Cheers,
Leonardo Borges
www.leonardoborges.com


On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote:

 Leonardo,

 I think a new compound key on email_id and activity in the activities table
 may help.

 I'm not sure if this will help or not, Its hard to test w/o having a large
 data set to test against.


 On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 ++-+---+---+-+--+-+++-+
 | id | select_type | table | type  | possible_keys   | key
  | key_len | ref| rows   | Extra   |

 ++-+---+---+-+--+-+++-+
 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 ++-+---+---+-+--+-+++-+


 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql
 and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users that
 did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 13 million rows, slowing the hell out of this left outer join,
 taking about 52 seconds in my machine.

 What this query is trying to do is to get the relative complement of set
 A(users) to B(activities). As far as I know mysql doesn't support set
 subtraction, thus the reason for these queries being slow.

 Based on that I've setup a test database on Postgresql, which supports
 this
 very set operation and rewrote the query to look like this:

 select u.id
 from users u
 except
 select a.user_id
 from activities a
 where a.activity = 'email_sent'
 and a.email_id = 100;

 The fact that postgresql knows how to subtract sets brought this query
 down
 to only 4 seconds.

 My question then is: since this is a somewhat common query in our system,
 are there any workarounds I could use in mysql to improve things?

 I did find one myself, but it's a bit convoluted and might not perform
 well
 under load, but the following sql script gives me similar performance in
 mysql:

 create temporary table email_sent_100
 select a.user_id
 from user_activity_events a
 where a.activity = 

Re: MySQL and set complements

2011-07-08 Thread Johnny Withers
What did the explain output look like after the new index?


On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hi Johnny,

 I just gave that a try but it didn't help as I suspected.

 I still believe the problem is in mysql not being able to handle set
 subtractions. Therefore, it has to perform the work harder to return the
 rows that represent a no match with NULL values in place so they can then
 be filtered by the WHERE clause.


 This type of query seems to be a corner case in mysql one should be aware
 about when working with large datasets.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote:

 Leonardo,

 I think a new compound key on email_id and activity in the activities
 table may help.

 I'm not sure if this will help or not, Its hard to test w/o having a large
 data set to test against.


 On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 ++-+---+---+-+--+-+++-+
 | id | select_type | table | type  | possible_keys   |
 key  | key_len | ref| rows   | Extra   |

 ++-+---+---+-+--+-+++-+
 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 ++-+---+---+-+--+-+++-+


 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql
 and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users that
 did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out
 all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 13 million rows, slowing the hell out of this left outer join,
 taking about 52 seconds in my machine.

 What this query is trying to do is to get the relative complement of set
 A(users) to B(activities). As far as I know mysql doesn't support set
 subtraction, thus the reason for these queries being slow.

 Based on that I've setup a test database on Postgresql, which supports
 this
 very set operation and rewrote the query to look like this:

 select u.id
 from users u
 except
 select a.user_id
 from activities a
 where a.activity = 'email_sent'
 and a.email_id = 100;

 The fact that postgresql knows how to subtract sets brought this query
 down
 to only 4 seconds.

 My question then is: since this is a somewhat common query in our
 system,
 are there any workarounds I could use in mysql to improve things?

 I did find one myself, but it's a bit convoluted and might not perform
 well
 under load, but the following sql 

Re: MySQL and set complements

2011-07-08 Thread Leonardo Borges
Same as before, but with the new index listed in the possible keys:

++-+---+---+--+--+-+++-+
| id | select_type | table | type  | possible_keys
 | key  | key_len | ref| rows   | Extra   |
++-+---+---+--+--+-+++-+
|  1 | SIMPLE  | u | index | NULL
  | id_idx   | 5   | NULL   | 972064 | Using index |
|  1 | SIMPLE  | a | ref   |
user_idx,email_idx,activity_idx,compound_idx | user_idx | 5   |
getup.u.id | 20 | Using where |
++-+---+---+--+--+-+++-+



On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers joh...@pixelated.netwrote:

 What did the explain output look like after the new index?


 On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Hi Johnny,

 I just gave that a try but it didn't help as I suspected.

 I still believe the problem is in mysql not being able to handle set
 subtractions. Therefore, it has to perform the work harder to return the
 rows that represent a no match with NULL values in place so they can then
 be filtered by the WHERE clause.


 This type of query seems to be a corner case in mysql one should be aware
 about when working with large datasets.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote:

 Leonardo,

 I think a new compound key on email_id and activity in the activities
 table may help.

 I'm not sure if this will help or not, Its hard to test w/o having a
 large data set to test against.


 On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 ++-+---+---+-+--+-+++-+
 | id | select_type | table | type  | possible_keys   |
 key  | key_len | ref| rows   | Extra   |

 ++-+---+---+-+--+-+++-+
 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 ++-+---+---+-+--+-+++-+


 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers 
 joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql
 and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down
 the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users
 that did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out
 all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 

Re: MySQL and set complements

2011-07-08 Thread mos

Leonardo,
What happens when you use force index(user_id) ?

See http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Mike

At 09:19 AM 7/8/2011, you wrote:

Same as before, but with the new index listed in the possible keys:

++-+---+---+--+--+-+++-+
| id | select_type | table | type  | possible_keys
 | key  | key_len | ref| rows   | Extra   |
++-+---+---+--+--+-+++-+
|  1 | SIMPLE  | u | index | NULL
  | id_idx   | 5   | NULL   | 972064 | Using index |
|  1 | SIMPLE  | a | ref   |
user_idx,email_idx,activity_idx,compound_idx | user_idx | 5   |
getup.u.id | 20 | Using where |
++-+---+---+--+--+-+++-+



On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers joh...@pixelated.netwrote:

 What did the explain output look like after the new index?


 On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Hi Johnny,

 I just gave that a try but it didn't help as I suspected.

 I still believe the problem is in mysql not being able to handle set
 subtractions. Therefore, it has to perform the work harder to return the
 rows that represent a no match with NULL values in place so they can 
then

 be filtered by the WHERE clause.


 This type of query seems to be a corner case in mysql one should be aware
 about when working with large datasets.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers 
joh...@pixelated.netwrote:


 Leonardo,

 I think a new compound key on email_id and activity in the activities
 table may help.

 I'm not sure if this will help or not, Its hard to test w/o having a
 large data set to test against.


 On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 
++-+---+---+-+--+-+++-+

 | id | select_type | table | type  | possible_keys   |
 key  | key_len | ref| rows   | Extra   |

 
++-+---+---+-+--+-+++-+

 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 
++-+---+---+-+--+-+++-+



 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers 
joh...@pixelated.netwrote:


 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql
 and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down
 the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users
 that did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out
 all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id =