Re: MySQL and set complements

2011-07-12 Thread Leonardo Borges
Just now realized I answered to Mike only oops.

So posting it again... forcing the use of the use_id index didn't really
improve things, unfortunately.


Cheers,
Leonardo Borges
www.leonardoborges.com


On Sat, Jul 9, 2011 at 7:24 AM, mos mo...@fastmail.fm wrote:

 Leonardo,
What happens when you use force index(user_id) ?

 See 
 http://dev.mysql.com/doc/**refman/5.1/en/index-hints.htmlhttp://dev.mysql.com/doc/refman/5.1/en/index-hints.html

 Mike

 At 09:19 AM 7/8/2011, you wrote:

 Same as before, but with the new index listed in the possible keys:

 ++-+---+--**-+**
 --+---**---+-++---**
 -+-+
 | id | select_type | table | type  | possible_keys
 | key  | key_len | ref| rows   | Extra   |
 ++-+---+--**-+**
 --+---**---+-++---**
 -+-+
 |  1 | SIMPLE  | u | index | NULL
  | id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   |
 user_idx,email_idx,activity_**idx,compound_idx | user_idx | 5   |
 getup.u.id | 20 | Using where |
 ++-+---+--**-+**
 --+---**---+-++---**
 -+-+



 On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers joh...@pixelated.net
 wrote:

  What did the explain output look like after the new index?
 
 
  On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges 
  leonardoborges...@gmail.com wrote:
 
  Hi Johnny,
 
  I just gave that a try but it didn't help as I suspected.
 
  I still believe the problem is in mysql not being able to handle set
  subtractions. Therefore, it has to perform the work harder to return
 the
  rows that represent a no match with NULL values in place so they can
 then
  be filtered by the WHERE clause.
 
 
  This type of query seems to be a corner case in mysql one should be
 aware
  about when working with large datasets.
 
  Cheers,
  Leonardo Borges
  www.leonardoborges.com
 
 
  On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.net
 wrote:
 
  Leonardo,
 
  I think a new compound key on email_id and activity in the activities
  table may help.
 
  I'm not sure if this will help or not, Its hard to test w/o having a
  large data set to test against.
 
 
  On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
  leonardoborges...@gmail.com wrote:
 
  Sure can:
 
  show create table activities;
 
  CREATE TABLE `activities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`email` varchar(100) DEFAULT NULL,
`country_iso` varchar(2) DEFAULT NULL,
`tags` varchar(255) DEFAULT NULL,
`postcode` int(11) DEFAULT NULL,
`activity` varchar(100) DEFAULT NULL,
`page_id` int(11) DEFAULT NULL,
`donation_frequency` varchar(100) DEFAULT NULL,
`email_id` int(11) DEFAULT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
  CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `user_idx` (`user_id`),
KEY `email_idx` (`email_id`),
KEY `activity_idx` (`activity`)
  ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
 
 
  And the explain:
 
 
  ++-+---+--**-+**
 -+--+-**+++---**--+
  | id | select_type | table | type  | possible_keys
 |
  key  | key_len | ref| rows   | Extra   |
 
  ++-+---+--**-+**
 -+--+-**+++---**--+
  |  1 | SIMPLE  | u | index | NULL
  |
  id_idx   | 5   | NULL   | 972064 | Using index |
  |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_**idx
 |
  user_idx | 5   | getup.u.id | 20 | Using where |
 
  ++-+---+--**-+**
 -+--+-**+++---**--+
 
 
  Cheers,
  Leonardo Borges
  www.leonardoborges.com
 
 
  On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers 
 joh...@pixelated.netwrote:
 
  Can you post show create table for activity and explain output of
 the
  problem query?
 
  On Jul 7, 2011 8:51 PM, Leonardo Borges 
 leonardoborges...@gmail.com
  wrote:
 
  Hello everyone,
 
  I have an increasingly popular web application running on top of
 mysql
  and
  due to its popularity, I'm running into performance issues. After
  carefully
  examining database indexes and tuning queries I was able to pin down
  the
  slowest part of the system.
 
  The app's got a user segmentation tool that allows you to filter
 users
  based
  on a range of criteria from which the slowest is: Select all users
  that did
  not receive the email of id 100
 
  To answer this question we turn to the activities table, which is
  basically
  

Re: MySQL and set complements

2011-07-08 Thread Johnny Withers
Leonardo,

I think a new compound key on email_id and activity in the activities table
may help.

I'm not sure if this will help or not, Its hard to test w/o having a large
data set to test against.


On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 ++-+---+---+-+--+-+++-+
 | id | select_type | table | type  | possible_keys   | key
  | key_len | ref| rows   | Extra   |

 ++-+---+---+-+--+-+++-+
 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 ++-+---+---+-+--+-+++-+


 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users that
 did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 13 million rows, slowing the hell out of this left outer join,
 taking about 52 seconds in my machine.

 What this query is trying to do is to get the relative complement of set
 A(users) to B(activities). As far as I know mysql doesn't support set
 subtraction, thus the reason for these queries being slow.

 Based on that I've setup a test database on Postgresql, which supports
 this
 very set operation and rewrote the query to look like this:

 select u.id
 from users u
 except
 select a.user_id
 from activities a
 where a.activity = 'email_sent'
 and a.email_id = 100;

 The fact that postgresql knows how to subtract sets brought this query
 down
 to only 4 seconds.

 My question then is: since this is a somewhat common query in our system,
 are there any workarounds I could use in mysql to improve things?

 I did find one myself, but it's a bit convoluted and might not perform
 well
 under load, but the following sql script gives me similar performance in
 mysql:

 create temporary table email_sent_100
 select a.user_id
 from user_activity_events a
 where a.activity = 'email_sent'


 and a.email_id = 100;

 create index user_id_idx on email_sent_100(user_id);   //this could
 potentially bring the runtime down in the case of a larg temp table.

 select count(u.id)
 from users u
 left outer join email_sent_100 s
   on u.id = s.user_id
   and s.user_id is null;

 A lot more lines and a lot more complex, but does the job in this example.

 I'd appreciate your thoughts.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com





-- 
-
Johnny Withers
601.209.4985
joh...@pixelated.net


Re: MySQL and set complements

2011-07-08 Thread Leonardo Borges
Hi Johnny,

I just gave that a try but it didn't help as I suspected.

I still believe the problem is in mysql not being able to handle set
subtractions. Therefore, it has to perform the work harder to return the
rows that represent a no match with NULL values in place so they can then
be filtered by the WHERE clause.


This type of query seems to be a corner case in mysql one should be aware
about when working with large datasets.

Cheers,
Leonardo Borges
www.leonardoborges.com


On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote:

 Leonardo,

 I think a new compound key on email_id and activity in the activities table
 may help.

 I'm not sure if this will help or not, Its hard to test w/o having a large
 data set to test against.


 On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 ++-+---+---+-+--+-+++-+
 | id | select_type | table | type  | possible_keys   | key
  | key_len | ref| rows   | Extra   |

 ++-+---+---+-+--+-+++-+
 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 ++-+---+---+-+--+-+++-+


 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql
 and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users that
 did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 13 million rows, slowing the hell out of this left outer join,
 taking about 52 seconds in my machine.

 What this query is trying to do is to get the relative complement of set
 A(users) to B(activities). As far as I know mysql doesn't support set
 subtraction, thus the reason for these queries being slow.

 Based on that I've setup a test database on Postgresql, which supports
 this
 very set operation and rewrote the query to look like this:

 select u.id
 from users u
 except
 select a.user_id
 from activities a
 where a.activity = 'email_sent'
 and a.email_id = 100;

 The fact that postgresql knows how to subtract sets brought this query
 down
 to only 4 seconds.

 My question then is: since this is a somewhat common query in our system,
 are there any workarounds I could use in mysql to improve things?

 I did find one myself, but it's a bit convoluted and might not perform
 well
 under load, but the following sql script gives me similar performance in
 mysql:

 create temporary table email_sent_100
 select a.user_id
 from user_activity_events a
 where a.activity = 

Re: MySQL and set complements

2011-07-08 Thread Johnny Withers
What did the explain output look like after the new index?


On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hi Johnny,

 I just gave that a try but it didn't help as I suspected.

 I still believe the problem is in mysql not being able to handle set
 subtractions. Therefore, it has to perform the work harder to return the
 rows that represent a no match with NULL values in place so they can then
 be filtered by the WHERE clause.


 This type of query seems to be a corner case in mysql one should be aware
 about when working with large datasets.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote:

 Leonardo,

 I think a new compound key on email_id and activity in the activities
 table may help.

 I'm not sure if this will help or not, Its hard to test w/o having a large
 data set to test against.


 On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 ++-+---+---+-+--+-+++-+
 | id | select_type | table | type  | possible_keys   |
 key  | key_len | ref| rows   | Extra   |

 ++-+---+---+-+--+-+++-+
 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 ++-+---+---+-+--+-+++-+


 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql
 and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users that
 did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out
 all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 13 million rows, slowing the hell out of this left outer join,
 taking about 52 seconds in my machine.

 What this query is trying to do is to get the relative complement of set
 A(users) to B(activities). As far as I know mysql doesn't support set
 subtraction, thus the reason for these queries being slow.

 Based on that I've setup a test database on Postgresql, which supports
 this
 very set operation and rewrote the query to look like this:

 select u.id
 from users u
 except
 select a.user_id
 from activities a
 where a.activity = 'email_sent'
 and a.email_id = 100;

 The fact that postgresql knows how to subtract sets brought this query
 down
 to only 4 seconds.

 My question then is: since this is a somewhat common query in our
 system,
 are there any workarounds I could use in mysql to improve things?

 I did find one myself, but it's a bit convoluted and might not perform
 well
 under load, but the following sql 

Re: MySQL and set complements

2011-07-08 Thread Leonardo Borges
Same as before, but with the new index listed in the possible keys:

++-+---+---+--+--+-+++-+
| id | select_type | table | type  | possible_keys
 | key  | key_len | ref| rows   | Extra   |
++-+---+---+--+--+-+++-+
|  1 | SIMPLE  | u | index | NULL
  | id_idx   | 5   | NULL   | 972064 | Using index |
|  1 | SIMPLE  | a | ref   |
user_idx,email_idx,activity_idx,compound_idx | user_idx | 5   |
getup.u.id | 20 | Using where |
++-+---+---+--+--+-+++-+



On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers joh...@pixelated.netwrote:

 What did the explain output look like after the new index?


 On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Hi Johnny,

 I just gave that a try but it didn't help as I suspected.

 I still believe the problem is in mysql not being able to handle set
 subtractions. Therefore, it has to perform the work harder to return the
 rows that represent a no match with NULL values in place so they can then
 be filtered by the WHERE clause.


 This type of query seems to be a corner case in mysql one should be aware
 about when working with large datasets.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers joh...@pixelated.netwrote:

 Leonardo,

 I think a new compound key on email_id and activity in the activities
 table may help.

 I'm not sure if this will help or not, Its hard to test w/o having a
 large data set to test against.


 On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 ++-+---+---+-+--+-+++-+
 | id | select_type | table | type  | possible_keys   |
 key  | key_len | ref| rows   | Extra   |

 ++-+---+---+-+--+-+++-+
 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 ++-+---+---+-+--+-+++-+


 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers 
 joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql
 and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down
 the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users
 that did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out
 all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 

Re: MySQL and set complements

2011-07-08 Thread mos

Leonardo,
What happens when you use force index(user_id) ?

See http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Mike

At 09:19 AM 7/8/2011, you wrote:

Same as before, but with the new index listed in the possible keys:

++-+---+---+--+--+-+++-+
| id | select_type | table | type  | possible_keys
 | key  | key_len | ref| rows   | Extra   |
++-+---+---+--+--+-+++-+
|  1 | SIMPLE  | u | index | NULL
  | id_idx   | 5   | NULL   | 972064 | Using index |
|  1 | SIMPLE  | a | ref   |
user_idx,email_idx,activity_idx,compound_idx | user_idx | 5   |
getup.u.id | 20 | Using where |
++-+---+---+--+--+-+++-+



On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers joh...@pixelated.netwrote:

 What did the explain output look like after the new index?


 On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Hi Johnny,

 I just gave that a try but it didn't help as I suspected.

 I still believe the problem is in mysql not being able to handle set
 subtractions. Therefore, it has to perform the work harder to return the
 rows that represent a no match with NULL values in place so they can 
then

 be filtered by the WHERE clause.


 This type of query seems to be a corner case in mysql one should be aware
 about when working with large datasets.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers 
joh...@pixelated.netwrote:


 Leonardo,

 I think a new compound key on email_id and activity in the activities
 table may help.

 I'm not sure if this will help or not, Its hard to test w/o having a
 large data set to test against.


 On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges 
 leonardoborges...@gmail.com wrote:

 Sure can:

 show create table activities;

 CREATE TABLE `activities` (
   `id` int(11) NOT NULL AUTO_INCREMENT,
   `user_id` int(11) DEFAULT NULL,
   `email` varchar(100) DEFAULT NULL,
   `country_iso` varchar(2) DEFAULT NULL,
   `tags` varchar(255) DEFAULT NULL,
   `postcode` int(11) DEFAULT NULL,
   `activity` varchar(100) DEFAULT NULL,
   `page_id` int(11) DEFAULT NULL,
   `donation_frequency` varchar(100) DEFAULT NULL,
   `email_id` int(11) DEFAULT NULL,
   `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
 CURRENT_TIMESTAMP,
   PRIMARY KEY (`id`),
   KEY `user_idx` (`user_id`),
   KEY `email_idx` (`email_id`),
   KEY `activity_idx` (`activity`)
 ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


 And the explain:


 
++-+---+---+-+--+-+++-+

 | id | select_type | table | type  | possible_keys   |
 key  | key_len | ref| rows   | Extra   |

 
++-+---+---+-+--+-+++-+

 |  1 | SIMPLE  | u | index | NULL|
 id_idx   | 5   | NULL   | 972064 | Using index |
 |  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
 user_idx | 5   | getup.u.id | 20 | Using where |

 
++-+---+---+-+--+-+++-+



 Cheers,
 Leonardo Borges
 www.leonardoborges.com


 On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers 
joh...@pixelated.netwrote:


 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql
 and
 due to its popularity, I'm running into performance issues. After
 carefully
 examining database indexes and tuning queries I was able to pin down
 the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users
 that did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is
 basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out
 all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = 

MySQL and set complements

2011-07-07 Thread Leonardo Borges
Hello everyone,

I have an increasingly popular web application running on top of mysql and
due to its popularity, I'm running into performance issues. After carefully
examining database indexes and tuning queries I was able to pin down the
slowest part of the system.

The app's got a user segmentation tool that allows you to filter users based
on a range of criteria from which the slowest is: Select all users that did
not receive the email of id 100

To answer this question we turn to the activities table, which is basically
a denormalized log of actions taken by the user in this format:
user_id  | activity| email_id | ...
10 | email_sent   | 100| ...
10 | subscribed   | NULL | ...
10 | email_open  | 100| ...


Given this table and the question above, the usual way of finding out all
users who did not receive this email is through the use of a left outer
join, such as:

select u.id
from users u
left outer join activities a
on u.id = a.user_id
and a.activity = 'email_sent'
and a.email_id = 100
where a.user_id is null

That's all fine  for medium-ish tables. However our current activities table
has over 13 million rows, slowing the hell out of this left outer join,
taking about 52 seconds in my machine.

What this query is trying to do is to get the relative complement of set
A(users) to B(activities). As far as I know mysql doesn't support set
subtraction, thus the reason for these queries being slow.

Based on that I've setup a test database on Postgresql, which supports this
very set operation and rewrote the query to look like this:

select u.id
from users u
except
select a.user_id
from activities a
where a.activity = 'email_sent'
and a.email_id = 100;

The fact that postgresql knows how to subtract sets brought this query down
to only 4 seconds.

My question then is: since this is a somewhat common query in our system,
are there any workarounds I could use in mysql to improve things?

I did find one myself, but it's a bit convoluted and might not perform well
under load, but the following sql script gives me similar performance in
mysql:

create temporary table email_sent_100
select a.user_id
from user_activity_events a
where a.activity = 'email_sent'


and a.email_id = 100;

create index user_id_idx on email_sent_100(user_id);   //this could
potentially bring the runtime down in the case of a larg temp table.

select count(u.id)
from users u
left outer join email_sent_100 s
   on u.id = s.user_id
   and s.user_id is null;

A lot more lines and a lot more complex, but does the job in this example.

I'd appreciate your thoughts.

Cheers,
Leonardo Borges
www.leonardoborges.com


Re: MySQL and set complements

2011-07-07 Thread Johnny Withers
Can you post show create table for activity and explain output of the
problem query?

On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
wrote:

Hello everyone,

I have an increasingly popular web application running on top of mysql and
due to its popularity, I'm running into performance issues. After carefully
examining database indexes and tuning queries I was able to pin down the
slowest part of the system.

The app's got a user segmentation tool that allows you to filter users based
on a range of criteria from which the slowest is: Select all users that did
not receive the email of id 100

To answer this question we turn to the activities table, which is basically
a denormalized log of actions taken by the user in this format:
user_id  | activity| email_id | ...
10 | email_sent   | 100| ...
10 | subscribed   | NULL | ...
10 | email_open  | 100| ...


Given this table and the question above, the usual way of finding out all
users who did not receive this email is through the use of a left outer
join, such as:

select u.id
from users u
left outer join activities a
on u.id = a.user_id
and a.activity = 'email_sent'
and a.email_id = 100
where a.user_id is null

That's all fine  for medium-ish tables. However our current activities table
has over 13 million rows, slowing the hell out of this left outer join,
taking about 52 seconds in my machine.

What this query is trying to do is to get the relative complement of set
A(users) to B(activities). As far as I know mysql doesn't support set
subtraction, thus the reason for these queries being slow.

Based on that I've setup a test database on Postgresql, which supports this
very set operation and rewrote the query to look like this:

select u.id
from users u
except
select a.user_id
from activities a
where a.activity = 'email_sent'
and a.email_id = 100;

The fact that postgresql knows how to subtract sets brought this query down
to only 4 seconds.

My question then is: since this is a somewhat common query in our system,
are there any workarounds I could use in mysql to improve things?

I did find one myself, but it's a bit convoluted and might not perform well
under load, but the following sql script gives me similar performance in
mysql:

create temporary table email_sent_100
select a.user_id
from user_activity_events a
where a.activity = 'email_sent'


and a.email_id = 100;

create index user_id_idx on email_sent_100(user_id);   //this could
potentially bring the runtime down in the case of a larg temp table.

select count(u.id)
from users u
left outer join email_sent_100 s
  on u.id = s.user_id
  and s.user_id is null;

A lot more lines and a lot more complex, but does the job in this example.

I'd appreciate your thoughts.

Cheers,
Leonardo Borges
www.leonardoborges.com


Re: MySQL and set complements

2011-07-07 Thread Leonardo Borges
Sure can:

show create table activities;

CREATE TABLE `activities` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `user_id` int(11) DEFAULT NULL,
  `email` varchar(100) DEFAULT NULL,
  `country_iso` varchar(2) DEFAULT NULL,
  `tags` varchar(255) DEFAULT NULL,
  `postcode` int(11) DEFAULT NULL,
  `activity` varchar(100) DEFAULT NULL,
  `page_id` int(11) DEFAULT NULL,
  `donation_frequency` varchar(100) DEFAULT NULL,
  `email_id` int(11) DEFAULT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `user_idx` (`user_id`),
  KEY `email_idx` (`email_id`),
  KEY `activity_idx` (`activity`)
) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


And the explain:

++-+---+---+-+--+-+++-+
| id | select_type | table | type  | possible_keys   | key
   | key_len | ref| rows   | Extra   |
++-+---+---+-+--+-+++-+
|  1 | SIMPLE  | u | index | NULL|
id_idx   | 5   | NULL   | 972064 | Using index |
|  1 | SIMPLE  | a | ref   | user_idx,email_idx,activity_idx |
user_idx | 5   | getup.u.id | 20 | Using where |
++-+---+---+-+--+-+++-+


Cheers,
Leonardo Borges
www.leonardoborges.com


On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers joh...@pixelated.netwrote:

 Can you post show create table for activity and explain output of the
 problem query?

 On Jul 7, 2011 8:51 PM, Leonardo Borges leonardoborges...@gmail.com
 wrote:

 Hello everyone,

 I have an increasingly popular web application running on top of mysql and
 due to its popularity, I'm running into performance issues. After carefully
 examining database indexes and tuning queries I was able to pin down the
 slowest part of the system.

 The app's got a user segmentation tool that allows you to filter users
 based
 on a range of criteria from which the slowest is: Select all users that
 did
 not receive the email of id 100

 To answer this question we turn to the activities table, which is basically
 a denormalized log of actions taken by the user in this format:
 user_id  | activity| email_id | ...
 10 | email_sent   | 100| ...
 10 | subscribed   | NULL | ...
 10 | email_open  | 100| ...


 Given this table and the question above, the usual way of finding out all
 users who did not receive this email is through the use of a left outer
 join, such as:

 select u.id
 from users u
 left outer join activities a
 on u.id = a.user_id
 and a.activity = 'email_sent'
 and a.email_id = 100
 where a.user_id is null

 That's all fine  for medium-ish tables. However our current activities
 table
 has over 13 million rows, slowing the hell out of this left outer join,
 taking about 52 seconds in my machine.

 What this query is trying to do is to get the relative complement of set
 A(users) to B(activities). As far as I know mysql doesn't support set
 subtraction, thus the reason for these queries being slow.

 Based on that I've setup a test database on Postgresql, which supports this
 very set operation and rewrote the query to look like this:

 select u.id
 from users u
 except
 select a.user_id
 from activities a
 where a.activity = 'email_sent'
 and a.email_id = 100;

 The fact that postgresql knows how to subtract sets brought this query down
 to only 4 seconds.

 My question then is: since this is a somewhat common query in our system,
 are there any workarounds I could use in mysql to improve things?

 I did find one myself, but it's a bit convoluted and might not perform well
 under load, but the following sql script gives me similar performance in
 mysql:

 create temporary table email_sent_100
 select a.user_id
 from user_activity_events a
 where a.activity = 'email_sent'


 and a.email_id = 100;

 create index user_id_idx on email_sent_100(user_id);   //this could
 potentially bring the runtime down in the case of a larg temp table.

 select count(u.id)
 from users u
 left outer join email_sent_100 s
   on u.id = s.user_id
   and s.user_id is null;

 A lot more lines and a lot more complex, but does the job in this example.

 I'd appreciate your thoughts.

 Cheers,
 Leonardo Borges
 www.leonardoborges.com