RE: Solved - median (was Re: mean/median/mode)
Using your exact dataset listed at the bottom, I tried using derived tables to combine your implementation under Alpha 4.1.1 as a single statement. I expected a message saying it was to complex or out of some resource but the server goes to 100 percent cpu and I have to kill it. I even got it to crash once. Funny thing is after I kill it my client(mysqlcc) gets the correct result setgo figure. Thanks for the post, I really like your histogram example. Using mysqld-nt Alpha 4.1.1 on WinXP, Dell 2.6ghz 600mg ram. Thanks, Ed SELECT s.name, AVG(s.val) AS median FROM (SELECT x.name, x.val FROM data x, data y WHERE x.name=y.name GROUP BY x.name, x.val HAVING SUM(y.val = x.val) = COUNT(*)/2 AND SUM(y.val = x.val) = COUNT(*)/2) AS s GROUP BY s.name -Original Message- On Friday, December 5, Robert Citek wrote: On Wednesday, December 3, 2003, at 06:27 PM, Robert Citek wrote: How can I calculate the mean/median/mode from a set of data using SQL? After a bit of googling, I found this link: http://mysql.progen.com.tr/doc/en/Group_by_functions.html and a few answers in the comments towards the bottom. Below I've included a sample table and the solution I used to calculate the median. Regards, - Robert - DROP TABLE IF EXISTS data; CREATE TABLE data ( name char(1) default NULL, val int(5) default NULL ) TYPE=MyISAM; INSERT INTO data VALUES ('a',2), ('a',2), ('a',2), ('a',2), ('a',20), ('b',4), ('b',4), ('b',4), ('b',4), ('b',40); CREATE TEMPORARY TABLE medians SELECT x.name, x.val FROM data x, data y WHERE x.name=y.name GROUP BY x.name, x.val HAVING ((COUNT(*)-(SUM(SIGN(1-SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)) and ((COUNT(*)-(SUM(SIGN(1+SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)); SELECT name, AVG(val) AS median FROM medians group by name; Output +--++ | name | median | +--++ | a| 2. | | b| 4. | +--++ Robert, I don't believe this is correct. I think it only appears correct due to the particular nature of your sample data. Try it with different data to see what I mean. For example, DROP TABLE IF EXISTS data; CREATE TABLE data (name char(1) default NULL, val int default NULL); INSERT INTO data VALUES ('a',1), ('a',2), ('a',2), ('a',2), ('a',3), ('a',6), ('a',7), ('a',11), ('a',11), ('a',12), ('a',13), ('a',13), ('a',20); INSERT INTO data VALUES ('b',2), ('b',3), ('b',4), ('b',5); DROP TABLE IF EXISTS medians; CREATE TEMPORARY TABLE medians SELECT x.name, x.val FROM data x, data y WHERE x.name=y.name GROUP BY x.name, x.val HAVING ((COUNT(*)-(SUM(SIGN(1-SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)) AND ((COUNT(*)-(SUM(SIGN(1+SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)); SELECT name, AVG(val) AS median FROM medians group by name; Output +--++ | name | median | +--++ | a| 6.5000 | | b| 3.5000 | +--++ As there are an odd number of values with name = 'a', we should choose the middle value, 7, as the median, but we got 6.5. (I'm assuming we agree that the definition of median is the middle value for N odd and the average of the 2 middle values for N even.) I tried to see if I could tweak your query to get the right result, but, between the complexity of your HAVING clause and the wrinkle of getting separate answers for each value in the name column, I just couldn't wrap my head around it. So, I went looking and found a supposed solution at http://www.oreilly.com/catalog/transqlcook/chapter/ch08.html. It is overly complex, purports to get an answer in a single select, and is simply wrong in the case where there are an even number of values. (The author also gives a definition of statistical median, as opposed to financial median, which doesn't match my Intro Stats book.) But I understood what it was trying to do and came up with the following: To get the median of the values in a column: DROP TABLE IF EXISTS medians; CREATE TEMPORARY TABLE medians SELECT x.val medians FROM data x, data y GROUP BY x.val HAVING SUM(y.val = x.val) = COUNT(*)/2 AND SUM(y.val = x.val) = COUNT(*)/2; SELECT AVG(medians) AS median FROM medians; Output using my sample data given above ++ | median | ++ | 5. | ++ To get the median of the values in a column for each value in another column: DROP TABLE IF EXISTS medians; CREATE TEMPORARY TABLE medians SELECT x.name, x.val medians FROM data x, data y WHERE x.name=y.name GROUP BY x.name, x.val HAVING SUM(y.val = x.val) = COUNT(*)/2 AND SUM(y.val = x.val) = COUNT(*)/2; SELECT name, AVG(medians) AS median FROM medians GROUP BY name; Output using my sample data given above +--++ | name | median | +--++ | a| 7. | | b| 3.5000 | +--++ I've tested these with several different data sets, so I'm fairly
Re: Solved - median (was Re: mean/median/mode)
On Friday, December 5, Robert Citek wrote: On Wednesday, December 3, 2003, at 06:27 PM, Robert Citek wrote: How can I calculate the mean/median/mode from a set of data using SQL? After a bit of googling, I found this link: http://mysql.progen.com.tr/doc/en/Group_by_functions.html and a few answers in the comments towards the bottom. Below I've included a sample table and the solution I used to calculate the median. Regards, - Robert - DROP TABLE IF EXISTS data; CREATE TABLE data ( name char(1) default NULL, val int(5) default NULL ) TYPE=MyISAM; INSERT INTO data VALUES ('a',2), ('a',2), ('a',2), ('a',2), ('a',20), ('b',4), ('b',4), ('b',4), ('b',4), ('b',40); CREATE TEMPORARY TABLE medians SELECT x.name, x.val FROM data x, data y WHERE x.name=y.name GROUP BY x.name, x.val HAVING ((COUNT(*)-(SUM(SIGN(1-SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)) and ((COUNT(*)-(SUM(SIGN(1+SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)); SELECT name, AVG(val) AS median FROM medians group by name; Output +--++ | name | median | +--++ | a| 2. | | b| 4. | +--++ Robert, I don't believe this is correct. I think it only appears correct due to the particular nature of your sample data. Try it with different data to see what I mean. For example, DROP TABLE IF EXISTS data; CREATE TABLE data (name char(1) default NULL, val int default NULL); INSERT INTO data VALUES ('a',1), ('a',2), ('a',2), ('a',2), ('a',3), ('a',6), ('a',7), ('a',11), ('a',11), ('a',12), ('a',13), ('a',13), ('a',20); INSERT INTO data VALUES ('b',2), ('b',3), ('b',4), ('b',5); DROP TABLE IF EXISTS medians; CREATE TEMPORARY TABLE medians SELECT x.name, x.val FROM data x, data y WHERE x.name=y.name GROUP BY x.name, x.val HAVING ((COUNT(*)-(SUM(SIGN(1-SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)) AND ((COUNT(*)-(SUM(SIGN(1+SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)); SELECT name, AVG(val) AS median FROM medians group by name; Output +--++ | name | median | +--++ | a| 6.5000 | | b| 3.5000 | +--++ As there are an odd number of values with name = 'a', we should choose the middle value, 7, as the median, but we got 6.5. (I'm assuming we agree that the definition of median is the middle value for N odd and the average of the 2 middle values for N even.) I tried to see if I could tweak your query to get the right result, but, between the complexity of your HAVING clause and the wrinkle of getting separate answers for each value in the name column, I just couldn't wrap my head around it. So, I went looking and found a supposed solution at http://www.oreilly.com/catalog/transqlcook/chapter/ch08.html. It is overly complex, purports to get an answer in a single select, and is simply wrong in the case where there are an even number of values. (The author also gives a definition of statistical median, as opposed to financial median, which doesn't match my Intro Stats book.) But I understood what it was trying to do and came up with the following: To get the median of the values in a column: DROP TABLE IF EXISTS medians; CREATE TEMPORARY TABLE medians SELECT x.val medians FROM data x, data y GROUP BY x.val HAVING SUM(y.val = x.val) = COUNT(*)/2 AND SUM(y.val = x.val) = COUNT(*)/2; SELECT AVG(medians) AS median FROM medians; Output using my sample data given above ++ | median | ++ | 5. | ++ To get the median of the values in a column for each value in another column: DROP TABLE IF EXISTS medians; CREATE TEMPORARY TABLE medians SELECT x.name, x.val medians FROM data x, data y WHERE x.name=y.name GROUP BY x.name, x.val HAVING SUM(y.val = x.val) = COUNT(*)/2 AND SUM(y.val = x.val) = COUNT(*)/2; SELECT name, AVG(medians) AS median FROM medians GROUP BY name; Output using my sample data given above +--++ | name | median | +--++ | a| 7. | | b| 3.5000 | +--++ I've tested these with several different data sets, so I'm fairly confident they are correct. Michael -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: mean/median/mode
Mike Johnson wrote: From: Robert Citek [mailto:[EMAIL PROTECTED] Hello all, How can I calculate the mean/median/mode from a set of data using SQL? Mean seems to exist as the average (avg): select name, avg(value) from table group by name Is there a way to calculate median and mode with a group by clause? Median: the value at which 50% of the samples are above and below that value. Mode: the most common value For mode, this should work: SELECT COUNT(value) AS mode FROM table GROUP BY value ORDER BY mode DESC LIMIT 1; I expect you meant to include the value itself: SELECT value, COUNT(value) AS mode FROM table GROUP BY value ORDER BY mode DESC LIMIT 1; This will work in many cases, but modes are tricky because: - There may not be a mode. For a value to be a mode, it must occur at least twice. - There may be more than one mode. Any value whose frequency matches the max frequency is a mode. For example, if values 2, 8, and 13 each occur 8 times, and all other values occur less frequently, then 2, 8, and 13 are modes. I feel that there must be a better way, but the following works: CREATE TEMPORARY TABLE counts SELECT val, COUNT(val) AS freq FROM data GROUP BY val HAVING COUNT(*)1 ORDER BY freq DESC; SELECT @mode:=MAX(freq) FROM counts; SELECT val AS mode, freq FROM counts WHERE freq = @mode; DROP TABLE IF EXISTS counts; As a further wrinkle, considering the error inherent in statistical sampling, many statisticians would consider a value whose frequency is much greater than most frequencies but not quite as high as the max frequency to be a mode, or at least sort of a mode. For example, if values 36 and 48 have counts of 102, value 76 has a count of 98, and every other value occurs no more than 25 times, most would call the distribution trimodal (3 modes), even though 76 is not strictly a mode. Put another way, finding the modes is usually about finding the shape of the distribution. An alternative to calculating the modes would be to look for them visually by plotting the distribution. Something like this: SELECT val, COUNT(val) AS frequency, repeat('.',COUNT(val)) AS histogram FROM data GROUP BY val ORDER BY val; +--+---+---+ | val | frequency | histogram | +--+---+---+ |1 | 3 | ... | |2 | 8 | | |3 | 3 | ... | |4 | 3 | ... | |5 | 2 | ..| |6 | 1 | . | |7 | 3 | ... | |8 | 8 | | |9 | 1 | . | | 11 | 2 | ..| | 12 | 3 | ... | | 13 | 8 | | | 14 | 1 | . | | 15 | 1 | . | | 16 | 2 | ..| +--+---+---+ 15 rows in set (0.00 sec) You might want to know something about the size of your counts relative to your screen width before you try that. You could adjust accordingly, however, with something like SELECT val, COUNT(val) AS frequency, repeat('+',ROUND(COUNT(val)/10)) AS histogram FROM data GROUP BY val ORDER BY val; Michael -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
RE: mean/median/mode
From: Robert Citek [mailto:[EMAIL PROTECTED] Hello all, How can I calculate the mean/median/mode from a set of data using SQL? Mean seems to exist as the average (avg): select name, avg(value) from table group by name Is there a way to calculate median and mode with a group by clause? Median: the value at which 50% of the samples are above and below that value. Mode: the most common value For mode, this should work: SELECT COUNT(value) AS mode FROM table GROUP BY value ORDER BY mode DESC LIMIT 1; As for median, it's sort of a hack, but this may do the trick: SELECT FLOOR(COUNT(value)/2) FROM table; SELECT name, value FROM table ORDER BY value ASC LIMIT previous result, 1; Caveat: That'll work if you have an odd # of rows in the table (I have 15 in my test table); I don't know, mathematically, what median should return for a set of data of an even number, actually. The middle two? Or should it pick one? Good luck either way. -- Mike Johnson Web Developer/Systems Asst. Smarter Living, Inc. phone (617) 497-2500 x226 -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Solved - median (was Re: mean/median/mode)
On Wednesday, December 3, 2003, at 06:27 PM, Robert Citek wrote: How can I calculate the mean/median/mode from a set of data using SQL? After a bit of googling, I found this link: http://mysql.progen.com.tr/doc/en/Group_by_functions.html and a few answers in the comments towards the bottom. Below I've included a sample table and the solution I used to calculate the median. Regards, - Robert - DROP TABLE IF EXISTS data; CREATE TABLE data ( name char(1) default NULL, val int(5) default NULL ) TYPE=MyISAM; INSERT INTO data VALUES ('a',2), ('a',2), ('a',2), ('a',2), ('a',20), ('b',4), ('b',4), ('b',4), ('b',4), ('b',40); CREATE TEMPORARY TABLE medians SELECT x.name, x.val FROM data x, data y WHERE x.name=y.name GROUP BY x.name, x.val HAVING ((COUNT(*)-(SUM(SIGN(1-SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)) and ((COUNT(*)-(SUM(SIGN(1+SIGN(y.val-x.val) = floor((COUNT(*) +1)/2)); SELECT name, AVG(val) AS median FROM medians group by name; Output +--++ | name | median | +--++ | a| 2. | | b| 4. | +--++ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
mean/median/mode
Hello all, How can I calculate the mean/median/mode from a set of data using SQL? Mean seems to exist as the average (avg): select name, avg(value) from table group by name Is there a way to calculate median and mode with a group by clause? Median: the value at which 50% of the samples are above and below that value. Mode: the most common value Regards, - Robert -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]