On Friday, December 5, Robert Citek wrote:

On Wednesday, December 3, 2003, at 06:27 PM, Robert Citek wrote:


How can I calculate the mean/median/mode from a set of data using SQL?


After a bit of googling, I found this link:
  http://mysql.progen.com.tr/doc/en/Group_by_functions.html
and a few answers in the comments towards the bottom.

Below I've included a sample table and the solution I used to calculate the median.

Regards,
- Robert

-----

DROP TABLE IF EXISTS data;

CREATE TABLE data (
  name char(1) default NULL,
  val int(5) default NULL
) TYPE=MyISAM;

INSERT INTO data VALUES
  ('a',2), ('a',2), ('a',2), ('a',2), ('a',20),
  ('b',4), ('b',4), ('b',4), ('b',4), ('b',40);

CREATE TEMPORARY TABLE medians SELECT x.name, x.val
FROM data x, data y
WHERE x.name=y.name
GROUP BY x.name, x.val
HAVING ((COUNT(*)-(SUM(SIGN(1-SIGN(y.val-x.val))))) <= floor((COUNT(*) +1)/2)) and
((COUNT(*)-(SUM(SIGN(1+SIGN(y.val-x.val))))) <= floor((COUNT(*) +1)/2));


SELECT name, AVG(val) AS median FROM medians group by name;

#### Output

+------+--------+
| name | median |
+------+--------+
| a    | 2.0000 |
| b    | 4.0000 |
+------+--------+

Robert,


I don't believe this is correct. I think it only appears correct due to the particular nature of your sample data. Try it with different data to see what I mean. For example,

DROP TABLE IF EXISTS data;

CREATE TABLE data (name char(1) default NULL, val int default NULL);

INSERT INTO data
VALUES ('a',1),  ('a',2),  ('a',2),  ('a',2),  ('a',3),  ('a',6),
       ('a',7),
       ('a',11), ('a',11), ('a',12), ('a',13), ('a',13), ('a',20);
INSERT INTO data VALUES ('b',2), ('b',3), ('b',4), ('b',5);

DROP TABLE IF EXISTS medians;

CREATE TEMPORARY TABLE medians SELECT x.name, x.val
FROM data x, data y
WHERE x.name=y.name
GROUP BY x.name, x.val
HAVING
((COUNT(*)-(SUM(SIGN(1-SIGN(y.val-x.val))))) <= floor((COUNT(*) +1)/2)) AND
((COUNT(*)-(SUM(SIGN(1+SIGN(y.val-x.val))))) <= floor((COUNT(*) +1)/2));

SELECT name, AVG(val) AS median FROM medians group by name;

#### Output
+------+--------+
| name | median |
+------+--------+
| a    | 6.5000 |
| b    | 3.5000 |
+------+--------+

As there are an odd number of values with name = 'a', we should choose the middle value, 7, as the median, but we got 6.5. (I'm assuming we agree that the definition of median is the middle value for N odd and the average of the 2 middle values for N even.)

I tried to see if I could tweak your query to get the right result, but, between the complexity of your HAVING clause and the wrinkle of getting separate answers for each value in the name column, I just couldn't wrap my head around it. So, I went looking and found a supposed solution at <http://www.oreilly.com/catalog/transqlcook/chapter/ch08.html>.

It is overly complex, purports to get an answer in a single select, and is simply wrong in the case where there are an even number of values. (The author also gives a definition of statistical median, as opposed to financial median, which doesn't match my Intro Stats book.) But I understood what it was trying to do and came up with the following:

#### To get the median of the values in a column:

DROP TABLE IF EXISTS medians;

CREATE TEMPORARY TABLE medians
SELECT x.val medians
FROM data x, data y
GROUP BY x.val
HAVING SUM(y.val <= x.val) >= COUNT(*)/2
AND SUM(y.val >= x.val) >= COUNT(*)/2;

SELECT AVG(medians) AS median FROM medians;

#### Output using my sample data given above
+--------+
| median |
+--------+
| 5.0000 |
+--------+


#### To get the median of the values in a column for each value in #### another column:

DROP TABLE IF EXISTS medians;

CREATE TEMPORARY TABLE medians
SELECT x.name, x.val medians
FROM data x, data y
WHERE x.name=y.name
GROUP BY x.name, x.val
HAVING SUM(y.val <= x.val) >= COUNT(*)/2
AND SUM(y.val >= x.val) >= COUNT(*)/2;

SELECT name, AVG(medians) AS median FROM medians GROUP BY name;

#### Output using my sample data given above
+------+--------+
| name | median |
+------+--------+
| a    | 7.0000 |
| b    | 3.5000 |
+------+--------+

I've tested these with several different data sets, so I'm fairly confident they are correct.

Michael


-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]



Reply via email to