This post deals with algorithms to calculate the median and the 
repercussions of such algorithms on the definition of the median.

To calculate the median, many textbooks indicate that the array must be 
*sorted first*. This is both unnecessary as well as very *expensive* as 
a full sort is very time-consuming.

1. I will therefore try to introduce some methods that do NOT need full 
sort to calculate the median. These are significantly faster than the 
classic algorithm. Because the sorted array is NOT further used in a 
spreadsheet after calculating the median, the sort is in most 
circumstances superfluous.

2. The *OASIS open formula document* puts the sorting of the array as a 
prerequisite to defining the median. This is misleading and should 
therefore be replaced with a more algorithm neutral definition. 
Unfortunately, the subscription to OASIS is too expensive for me, 
therefore I hope that other persons that do have access to the 
development board point this out.

I do not know, which algorithms do gnumeric and OOo Calc use when 
calculating the median, BUT, because these alternative algorithms offer 
many advantages, they should be strongly considered (IF the current 
algorithm uses the usual full sort).

1. FAST ALGORITHMS
==================
Some fast algorithms are presented on the following page: 
http://ndevilla.free.fr/median/median/node20.html

I have imagined a different algorithm, too, although I never tested it. 
I will briefly describe the case with 2*n elements (2n+1 should be similar):
1. take first (n+1) elements
2. sort these elements using e.g. Quick Sort (if enough memory) or a 
different O(n log (n)) algorithm => this sorted array is x[0] to x[n] 
(has n+1 elements)
3. there are still (n-1) elements to process
4. WHILE (still unsorted elements) {
5. IF (next element > x[n]) => drop this element and drop x[0]; => we 
have (n-2) elements left to process; and (n) elements in the sorted 
array: x[1] to x[n]; => renaming array to x[0] to x[n-1];
6. ELSE IF (next element < x[0]) => drop this element; drop x[n], too; 
=> (n-2) elements left to process; (n) elements left in sorted array: 
x[0] to x[n-1];
7. ELSE: drop x[0] and x[n]; new array is x[1] to x[n-1] (n-2 elements); 
put new element in this array and sort this new array (because this 
array is already sorted, adding one value should proceed with log(n) 
speed) => will have now (n-1) elements => renaming (new x[0]) to (new 
x[n-1]);
8. REPEAT (instead of 'n' use last element in sorted array)

Initially: (n+1) values in sorted array; (n-1) still to process
1st iteration: (n) values; (n-2) values;
2nd iteration: (n-1) values; (n-3) values;
...
(n-2) iterations: 3 values; 1 value left to process;
last iteration: in the end, the sorted array will consist of only 2 
values => median = (x[0]+x[1])/2

This algorithm should have at most O(n/2 * log(n) -1 + log[(n/2 -1)!]) 
(where '!' is factorial), but it is probably faster. [I am not sure, 
that my calculation is actually accurate, so take this result carefully.]

It is NOT recursive, it computes actually the median accurate even for 
an even number of elements and does NOT consume as much memory as other 
algorithms.


2. OASIS Open Formula Format
=======================
The OASIS open formula document (2007-02-08) describes the median as:
> MEDIAN
> Summary: Returns the median (middle) value in the list.
> Syntax: MEDIAN( { NumberSequence X}+ )
> Returns: Number
> Semantics:
> MEDIAN logically sorts the numbers (lowest to highest).  If given an 
> odd number of values, MEDIAN returns the middle value. If given an 
> even number of values, MEDIAN returns the arithmetic average of the 
> two middle values.

This is a bit misleading. *MEDIAN logically sorts the numbers* is NOT 
really necessary to calculate the median. Actually, such an algorithm is 
notoriously slow. There are far better alternatives to get the median, 
and therefore a more neutral definition for the median seems warranted. 
Also, this definition is ambiguous for sequences like 1,1,2,3, where the 
two middle values are in effect 3 elements (two ones and the element 2).

DEFINITION
==========
This is an algorithmic definition, too, BUT leaves the option open which 
algorithm is chosen.
1. Odd numbers: The median is that particular element from the list that 
bisects the list into 2 halves, so that one can select 2 non-overlapping 
sets with an equal number of elements, so that all elements are greater 
or equal to the median in one set and smaller or equal in the 2nd set 
(the element that is the median is excluded).
2. Even numbers: The  median is the mean of those 2 values that bisect 
the list in two equal halves, so that we can find 2 non-overlapping sets 
with an equal number of elements, one set in which all elements are 
larger or equal to the median and the second where all elements are 
smaller or equal to the median (the 2 elements from which the median was 
calculated are excluded).

Test Case:
1,2,3 => median is 2 => 2 equal lists: a.) element 1; b.) element 3;
1,2,3,4 => 2 and 3 bisect the list; median is 2.5 => 2 equal lists: a.) 
element 1; b.) element 4;
1,1,3 => median is 1 => 2 equal lists: a.) element 1 (only one); b.) 
element 3;
1,1,2,2 => 1 and 2 bisect list; median is 1.5 => 2 equal lists: a.) 
element 1; b.) element 2 (both only once)
1,1,1 => median is 1 => 2 equal lists: a.) element 1; b.) element 1 
(both only once; 3rd element 1 is excluded as it is median)
1,1,1,3 => 1 and 1 bisect list => median is 1 => 2 equal lists: a.) 
element 1 (only once; the 2 other elements 1 have been excluded); b.) 
element 3;
1,1,1,1 => 1 and 1 bisect list => median is 1; 2 equal lists, each 
consisting of one element 1 (the other 2 elements of 1 have been 
excluded per definition);

Although this definition is somehow more complex, I believe it is more 
accurate and more algorithm neutral.


Kind regards,

Leonard Mada
_______________________________________________
gnumeric-list mailing list
gnumeric-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gnumeric-list

Reply via email to