Paul McGuire wrote: > <[EMAIL PROTECTED]> wrote in message > news:[EMAIL PROTECTED] >> Paul McGuire wrote: >>> <[EMAIL PROTECTED]> wrote in message >>> news:[EMAIL PROTECTED] >>>> [snip] >>>> >>>> no sort() is needed to calculate the median of a list. >>>> >>>> you just need one temp var. >>>> >>> Ok, I'll bite. How do you compute the median of a list using just a >>> single >>> temp var? >>> >>> -- Paul >> hi Paul; well when this was a stats-class assignment (back when pascal >> was popular :) i just stepped through the vector and compared it >> >> (pseudo-code) >> >> ptr p = [with values]. >> >> fun median { >> var x = 0. >> while( *p++) { >> if( (*p) > x) x = *p. >> } >> return x. >> } >> >> of course, pascal is more verbose but that's median() >> > > No, that's the maximum. The median value is the value that is in the middle > of the list when the list is sorted. Many analyses prefer median to mean > (also known as "average") because the median is less sensitive to wild > outlier points. > > My original question was in response to your post, that sort() wasn't > required but only a temp variable. I am very interested in seeing your > solution that does not require the data to be sorted. (This is not just an > academic exercise - given a large historical data set, sorting the data is > one of the costliest parts of computing the median, and I would greatly > appreciate seeing an alternative algorithm.)
Here's a K&R C function I wrote almost 20 years ago. It's a general purpose quantile. The essential idea is to choose an element at random (thus mostly avoiding perverse behavior with already sorted data) and divide the set into two pieces around it. Then you figure out which piece contains the quantile you want, and what quantile it is within that piece, and recurse. When you see enough identical elements in the piece you're processing, it's done. In the extreme case you'll get down to one element. ixrand(n) generates a random integer in the range 0..n-1. I think that's the only nonstandard function used. The style is torqued by what Unisoft C could and couldn't optimize: I wouldn't write it quite like that today. One of my students pointed out that all of the recursion is tail recursion so it should be easy to flatten. Left as an exercise to the reader... Somewhere, in Knuth I think, I saw a somewhat similar algorithm that promised a little better performance by estimating the median from a sample of the data, breaking the data up there, and then looking for a quantile (statistically guaranteed to be) close to the min or max of one of the subsets. It shouldn't be hard to recode in Python, Forth, or whatever. That wasn't my purpose in the exercise that started the thread though: I wanted to see if I could import modules good enough to do the job from public sources. In Python I could, and the entire image processing program is 15 lines. In Forth I couldn't. Anyway, here it is: /* Find the nth from the minimum value in an array */ /* Monte Carlo method intended for finding medians */ /* 2/13/85 jpd */ /* For random data, this routine takes about */ /* 2.6*numdata + O( log( numdata )) comparisons */ /* If the data is tightly clustered about the mean, */ /* there is a speedup; it may take as few as /* 0.5*numdata comparisons. */ /* There is a slight penalty if the array is completely */ /* or partially sorted; it is at most 25%. */ /* NTH will be nthi, nths, etc., depending on DATATYPE */ NTH( data, numdata, n ) DATATYPE data[]; /* Data array (will be scrambled on return) */ int numdata; /* lemgth of data array */ int n; /* index if item to find: 1 <= n <= numdata */ { register DATATYPE boundary, thisdata; register DATATYPE *lowp, *highp; DATATYPE v1, v2; int nlowbin; lowp = data; /* Init data pointer */ v1 = data[ ixrand( numdata )]; { register DATATYPE v1r = v1; int nc = 1 + numdata - n; /* "Complement" of n */ if( nc > n ) highp = lowp + nc; else highp = lowp + n; /* Limit to test for done */ /* Scan for the first point which doesn't match the boundary point. If we encounter enough matching points, the boundary is the answer */ while( *lowp++ == v1r ) { if( lowp >= highp ) return( v1r ); } v2 = *--lowp; /* Back up to get point */ } boundary = ( v1 >> 1 ) + ( v2 >> 1 ); /* Beware overflows */ highp = data + numdata; /* Now process the whole thing */ thisdata = *lowp; /* Prime the pump */ if( v2 < v1 ) { /* Bin 2 is low bin */ for( ; lowp < highp; thisdata = *lowp ) { if( thisdata <= boundary ) { /* Bin 2 */ *lowp = *--highp; /* Exchange */ *highp = thisdata; } else ++lowp; /* Data point in right place */ } nlowbin = numdata - ( lowp - data ); if( nlowbin >= n ) return( NTH( highp, nlowbin, n )); else return( NTH( data, lowp - data, n - nlowbin )); } else { /* Primary bin is low bin */ for( ; lowp < highp; thisdata = *lowp ) { if( thisdata > boundary ) { /* Bin 2 */ *lowp = *--highp; /* Exchange */ *highp = thisdata; } else ++lowp; /* Don't move point */ } nlowbin = ( lowp - data ); if( nlowbin >= n ) return( NTH( data, nlowbin, n )); else return( NTH( highp, numdata - nlowbin, n - nlowbin )); } } -- John Doty, Noqsi Aerospace, Ltd. -- Specialization is for robots. -- http://mail.python.org/mailman/listinfo/python-list