Re: [PHP] Variance Function
At 4:19 AM + 1/12/07, Andrew Brampton wrote: - Original Message - From: Richard Lynch [EMAIL PROTECTED] To: php-general@lists.php.net Sent: Thursday, January 11, 2007 11:29 PM Subject: [PHP] Variance Function Any advice? Anybody got a good variance function to do what I'm trying to do? Hey, I've seen you solve many questions on this list, and I feel honour to be able to try and help :) Well the solution that pops into my head is clustering. Since you have a set of numbers and 1 or more of them may be abnormal, then you can cluster them into one or more groups of similar values. -snip- Very impressive work Andrew. You might also look into cluster analysis, which will also provide the degree of similarity between items. It's often surprising how hidden data falls out of such analyses. Cheers, tedd -- --- http://sperling.com http://ancientstones.com http://earthstones.com -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Variance Function
Richard, I think you are looking for data about both the variance and the standard deviation of the array of dates. This would yield, for example, a bell-shaped curve, like one gets when administering an intelligence test to a sample population. Two thirds of the dates fall within +/- 1 standard deviation of the mean date, ~95% of the dates fall within 2 standard deviations, and so on. I think this site does a good job of making the calculations and equations understandable -- http://davidmlane.com/hyperstat/A16252.html . hth, David
Re: [PHP] Variance Function
hi Richard, your email was hard to follow, and I don't have real answers for you but maybe my simpleton's view of the situation might offer you new avenues of thought to consider. Richard Lynch wrote: It's been 20+ years since I took a stats class... 20 years ago I was mostly riding a push bike ... and I've never taken a stats class as such (bare this in mind :-) I didn't enjoy that class, and doubt if I remember 1% of what was covered. you'd be 1 up on me ;-) ... And the sheer number of functions in the stats package is making my head spin. ... Some fools have their PC clock set to, like, 1970 or whatever. So let's be generous and assume their CMOS battery has died, and they haven't had a chance to change it. Fine. Deal with it. Okay, so *NOW* the algorithm is to do this: Take the Date: header, or Sent: header if no Date: header - $whatdate Parse the Received: headers for the MTA date-stamps - $fromdates[] Compare the values in $fromdates array with $whatdate. If the variance is too high, then ignore the $whatdate, and take the, errr, first?, average?, $fromdates[]. does it matter so long as your consistent in what you pick/use/calculate? I would tend to go for the oldest date in any given array of processed dates as this would seem to be the closest to the likely actual send date. No, wait, maybe I should do a variance within the $fromdates in case some stupid MTA server has a bad clock? I would start by setting out a few acceptable boundaries and 'knowns' for instance: 1. the first mail was sent no earlier than timestampX (so any timestamp encountered that is earlier than this is bogus.) 2. a maximum time an email could be expected to hang out at any given MTA whilst waiting to be moved on. (could be used to drop an outer timestamps [oldest newest] from a given array of timestamps extracted from mail whose difference is to it's 'neighbour' is greater than this agreed maximum period.) Any advice? 1. don't forget to normalize all found dates in a given mails array of dates into UTC (if that is even an issue) before doing any actual processing/analysis of the collected dates. 2. I would consider the date's found in the Date: and/or Sent: headers with the same brush as any dates found in the Recieved headers - your explanation suggest than no one header could be construed as being more reliable than another. 3. er there is no 3, unless you consider 'buy a bigger brain' real advice ;-) Anybody got a good variance function to do what I'm trying to do? Am I on the entirely wrong path here? dunno - but it's another typical Lynch problem that was just too interesting for me to let slide :-) please do keep us posted as to your progress! Sheesh! We may just ignore any obviously wrong dates, and process those by hand... indeed anything that is blatantly 'dodgy' with regard to dates is probably easier to (and more accurately) processed by hand than it is to create some wizzo algo. for it - it's a matter of getting the number of 'dodgy' down to an acceptable level of course. -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] Variance Function
- Original Message - From: Richard Lynch [EMAIL PROTECTED] To: php-general@lists.php.net Sent: Thursday, January 11, 2007 11:29 PM Subject: [PHP] Variance Function Any advice? Anybody got a good variance function to do what I'm trying to do? Hey, I've seen you solve many questions on this list, and I feel honour to be able to try and help :) Well the solution that pops into my head is clustering. Since you have a set of numbers and 1 or more of them may be abnormal, then you can cluster them into one or more groups of similar values. I quickly read up on clustering and coded a function to do something you might find useful. cluster.php ?php function mean($arr) { return array_sum($arr) / count($arr); } function find_k_clusters($arr, $k) { if ($k = 1) return array($arr); // Setup n clusters (and their means) $cluster = array(); $clusterMean = array(); foreach ($arr as $a) { $cluster[] = array($a); $clusterMean[] = $a; } //populate an array of all the differences between pairs $diff = array(); foreach ($clusterMean as $i = $c1) { $diff[$i] = array(); foreach ($clusterMean as $j = $c2) { // Only loop until we get to j, so we don't duplicate results if ($i = $j) break; $diff[$i][$j] = abs( $c1 - $c2 ); } } while ( count($cluster) $k ) { // find the smallest value (hence the closest pair) $p1 = false; $p2 = false; foreach ($diff as $i = $diffi) { foreach ($diffi as $j = $d) { if ($p1 === false || $d $diff[$p1][$p2]) { $p1 = $i; $p2 = $j; } } } echo $p1 $p2\n; //print_r($cluster); // Add the 2nd cluster to the first, and remove the 2nd $cluster[ $p1 ] = array_merge ($cluster[ $p1 ], $cluster[ $p2 ]); $clusterMean[$p1] = mean( $cluster[ $p1 ] ); unset( $cluster[ $p2 ] ); unset( $clusterMean[ $p2 ] ); // Now recalc any diffs that would have changed unset( $diff[ $p2 ] ); // Remove the $p2 row // Remove the p2 col foreach( $diff as $i = $ds ) { if ( $i $p2 ) { unset($ds[$p2]); } } // recalc the full p1 row foreach ($diff[$p1] as $j = $d) { $diff[$p1][$j] = abs( $clusterMean[$p1] - $clusterMean[$j] ); } } return array_values( $cluster ); } $a = array( 1132565342 , 0, 1132565360, 100, 1132565359, 1132565360, 1 ); print_r ( find_k_clusters($a, 2) ) ; ? - Now you pass the function a array of values, and the number of clusters you wish to find. So for example entering the array 1132565342 , 0, 1132565360, 100, 1132565359, 1132565360, 1 will return 2 clusters like so: [0] = 1132565342 , 1132565360, 1132565359, 1132565360 [1] = 0, 100, 1 It works by putting each value in its own cluster, and then finding the two closest clusters again and again until you are left with $k clusters. I haven't used the concept of variance. Now its just up to you to figure out which cluster is correct, and voila you can throw away (or correct) the bad cluster values. The problem might get more complex if you have for example dates such as 1970, 1990, 2006... Because then the 1990 will be nearer to the 2006 and be clustered in the good cluster. If you have values such as this you might want to change this so instead of creating k cluster, it only clusters values within a suitable distance of each other (for example within 72 hours of each other, which is a max acceptable time for a email to be bounced around). I hope this helps in some way. If not it was fun quickly coding up a clustering algorithm :) On reflexation it might be a lot easier to not use clustering and instead just look at todays date, and throw away any value more than X days out. Andrew Brampton -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php