Re: [PHP] Variance Function

2007-01-13 Thread tedd

At 4:19 AM + 1/12/07, Andrew Brampton wrote:

- Original Message - From: Richard Lynch [EMAIL PROTECTED]
To: php-general@lists.php.net
Sent: Thursday, January 11, 2007 11:29 PM
Subject: [PHP] Variance Function


Any advice?

Anybody got a good variance function to do what I'm trying to do?



Hey,
I've seen you solve many questions on this list, and I feel honour 
to be able to try and help :)


Well the solution that pops into my head is clustering. Since you 
have a set of numbers and 1 or more of them may be abnormal, then 
you can cluster them into one or more groups of similar values.

-snip-


Very impressive work Andrew.

You might also look into cluster analysis, which will also provide 
the degree of similarity between items. It's often surprising how 
hidden data falls out of such analyses.


Cheers,

tedd

--
---
http://sperling.com  http://ancientstones.com  http://earthstones.com

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Variance Function

2007-01-12 Thread David Giragosian

Richard,

I think you are looking for data about both the variance and the standard
deviation of the array of dates. This would yield, for example, a
bell-shaped curve, like one gets when administering an intelligence test to
a sample population. Two thirds of the dates fall within +/- 1 standard
deviation of the mean date, ~95% of the dates fall within 2 standard
deviations, and so on.

I think this site does a good job of making the calculations and
equations understandable -- http://davidmlane.com/hyperstat/A16252.html .

hth,

David


Re: [PHP] Variance Function

2007-01-11 Thread Jochem Maas
hi Richard,

your email was hard to follow, and I don't have real answers for
you but maybe my simpleton's view of the situation might offer
you new avenues of thought to consider.

Richard Lynch wrote:
 It's been 20+ years since I took a stats class...

20 years ago I was mostly riding a push bike ... and I've never
taken a stats class as such (bare this in mind :-)

 
 I didn't enjoy that class, and doubt if I remember 1% of what was
 covered.

you'd be 1 up on me ;-)

 


...

 
 And the sheer number of functions in the stats package is making my
 head spin.
 

...

 
 Some fools have their PC clock set to, like, 1970 or whatever.  So
 let's be generous and assume their CMOS battery has died, and they
 haven't had a chance to change it.  Fine.  Deal with it.
 
 Okay, so *NOW* the algorithm is to do this:
 
 Take the Date: header, or Sent: header if no Date: header - $whatdate
 
 Parse the Received: headers for the MTA date-stamps - $fromdates[]
 
 Compare the values in $fromdates array with $whatdate.
 
 If the variance is too high, then ignore the $whatdate, and take
 the, errr, first?, average?, $fromdates[].

does it matter so long as your consistent in what you pick/use/calculate?

I would tend to go for the oldest date in any given array of processed dates
as this would seem to be the closest to the likely actual send date.

 
 No, wait, maybe I should do a variance within the $fromdates in case
 some stupid MTA server has a bad clock?

I would start by setting out a few acceptable boundaries and 'knowns'
for instance:

1. the first mail was sent no earlier than timestampX
(so any timestamp encountered that is earlier than this is bogus.)  
2. a maximum time an email could be expected to hang out at any given MTA whilst
waiting to be moved on.
(could be used to drop an outer timestamps [oldest  newest] from a 
given array of
timestamps extracted from mail whose difference is to it's 'neighbour' 
is
greater than this agreed maximum period.)

 
 Any advice?

1. don't forget to normalize all found dates in a given mails array of dates
into UTC (if that is even an issue) before doing any actual processing/analysis 
of
the collected dates.

2. I would consider the date's found in the Date: and/or Sent: headers with the 
same
brush as any dates found in the Recieved headers - your explanation suggest 
than no one
header could be construed as being more reliable than another.

3. er there is no 3, unless you consider 'buy a bigger brain' real advice ;-)

 
 Anybody got a good variance function to do what I'm trying to do?
 
 Am I on the entirely wrong path here?

dunno - but it's another typical Lynch problem that was just too interesting
for me to let slide :-) please do keep us posted as to your progress!

 Sheesh!
 
 We may just ignore any obviously wrong dates, and process those by
 hand...

indeed anything that is blatantly 'dodgy' with regard to dates is probably 
easier
to (and more accurately) processed by hand than it is to create some wizzo 
algo. for
it - it's a matter of getting the number of 'dodgy' down to an acceptable level 
of course.

 

-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] Variance Function

2007-01-11 Thread Andrew Brampton
- Original Message - 
From: Richard Lynch [EMAIL PROTECTED]

To: php-general@lists.php.net
Sent: Thursday, January 11, 2007 11:29 PM
Subject: [PHP] Variance Function


Any advice?

Anybody got a good variance function to do what I'm trying to do?



Hey,
I've seen you solve many questions on this list, and I feel honour to be 
able to try and help :)


Well the solution that pops into my head is clustering. Since you have a set 
of numbers and 1 or more of them may be abnormal, then you can cluster them 
into one or more groups of similar values.


I quickly read up on clustering and coded a function to do something you 
might find useful.


 cluster.php 
?php

function mean($arr) {
return array_sum($arr) / count($arr);
}

function find_k_clusters($arr, $k) {

if ($k = 1)
 return array($arr);

// Setup n clusters (and their means)
$cluster = array();
$clusterMean = array();
foreach ($arr as $a) {
 $cluster[] = array($a);
 $clusterMean[] = $a;
}

//populate an array of all the differences between pairs
$diff = array();
foreach ($clusterMean as $i = $c1) {
 $diff[$i] = array();
 foreach ($clusterMean as $j = $c2) {
   // Only loop until we get to j, so we don't duplicate results
  if ($i = $j)
   break;
  $diff[$i][$j] = abs( $c1 - $c2 );
 }
}

while ( count($cluster)  $k ) {

 // find the smallest value (hence the closest pair)
 $p1 = false;
 $p2 = false;

 foreach ($diff as $i = $diffi) {
  foreach ($diffi as $j = $d) {
   if ($p1 === false || $d  $diff[$p1][$p2]) {
$p1 = $i;
$p2 = $j;
   }
  }
 }

 echo $p1 $p2\n;
 //print_r($cluster);

 // Add the 2nd cluster to the first, and remove the 2nd
 $cluster[ $p1 ] = array_merge ($cluster[ $p1 ], $cluster[ $p2 ]);
 $clusterMean[$p1] = mean( $cluster[ $p1 ] );
 unset( $cluster[ $p2 ] );
 unset( $clusterMean[ $p2 ] );

 // Now recalc any diffs that would have changed
 unset( $diff[ $p2 ] ); // Remove the $p2 row

 // Remove the p2 col
 foreach( $diff as $i = $ds ) {
  if ( $i  $p2 ) {
   unset($ds[$p2]);
  }
 }

 // recalc the full p1 row
 foreach ($diff[$p1] as $j = $d) {
  $diff[$p1][$j] = abs( $clusterMean[$p1] - $clusterMean[$j] );
 }

}

return array_values( $cluster );
}

$a = array( 1132565342 , 0, 1132565360, 100, 1132565359, 1132565360, 
1 );


print_r ( find_k_clusters($a, 2) ) ;


?
-

Now you pass the function a array of values, and the number of clusters you 
wish to find. So for example entering the array

1132565342 , 0, 1132565360, 100, 1132565359, 1132565360, 1
will return 2 clusters like so:
[0] = 1132565342 , 1132565360, 1132565359, 1132565360
[1] = 0, 100, 1

It works by putting each value in its own cluster, and then finding the two 
closest clusters again and again until you are left with $k clusters. I 
haven't used the concept of variance.


Now its just up to you to figure out which cluster is correct, and voila you 
can throw away (or correct) the bad cluster values.


The problem might get more complex if you have for example dates such as 
1970, 1990, 2006... Because then the 1990 will be nearer to the 2006 and be 
clustered in the good cluster. If you have values such as this you might 
want to change this so instead of creating k cluster, it only clusters 
values within a suitable distance of each other (for example within 72 hours 
of each other, which is a max acceptable time for a email to be bounced 
around).


I hope this helps in some way. If not it was fun quickly coding up a 
clustering algorithm :)


On reflexation it might be a lot easier to not use clustering and instead 
just look at todays date, and throw away any value more than X days out.


Andrew Brampton 


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php