So there's N people with a given full name ('Stefan Magdalinski', for example).
There's L registered lobbyists, and V whitehouse visitors.
(the population of america is P, so there's a L/P chance of being a lobbyist,
and a V/P chance of being a visitor, unless there's a way of reducing this?)Assuming that lobbyists and visitors are independant (i.e. there is no true correlation) then the probability that, for a given name, both a lobbyist and visitor exist is given by: p(lobbyist in N) * p(visitor in N) p(visitor in N) = 1- p(NOT visitor in N) = 1- (1-(V/P))^N However, we actually *know* the probability p(lobbyist in N), so we can ignore this and assign it as 0 or 1. (1 whenever we care - i.e. have a lobbyist). Or should we do this for the visitor? Largest number? Someone who knows more stats than me should probably do this. So for John Adams, N=8568 http://futureboy.homeip.net/fsp/namefreq.fsp?firstName=john&lastName=adams&pop=300+million P=300e6, V is guessed at 1000. 1-(1-(1000/300e6))^8568 = 0.03 - i.e. there is a 3% chance that this match would occur by chance. For V=10000 this goes up to 24%, for V=100, this goes down to 0.3% For N=80000 this goes up to 23%, for N=800, this goes down to 0.3% Someone who actually remembers more stats than me, please check I'm not barking up the wrong tree. Dave, who doesn't trust the source he used for the name: being called John and Adams are STRONGLY CORRELATED. _______________________________________________ Mailing list [email protected] Archive, settings, or unsubscribe: https://secure.mysociety.org/admin/lists/mailman/listinfo/developers-public
