Subject: OT: Tiangco's lecture on statistics
From: <A HREF="mailto:[EMAIL PROTECTED] ">[EMAIL PROTECTED] </A> (Mark 
Constantino)
Date: 2/23/03 10:09 PM Pacific Standard Time
Message-id: <[EMAIL PROTECTED]>

I offered once that a statistical sampling of votes over a normal distribution
of voters could predict or accurately determine the winner in a democratic
election.  The polls you might see on a news network such as CNN with Robin
Meade are always stated as "to an accuracy of within +/- 3%" or the like.  
What
does this mean?

It occurs to me that this is still mumbo jumbo to most of the average reading
populace, so I'll try to get as in-depth as I can without confusing cuppa Joe
student.

First we should define several terms used in mathematics in the field of
statistics.  My engineering background requires that I take one or two classes
in engineering statistics, which is a mathematician's way of trying to get as
in-depth as the math department can without confusing cuppa Joe engineering
student.  That is, I learn the algorithm without really paying too much
attention to the theory, but now that I need to explain why the above 
statement
on CNN on behalf of Robin Meade, as she breathes out the words with as much
confidence as I have popping my eyes out at the headlights on her classy
well-maintained Jaguar, should be taken with 100% confidence in CNN's data
analysis department.

I quote from an excellent website from which I distil this lecture:
<A HREF="http://www.bized.ac.uk/timeweb/crunching/crunch_process_expl.htm";>
http://www.bized.ac.uk/timeweb/crunching/crunch_process_expl.htm</A>

Central tendency: any measure of the central tendency is an average. In
practice, three different types of average exist: The 'mode' is the most
frequently occurring value in a set of data.

The 'median' is the middle value in a set of data, when the data is arranged 
in
ascending order.

The 'mean' is the measure of central tendency that takes into account all of
the values in a set of data. There are different versions of the mean, but the
most commonly used is the 'arithmetic' mean which is calculated by summing the
values in a dataset and dividing the result by the number of values that the
dataset contains. The arithmetic mean is the most frequently used measure of
central tendency. It is the most easily understood 'average' and is relatively
simple to calculate. It is a very useful statistic to compare countries, time
periods and so on. It is perhaps at its weakest when there are within the
dataset a few outliers at one end of the range of data. The effect of this 
will
be to 'pull' the mean towards them, thus making the mean unrepresentative of
the dataset as a whole.

The formula used to calculate the mean is:

X-bar = [sigma X(i)] / n
where X-bar = the mean of the observations
X(i) = the sum of the observations
N = the number of observations
sigma = the sum of

The mean average is often used to help interpret data if it is grouped. This
use of the so-called 'weighted average' is illustrated elsewhere.

The formula for calculating an index is:

Index = value / base value x 100

Notice that all indices are constructed using a base year. This is the 
starting
point for any index, because it provides the foundation for comparing what is
happening now, with what happened in the base year. This base will change from
time to time and part of the task in the worksheet on indices will require you
to carry out the re-basing of an index.

The 'dispersion' of a set of values is also the spread of the data. One 
measure
of dispersion is the 'range' - simply the difference between the highest and
lowest values in the dataset - but this only takes account of the two extremes
of the dataset. Sometimes the highest and lowest figures are stated,
alternatively the difference between the two is quoted.

Another measure of dispersion is the 'quartile range' which is half the range
of the middle 50% of values. The quartile range is unaffected by extremes in
values in the dataset and is useful when values are 'skewed'. However, as
suggested above, the quartile range does not take account of all values in a
dataset. The Mean Absolute Deviation: When looking at how items within a
dataset differ from the mean of that dataset, some observations will be below
and some above the arithmetic mean. These differences must sum to zero. The
mean absolute deviation takes account of these difference by ignoring the 
sign.
It measures the absolute deviation from the mean over all the observations.

Finally, the Standard Deviation. This measure of dispersion avoids the
disadvantages associated with the two earlier measures in that it takes 
account
of all the values in the dataset. The negative and positive differences from
the mean are taken account of by squaring the differences. The size of the
standard deviation relative to the mean tells us how dispersed the items in 
the
population are from the average for the sample. The variance measures the
average squared deviations from the mean. The standard deviation is the square
root of this. Although it is complicated to work out and may seem hard to
visualise what it means, the standard deviation (S.D.) is most valuable to us
when we are working with sample data.

/end quote

The most important concept in polling analysis at least is the standard
deviation.  When we graph normal distributions what we see is the familiar
bell-shaped curve, regardless of the units of data or the actual numbers
themselves.  That is, even though a bell-shaped curve might be narrower or
wider than any other bell-shaped curve, the basic form is bell-shaped. 
Converting the data and the specific units used (apples, oranges, voters, 
yada)
into units of standard deviation we derive only one bell-shaped curve.  That's
the curve that is called the Z-distribution, and we derive Z scores from it, 
or
n number of standard deviations (as defined above) from the statistical mean.

The link <A 
HREF="http://www.bized.ac.uk/timeweb/crunching/crunch_process_expl.htm";>
http://www.bized.ac.uk/timeweb/crunching/crunch_process_expl.htm</A> has a
very nice illustration of a standard distribution curve and a short Z table 
(to
three standard deviations).  I suggest that the table itself uses simple
integration over x [data points OR more specifically data points converted to
standard deviations] of a particular bell curve formula to find the area under
the curve at a given point to determine the values on the table, but it's a
hassle to illustrate or prove.  Or not.  I can already see it just by looking
at the table.

I quote again from the website to get more of a familiarity with standard
distributions [my commentary and explanation]:

Summary of normal distribution
When we say that a particular population is normally distributed, we mean the
following:

1.)  The normal frequency curve shows that the highest frequency falls in the
centre of the chart, at the mean of the values in the distribution, with an
equal and exactly similar curve on either side of that centre. So, the most
frequent value in a normal distribution is the average, with half the values
falling below the average and half above it.
[Label the top-most point on the curve 50% because that is where exactly 50% 
of
the data points should fall beneath and 50% should fall above.  Calculus-ers
will note that it's the maxima, or differential with slope set to zero.  In
simpler terms, if the graph is of how many apples fall m meters from a tree,
with m on the y axis and how many apples on the x axis, the top of the bell
curve will show the exact 50% point of apples falling either nearer to the 
tree
or farther from the tree.]

2.)  The normal curve, which is often called a bell curve, is perfectly
symmetrical. So the mean (arithmetic average), the mode (most frequent value),
and the median (the middle value) all coincide at the centre of the curve -
which is the high point of the curve.

3.)  The further away any particular value is from the average, the less
frequent that value will be.

4.)  Because the two halves either side of the centre of the curve are
symmetrical, the frequency of values above and below the mean will match
exactly, provided that the distances between the values and the mean are
identical.

5.)  The total frequency of all values in the population will be contained by
the area under the curve. In other words, the total area under the curve
represents all the possible occurrences of that characteristic.

6.)  Certain areas under the curve therefore indicate the percentage of the
total frequency. For instance, 50% of the area under the curve lies to the 
left
of the mean, and 50% lies to the right. This means that 50% of all scores lie
to the left and 50% to the right. Equal areas under the curve represent equal
numbers in the frequency.

7.)  68% of a population lies within plus or minus one standard deviation of
the mean.
[this exact figure is derived from calculating the Z table, or table of
standard deviations from the mean, or bell curve of any normal distribution
with percentage of data points on the y axis and the number of standard
deviations on the x axis.]

8.)  Approximately 95% of the items in a population are contained within two
standard deviations above and below the mean.
[this exact figure is derived from calculating the Z table]

9.)  Approximately 99% of the population are contained within three standard
deviations above and below the mean.
[derived and written down on the Z table]

10.)  Normal curves may have different shapes. What determines the overall
shape of the curve is the value of the mean and the standard deviation in the
population. But whatever the shape, these general characteristics remain the
same. 
[My explanation is clearer.  But saying this you'll understand it better.  
Here
is the Z table from the website as well.]

Z-scores and Confidence Intervals
The Z-score is the standard normal unit of measurement. Tables have been
created from which we can read off particular distances from the mean and 
their
corresponding areas under the normal curve. So we can easily determine the
level of confidence that we want and find the distance appropriate to it.

z       0.00    0.05    0.10    0.15    0.20    0.25    0.30    0.35
0.40    ...
 
0.0     0.0000  0.0199 
0.1     0.0398  0.0596 
0.2     0.0793  0.0987 
0.3     0.1179  0.1368 
0.4     0.1554  0.1736 
0.5     0.1915  0.2088 
0.6     0.2257  0.2422 
0.7     0.2580  0.2734 
0.8     0.2881  0.3023 
0.9     0.3159  0.3289 
1.0     0.3413  0.3531 
1.1     0.3643  0.3749 
1.2     0.3849  0.3944 
1.3     0.4032  0.4115 
1.4     0.4192  0.4265 
1.5     0.4332  0.4394 
1.6     0.4452  0.4505 
1.7     0.4554  0.4599 
1.8     0.4641  0.4678 
1.9     0.4713  0.4744 
2.0     0.4772  0.4798 
2.1     0.4821  0.4842 
2.2     0.4861  0.4878 
2.3     0.4893  0.4906 
2.4     0.4918  0.4929 
2.5     0.4938  0.4946 
2.6     0.4953  0.4960 
2.7     0.4965  0.4970 
2.8     0.4974  0.4978 
2.9     0.4981  0.4984 
3.0     0.4987  0.4989 

This table indicates in the left hand-side column the distance from the mean 
in
standard deviation units up to one decimal place. This distance is the same as
the z-score.  The figure in each cell of the table indicates the area under 
the
curve at that particular Z-score. Note that this is only for one half of the
normal curve.

So in the first line, the area under the curve at a Z-score of 0.00 is 0.0000.
This means that when we are exactly on the mean, the area under the curve is 
0,
because there is no distance between the mean and itself. If we move one 
column
to the right, to a Z-score of 0.05, the corresponding area under the curve
between the mean and this distance away from it is 0.0199.

Now check the area figure for a z-score of 1.00. Notice that it reads .3413.
This means that of all the scores under the curve, 34.13% of them will fall
between the mean and a Z-score of 1 on either side of the mean. To include all
the scores within 1 standard deviation of the mean on both sides, we double
that figure to 68.26%. Notice that we have been using the figure 68% as an
approximation of that value.
[To rephrase:  z score of 1 means a fraction of all the data points within one
standard deviation of the mean to one side of the mean.  How many apples fall
farther from the tree than the average as a fraction of the total number of
apples being considered.  That should be 34.13% of all the apples in an actual
statistic sampling in every case.  It's not exactly that in any particular
statistic sample -- for example out of 1000 apples from a tree 330 fell within
the 1 standard deviation distance of all apple trees the exact same size of 
our
experiment tree where the table predicts 341 or 342 -- because you are
considering a finite statistic sample whereas the actual z score is like pi or
e, or a number fixed in the ideal universe.  A voter poll is a statistic
example to within a percentage confidence level of this z distribution.  You
are confident that the "33% of apples from all apple trees fall farther than 2
meters away" statement obtained from your actual sample apple tree graph is
within the bounds determined by the statement "+/- 3% of the actual data of 
all
apple trees".  Though we have yet to explain how to determine that last 
phrase.
 Becomes clearer now, hey?]

The decimal numbers in the cells of the table also indicate the probability
that any value in a normal distribution will fall between the mean and the
particular Z-score that corresponds to that value. So if we wanted to know the
Z-score that gives us a confidence level of .75 (or 75%), we could find this
out easily.

Half of 75 percent is 37.5, which is expressed as a decimal as .375. Looking 
at
the table, we can find the number closest to that value in the cell that
corresponds to a Z-score of 1.15. The value in the table is .3749. This means
that in a normal distribution, we can be 75% certain that any value will fall
within the range of 1.15 standard deviations from the mean.
[To rephrase:  The above rephrasing assumes that the mean distance is 2 
meters.
 Let's say that the standard deviation then is 1 meter for simplicity's sake. 
Find a set of apple trees where this is more or less true if you're that
nitpicky.  Say MacIntosh trees of 20 meters height or something like that. 
It's problematic but only an example to illustrate.  If you want to find the
distances from the archetype apple tree where you are certain that 75% of all
apples from that class of trees fall you would look at the z table and find a
score of 1.15 deviations, 1 meter per deviation.  So you know 2 meters is the
mean, and 1.15 deviations at 1 meter per deviation gives a range of 0.85 
meters
to 3.15 meters from the tree.  75% of all apples should fall in that distance
range.  This example is coming at it from a different angle than implied in 
the
website, and if you don't see that, read on.  Otherwise, you understand
completely.]

/end quote

Let's move from apples to voters in a sample election poll.  Say Ron Gonzales
versus Claire Bustamonte in an election for Mayor of San Jose.  We know that
the voting population in San Jose is about 1 million, and even though only 20%
of those people actually vote in real life, let's say that all 1 million vote
in the actual election.  We can take polls from certain districts within San
Jose, which aren't in a normal distribution curve for the whole populace of 
San
Jose because of the demographics of the city.  Some parts are mostly Mexican
immigrants, some parts are mostly high tech workers, yada blah blah.  But say
we solve that problem and are reasonably sure of our sample poll as
representative of a normal distribution over the whole voting population of 
San
Jose (to within +/- 1%, ha!).  Our poll is from 1000 voters to list candidates
in the order for which they would vote in an election.  To give arbitrary
values, we get 490 for Ronzales, 420 for Busta-rhyme-ee, and 90 for the
write-ins at the top of the lists, or 49% Ronzales, 42% Busta, and 9% Other,
then maybe to get even more complicated a secondary graph of probability at 
the
2nd name on the voter lists of 50% for Busta, 42% for Ronzales, and 8% for
Other.  To create a semblance of a normal distribution curve, we set the y 
axis
to probability of voting for Ronzales, and the x axis to number of voters.  To
get even more complicated we weight the names and positions on the list and 
put
them through a tried and true formula which I don't know offhand and derive an
adjusted probability for voting for Ronzales (to within +/- n% again, ha!).  

Let's go simple, it's only a half-glutic example.

Anyway, the real key in voter polls is the sample size and accuracy of 
standard
deviation. 
[The cite is <A HREF="http://www.lordsutch.com/pol251/schacht-08-web.pdf";>
http://www.lordsutch.com/pol251/schacht-08-web.pdf</A> ]

In our example 1000 out of 1000000 and assuming one standard deviation is the
difference between voting for Ronzales or Busta-rhyme-ee or another, it turns
out that the value on the z table corresponds to just about the square root of
the relation of the natural log of sample size to the natural log of overall
population.  Taking the square root of ln(10^6) - ln(10^3)  [from 10^3 in
relation to 10^6]  we get 2.6282 -- looking at the z table for a value of
2.6282 under the 1 deviation column we find the z score to be about 0.4965,
which corresponds to (1.00 - % confidence)/2, or % confidence is +/- 0.70%
  
My exact figures should be wrong.  In any case, taking 1000 sample size for a
poll representative of almost any population size imaginable is very very
accurate, as long as you can get the standard deviation figure for the
population very accurate, and the distribution curve to be as normal as
possible.
     
I hope this explains things better than anywhere else you can find it.

Mark Tiangco-huangco






_______________________________________________
http://www.mccmedia.com/mailman/listinfo/brin-l

Reply via email to