Subject: OT: Tiangco's lecture on statistics From: <A HREF="mailto:[EMAIL PROTECTED] ">[EMAIL PROTECTED] </A> (Mark Constantino) Date: 2/23/03 10:09 PM Pacific Standard Time Message-id: <[EMAIL PROTECTED]>
I offered once that a statistical sampling of votes over a normal distribution of voters could predict or accurately determine the winner in a democratic election. The polls you might see on a news network such as CNN with Robin Meade are always stated as "to an accuracy of within +/- 3%" or the like. What does this mean? It occurs to me that this is still mumbo jumbo to most of the average reading populace, so I'll try to get as in-depth as I can without confusing cuppa Joe student. First we should define several terms used in mathematics in the field of statistics. My engineering background requires that I take one or two classes in engineering statistics, which is a mathematician's way of trying to get as in-depth as the math department can without confusing cuppa Joe engineering student. That is, I learn the algorithm without really paying too much attention to the theory, but now that I need to explain why the above statement on CNN on behalf of Robin Meade, as she breathes out the words with as much confidence as I have popping my eyes out at the headlights on her classy well-maintained Jaguar, should be taken with 100% confidence in CNN's data analysis department. I quote from an excellent website from which I distil this lecture: <A HREF="http://www.bized.ac.uk/timeweb/crunching/crunch_process_expl.htm"> http://www.bized.ac.uk/timeweb/crunching/crunch_process_expl.htm</A> Central tendency: any measure of the central tendency is an average. In practice, three different types of average exist: The 'mode' is the most frequently occurring value in a set of data. The 'median' is the middle value in a set of data, when the data is arranged in ascending order. The 'mean' is the measure of central tendency that takes into account all of the values in a set of data. There are different versions of the mean, but the most commonly used is the 'arithmetic' mean which is calculated by summing the values in a dataset and dividing the result by the number of values that the dataset contains. The arithmetic mean is the most frequently used measure of central tendency. It is the most easily understood 'average' and is relatively simple to calculate. It is a very useful statistic to compare countries, time periods and so on. It is perhaps at its weakest when there are within the dataset a few outliers at one end of the range of data. The effect of this will be to 'pull' the mean towards them, thus making the mean unrepresentative of the dataset as a whole. The formula used to calculate the mean is: X-bar = [sigma X(i)] / n where X-bar = the mean of the observations X(i) = the sum of the observations N = the number of observations sigma = the sum of The mean average is often used to help interpret data if it is grouped. This use of the so-called 'weighted average' is illustrated elsewhere. The formula for calculating an index is: Index = value / base value x 100 Notice that all indices are constructed using a base year. This is the starting point for any index, because it provides the foundation for comparing what is happening now, with what happened in the base year. This base will change from time to time and part of the task in the worksheet on indices will require you to carry out the re-basing of an index. The 'dispersion' of a set of values is also the spread of the data. One measure of dispersion is the 'range' - simply the difference between the highest and lowest values in the dataset - but this only takes account of the two extremes of the dataset. Sometimes the highest and lowest figures are stated, alternatively the difference between the two is quoted. Another measure of dispersion is the 'quartile range' which is half the range of the middle 50% of values. The quartile range is unaffected by extremes in values in the dataset and is useful when values are 'skewed'. However, as suggested above, the quartile range does not take account of all values in a dataset. The Mean Absolute Deviation: When looking at how items within a dataset differ from the mean of that dataset, some observations will be below and some above the arithmetic mean. These differences must sum to zero. The mean absolute deviation takes account of these difference by ignoring the sign. It measures the absolute deviation from the mean over all the observations. Finally, the Standard Deviation. This measure of dispersion avoids the disadvantages associated with the two earlier measures in that it takes account of all the values in the dataset. The negative and positive differences from the mean are taken account of by squaring the differences. The size of the standard deviation relative to the mean tells us how dispersed the items in the population are from the average for the sample. The variance measures the average squared deviations from the mean. The standard deviation is the square root of this. Although it is complicated to work out and may seem hard to visualise what it means, the standard deviation (S.D.) is most valuable to us when we are working with sample data. /end quote The most important concept in polling analysis at least is the standard deviation. When we graph normal distributions what we see is the familiar bell-shaped curve, regardless of the units of data or the actual numbers themselves. That is, even though a bell-shaped curve might be narrower or wider than any other bell-shaped curve, the basic form is bell-shaped. Converting the data and the specific units used (apples, oranges, voters, yada) into units of standard deviation we derive only one bell-shaped curve. That's the curve that is called the Z-distribution, and we derive Z scores from it, or n number of standard deviations (as defined above) from the statistical mean. The link <A HREF="http://www.bized.ac.uk/timeweb/crunching/crunch_process_expl.htm"> http://www.bized.ac.uk/timeweb/crunching/crunch_process_expl.htm</A> has a very nice illustration of a standard distribution curve and a short Z table (to three standard deviations). I suggest that the table itself uses simple integration over x [data points OR more specifically data points converted to standard deviations] of a particular bell curve formula to find the area under the curve at a given point to determine the values on the table, but it's a hassle to illustrate or prove. Or not. I can already see it just by looking at the table. I quote again from the website to get more of a familiarity with standard distributions [my commentary and explanation]: Summary of normal distribution When we say that a particular population is normally distributed, we mean the following: 1.) The normal frequency curve shows that the highest frequency falls in the centre of the chart, at the mean of the values in the distribution, with an equal and exactly similar curve on either side of that centre. So, the most frequent value in a normal distribution is the average, with half the values falling below the average and half above it. [Label the top-most point on the curve 50% because that is where exactly 50% of the data points should fall beneath and 50% should fall above. Calculus-ers will note that it's the maxima, or differential with slope set to zero. In simpler terms, if the graph is of how many apples fall m meters from a tree, with m on the y axis and how many apples on the x axis, the top of the bell curve will show the exact 50% point of apples falling either nearer to the tree or farther from the tree.] 2.) The normal curve, which is often called a bell curve, is perfectly symmetrical. So the mean (arithmetic average), the mode (most frequent value), and the median (the middle value) all coincide at the centre of the curve - which is the high point of the curve. 3.) The further away any particular value is from the average, the less frequent that value will be. 4.) Because the two halves either side of the centre of the curve are symmetrical, the frequency of values above and below the mean will match exactly, provided that the distances between the values and the mean are identical. 5.) The total frequency of all values in the population will be contained by the area under the curve. In other words, the total area under the curve represents all the possible occurrences of that characteristic. 6.) Certain areas under the curve therefore indicate the percentage of the total frequency. For instance, 50% of the area under the curve lies to the left of the mean, and 50% lies to the right. This means that 50% of all scores lie to the left and 50% to the right. Equal areas under the curve represent equal numbers in the frequency. 7.) 68% of a population lies within plus or minus one standard deviation of the mean. [this exact figure is derived from calculating the Z table, or table of standard deviations from the mean, or bell curve of any normal distribution with percentage of data points on the y axis and the number of standard deviations on the x axis.] 8.) Approximately 95% of the items in a population are contained within two standard deviations above and below the mean. [this exact figure is derived from calculating the Z table] 9.) Approximately 99% of the population are contained within three standard deviations above and below the mean. [derived and written down on the Z table] 10.) Normal curves may have different shapes. What determines the overall shape of the curve is the value of the mean and the standard deviation in the population. But whatever the shape, these general characteristics remain the same. [My explanation is clearer. But saying this you'll understand it better. Here is the Z table from the website as well.] Z-scores and Confidence Intervals The Z-score is the standard normal unit of measurement. Tables have been created from which we can read off particular distances from the mean and their corresponding areas under the normal curve. So we can easily determine the level of confidence that we want and find the distance appropriate to it. z 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 ... 0.0 0.0000 0.0199 0.1 0.0398 0.0596 0.2 0.0793 0.0987 0.3 0.1179 0.1368 0.4 0.1554 0.1736 0.5 0.1915 0.2088 0.6 0.2257 0.2422 0.7 0.2580 0.2734 0.8 0.2881 0.3023 0.9 0.3159 0.3289 1.0 0.3413 0.3531 1.1 0.3643 0.3749 1.2 0.3849 0.3944 1.3 0.4032 0.4115 1.4 0.4192 0.4265 1.5 0.4332 0.4394 1.6 0.4452 0.4505 1.7 0.4554 0.4599 1.8 0.4641 0.4678 1.9 0.4713 0.4744 2.0 0.4772 0.4798 2.1 0.4821 0.4842 2.2 0.4861 0.4878 2.3 0.4893 0.4906 2.4 0.4918 0.4929 2.5 0.4938 0.4946 2.6 0.4953 0.4960 2.7 0.4965 0.4970 2.8 0.4974 0.4978 2.9 0.4981 0.4984 3.0 0.4987 0.4989 This table indicates in the left hand-side column the distance from the mean in standard deviation units up to one decimal place. This distance is the same as the z-score. The figure in each cell of the table indicates the area under the curve at that particular Z-score. Note that this is only for one half of the normal curve. So in the first line, the area under the curve at a Z-score of 0.00 is 0.0000. This means that when we are exactly on the mean, the area under the curve is 0, because there is no distance between the mean and itself. If we move one column to the right, to a Z-score of 0.05, the corresponding area under the curve between the mean and this distance away from it is 0.0199. Now check the area figure for a z-score of 1.00. Notice that it reads .3413. This means that of all the scores under the curve, 34.13% of them will fall between the mean and a Z-score of 1 on either side of the mean. To include all the scores within 1 standard deviation of the mean on both sides, we double that figure to 68.26%. Notice that we have been using the figure 68% as an approximation of that value. [To rephrase: z score of 1 means a fraction of all the data points within one standard deviation of the mean to one side of the mean. How many apples fall farther from the tree than the average as a fraction of the total number of apples being considered. That should be 34.13% of all the apples in an actual statistic sampling in every case. It's not exactly that in any particular statistic sample -- for example out of 1000 apples from a tree 330 fell within the 1 standard deviation distance of all apple trees the exact same size of our experiment tree where the table predicts 341 or 342 -- because you are considering a finite statistic sample whereas the actual z score is like pi or e, or a number fixed in the ideal universe. A voter poll is a statistic example to within a percentage confidence level of this z distribution. You are confident that the "33% of apples from all apple trees fall farther than 2 meters away" statement obtained from your actual sample apple tree graph is within the bounds determined by the statement "+/- 3% of the actual data of all apple trees". Though we have yet to explain how to determine that last phrase. Becomes clearer now, hey?] The decimal numbers in the cells of the table also indicate the probability that any value in a normal distribution will fall between the mean and the particular Z-score that corresponds to that value. So if we wanted to know the Z-score that gives us a confidence level of .75 (or 75%), we could find this out easily. Half of 75 percent is 37.5, which is expressed as a decimal as .375. Looking at the table, we can find the number closest to that value in the cell that corresponds to a Z-score of 1.15. The value in the table is .3749. This means that in a normal distribution, we can be 75% certain that any value will fall within the range of 1.15 standard deviations from the mean. [To rephrase: The above rephrasing assumes that the mean distance is 2 meters. Let's say that the standard deviation then is 1 meter for simplicity's sake. Find a set of apple trees where this is more or less true if you're that nitpicky. Say MacIntosh trees of 20 meters height or something like that. It's problematic but only an example to illustrate. If you want to find the distances from the archetype apple tree where you are certain that 75% of all apples from that class of trees fall you would look at the z table and find a score of 1.15 deviations, 1 meter per deviation. So you know 2 meters is the mean, and 1.15 deviations at 1 meter per deviation gives a range of 0.85 meters to 3.15 meters from the tree. 75% of all apples should fall in that distance range. This example is coming at it from a different angle than implied in the website, and if you don't see that, read on. Otherwise, you understand completely.] /end quote Let's move from apples to voters in a sample election poll. Say Ron Gonzales versus Claire Bustamonte in an election for Mayor of San Jose. We know that the voting population in San Jose is about 1 million, and even though only 20% of those people actually vote in real life, let's say that all 1 million vote in the actual election. We can take polls from certain districts within San Jose, which aren't in a normal distribution curve for the whole populace of San Jose because of the demographics of the city. Some parts are mostly Mexican immigrants, some parts are mostly high tech workers, yada blah blah. But say we solve that problem and are reasonably sure of our sample poll as representative of a normal distribution over the whole voting population of San Jose (to within +/- 1%, ha!). Our poll is from 1000 voters to list candidates in the order for which they would vote in an election. To give arbitrary values, we get 490 for Ronzales, 420 for Busta-rhyme-ee, and 90 for the write-ins at the top of the lists, or 49% Ronzales, 42% Busta, and 9% Other, then maybe to get even more complicated a secondary graph of probability at the 2nd name on the voter lists of 50% for Busta, 42% for Ronzales, and 8% for Other. To create a semblance of a normal distribution curve, we set the y axis to probability of voting for Ronzales, and the x axis to number of voters. To get even more complicated we weight the names and positions on the list and put them through a tried and true formula which I don't know offhand and derive an adjusted probability for voting for Ronzales (to within +/- n% again, ha!). Let's go simple, it's only a half-glutic example. Anyway, the real key in voter polls is the sample size and accuracy of standard deviation. [The cite is <A HREF="http://www.lordsutch.com/pol251/schacht-08-web.pdf"> http://www.lordsutch.com/pol251/schacht-08-web.pdf</A> ] In our example 1000 out of 1000000 and assuming one standard deviation is the difference between voting for Ronzales or Busta-rhyme-ee or another, it turns out that the value on the z table corresponds to just about the square root of the relation of the natural log of sample size to the natural log of overall population. Taking the square root of ln(10^6) - ln(10^3) [from 10^3 in relation to 10^6] we get 2.6282 -- looking at the z table for a value of 2.6282 under the 1 deviation column we find the z score to be about 0.4965, which corresponds to (1.00 - % confidence)/2, or % confidence is +/- 0.70% My exact figures should be wrong. In any case, taking 1000 sample size for a poll representative of almost any population size imaginable is very very accurate, as long as you can get the standard deviation figure for the population very accurate, and the distribution curve to be as normal as possible. I hope this explains things better than anywhere else you can find it. Mark Tiangco-huangco _______________________________________________ http://www.mccmedia.com/mailman/listinfo/brin-l