RE: Chi-square chart in Excel
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Ronny Richardson Sent: Wednesday, February 20, 2002 7:29 PM To: [EMAIL PROTECTED] Subject: Chi-square chart in Excel Can anyone tell me how to produce a chart of the chi-square distribution in Excel? (I know how to find chi-square values but not how to turn those into a chart of the chi-square curve.) Ronny Richardson --- Excel does not have a function that gives the Chi-Square density The following might be helpful regarding future graphs. It is a fraction of a larger package I am preparing. It is awkward to present it in .txt format. DISTRIBUTIONDENSITY CUMMULATIVE I NVERSE BetaBETADISTBETAINV BinomialBINOMST CRITBINOM Chi-Square CHIDIST CHINV Exponential EXPONDIST EXPONDIST F FDIST FINV Gamma GAMMADIST GAMMADIST GAMMAINV Hyper geometric HYPGEOMDIST Log Normal LOGNORMDIST LOGINV Negative Binomial NEGBINOMDIST Normal(with parameters) NORMDISTNORMDISTNORMINV Normal (z values) NORMSDIST NORMSINV Poisson POISSON t TDIST TINV Weibull WEIBULL You have to build a column (say B) of X values. Build an expression for column C calculating the Chi-Square density, given the x value in col B and the df value in A1. It would be =exp(($A$1/2)*LN(2) + GAMMALN($A$1/2) + (($A$1/2)-1)*LN(B1) - B1/2) without the quotes. You can equation-drag this cell down column C for each X value. Now build a smoothed scatter plot graph as series 1 with the X value column B and the Y value as column C. DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
An Interesting Student Problem
Was Darwin's statement It has been experimentally proved that if a plot of ground be sown with one species of grass, and a similar plot be sown with several distinct genera of grasses, a greater number of plants and a greater weight of dry herbage can be raised, a valid statement? In the 25th January 2002 edition of Science on page 639, Andy Hector ([EMAIL PROTECTED]) and Rowan Hooper ([EMAIL PROTECTED]) describe the work conducted at Woburn Abby in 1824, which was the basis of Darwin's statement. The article describes the plots used, and gives a full list of references. The results described in the 1824 source have been put into an interesting table at (www.sciencemag.org/cgi/content/full/295//639/DC1). Basically there were 242 experimental plots set up at Wolburn Abbey, each 4 square feet, enclosed by boards set in cast iron frames, with leaded tanks for aquatic species. This work predates modern methods of experimental design and statistical analysis. Given the table, the problem given to students would be 1) What would be your hypothesis to test? 2) What statistical methods would you apply given the actual missing data and unknowns (from the table)? Apply (them) and state your conclusions. 3) What would be a modern experimental design (for 242 plots) given the variables suggested by the table (and the additional information in the article and in (if available) the references)? The interesting aspect is that there as many ways to do 1), 2) and 3). The interest would be on verbal presentations by each student. As Hector and Hooper point out that how ecosystems operate is one of the hottest and most active areas in ecology today. This really is a tremendous area for the application of statistics. DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
RE: Interpreting multiple regression Beta is only way?
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Wuzzy Sent: Monday, February 04, 2002 4:14 PM To: [EMAIL PROTECTED] Subject: Re: Interpreting mutliple regression Beta is only way? ... I've heard of ridge regression will try to investigate this area more.. Ridge analysis perturbs the coefficient set obtained from least mean squares regression. It does it in a way to reduce the main effects coefficient values. The result is no longer minimum variance of the residuals, but a more tolerant solution of the coefficients. Originated by Hoerl back in the 1950's. Basically its effectiveness is in its ability to make reasonable (and better) predictions of Y values from X values outside of the data set used to obtain coefficient values. The other advantage is that it tends to bring coefficient values closer to reality in a physical sense. The biggest disadvantage is that there is no logical stopping point at some optimum set of coefficient values. This is why it is seldom used except in industry, where valid prediction is more important than the correct model. DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
RE: how to adjust for variables
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Wuzzy Sent: Thursday, January 24, 2002 3:30 PM To: [EMAIL PROTECTED] Subject: Re: how to adjust for variables I find it extremely difficult to interpret multivariate equations. Are there any good books on conceptualizing the equation? For instance: If you are assessing whether protein, fat, or carbohydrate is important in obesity independant of calories, do you do the following model: Disease=carb+proten+fat+calories and if so, isn't the word calories meaningless as it is equal to the sum of the other three. Perhaps it should not be included in the model. I have read of studies were they will use everything except carb as follows: disease=protein+fat+calories and from here you can determine what substituting carb with protein or fat will have on the disease. It is very difficult to conceptualize and very difficult to understand what the word calories means anymore in a multivariate model.. It seems if you use univariate adjusted values it is easier to model, I have very little experience in statistics as everyone can tell.. Just commenting, no real question here.. I will probably understand it with time.. -- The points that Wuzzy makes here illustrate one of the difficulties of model building. There are two ways to adjust. I will illustrate below using totally artificial numbers, because I don't have my handbooks in front of me that has realistic values. Assume we have a subject: Assume subject consumed 120 grams of fat (with a calorie content of 20 cals/gram), 211 grams of protein (with a calorie content of 15 cals/gram) and 350 grams of starch (with a calorie content of 33 cals/gram): I. Adjust for a total calorie level of 10,000 calories per day: Multiply each by 1/16715 gives an adjusted fat consumption of 71.8 grams, protein of 126.2 grams and starch of 209.4 grams. The three X values then would be 71.8, 126.2 and 209.4 rather then the actual values of 120, 211 and 350 grams II. Adjust for individual calorie levels (15,10,20): Multiply 120 times 20/15 to give a standard fat intake of 133.3 grams. Do the same for the others to get (211*15/10=316.5) and (350*33/20=412.5). The three X values would then be 133.3, 316.5 and 412.5 rather than the actual values of 120, 211 and 350 grams. This of course ignors the really important attributes that may correlate to the disease, such as this: Fat Saturated Unsaturated Animal Vegtable Calories (heat of combustion, higher or lower) Thermal processed Chemical and physical modifications (i,e, hydogenation, extraction, low temperature filtering, etc.) Protein Animal Vegtable Fraction of added chemicals (i.e. Lysine) Thermal processed Calories Carbohydrates Starch from cereals Cellulose Hydolyzed cellulose Starches from animal sources Water soluble sugars (Sucrose) Polysacharides Chemically generated sugars (i.e. corn starch derived sweeteners) Thermally processed starches Calories (heat of combustion) Chemically esterified starches If you are going to measure the amount of fat by a calorie measure, this excludes all the other attributes. DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
RE: Faults and Errors in EXCEL
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Humberto Barreto Sent: Monday, January 21, 2002 6:29 AM To: David Heiser; [EMAIL PROTECTED] Subject: Re: Faults and Errors in EXCEL At 05:41 PM 1/20/02 -0800, David Heiser wrote: These are my questions: 1. Given that X is singular, how can I test this in EXCEL given that I only have the MMULT, MINVERSE and MDETERM matrix functions available in EXCEL? One test is to calculate MDETERM on the X'X matrix. For this data set, the determinant of X'X is ~E-08, which is not particularly small. Another test is to multiply (X'X)^-1 times XX and look for something clearly not a unitary matrix. In this case it very clearly is nothing like 'I'. I tend to favor the latter. If you are saying that you don't trust the latter, (X'X)-1 x (X'X) should equal Identity matrix, in other cases, then why not always do both and if either fails, you know there's a problem? - I don't see this as a 'trust' problem. I see it a a problem in making a decision about the results of the regression. What is the (true/false) decision on the validity of the values of the regession coefficients? Are they good to 2 significant figures, 3, 4 or what? DAH - How about taking the inverse of the (X'X)-1 result? How does that do? I was playing with 2 3 2 3.01 With 5 zeroes (like above), minverse on the matrix, then minverse on the inverse gives you the original matrix back, but six or more zeroes and it does not, getting worse and worse as you add zeroes. Would this help? - You sort of are just restating my problem. It degrads. McCullough's position is that any results from a program that has a LRE value for the computed coefficient that is less than the corresponding LRE value computed by STATA is absolutely to be rejected. If my data has only 4 significant figures, all I need is a program that will give 5 figure coefficient accuracy. My objective is to be able to come up with something that can be derived from the data-program that tells me I have at least 5 figure accuracy. DAH If so, now you have 3 tests that must be passed. 3. What is the appropriate screen to use with EXCEL (without additional macros) to indicate that the results are wrong in a regression. With the complicated data sets now being fitted, singularity is not obvious. Please send me the Excel workbook with the data. I'd like to try a few ideas. I'm thinking a chart of y and predicted y might show some obvious problems. I will send you by separate message copies of the EXCEL files that I have been working with, so you can try out your ideas. DAH 4. Telling students that EXCEL does not properly compute multivariate regression is obviously an over-kill. I agree, although LINEST's limitation of 16 X variables is pretty bad, don't you think? No I don't. Any regression program that uses/depends on the basic IEEE 64 bit floating point instruction set as implemented in the Intel (and others) chip, has very severe problems with some matrix problems. The problem is the inherent inaccuracy of the inversion operation on a matrix. Although the rounding unit error is about E-16, the error propagates in subsequent computations. In any summation series, the errors are bounded by n-1 times the adjusted rounding unit (Stewart), where n is the number of additions. All matrix inversions methods that I am aware of depend on the Shur complement operation which is S=A22-A21*A11'*A12 (after Stewart), which involves subtraction. In any floating point subtraction, the result has fewer significant figures then the two original terms. The floating point operation then in effect adds zeros on the right to fill out the mantissa to a base length of 52 bits. This an inherent loss of significant figures. Consequently, a matrix inversion is the most inaccurate computation in any software package. My position is that any commercial software package that uses IEEE 64 bit arithmetic will have significant errors in the X'X inversion matrix, and this is the primary source of uncorrectable errors in regression coefficients. I would say that any attempt to do multivariate work with more than 16 variables should be avoided. Now if you have a Fortran package and can do 128 bit work in software then, you can look at many more variables and the values have a higher trust level. The future of computations depends on having 128 and 256 bit floating point chip sets within the next 5 years. DAH
Faults and Errors in EXCEL
I have gone through McCullough and Wilson, two of the papers from the Stern school (Simon and Simonoff), Cryer, and Goldwater and have workarounds or fixes for each of the problems that they raise (with data sets) about EXCEL. There is one however that I don't have a fix for. For example using EXCEL, I can beat Stata (IRE(coefficients)=11.1 to 15.1) on a 10th order polynomial fit to the NIST Filip Data Set. The one that I can't fix, is the problem of multivariate linear regression fit. In this case there is a singular or near singular X matrix. Given N columns of X, the rank of X here is N-1 or less. Lets assume we have the case of the dental data from Gary Simon. Here we have 54 observations, with 5 variables, 1 for the intercept, 1 for a set of different values from 0 to 8 (Molar Numb), 2 variables which are indicator variables (Alc and Drg) and one variable (Drg Grp) is a sum (actually 4 minus the sum) of the two indicator variables. X clearly is singular, but EXCEL gives an answer with LINEST. I have my own set of matrix and regression subroutines in QB, which I used to clarify why this occurs. When I do a Gaussian triangularization with pivoting on X, I get a very clear true zero on the pivoting, indicating singularity. Going to the X'X positive definite 5x5 matrix, I end up with reasonable values, but it still is a singular matrix. Doing the Gaussian again, I get a diagonal value of 8.4E-15, which is close to zero, but not a true zero. Therefore I can invert X'X in EXCEL and it works, but the values are screwy. I get values of about E+14 with one row and one column of E-03 values. No divide check errors. With finite values of X'Y and (X'X)^-1, you get a reasonable set of coefficient values. These are my questions: 1. Given that X is singular, how can I test this in EXCEL given that I only have the MMULT, MINVERSE and MDETERM matrix functions available in EXCEL? One test is to calculate MDETERM on the X'X matrix. For this data set, the determinant of X'X is ~E-08, which is not particularly small. Another test is to multiply (X'X)^-1 times XX and look for something clearly not a unitary matrix. In this case it very clearly is nothing like 'I'. I tend to favor the latter. 2. Do we or do we not teach accountants and business majors in their introductory stat class based on EXCEL about matrices and singularities, rank and eigenvalues? Assuming that this is never covered or taught, what supplementary material should be passed out along with the training on the use of EXCEL? 3. What is the appropriate screen to use with EXCEL (without additional macros) to indicate that the results are wrong in a regression. With the complicated data sets now being fitted, singularity is not obvious. 4. Telling students that EXCEL does not properly compute multivariate regression is obviously an over-kill. Any thoughts. DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
RE: How to compute Beta variates
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of MIchael Bals Sent: Friday, January 11, 2002 7:15 AM To: [EMAIL PROTECTED] Subject: How to compute Beta variates Hi ! I am new to this group, so I hope you haven't been bothered too often with such questions. I looked in the group but didn't really find anything. I want to compute the inverse of the beta distribution in VB. I know that there is no closed form for it. I tried to translate the AS109 Algo. from StatLib but don't know how to get the log of the complete beta distribution that is needed for it. perhaps someone can help me out here. Are there any other ways to get variates from beta distribution for ONE specific x~U(0,1)? I need them for Latin Hypercube sampling and as far as I understood I need the inverse of any distribution I am sampling in order to use LHS. I am also looking for inverse methods for the gamma, possion, binomial distributions or any approximation to them A lot of questions...hope someone can help. Bye Michael Bals - I was waiting for all the experts who know a lot more than me about doing this. 1. The basic distributions in mathematical form are given in Abramowitz. Just about everybody who provides packaged programs to calculate distributions, relies on this source. There are also a number of approximation (polynomials) methods developed in Fortran from the 60's on, when computers were large and slow. With the advent of small fast computers, the solutions using infinite series (i.e. Abramowitz) are fast enough to give accurate results, if special algorithms are used for p values close to zero or one. It is the tails of these distributions that present underflow/overflow/error problems using IEEE 64 bit double precision floating point numbers. Fortran 90 has a method of doing computations in 128 bit floating point numbers, but this is in software, and as a result is very slow. The 128 bit machine language arithmetic functions were never incorporated in VB, even up to the .NET version. I tried building some 128 bit arithmetic functions, but did not get anywhere. One of the problems was that VB does not have the assembly level building capability that C++ has. Fortran programs can be easily translated to VB, if you know enough about both languages. 2. The inverse has traditionally been computed by looping, using Newton's method, since the density of the distribution is the derivative of the cumulative function. It is fast and works if the density if smooth. Some of the software packages use different infinite series (or approximations) depending on the values of the parameters. When these are combined, the discontinuities will give closure problems and inaccurate results. The beta distribution in EXCEL is like this. 3. A common element in these distributions is the factorial function. The general approach is to use an infinite series to calculate the log of the factorial function. Equations 6.1.34 and 6.1.48 in Abramowitz are examples. Equation 6.2.2 defines the Beta function, and the log(Beta) can be directly calculated from the logs of the factorials. These equations work for real, floating point numbers. To obtain the beta function, test the log values for the allowable exponential range and do an EXP(logBeta). This will give you values within the 10^+308 to 10^-308 range of a double precision number. 4. In general, random numbers from these distributions are obtained by getting a good random number from a generator (that passes the die-hard tests), computing the inverse of the cumulative distribution for a true random deviate. This is slow, and there have been several approaches to directly calculate a random deviate that is a close approximation to a true random deviate. Methods for the normal and gamma distribution have been developed, to give fast values suitable for Monte Carlo studies. 5. I did most of my work on distributions and inverses in tha mid 1980's, starting on a Honeywell mainframe back in the late 70's, so I may not be familiar with recent developments in this century. DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
RE: Excel vs Quattro Pro
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Jay Warner Sent: Friday, January 11, 2002 6:45 PM Cc: [EMAIL PROTECTED] Subject: Re: Excel vs Quattro Pro - Clap, clap, clap (sound of applause) DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
EXCEL 2000 Statistical ToolPac Faults and Problems
I have started going through McCullough and Wilson's paper On the Accuracy of Statistical Procedures in Microsoft Excel 2000. I have found one error, and may find more. However I need to put them all together, and that will take some time. My point is that the Excel 2000 faults are not that severe when Excel is used in the intended environment. The NIST tests are pretty severe and represent primarily invented data sets or unusual data fitting situations. I can see workarounds to bypass some of the Excel limitations. I need however to test them for validity against the NIST data sets first. Again this is going to take some time. What I intended is to say is, don't jump to conclusions yer. DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
RE: Stat Requirement (was Excel2000)
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of David Firth Sent: Sunday, January 06, 2002 1:22 PM To: [EMAIL PROTECTED] Subject: Re: Stat Requirement (was Excel2000) Very good points. Appreciated. DAHeiser = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
RE: Excel2000- the same errors in stat. computations and graphics
-Original Message-From: Jon Cryer [mailto:[EMAIL PROTECTED]]Sent: Saturday, January 05, 2002 9:14 AMTo: David HeiserCc: [EMAIL PROTECTED]Subject: RE: Excel2000- the same errors in stat. computations and graphics David:I have certainly never said nor implied that Excel cannot produce reasonablygood graphics. My concern is that it makes it so easy to produce poorgraphics. The defaults are absurd and should never be used. It seems to me thatdefaults should produce at least something useful. The default graphs are certainly not goodbusiness graphs if the intent is to produce good visual display of quantitative information!Isn't that what graphs are for?[David Heiser] The EXCEL chart defaults are as you say are poor,for an audience/usersof statisticians/scientists/engineers. Even going to the effort to make a lot of changes to the defauts, you never really get an outstanding graph, like you would find in a professional publication. The default charts are specifically set-up for the type of business applications (i.e. sales, gross income, expenses, operating costs, salesmen performances, product distributions, etc.), in which bar and column charts appear to be meaningful and the presentation of 3D graphs impressive (at least for alate 1980's audience).The Microsoft User's manual clearly identifies the audience for which EXCEL charts were developed for. This user/audience is essentially the same one that ACCESS was written for. The chart defaults generate useful charts for management and tracking of product sales and sales efforts. What we have now is an entirely different user/audience using EXCEL to do something it was never designed to do. The EXCEL ToolPak pack represents the way things were done and viewed in the 1980's. Essentially it is a 20 year old package, that never has been updated. In 2002 we have been exposed to the tremendously impressive game displays and the capability of the many (new) statistical software programs, and now want better graphics. I don't think that Microsoft will improve the graphics, leaving it to the developers to create and market separate programs and add-ones to EXCEL to give better graphics. I don't think Office XP (EXCEL 2002) is any different from EXCEL version 5.0. What would be helpful, would be an EXCEL front end fora developer computation package that would in turn generate a standard interface to separate graphics packages. In a competitive world, there would be interface standards, so that one could buy and separate software packages. However again cost enters in, and EXCEL as a stand-alone package is for most users the economic choice. So we have to accept the poor EXCEL graphics and computational limitations, because in many business's they only have the Microsoft Office application, and under their ground rules (and usage/liscensing requirements), you have to use their EXCEL for any computations and presentations. This is the environment, I have had to work in. Then on top of it you have such schools as the University of Phoenix who teach undergraduate stat as a course in using EXCEL only.
RE: Stat Requirement (was Excel2000)
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of David Firth Sent: Saturday, January 05, 2002 3:28 PM To: [EMAIL PROTECTED] Subject: Stat Requirement (was Excel2000) thank you for your response Thanks for the wealth of Excel trivia. Use the right tool for the job, I say. Excel might not be it. I do have to take a little offense to the accuracy remark regarding business calcs -- I learned early on that bcd reals or integer-based math libs were the only appropriate mechanisms for business calcs. I prefer not to use regular real/float types if I have alternatives. But I'm a measurement/data acq man. We can be a bit, well, anal about accuracy. --- What I had in mind was that the variant form of data input in each cell (in EXCEL, in VB and in VBA), accepts the following: 0 Empty No data/entry (once held something but now is blank). 1 NullNo value. Unknown data. 2 Integer Whole number -32768 to +32767, 2 bytes. 3 LongWhole number (integer) -2,147,483,648 to +2,147,483,647, 4 bytes. 4 Single Floating point decimal number, aprox 7 decimal digits, 4 bytes. 5 Double Floating point decimal number, aprox 15 decimal digits, 8 bytes. -1.79769313486231E308 to -4.94065645841247E-324 and +4.94065645841247E-324 to +1.79769313486232E308 6 CurrencyDecimal number with 4 decimal places, 8 bytes, 19 digits max. Use it to minimize rounding errors. -922,337,203,685,477.5808 to +922,337,203,685,477.5807 7 Date/time A number, the left integer portion representing days and the right decimal portion representing time as afraction of a day, 8 bytes There are a number of functions that will extract calendar and time information from it. 8 String Text. 10 bytes + string length, up to 2 billion characters. However EXCEL limits cell contents to a maximum of 256 characters. 9 OLE Object 4 bytes. 10 Error Code number returned if an error in a computation occurred, 2 bytes. 11 Boolean True or false, 2 bytes. Integer, 0 is false, -1 is true. 12 Variant An array of variants, 16 bytes numbers, 22 bytes + string length 13 Non-Ole Object 14 Decimal 14 bytes. +/-79,228,162,514,264,337,593,543,950,335 with no decimal point. +/-7.9228162514264337593543950335 with 28 places to the right of the decimal. Smallest non-zero number is +/-0.(27 0's)1 17 Byte1 byte. 0 to 255. 8192Array An ordered table of values The number in the left column comes from the VarType() function. EXCEL uses a separate formatting code to indicate how the number appears in the cell (i.e. number of decimal points and % conversion). If I do calculations with currency, I have up to 19 accurate digits, whereas with double, only 15. If I do integer arithmetic with long integers, I only have up to 10 digits. When you format a cell, EXCEL allows only 0, 1, 5, 6, 7, 8 and 11. Macros can use all the other types. I have no experience with using decimal numbers. DAHeiser - A CS program more aligned with the needs of science, business, or econ might be found, but CS is a general thing. The idea is that with the help of content experts or reference books the capable CS grad could do good work. If the programmers involved slapped out some code and then went roller skating (apologies to Dilbert) then they didn't do a good job. My 11 years in the software side of embedded systems has contained too many recommendations to peers about spending time to understand the customer's needs. -- Bravo. Applause. (We will overlook the fact that the customer usually doesn't know his needs until 2 weeks before product delivery date. This is why software development is an involved time consuming interactive process.) DAHeiser -- IMHO it is more of a process issue than a knowledge base problem. I took stats as an elective when I was an electronics engr tech undergrad because I thought it would be handy. --- Great, Good. Applause. DAHeiser -- From what I have observed, many business type have a very limited math background, and even learning simple business stat is a major problem. For example try getting them to understand the difference between using z and t tests, Yes, but the idea again is to be a generalist who makes use of content experts. I am now in an MBA program and have
RE: Excel2000- the same errors in stat. computations and graphics
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Shareef Siddeek Sent: Friday, January 04, 2002 1:22 PM To: [EMAIL PROTECTED] Subject: Excel2000- the same errors in stat. computations and graphics Happy new year to all. I frequently use Excel2000 for graphic presentation, spreadsheet maths, simple nonlinear model fitting (using the Excel solver) with one or two parameters, and simulations. I thought Excel2000 corrected those errors found in the analysis tool pack and other in-built computational procedures in the older 97 version. However, following articles point out that the developers have done nothing to corrected those errors. I would like your comments on this. Thanks. Siddeek --- 1. I appreciate receiving your note and the URLs. 2. One really can't effectively use EXCEL without having to make the effort of learning it from the books. Some of the complaints from Cryer have to do with the fact that he never learned how to build charts in EXCEL. This includes chart layouts, legends, scales, axis, labels, etc. One can use the drawing overlay features to build up text on the charts. I always recommend spending time reading the big commercial manuals available on EXCEL 2000. I have several. EXCEL HELP is lousy for finding the information you really need. 3. The EXCEL stat package was an add-on developer package by GreyMatter International Inc, Cambridge, MA. back in the early 90's. Microsoft did not write it. Being familiar with developers, the people writing the software have to be familiar with an enormous lexicon of object links and protocols. Stat is not one of the courses toward a degree in computer science. Consequently much of the formula building comes from a convenient textbook. I really am surprised at the developers/programmers out there that have no knowledge of basic math, or how time works (calendar-time linkage). Much of the problem has to do with the assumption that software built-in functions work as the programmer thinks they work, not how they actually work. It is obvious that Bill Gates has no interest in fixing EXCEL accuracy, only in it's appearance and ability to fit in as a part of larger program packages. His only interest now is .NET and the ability to pull off company data in spreadsheet format using the internet as the company's internal network. 3. There is a problem with EXCEL histograms. This has been commented on in previous edstat e-mails. In general EXCEL produces simple graphs, primarily for business purposes. It does not produce good scientific graphics. All it does is get you a quick graph with a minimum of effort. 4. Part of the inaccuracy problem has to do with the fact that each EXCEL cell by default is treated as a variant variable. Unless you format all the numerical cells properly (as decimal or integer), you are likely to have problems. I Always format all my cells properly, declaring the type of cell contents. If for example you are to precede a number by a space, EXCEL may interpret the number as text. By use of the variant, empty cells can be handled, and not cause computational halts. 5. The primary use of EXCEL is in business, doing the type of calculations and reports described in Microsoft EXCEL User's Guide. In business applications, accuracy is not that important, except when money is involved. If for example if McCullough were to declare his numbers as currency instead of variant, his accuracy would probably improve. Considering the type of business applications for stat (for example see The Complete Idiot's Guide to Business Statistics) what EXCEL does is fine. From what I have observed, many business type have a very limited math background, and even learning simple business stat is a major problem. For example try getting them to understand the difference between using z and t tests, and to understand confidence intervals. Business people expect the computer to give them a number. The statement by McCullough that ..it is important for the package to determine whether the answer is likely to be corrupted by cumulated rounding errors as to be worthless and if so, not to display the answer. This policy is not acceptable to business types, and this is one of the ongoing problems on the nets. They would rather get a wrong number, then none. In most cases, the computed result is not the sole basis for a business decision. (Please note here that these comments do not apply to those in quality control, research or product improvement/development.) 6. The EXCEL solver was developed by Frontline Systems, at Incline Village, NV. (Incline Village is an expensive skiing/condominium/housing area up at the north end of Lake Tahoe. It was named for a huge inclined water 'trough' that was used in the past to bring logs from the mountains down to sawmills at the lake.) The solver algorithm has not been divulged, and obviously it doesn't compare to
RE: Question about concatenating probability distributions
RE: The Poisson process and Lognormal action time. This kind of problem arises a lot in the actuarial literature (a process for the number of claims and a process for the claim size), and the Poisson and the lognormal have been used in this context - it might be worth your while to look there for results. Glen ... This is a very general and important event process. It is also used to describe the general failure-repair process that occurs at any repair shop. The Poisson is a good approximation of the arrival times of equipment to be repaired, and the log-normal is a good approximation of the time it takes to repair it. From an operations standpoint, the downtime is approximated by the exponential distribution (occurrence) and a log-normal repair time, which includes diagnosis, replacement and validation. In the Air Force (1982-1995) where the reliability and maintainability of equipment has to be characterized, the means are determined and used in a form called availability. We never got beyond the use of availability. They never got into the distribution and confidence interval aspects. As a general approximation, the log-normal distribution approximates human reaction times to events. DAHeiser = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
RE: What is a confidence interval?
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Gordon D. Pusch Sent: Thursday, September 27, 2001 7:33 PM To: [EMAIL PROTECTED] Subject: Re: What is a confidence interval? John Jackson [EMAIL PROTECTED] writes: this is the second time I have seen this word used: frequentist? What does it mean? ``Frequentist'' is the term used by Bayesians to describe partisans of Fisher et al's revisionist edict that ``probability'' shall be declared to be semantically equivalent to ``frequency of events'' in some mythical ensemble. Bayesians instead hold to the original Laplace-Bernoulli concept that probability is a measure of one's degree of confidence in an hypothesis, whereas the frequency of occurance of an outcome in a set of trials is a totally independent concept that does not even live in the same space as a probability. -- Gordon D. Pusch -- I disagee with Pusch. Bayesians have a way of modifying definitions to support their arguments. Bayesians are those people who have to invent loss functions in order to make a decision. A frequentist defines the concept of probability in terms of gaming, where the probability is defined as the ratio of the number of times an event (such as the occurance of a one showing on a die) is favorable to the number of times all other events occur (all 6 sides of the die) as the number of repeats (identically distributed independent random events) becomes very, very large. This was very difficult to define mathematically, since what is a repetition could not be adequately defined. Von Mises is usually taken as the main source of this concept. There is a fundamental problem of defining probability, without involving circular references. The terms identically distributed and independent (random) events depend on the term equi-probable, and then we are right back at square 1. The definition of random involves something that can't be defined, except by saying that the next random event can't be predicted. When with my die a 2 keeps comming up by chance, then what? Bayesians say it is just a matter of belief, whatever that is. This leaves probability undefined, as a mathematical property with values between 0 and 1. Whether there is such a real thing as zero probability or a probability of 1 or not, for values between 0 and 1, statisticians have to resort to a frequentist viewpoint in order to establish limiting values as the number repetitions approach infinity. This is why it so hard to teach statistics. It all depends on what is the students internal understanding of what probability means. If you are comfortable with belief then fine. Now tell me what the difference is between a p value of 0.05 and 0.06 in real world terms? If my study has a lot of sizzle and has important ramifications about what we believe about our universe, a p value may not be important. Afterall, proof of Einstein's theory of relativity was based on a pretty sloppy single observation of the position of Mercury at an eclipse. Fisher in his reflective later life, took great pains to avoid making a hard and fast decision based on probability values. He always said that it was up to the investigator to determine whether a p value of 0.06 meant that there was an improbable chance that random events could have determined the outcome of his experiment, not the publication editor. Nowdays it is determined by the stupid Peer system. Also by editors that are looking hard at the best way to determine belief in the claims of the experimenter when they haven't the foggiest idea of what the investigation was about, and corporate profits or the status quo is the most important issue. This was very probably the situation in England in the 1950's which pushed Fisher to go to Australia. Joseph F. Lucke said this in a recent post: -- I saw the same show on Nova. Flower had a different definition of randomness than we now use. We now define randomness as (probabilistic) independence, but that was not always the case. In the 1930s or so, the mathematician-philosopher-statistician von Mises developed a theory of probability based on frequencies. This was not the Kolmogorov version in which the axioms are interpreted as frequencies, but an axiomatic system derived from the properties of repeated events. Von Mises introduced the notion of a collective or sequence of potentially infinitely repeatable events. Probability was defined as the limiting relative frequency in this collective. One of his axioms was that the events within the collective were random. But because he had not yet developed the concept of independence in his system, he could not define randomness as mutual independence among the events within the collective. Indeed, randomness was a primitive concept in his axiomatic system. Von Mises defined randomness
RE: what type of distribution on this sampling
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Joe Galenko Sent: Friday, September 21, 2001 12:30 PM To: [EMAIL PROTECTED] Subject: Re: what type of distribution on this sampling Just out of curiousity, I'd like to know what kind of population you could have such that a sample mean with N = 200 wouldn't be approximately Normally distributed. That would have to be a very, very strange distribution indeed. On Fri, 21 Sep 2001, Gus Gassmann wrote: Joe Galenko wrote: The mean of a random sample of size 81 from a population of size 1 billion is going to be Normally distributed regardless of the distribution of the overall population (i.e., the 1 billion). Oftentimes the magic number of 30 is used to say that the mean will have a Normal distribution, although that is when we're drawing from an infinitely large population. But for the purposes of determining the distribution of a mean, 1 billion is effectively infinite. And so, 81 is plenty. Certain generating processes such as those generating particles by mechanical grinders generate products with multimode distributions, with very evident density peaks. I can take 10^23 particles and measure them and I would still not have a normal distribution. I encountered this 40 years ago when trying to identify the effect of grinding process variables on grinding ammonium perchlorate when the criteria is propellant burning rate. DAHeiser = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Comment on Powerball---Expected Value
Jordan Ellenberg has an interesting comment on the statistical aspects of the expected value of a Powerball ticket. It was posted on Microsoft Internet Explored, Saturday Sept 1st under Slate. If you can't get it, I saved the text version and can send it as an e-mail attachment upon request. Try Internet Explorer first. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
RE: Bayesian analyses in education
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of KKARMA Sent: Wednesday, July 11, 2001 2:04 AM To: [EMAIL PROTECTED] Subject: Bayesian analyses in education As a teacher of research methodology in (music) education I am interested in the relation between traditional statistics and the bayesian approach. Bayesians claim that their approach is superior compared with the traditional, for instance because it does not assume normal distributions, is intuitively understandable, works with small samples, predicts better in the long run etc. If this is so, why is it so rare in educational research? Are there some hidden flaws in the approach or are the researchers just ignorant? Comments? - S.F. Thomas gave a good reply. I might add that probably most statisticians use methods that are appropriate to the problem. In some problems, only a Bayesian approach works. There are other problems in which a Bayesian approach is not appropriate or is not a part of the problem. In much of educational research, the focus is on the application of measurement theory to develop relationships between factors, and ways to measure concepts. The issue of the probability of a parameter value is not of major concern, since general concepts of means, normality (multivariate) and chi-square distributions are considered adequate in presenting results. One example is the use of structural equation modeling (SEM), where the focus is on model fit, and the cause and effect relationships that the model implies. A more recent interest is in the application of Bayesian concepts to causality, such as, We will adhere to the Bayesian interpretation of probability, according to which probabilities encode degrees of belief about events in the world and data are used to strengthen, update, or weaken those degrees of belief. In this formalism, degrees of belief are assigned to propositions in some language, and these degrees of of belief are combined and manipulated according to the rules of probability calculus. (Judea Pearl, Causality, Cambridge Press, 2000). The SEM modeling is non-Bayesian, but the nature of the conclusions may be expressed in the form of Bayesian Networks. I would expect to see more of these concepts to show up in educational research. It allows one to express a degree of uncertainty in conclusions in a highly technical language in which very few understand. DAHeiser = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
FW: Diagnosing and addressing co linearity in Survival Analysis
-Original Message- From: David Heiser [mailto:[EMAIL PROTECTED]] Sent: Wednesday, June 06, 2001 1:55 PM To: ELANMEL Subject: RE: Diagnosing and addressing co linearity in Survival Analysis -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of ELANMEL Sent: Tuesday, June 05, 2001 11:47 PM To: [EMAIL PROTECTED] Subject: Diagnosing and addressing collinearity in Survival Analysis Any assistance would be appreciated: I am attempting to run some survival analyses using Stata STCOX, and am getting messages that certain variables are collinear and have been dropped. Unfortunately, these variables are the ones I am testing in my analysis! I would appreciate any information or recommendations on how best to diagnose and explore solutions to this problem. Thanks! Elan --- I see this as a deficiency in your software product Stata STCOX. You should be climbing down the neck of the company that you bought the software from. Their manual should describe how their software arrived at that declaration, what was the logic that selected those particular variables that were dropped, and how to work around the problem. These software companies are strictly for profit companies. We should hold them responsible for bad products, just as we hold Ford and Firestone responsible for faulty products. These companies hire software programmers primarily to develop software that has a lot of flash, just like a computer game. These are the selling features. Once the product is sold, they have no interest. It is up to the user to challenge the company and get the problems solved. We are seeing more and more of users getting their advanced statistical training from software manuals. We (the statistical community) should be putting pressure on these developers to put into their manuals all the text that would be normally found in a textbook on the subject. David A. Heiser = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
The False Placebo Effect
Be careful on your assumptions in your models and studies! --- Placebo Effect An Illusion, Study Says By Gina Kolata New York Times (Published in the Sacramento Bee, Thursday, May 24, 2001) In a new report that is being met with a mixture of astonishment and some disbelief, two Danish researchers say that the placebo effect is a myth. The investigators analyzed 114 published studies involving about 7,500 patients with 40 different conditions. They found no support for the common notion that, in general, about one-third of patients will improve if they are given a dummy pill and told it is real. Instead, they theorize, patients seem to improve after taking placebos because most diseases have uneven courses in which their severity waxes and wanes. In studies in which treatments are compared not just to placebos but also to no treatment at all, they said, participants given no treatment improve at about the same rate as participants given placebos. The paper appears today in the New England Journal of Medicine. Both authors, Dr. Asbjorn Hrobjartsson and Dr. Peter C. Gotzsche, are with the University of Copenhagen and the Nordic Cochran Center, an international organization of medical researchers who review randomized clinical trials. Reaction to the report covers the spectrum. Dr. Donald Berry a statistician at the M.D. Anderson Cancer Center in Houston, said: I believe it. In fact, I have long believed that the placebo effect is nothing more than a regression effect, referring to a statistical observation that patients who feel terrible one day will almost in- variably feel better the next day, no matter what is done for them. But others, like David Freedman, a statistician at the University of California, Berkeley, said he was not convinced. He said that the statistical method the researchers used -pooling data from many studies and using a statistical tool called meta-analysis to analyze them -could give results that were misleading. I just don't find this report to be incredibly persuasive, Freedman said. The researchers said they saw a slight effect of placebos on subjective outcomes reported by patients, like their descriptions of how much pain they experienced. But Hrobjartsson said he questioned that effect. It could be a true effect, but it also could be a reporting bias, he said. The patient wants to please the investigator and tells the investigator, 'I feel slightly better. ' Placebos still are needed in clinical research, Hrobjartsson said, to prevent researchers from knowing who is getting a real treatment. Curiosity prompted Hrobjartsson and Gotzsche to act. Over and over, medical journals and textbooks asserted that placebo effects were so powerful that, on average, 35 percent of patients would improve if they were told a dummy treatment was real. They began asking where this assessment came from. Every paper , Hrobjartsson said, seemed to refer back to other papers. He began peeling back the onion, finally coming to the original paper. It was written by a Boston doctor, Henry Beecher, who had been chief of anesthesiology at Massachusetts General Hospital in Boston and published a paper in the Journal of the American Medical Association in 1955 titled, The Powerful Placebo. In it, Beecher, who died in 1976, reviewed about a dozen studies that compared placebos to active treatments and concluded that placebos had medical effects. He came up with the magical 35 percent number that has entered placebo mythology, Hrobjartsson said. But, Hrobjartsson said, diseases naturally wax and wane. Of the many articles I looked through, no article distinguished between a placebo effect and the natural course of a disease, Hrobjartsson said. He and Gotzsche began looking for well-conducted studies that divided patients into three groups, giving one a real medical treatment, one a placebo and one nothing at all. That was the only way, they reasoned, to decide whether placebos had any medical effect. They found 114, published between 1946 and 1998. When they analyzed the data, they could detect no effects of placebos on objective measurements, like cholesterol levels or blood pressure. The Washington Post contributed to this report. -end of article- = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: [Q] Generate a Simple Linear Model
well, if you have access to a routine that will generate two variables with specified r ... then, you can do it ... i have one that runs in minitab ... it is a macro ... and i know that jon cryer has one too ... http://roberts.ed.psu.edu/users/droberts/macro.htm ... check #1 ... I am not familiar with the Minitab coding schemes. I have never used Minitab. To get some insight, I went thru "1. Generate X,Y Data With Desired n and r". I see some basic FORTRAN in the I/O and early BASIC (dark ages BASIC) in the statements. Noted an error. The line "cent c3 c6, c12 13" should be "cent c5 c6, c12 c13". Otherwise, the lines seem pretty obvious as to what is going on. I don't see extensive use of functions and subroutines which are the basis for modern programming. They would really reduce much of your coding. Are you doing all calculations in double precision or single precision? To go through and find other errors would be a prohibitive task for me. I am glad that I don't have to teach in your "summer cottage". The lecture halls must be dreadful. I remember my psych 1 course in a huge (1000 seat capacity and it was filled) auditorium. DAHeiser = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: multivariate normality
- Original Message - From: "yogab" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, March 08, 2001 12:23 PM Subject: multivariate normality Hello all., Iam using mutivariate normality test (Mardia 1974) to check for normaility for my sample set of size 1000 (multivariate data). Please give me the details on your reference. I have Mardia (1970) in which he developed a multivariate skewness and kurtosis relationship. His eq. 2.23 was a brilliant insight on multivariate skewness. His convergent form, eq. 2.24 is a poor measure of b1,p and requires calculations from the correlation matrix. I have no idea of what SAS does inside their black box to arrive at b1,p. Mardia also went on to a different approach in his later papers, ignoring eq 2.23. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: probability definition
- Original Message - From: "Alex Yu" [EMAIL PROTECTED] To: "Shareef Siddeek" [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Thursday, March 01, 2001 5:01 PM Subject: Re: probability definition 1. Very Interesting. Would it be possible to get a copy of your paper on Probability. I gave a review of "The Meanings of 'Probability'" several years ago for our ASA chapter, and would like to redo it. For a quick walk through of various prob. theories, you may consult "The Cambridge Dictionary of Philosophy." pp.649-651. Basically, propensity theory is to deal with the problem that frequentist prob. cannot be applied to a single case. Propensity theory defines prob. as the disposition of a given kind of physical situation to yield an outcome of a given type. The following is extracted from one of my papers. It brielfy talks about the history of classical theory, Reichenbach's frequentism and Fisherian school: Fisherian hypothesis testing is based upon relative frequency in long run. Since a version of the frequentist view of probability was developed by positivists Reichenbach (1938) and von Mises (1964), the two schools of thoughts seem to share a common thread. 2. Von Mises (1957) quotes Johannes von Kries and goes on the address the use as "I shall assume therefore a definite probability of the death of Caius, Sempronius or Titus in the course of the next year"., as support of his concept of "probability in a collective". He does include the "single event" as part of his "collective". He then states, "The term 'probability' will be reserved for the limiting value of the relative frequency in a true collective which satisfies the condition of randomness." With respect to Caius, Sempronius and Titus" he was considering the collective of aged rulers of Rome. However, it is not necessarily true. Both Fisherian and positivist's frequency theory were proposed as an opposition to the classical Laplacean theory of probability. 3. My reading of Fisher was that he opposed the Laplacian view because it had no mathematical basis, Bayes however did, and was fully accepted. In the Laplacean perspective, probability is deductive, theoretical, and subjective. To be specific, this probability is subjectively deduced from theoretical principles and assumptions in the absence of objective verification with empirical data. Assume that every member of a set has equal probability to occur (the principle of indifference), probability is treated as a ratio between the desired event and all possible events. This probability, derived from the fairness assumption, is made before any events occur. Positivists such as Reichenbach and von Mises maintained that a very large number of empirical outcomes should be observed to form a reference class. Probability is the ratio between the frequency of desired outcome and the reference class. Indeed, the empirical probability hardly concurs with the theoretical probability. For example, when a dice is thrown, in theory the probability of the occurrence of number "one" should be 1/6. But even in a million simulations, the actual probability of the occurrence of "one" is not exactly one out of six times. It appears that positivist's frequency theory is more valid than the classical one. However, the usefulness of this actual, finite, relative frequency theory is limited for it is difficult to tell how large the reference class is considered large enough. 4. The idea of a "limiting condition" is based on the same understanding of differential calculus and infinite series. That is a limit is reached, not on the value of N, only that some value converges to a limit as N increases. This does not depend on any arbitrary large value of N 5. von Mises also based his position on the laws of probability, which can be defined as a "natural" outcome from the frequentist view, where limiting occurs as a converging value of a ratio. He differentiated between the "actual occurances" and as a "thought experiment". It was the latter that he was using. Fisher (1930) criticized that Laplace's theory is subjective and incompatible with the inductive nature of science. However, unlike the positivists' empirical based theory, Fisher's is a hypothetical infinite relative frequency theory. In the Fisherian school, various theoretical sampling distributions are constructed as references for comparing the observed. 6. My reading of the historical record is that this was the K. Pearson school that did this. Fisher stuck to the Uniform, Normal and Poisson. Since Fisher did not mention Reichenbach or von Mises, it is reasonable to believe that Fisher developed his frequency theory independently. 7. Fisher was aware of many who challenged his views, but chose not to respond, except to K and E Pearson. Anyone who challenged his views like von Mises did on likelihood (back in 1930), was either to be ignored or challenged by an exchange of
Re: stat question
- Original Message - From: Herman Rubin [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, November 23, 2000 4:55 PM Subject: Re: stat question Herman Rubin wrote: anyone wanting to learn good statistics should not even consider taking an "undergraduate" statistics course Nonsense.(Reply by this anonymous, lion soaking in oil) Not only is that not nonsense, but it is quite difficult to get students who have learned techniques to consider what, if any, basis was behind those techniques. Meaningful statistics is based on the concept of probability, not the computation of probabilities, and consideration of the totality of consequences. Wow, Herman, this is deep stuff. There is a huge literature on the attempt to understand what probability is. Even Fisher had problems trying to understand it outside of the frequentist viewpoint. There is a lot of stat work involving maximum likelihood estimates, where there is no probability support unless you take a Bayesian approach. (Which is infrequent.) Just look at the extent of the literature on the 2X2 table, and the difficulty there is in understanding the concepts behind an analysis for effects. I have been reading the absolutely wonderful discussion and raging arguments on SEMNET between Mulaik, Pearl, Hayduk, Shipley and others on the meaning of b in the simple linear equation Y=bX+e1, where X is one variable and e1 is a combination of the effects of all other variables and random effects. When e1 is large with respect to Y, it becomes very difficult to define a simple meaning of b in terms of a quantitative causality. This is deep stuff, that even the professors have difficulty in understanding. Considering most of the important work involving statistics is in psychology, marketing, medicine, economics, physics, social studies, and every other hard or soft science out there, we cannot assume that all these PhD practitioners understand probability or really understand the nuances of the models, equations and conclusions they arrive at. (I did it. One sentence per paragraph. Does this put me on Fisher's level?) It would be nice if all these practitioners had graduate courses in stat, but more than likely it is a undergraduate level course taught at the graduate level to a student say in psychology, or medicine or...(e.g., Abelson in his book, "Statistics as Principled Argument", in the first sentence of the introduction says, 'This book arises from 35 years of teaching a first-year graduate statistics course in the Yale Psychology Department." This is typical of most graduate schools, where the first year stat course is all that the student gets.) People in these fields will get exposed to huge data bases with large numbers of variables and find the impossibility of assessing all the implications of any model, hypothesis or set of conclusion made. It is very clear that an education in statistics never stops. The undergraduate level exposes you to the concepts, and the understanding comes with continued education and experience. There has been a long discussion previously on EDSTAT about the 0.05 probability value and the use of it. There was no common agreement, which is typical of most of the basic fundamental things we use in statistics. Since we as statisticians can't agree on what is significant (in terms of probability), how can we expect practitioners to fully understand what probability is? DAHeiser = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Help needed ... :-(
Based on the problems we have in ansering vague questions on edstat, I can say that any requestor must be able to state the question, so we here (using American English) can understand what he is saying and give a helpful answer. It is obvious that all of us have problems understanding the questions in English. The complex field of statistics involves so many variations and such a large body of knowledge, that giving a helpful answer is not easy. I just don't reply in areas that I am weak in. The issue is not on an individual being from Germany, or on international relations, or on responding to people from different cultures and countries, etc., etc. ; the issue is that the requestor should be able to state the question so we can understand it. If he can state the question in German, do it. Requestors post questions in Spanish, Italian, Swedish and in other languages that I can't recognise, and get answers. Let us encourage those on edstat from foreign countries to answer the questions in their own languages and to use their own references. If edstat is to be truely international, we need a lot mre questions and responses in other languages. DAHeiser = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Two t tests
- Original Message - From: Robert J. MacG. Dawson [EMAIL PROTECTED] To: Richard Lehman [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, November 03, 2000 1:16 PM Subject: Re: Two t tests Richard Lehman wrote: A colleague sent me this note. A statistics question. Temperatures taken from different portions of a stream: Portion 1 16.9 17 15.8 17.1 18.7 18 mean = 17.25 variance = 0.995 Portion 2 18.3 18.5 mean = 18.4 variance = 0.02 Do these portions have different temperatures? Obviously the variances are unequal and a 2-sample [unequal variance] From knowing the characteristics of flowing streams, one should expect considerable variance. The concept of experimental design would enable one to plan where to measure to obtain a reasonable mean value. The bottom, if after a pool, will be colder, and a cross section of the stream will have high variance. Side stream entries will result in one side being a different temperature then the other side. A turbulent area will have less variance. You have to determine where the measurements were made in terms of the flow characteristics of the locality. Therefore from a practical standpoint, you can't conclude that the two groups have different means. DAHeiser. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: November Issue of SecondMoment
- Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, November 01, 2000 10:32 AM Subject: November Issue of SecondMoment Now I know why data mining is considered so much hog wash. No attempt to deal with uncertainty or to quantify it. Everything is gospel by a prime authority. Believe everything that is found in the data. DAHeiser = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: questions on hypothesis
- Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, October 16, 2000 4:24 PM Subject: Re: questions on hypothesis In article [EMAIL PROTECTED], Chris: That's not what Jerry means. What he's saying is that if your sample size is large enough, a difference may be statistically significant (a term which has a very precise meaning, especially to the Apostles of the Holy 5%) but not large enough to be practically important. [A hypothetical very large sample might show, let us say, that a very expensive diet supplement reduced one's chances of a heart attack by 1/10 of 1%.] Firstly, I think we can thank publication pressures for the church of the Holy 5%. I go with Keppel's approach in suspending judgement for mid range significance levels (although we should do this for nonsignificant results anyway as they are inherently indeterminant). - The 5% is a historical arifact, the result of statistics being invented before electronic computers were invented. The work in the early 1900's was severely restricted by the fact that computations of the cummulative probability distribution involved tedious paper and pencil calculations, and later on the use of mechanical calculators. Available tables only gave the values for 5% and in some cases 1%. R.A. Fisher in his publications consistently referred to values well below 1% as being "convincing". To illustrate the fundamental test methods, he had to rely on available tables and chose 5% in most of his examples. However he did not consider 5% as being "scientifically convincing". DAH = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: questions on hypothesis
- Original Message - From: Ting Ting [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Friday, October 13, 2000 10:57 PM Subject: Re: questions on hypothesis A good example of a simple situation for which exact P values are unavailable is the Behrens-Fisher problem (testing the equality of normal means from normal populations with unequal variances). Some might say we have approximate solutions that are good enough. . I see this as an imprecise statement of a hypothesis. From set theory, I can see several different logical constructs, each of which would arrive at a different probability distribution, and consequently different p values. It boils down to just what is the hypothesis on the generator of the data. Is it a statement of logical equality or the value of a difference function. Does sample "A" come from process "a" and sample "B" come from process "b", or do both samples come from process "c"? The problem is simplified when process "a" and process "b" are known. When process "a" and "b" are not known, we have that Fisher problem of defining a set of all "a" parameter values = to a given p1 value and defining a set of all "b" parameter values = to a given p2 value. When the processes are one parameter processes, every thing is straightforward. (Fisher in his book-set very nicely used one parameter distributions to illustrate his ideas.) However for a two parameter process, the Fisher-Berens problem states an equality (intersection) of mean values and a disjoint of variance values, which cannot be analytically combined (given the normal distribution function) in terms of a single p value. Consequently, one finds in the textbooks, all the different approaches to establish a "c" process, for which tests can be constructed to determine if "A" and "B" come from the process "c" or not. The hypothesis being tested is then based on process "c", not on the original idea. DAH = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =