RE: Excel2000- the same errors in stat. computations and graphics
David: I have certainly never said nor implied that Excel cannot produce reasonably good graphics. My concern is that it makes it so easy to produce poor graphics. The defaults are absurd and should never be used. It seems to me that defaults should produce at least something useful. The default graphs are certainly not good business graphs if the intent is to produce good visual display of quantitative information! Isn't that what graphs are for? In business applications, accuracy is not that important, except when money is involved. Huh? Jon At 09:39 PM 1/4/2002 -0800, you wrote: -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On Behalf Of Shareef Siddeek Sent: Friday, January 04, 2002 1:22 PM To: [EMAIL PROTECTED] Subject: Excel2000- the same errors in stat. computations and graphics Happy new year to all. I frequently use Excel2000 for graphic presentation, spreadsheet maths, simple nonlinear model fitting (using the Excel solver) with one or two parameters, and simulations. I thought Excel2000 corrected those errors found in the analysis tool pack and other in-built computational procedures in the older 97 version. However, following articles point out that the developers have done nothing to corrected those errors. I would like your comments on this. Thanks. Siddeek --- 1. I appreciate receiving your note and the URLs. 2. One really can't effectively use EXCEL without having to make the effort of learning it from the books. Some of the complaints from Cryer have to do with the fact that he never learned how to build charts in EXCEL. This includes chart layouts, legends, scales, axis, labels, etc. One can use the drawing overlay features to build up text on the charts. I always recommend spending time reading the big commercial manuals available on EXCEL 2000. I have several. EXCEL HELP is lousy for finding the information you really need. 3. The EXCEL stat package was an add-on developer package by GreyMatter International Inc, Cambridge, MA. back in the early 90's. Microsoft did not write it. Being familiar with developers, the people writing the software have to be familiar with an enormous lexicon of object links and protocols. Stat is not one of the courses toward a degree in computer science. Consequently much of the formula building comes from a convenient textbook. I really am surprised at the developers/programmers out there that have no knowledge of basic math, or how time works (calendar-time linkage). Much of the problem has to do with the assumption that software built-in functions work as the programmer thinks they work, not how they actually work. It is obvious that Bill Gates has no interest in fixing EXCEL accuracy, only in it's appearance and ability to fit in as a part of larger program packages. His only interest now is .NET and the ability to pull off company data in spreadsheet format using the internet as the company's internal network. 3. There is a problem with EXCEL histograms. This has been commented on in previous edstat e-mails. In general EXCEL produces simple graphs, primarily for business purposes. It does not produce good scientific graphics. All it does is get you a quick graph with a minimum of effort. 4. Part of the inaccuracy problem has to do with the fact that each EXCEL cell by default is treated as a variant variable. Unless you format all the numerical cells properly (as decimal or integer), you are likely to have problems. I Always format all my cells properly, declaring the type of cell contents. If for example you are to precede a number by a space, EXCEL may interpret the number as text. By use of the variant, empty cells can be handled, and not cause computational halts. 5. The primary use of EXCEL is in business, doing the type of calculations and reports described in Microsoft EXCEL User's Guide. In business applications, accuracy is not that important, except when money is involved. If for example if McCullough were to declare his numbers as currency instead of variant, his accuracy would probably improve. Considering the type of business applications for stat (for example see The Complete Idiot's Guide to Business Statistics) what EXCEL does is fine. From what I have observed, many business type have a very limited math background, and even learning simple business stat is a major problem. For example try getting them to understand the difference between using z and t tests, and to understand confidence intervals. Business people expect the computer to give them a number. The statement by McCullough that ..it is important for the package to determine whether the answer is likely to be corrupted by cumulated rounding errors as to be worthless and if so, not to display the answer. This policy is not acceptable to business types, and this is one of the ongoing problems on the nets. They would rather get a wrong number, then none. In most cases, the computed
Re: When to Use t and When to Use z Revisited
But then you should use a binomial (or hypergeometric) distribution. Jon Cryer p.s. Of course, you might approximate by an appropriate normal distribution. At 11:39 AM 12/10/01 -0400, you wrote: Dennis Roberts wrote: this is pure speculation ... i have yet to hear of any convincing case where the variance is known but, the mean is not What about that other application used so prominently in texts of business statistics, testing for a proportion? = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = Jon Cryer, Professor Emeritus Dept. of Statistics www.stat.uiowa.edu/~jcryer and Actuarial Science office 319-335-0819 The University of Iowa home 319-351-4639 Iowa City, IA 52242 FAX 319-335-3017 It ain't so much the things we don't know that get us into trouble. It's the things we do know that just ain't so. --Artemus Ward
Re: When to Use t and When to Use z Revisited
I always thought that the precision of a scale was proportional to the amount weighed. So don't you have to know the mean before you know the standard deviation? But wait a minute - we are trying assess the size of the mean! Jon Cryer At 03:42 PM 12/10/01 +, you wrote: Dennis Roberts wrote: this is pure speculation ... i have yet to hear of any convincing case where the variance is known but, the mean is not A scale (weighing device) with known precision. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = Jon Cryer, Professor Emeritus Dept. of Statistics www.stat.uiowa.edu/~jcryer and Actuarial Science office 319-335-0819 The University of Iowa home 319-351-4639 Iowa City, IA 52242 FAX 319-335-3017 It ain't so much the things we don't know that get us into trouble. It's the things we do know that just ain't so. --Artemus Ward
Re: When to Use t and When to Use z Revisited
Only as an approximation. At 12:57 PM 12/10/01 -0400, you wrote: Art Kendall wrote: (putting below the previous quotes for readability) Gus Gassmann wrote: Dennis Roberts wrote: this is pure speculation ... i have yet to hear of any convincing case where the variance is known but, the mean is not What about that other application used so prominently in texts of business statistics, testing for a proportion? the sample mean of the dichotomous (one_zero, dummy) variable is known, It is the proportion. Sure. But when you test Ho: p = p0, you know (or pretend to know) the population variance. So if the CLT applies, you should use a z-table, no? = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: What is a confidence interval?
Dennis: Example A is a mistaken interpretation of a confidence interval for a mean. Unfortunately, this is is a very common misinterpretation. What you have described in Example A is a _prediction_ interval for an individual observation. Prediction intervals rarely get taught except (maybe) in the context of a regression model. Jon At 03:11 PM 9/26/01 -0400, you wrote: as a start, you could relate everyday examples where the notion of CI seems to make sense A. you observe a friend in terms of his/her lateness when planning to meet you somewhere ... over time, you take 'samples' of late values ... in a sense you have means ... and then you form a rubric like ... for sam ... if we plan on meeting at noon ... you can expect him at noon + or - 10 minutes ... you won't always be right but, maybe about 95% of the time you will? B. from real estate ads in a community, looking at sunday newspapers, you find that several samples of average house prices for a 3 bedroom, 2 bath place are certain values ... so, again, this is like have a bunch of means ... then, if someone asks you (visitor) about average prices of a bedroom, 2 bath house ... you might say ... 134,000 +/- 21,000 ... of course, you won't always be right but perhaps about 95% of the time? but, more specifically, there are a number of things you can do 1. students certainly have to know something about sampling error ... and the notion of a sampling distribution 2. they have to realize that when taking a sample, say using the sample mean, that the mean they get could fall anywhere within that sampling distribution 3. if we know something about #1 AND, we have a sample mean ... then, #1 sets sort of a limit on how far away the truth can be GIVEN that sample mean or statistic ... 4. thus, we use the statistics (ie, sample mean) and add and subtract some error (based on #1) ... in such a way that we will be correct (in saying that the parameter will fall within the CI) some % of the time ... say, 95%? it is easy to show this via simulation ... minitab for example can help you do this here is an example ... let's say we are taking samples of size 100 from a population of SAT M scores ... where we assume the mu is 500 and sigma is 100 ... i will take a 1000 SRS samples ... and summarize the results of building 100 CIs MTB rand 1000 c1-c100; made 1000 rows ... and 100 columns ... each ROW will be a sample SUBC norm 500 100. sampled from population with mu = 500 and sigma = 100 MTB rmean c1-c100 c101 got means for 1000 samples and put in c101 MTB name c1='sampmean' MTB let c102=c101-2*10 found lower point of 95% CI MTB let c103=c101+2*10 found upper point of 95% CI MTB name c102='lowerpt' c103='upperpt' MTB let c104=(c102 lt 500) and (c103 gt 500) this evaluates if the intervals capture 500 or not MTB sum c104 Sum of C104 Sum of C104 = 954.00954 of the 1000 intervals captured 500 MTB let k1=954/1000 MTB prin k1 Data Display K10.954000 pretty close to 95% MTB prin c102 c103 c104 a few of the 1000 intervals are shown below Data Display Row lowerpt upperpt C104 1 477.365 517.365 1 2 500.448 540.448 0 here is one that missed 500 ...the other 9 captured 500 3 480.304 520.304 1 4 480.457 520.457 1 5 485.006 525.006 1 6 479.585 519.585 1 7 480.382 520.382 1 8 481.189 521.189 1 9 486.166 526.166 1 10 494.388 534.388 1 _ dennis roberts, educational psychology, penn state university 208 cedar, AC 8148632401, mailto:[EMAIL PROTECTED] http://roberts.ed.psu.edu/users/droberts/drober~1.htm = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = ___ --- | \ Jon Cryer, Professor Emeritus ( ) Dept. of Statistics www.stat.uiowa.edu/~jcryer \\_University and Actuarial Science office 319-335-0819 \ * \of Iowa The University of Iowa home 319-351-4639 \/Hawkeyes Iowa City, IA 52242 FAX319-335-3017 |__ ) --- V It ain't so much the things we don't know that get us into trouble. It's the things we do know that just ain't so. --Artemus Ward = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Free program to generate random samples
I wouldn't call bootstrapping sampling from a population. Would you? Jon Cryer At 06:03 PM 9/21/01 GMT, you wrote: Jon Cryer wrote: But it would be bad statistics to sample with replacement. Whew! saves me from having to learn about all that bootstrap stuff! :-) = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Free program to generate random samples
But it would be bad statistics to sample with replacement. Jon Cryer At 08:35 AM 9/21/01 -0300, you wrote: >"@Home" wrote: >> > >> > Is there any downloadable freeware that can generate let's say 2000 random >> > samples of size n=100 from a population of 100 numbers. >> > >> >and Randy Poe responded: >> Um. >> >> A sample of 100 from a population of 100 is going to >> give you the entire population. > > Depends whether you sample with or without replacement. > > -Robert Dawson > > >= >Instructions for joining and leaving this list and remarks about >the problem of INAPPROPRIATE MESSAGES are available at > http://jse.stat.ncsu.edu/ >= > = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: how to compare generated values with the specified distribution basis
Robert: even when N=20, a uniform distribution can be treated as normal for most purposes. I assume you meant to say that for N=20, the sample mean based on a random sample from a uniform distribution can be assumed to have a normal distribution for most purposes. Right? Jon Cryer At 01:16 PM 9/20/01 -0300, you wrote: JHWB wrote: Hm, hope I didn't make that subject to complex, resulting in zero replies. But hopefully you can answer this: snip The gotcha is that while these may be roughly equivalent questions for (say) N=20, for N small deviations from normality are important and the test is poor at detecting them; for N large, deviations from normality do not matter very much but the test is hypersensitive. For instance: even when N=20, a uniform distribution can be treated as normal for most purposes. However, it will generally fail the Ryan-Joiner test at a 5% level! -Robert Dawson = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = ___ --- | \ Jon Cryer, Professor Emeritus ( ) Dept. of Statistics www.stat.uiowa.edu/~jcryer \\_University and Actuarial Science office 319-335-0819 \ * \of Iowa The University of Iowa home 319-351-4639 \/Hawkeyes Iowa City, IA 52242 FAX319-335-3017 |__ ) --- V It ain't so much the things we don't know that get us into trouble. It's the things we do know that just ain't so. --Artemus Ward = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Presenting results of categorical data?
I do not see how (probabilistic) inference is appropriate here at all. I assume that _all_ employees are rated. There is no sampling, random or otherwise. Jon Cryer At 11:14 AM 8/15/01 -0300, you wrote: Silvert, Henry wrote: I would like to add that with this kind of data [three-level ordinal] we use the median instead of the average. Might I suggest that *neither* is appropriate for most purposes? In many ways, three-level ordinal data is like dichotomous data - though there are a couple critical differences. Nobody would use the median (which essentially coincides with the mode) for dichotomous data unless thay had a very specific reason for wanting that specific bit of information (and I use the word bit in its technical sense.) By contrast, the mean (=proportion) is a lossless summary of the data up to permutation (and hence a sufficient statistic for any inference that assumes an IID model) - about as good as you can get. With three levels, the mean is of course hopelessly uninterpretable without some way to establish the relative distances between the levels. However, the median is still almost information-free (total calorie content per 100-gram serving = log_2(3) 2 bits). I would suggest that unless there is an extremely good reason to summarize the data as ONE number, three-level ordinal data should be presented as a frequency table. Technically one row could be omitted but there is no strong reason to do so. What about inference? Well, one could create various nice modifications on a confidence interval; most informative might be a confidence (or likelihood) region within a homogeneous triangle plot, but a double confidence interval for the two cutoff points would be easier. As for testing - first decide what your question is. If it *is* really are the employees in state X better than those in state Y? you must then decide what you mean by better. *Do* you give any weight to the number of exceeded expectations responses? Do you find 30-40-30 to be better than 20-60-20, equal, or worse? What about 20-50-30? If you can answer all questions of this type, by the way, you may be ready to establish a scale to convert your data to ratio. If you can't, you will have to forego your hopes of One Big Hypothesis Test. I do realize that we have a cultural belief in total ordering and single parameters, and we tend to take things like stock-market and cost-of-living indices, championships and MVP awards, and quality- of- living indices, more seriously than we should. We tend to prefer events not to end in draws; sports that can end in a draw tend to have (sometimes rather silly) tiebreaking mechanisms added to them. Even in sports (chess, boxing) in which the outcomes of (one-on-one) events are known to be sometimes intransitive, we insist on finding a champion. But perhaps the statistical community ought to take the lead in opposing this bad habit! To say that 75% of all respondents ranked Ohio employees as having 'Met Expectations' or 'Exceeded Expectations.' , as a single measure, is not a great deal better than taking the mean in terms of information content *or* arbitrariness. Pooling two levels and taking the proportion is just taking the mean with a 0-1-1 coding. It says, in effect, that we will consider (Exceed - Meet)/(Meet - Fail) = 0 while taking the mean with a 0-1-2 coding says that we will consider (Exceed - Meet)/(Meet - Fail) = 1. One is no less arbitrary than the other. (An amusing analogy can be drawn with regression, when users of OLS regression, implicitly assuming all the variation to be in the dependent variable, sometimes criticise the users of neutral regression for being arbitrary in assuming the variance to be equally divided.) -Robert Dawson = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = ___ --- | \ Jon Cryer, Professor Emeritus ( ) Dept. of Statistics www.stat.uiowa.edu/~jcryer \\_University and Actuarial Science office 319-335-0819 \ * \of Iowa The University of Iowa home 319-351-4639 \/Hawkeyes Iowa City, IA 52242 FAX319-335-3017 |__ ) --- V It ain't so much the things we don't know that get us into trouble. It's the things we do know that just ain't so. --Artemus Ward = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http
Re: Student's t vs. z tests
These examples come the closest I have seen to having a known variance. However, often measuring instruments, such as micrometers, quote their accuracy as a percentage of the size of the measurement. Thus, if you don't know the mean you also don't know the variance. Jon Cryer At 09:28 AM 4/23/01 -0400, you wrote: Date: Fri, 20 Apr 2001 13:02:57 -0500 From: Jon Cryer [EMAIL PROTECTED] Could you please give us an example of such a situation? Consider first a set of measurements taken with a measuring instrument whose sampling errors have a known standard deviation (and approximately normal distribution). Sure. Suppose we use an instrument such as a micrometer, electronic balance or ohmmeter to measure a series of similar items. (For concreteness, suppose they are components coming off a mass production machine such as a screw machine.) As long as the measuring instrument isn't broken, we don't have to conduct an extensive series of repeated measurements every time we use it to determine its error variance with a part of the given conformation. Normality is also reasonably likely under those circumstances. Slightly more sophisticated version of the same: Supposed the operating characteristics of such a machine can be characterized by slow drift (due to tool wear, heat expansion of machine parts, settings that gradually shift, etc.) plus independent random noise that is approximately normal. It is plausible in that setting that the variance of measurements on a short series of parts would be fairly constant. (I'm not just making this up; it's consistent with my own experience in my former career as a machinist.) Again, you don't have to calibrate the error variance of the measurement (in this case, average measurement of several successive parts to estimate the current system mean) every time you do it. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Student's t vs. z tests
Alan: Could you please give us an example of such a situation? "Consider first a set of measurements taken with a measuring instrument whose sampling errors have a known standard deviation (and approximately normal distribution)." Jon At 01:10 PM 4/20/01 -0400, you wrote: (This note is largely in support of points made by Rich Ulrich and Paul Swank.) I disagree with the claim (expressed in several recent postings) that z-tests are in general superseded by t-tests. The t-test (in simple one-sample problems) is developed under the assumption that independent observations are drawn from a normal distribution (and hence the mean and sample SD are independent and have specific distributional forms). It is widely applicable because it is fairly robust against violations of this assumptions. However, there are also situations in which the t-test is clearly inferior to a z-test. Consider first a set of measurements taken with a measuring instrument whose sampling errors have a known standard deviation (and approximately normal distribution). In this case, with a few observations (let's say 1 or 2, if you want to make it very clear), the z-based procedure that uses the known SD will give much more useful tests or intervals than a t-based procedure (which estimates the SD from the data at hand). snip Alan Zaslavsky Harvard Med School = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Student's t vs. z tests
Alan: I don't understand your comments about the estimation of a proportion. It sounds to me as if you are using the estimated standard error. (Surely you are not assuming a known standard error.) You are presumably, also using the normal approximation to the binomial (or perhaps the hypergeometric.) To do so requires a "large" sample size in which case it doesn't matter whether you use the normal or t distribution. Both would be acceptable approximations. (and both would be approximations.) So what is your point? Once more I think you need to separate the issues of what statistic to use and what distribution to use. Jon At 01:10 PM 4/20/01 -0400, you wrote: (This note is largely in support of points made by Rich Ulrich and Paul Swank.) snip Now consider estimation of a proportion. Using the information that the data consist only of 0's and 1's, and an approximate value of the proportion, we can calculate an approximate standard error more accurately (for p near 1/2) than we could without this information. The interval based on the usual variance formula p(1-p) and the z distribution is therefore better than the one based on the t distribution. This is why (as Paul pointed out) everybody uses z tests in comparing proportions, not t tests. The same applies to generalizations of tests of proportions as in logistic regression. snip Alan Zaslavsky Harvard Med School = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Student's t vs. z tests
Why not introduce hypothesis testing in a binomial setting where there are no nuisance parameters and p-values, power, alpha, beta,... may be obtained easily and exactly from the Binomial distribution? Jon Cryer At 01:48 AM 4/20/01 -0400, you wrote: At 11:47 AM 4/19/01 -0500, Christopher J. Mecklin wrote: As a reply to Dennis' comments: If we deleted the z-test and went right to t-test, I believe that students' understanding of p-value would be even worse... i don't follow the logic here ... are you saying that instead of their understanding being "bad" it will be worse? if so, not sure that this is a decrement other than trivial what makes using a normal model ... and say zs of +/- 1.96 ... any "more meaningful" to understand p values ... ? is it that they only learn ONE critical value? and that is simpler to keep neatly arranged in their mind? as i see it, until we talk to students about the normal distribution ... being some probability distribution where, you can find subpart areas at various baseline values and out (or inbetween) ... there is nothing inherently sensible about a normal distribution either ... and certainly i don't see anything that makes this discussion based on a normal distribution more inherently understandable than using a probability distribution based on t ... you still have to look for subpart areas ... beyond some baseline values ... or between baseline values ... since t distributions and unit normal distributions look very similar ... except when df is really small (and even there, they LOOK the same it is just that ts are somewhat wider) ... seems like whatever applies to one ... for good or for bad ... applies about the same for the other ... i would be appreciative of ANY good logical argument or empirical data that suggests that if we use unit normal distributions and z values ... z intervals and z tests ... to INTRODUCE the notions of confidence intervals and/or simple hypothesis testing ... that students somehow UNDERSTAND these notions better ... i contend that we have no evidence of this ... it is just something that we think ... and thus we do it that way = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = ___ ------- | \ Jon Cryer, Professor [EMAIL PROTECTED] ( ) Dept. of Statistics www.stat.uiowa.edu/~jcryer \\_University and Actuarial Science office 319-335-0819 \ * \of Iowa The University of Iowa dept. 319-335-0706 \/Hawkeyes Iowa City, IA 52242 FAX319-335-3017 |__ ) --- V "It ain't so much the things we don't know that get us into trouble. It's the things we do know that just ain't so." --Artemus Ward = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Excel Graphics
The absolute best advice concerning the use of Excel for graphics (or for statistics for that matter) is: DON'T! The _majority_ of graph-types available in Excel should never be used for any purpose as they produce misleading graphs -- mainly false third dimensions that can only serve to hide important features in the graph.) Jon Cryer At 02:26 PM 1/27/01 GMT, you wrote: Not sure if this is the best place to ask, but can anyone point me towards any web sites that provide advice on using Excel for technical/scientific graphing. I am not sure why exactly, but I find the graphs produced by Excel, compared to S-Plus or Statistica, to look out of place in a technical report. As I know others feel the same way, I was hoping that there might be some advice out there on how to improve their appearance. Many thanks, Graham S . = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Number of classes.
I asked Minitab support how they did it. Here is their answer: Date: Fri, 26 Sep 1997 15:07:50 -0400 To: [EMAIL PROTECTED] From: Tech Support Subject: number of bars in MINITAB histogram Jonathan, I finally found an answer for you. Here's the algorithm. There are upper and lower bounds on the number of bars. Lower bound = Round( (16.0*N)**(1.0/3.0) + 0.5 ) Upper bound = Lower bound + Round(0.5*N) After you find the bounds, MINITAB will always try to get as close to the lower bound as it can. Then we have a "nice numbers" algorithm that finds interval midpoints, given the constraints on the number of intervals. But there is special code for date/time data and for highly granular data (e.g., all 1's and 2's). Find the largest integer p such that each data value can be written (within fuzz) as an integer times 10**p. Let BinWidth = 10**p. Let BinCount = 1 + Round( ( range of data ) / BinWidth ) If BinCount is = 10, then let the bin midpoints run from the data min to the data max in increments of BinWidth. Otherwise, use the "nice numbers" algorithm. Hope this helps. Andy Haines Minitab, Inc. At 11:01 PM 1/4/01 -0500, you wrote: To determine the number of classes for a histogram, Excel uses square root of the number of observations. Is it also true for the number of observations greater than 200, say, for 2000?. Does the MINITAB use the same for determining the number of classes for a histogram? Any help would be appreciated. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = ___ ------- | \ Jon Cryer, Professor [EMAIL PROTECTED] ( ) Dept. of Statistics www.stat.uiowa.edu/~jcryer \\_University and Actuarial Science office 319-335-0819 \ * \of Iowa The University of Iowa dept. 319-335-0706 \/Hawkeyes Iowa City, IA 52242 FAX319-335-3017 |__ ) --- V "It ain't so much the things we don't know that get us into trouble. It's the things we do know that just ain't so." --Artemus Ward = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: urgent problem (statistics for management)
This is quite a silly problem. No wonder statistics (for business) gets so little respect. This is time series or process data--not a random sample from some fixed population. There is no information about the stability of the process over time. Very few business processes are stable over five years. Why can't we teach meaningful statistics? Jon Cryer At 05:14 PM 12/13/00 +0100, you wrote: I have some difficulties with following problem (I need the solution urgently for tomorrow): Production levels for Giles Fashion vary greatly according to consumer acceptance of the latest styles. Therefore, the company's weekly orders of wool cloth are difficult to predict in advance. On the basis of 5 years data, the following probability distribution for the company's weekly demand for wool has been computed: Amount of wool (lb) Probability 25000.30 35000.45 45000.20 55000.05 From these data, the raw-materials purchaser computed the expected number of pounds required. Recently, she noticed that the company's sales were lower in the last year than in years before. Extrapolating, she observed that the company will be lucky if its weekly demand averages 2,500 this year. (a) What was the expected weekly demand for wool based on the distribution from past data? (b) If each pound of wool generates $5 in revenue and costs $4 to purchase, ship, and handle, how much would Giles Fashion stand to gain or lose each week if it orders wool based on the past expected value and company's demand is only 2,500? (End of the text of the problem.) Possible solution (in my opinion): I. (a) I fink is obvious: If X means company's weekly demand for wool (lb), then the expected weekly demand for wool based on the distribution from past data =E(X) = 0.3*2500+0.45*3500+0.20*4500+0.05*5500= = 3500. Am I right? (b) Actually I am not sure what company's weekly demand for wool in the past data (table of probability distr.) means. It is the amount of wool which company bought weekly or is the amount of wool which company sold (in it's products) weekly? The last sentence make difference between company's orders (it orders wool based...) and company's demand ( and company's demand is only 2,500) (I think but I am not sure, it's actually company's weekly demand for wool). So In my opinion company's weekly demand for wool means: the amount of wool which company sold (in it's products) weekly? Am I right? I am not sure what the last sentence means. Does it mean that the company orders weekly 3500 lb of wool ( it orders wool based on the past expected value and the past expected value = 3500 from (a)) and it sells weekly 2500 lb in their products (and company's demand is only 2,500)? If so the solution seems to be: The company should expect to gain weekly: 2500*1$-1000*4$=-1500$ so in fact it should expect to lose weekly 1500$. -- Am I right? Maybe I should consider that the company's weekly demand is 2500 lb but it orders are: Amount of wool (lb) Probability 25000.30 35000.45 45000.20 55000.05 (Loss | Orders=2500 ) 0$ -1500$ ... probability 0.30 0.45 E(Loss | Orders=2500 ) = 0*0.3+(-1500)*0.45+ ... Please somebody correct me if I am wrong. Jan = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ = ___ --- | \ Jon Cryer, Professor [EMAIL PROTECTED] ( ) Dept. of Statistics www.stat.uiowa.edu/~jcryer \\_University and Actuarial Science office 319-335-0819 \ * \of Iowa The University of Iowa dept. 319-335-0706 \/Hawkeyes Iowa City, IA 52242 FAX319-335-3017 |__ ) --- V = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Density Function in Minitab
Olympio: I used the Minitab menus to produce the following code and graph the standard normal density. To do other densities you need to change the range of values appropriately and change the density calculated and stored. Hope this helps. Jon Cryer MTB Name c1 = 'z' MTB Set 'z' DATA 1( -4 : 4 / .01 )1 # Could shorten to -4:4/.01 Change for other densities DATA End. MTB Name c2 = 'Density' MTB PDF 'z' 'Density'; # change for other densities SUBC Normal 0.0 1.0.# change for other densities MTB Plot 'Density'*'z'; SUBC Connect; # Connect as a smooth curve SUBC ScFrame; # not needed SUBC ScAnnotation; # not needed SUBC Reference 2 0. # added to put a nice base on the curve At 04:47 AM 7/6/00 GMT, you wrote: Friends: How Minitab can show the Density Function of a variable? Can the program calculate this one and show the formula? Thanks Olympio Sent via Deja.com http://www.deja.com/ Before you buy. === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ === density.jpg
Re: normal distribution table online for download??
If you think you need more precision than given in the usual tables or with a caculator, think again. You are probably fooling yourself since no distribution in the real world is _exactly_ normal. Jon Cryer At 03:55 PM 7/5/00 GMT, you wrote: Trying to use in finacial calcs. Hardcosed one to four decimals. Prefer more precision.Thanks. [EMAIL PROTECTED] === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ === _ - | \ Jon Cryer[EMAIL PROTECTED] ( ) Department of Statistics http://www.stat.uiowa.edu\ \_ University and Actuarial Science office 319-335-0819 \ * \ of Iowa The University of Iowa dept. 319-335-0706\ / Hawkeyes Iowa City, IA 52242FAX319-335-3017 | ) - V === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ ===
Re: normality and regression analysis
Mike: It's really the error terms in the regression model that are required to have normal distributions with constant variance. We check this by looking at the properties of the residuals from the regression. You shouldn't expect the response (dependent) variable to have a normal distribution with a fixed mean since then you wouldn't be doing regression. By the way, you have a fine Statistics Department at VPI. I am sure they do excellent consulting. Jon Cryer At 06:39 PM 5/11/00 -0400, you wrote: I would like to obtain a prediction equation using linear regression for some data that I have collected. I have read in some stats books that linear regression has 4 assumptions, 2 of them being that 1) data is normally distributed and 2) constant variance. In SAS, I have run univariate analysis testing for normality on both my dependent and independent variable (n=147). Both variables have distributions that are skewed. For the dependent variable: skewness=0.69 and Kurtosis=0.25. For the independent variable: skewness=0.52 and Kurtosis= -0.47. The normality test (Shapiro-Wilk Statistic) states that both the dependent and independent variables are not normally distributed. I have also transformed the data (both dependent and independent variables) using log, arcsine, and square root transformations. When I run the normality tests on the transformed data, the test shows that even the transformed data is not normally distributed. I realize that I can use nonparametric tests for correlation (I will use Spearman), but is there a nonparametric linear regression? If not, is it acceptable to use linear regression analysis on data that is not normally distributed as a way to show there is a linear relationship? thanks in advance..Mike === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ === _ - | \ Jon Cryer[EMAIL PROTECTED] ( ) Department of Statistics http://www.stat.uiowa.edu\ \_ University and Actuarial Science office 319-335-0819 \ * \ of Iowa The University of Iowa dept. 319-335-0706\ / Hawkeyes Iowa City, IA 52242FAX319-335-3017 | ) - V === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ ===
Re: hyp testing -Reply
I thought everone knew there was a difference in Anatomy between male and female professors! ;) At 12:19 PM 4/20/00 +0100, you wrote: dennis roberts wrote: At 10:32 AM 4/17/00 -0300, Robert Dawson wrote: There's a chapter in J. Utts' mostly wonderful but flawed low-math intro text "Seeing Through Statistics", in which she does much the same. She presents a case study based on some of her own work in which she looked at the question of gender discrimination in pay at her own university, and fails to reject the null hypothesis [no systemic difference in pay between male and female faculty]. She heads the example "Important, but not significant, differences in salaries"; comments (_perhaps_ technically correctly but misleadingly) that "a statistically naive reader could conclude that there is no problem" and in closing states: the flaw here is that ... she has population data i presume ... or about as close as one can come to it ... within the institution ... via the budget or comptroller's office ... THE salary data are known ... so, whatever differences are found ... DEMS are it! the notion of statistical significance in this case seems IRRELEVANT ... the real issue is ... given that there are a variety of factors that might account for such differences (numbers in ranks, time in ranks, etc. etc.) is the remaining difference (if there is one) IMPORTANT TO DEAL WITH ... Yes! This reminds me of a newspaper article and radio news item in the UK this year about female and male professors. They had data to show that there was a large salary difference. However, they went on to say that the largest difference was in Anatomy. I mentioned this to a female colleague of mine (who works in that area) who pointed out there was only one female professor of Anatomy in the UK. Thom === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ === _ ----- | \ Jon Cryer[EMAIL PROTECTED] ( ) Department of Statistics http://www.stat.uiowa.edu\ \_ University and Actuarial Science office 319-335-0819 \ * \ of Iowa The University of Iowa dept. 319-335-0706\ / Hawkeyes Iowa City, IA 52242FAX319-335-3017 | ) - V === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ ===
density of integral(RV(t)~f(t), 0..T, dt)
Can't be done without knowledge of the joint distributions of Y(t1), Y(t2),..., Y(t). Jon Cryer --- Text of forwarded message --- X-Authentication-Warning: jse.stat.ncsu.edu: majordom set sender to [EMAIL PROTECTED] using -f To: [EMAIL PROTECTED] Date: Wed, 19 Apr 2000 16:46:32 +0200 From: Thomas Peter Burg [EMAIL PROTECTED] Organization: University of Illinois at Urbana-Champaign Reply-To: [EMAIL PROTECTED] Subject: density of integral(RV(t)~f(t), 0..T, dt) Sender: [EMAIL PROTECTED] Precedence: bulk Does anyone know if there's an answer to the following problem: I'm given a function of time Y(t), with the property that all values of Y are random variables which are drawn from a time dependent distribution with known time dependent density f(t). I.e. the probability that Y(t)x is Integral(f(t),-inf..x,dt): d/dx P( Y(t) x ) = f(t) With these facts given, is there anything that can be said about the distribution of Integral(Y(tau), 0..t, dtau) ?? or its density function? Is there a nice expression for that in terms of the known density f(t) in general? or maybe with specific assumptions about f? (E.g. Gaussian with mean(t) and var(t)) I'd greatly appreciate answers to any of these questions or any references that might deal with this problem. Thanks, Thomas Burg Dept. of Physics, Swiss Federal Institute of Technology [EMAIL PROTECTED] === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ === _ - | \ Jon Cryer[EMAIL PROTECTED] ( ) Department of Statistics http://www.stat.uiowa.edu\ \_ University and Actuarial Science office 319-335-0819 \ * \ of Iowa The University of Iowa dept. 319-335-0706\ / Hawkeyes Iowa City, IA 52242FAX319-335-3017 | ) - V === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ ===
Re: 3-D regression planes graph,
The free ARC software from the University of Minnesota will do some of this. Look at http://stat.umn.edu/ARCHIVES/archives.html Jon Cryer At 01:59 PM 4/10/00 -0500, you wrote: Hello all, I'm looking for software that can display a 3-D regression environment (x, y, and z variables) and draw a regression plane for each of two subgroups. So far, Minitab does a good job of the 3-D scatterplots (regular, wireframe, and surface (plane) plots), but there's no option (as in the regular scatterplots) to either code data points according to categorical variables or to overlay two graphs on the same set of axes. I'm saving the data in both Minitab and SPSS files, and I can easily convert to Excel (as a standard go-between spreadsheet file). Any help will be greatly appreciated. The effect in my research that I'm finding so far is that my two groups look similar in univariate and bivariate settings, but the trivariate regression planes are different. I know I could do what I needed to with regression equations (and will do so), but I'd l-o-v-e to have some graphs to go with it. SPSS will be fine for the actual regression equations-- it can deal with subgroups like that. Thank you very much in advance, Cherilyn Young === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ === === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ ===