Re: varinace
Duane Allen [EMAIL PROTECTED] wrote: Kelly [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Hi, I'm trying to calculate sample size using the sample size formula for simple random sampling, which requires an estimate of the variance. But I don't know the variance, instead I wan't to use the maximum varince the random variable can take. I know the range of the varible say a to b. How can I use the range to calculate the max. variance the random variable can possibly take? Thanks in advance for your help. The bounding formulas are dependent on sample size. The bounding upper formulas are s^2 = [n/(4(n-1))]*R^2 for n even OK, when applying the sample formula using (n-1). The first part of your formula - in square brackets - can be rewritten as { n/(n-1) } * 1/4 and s^2 =[(n+1)/(4n)]*R^2 for n odd Here the first part can be rewritten as { (n+1)/n } * 1/4 Forget about the 1/4, that has to do with the denominator of the deviation-scores and is perfectly allright. I'm concerned about the first part of the two rewritten formulas, the part in accolades. Both the nominator and the denominator are different. Why is that? The denominator in the second one should still be (n-1), as always in the sample variance calculation. The nominator however should be (n-1) too, instead of n+1. With an even sample size, variance is maximal when half of the scores - n/2 - equal the minimum, and the other half - n/2 - equal the maximum. The mean then of course equals (a+b)/2. [a is minimum, b is maximum]. All deviation scores of course are equal - half is on the negative side, half is on the positive side - and equal half the range (b-a)/2. Squaring these deviation scores over n scores yields n * {(b-a)/2}^2 == n * { (R^2)/4 } where R is the range (b-a). Subsequent division by (n-1) yields n/(n-1) * 1/4 * R^2. So far, so good. With an odd sample size, the variance is maximal when _one_ score is exactly in the middle, and half of the others - (n-1)/2 - equal the minimum, and the others - also (n-1)/2 - equal the maximum. The deviation scores of those scores that are on the extremes are of course exactly the same as in the even-sized-sample example. The one in the middle is by definition on the mean and consequently doesn't have a deviation score. To calculate the variance, we have to sum the squared deviation scores. Since the one in the middle doesn't have a deviation score, and doesn't add to the variance, we now sum over (n-1) scores, yielding (n-1) * {(b-a)/2}^2 == (n-1) * { (R^2)/4 } Subsequent division by (n-1) (n-1)/(n-1) * 1/4 * R^2 == 1/4 * R^2 Where in your derivation does the part (n+1)/n come from? Chris = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: multivariate techniques for large datasets
On 11 Jun 2001, srinivas wrote: I have a problem in identifying the right multivariate tools to handle datset of dimension 1,00,000*500. The problem is still complicated with lot of missing data. So far, you have not described the problem you want to address, nor the models you think may be appropriate to the situation. Consequently, no-one will be able to offer you much assistance. Can anyone suggest a way out to reduce the data set and also to estimate the missing value. There are a variety of ways of estimating missing values, all of which depend on the model you have in mind for the data, and the reason(s) you think you have for substituting estimates for the missing data. I need to know which clustering tool is appropriate for grouping the observations ( based on 500 variables ). No answer is possible without context. No context has been supplied. Donald F. Burrill [EMAIL PROTECTED] 184 Nashua Road, Bedford, NH 03110 603-471-7128 = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
About kendall
When I aplly Kendall tau or Kendall's partial tau to a time series do I have to calcolate ranks or not? In fact a time series has a natural temporal order. Thanks, Monica De Stefani. = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
3 Biopharmaceutical Statistics Conferences
Dear Colleagues, Please note that the following Statistical Conferences will take place in Washington DC. Title:Statistical Issues in Drug Development Content: A Two-day Intensive course avoiding technical detail and concentrating on PRACTICAL and PHILOSOPHICAL issues that determine which data are collected, how the study should be organized, and what the outcome means. It will ensure that participants acquire a well grounded appreciation of the all the important statistical issues surrounding Drug Development. Date: 14 15 June 2001 Venue:Georgetown University Conference Center, Washington DC Course Leader: Professor Stephen Senn, University College London, UK Speakers: Professor Peter Lachenbruch, Director of Biostatistics, CBER, US Food and Drug Administration Dr Richard Simon, Head of Molecular Statistics and Bioinformatics, National Cancer Institute, NIH ALL DELEGATES RECEIVE A FREE HARDBACK COPY OF STATISTICAL ISSUES IN DRUG DEVELOPMENT BY AUTHOR AND COURSE DIRECTOR Professor STEPHEN SENN Title: Statistics of Optimal Dosing Content: Many statisticians feel that dose selection is very poorly done. This briefing will cover methods appropriate for establishing the optimal dose. It will examine the problems of dosing and suggest practical solutions. Dose assessment must be done correctly: if assessed incorrectly the cost project delay and extra costs can be very high. Venue: Washington DC, USA Date: 26th July 2001 Title: Statistics of Multi-center Trials Content: Views differ enormously on how multi-center trials should be designed and data from them analyzed. Section 3.2 of the ICH guideline, which has now been adopted in the US, Europe and Japan is open to interpretation. The FDA provides specific guidelines for a strong statistical basis for the design and analysis of clinical trials. Clare Gnecco, Senior Biostatistician, CBER, US FDA will discuss along with other experts, the statistical issues surrounding multi-center trials Venue: Washington DC, USA Date: 14th September 2001 For more information, please got to our website at www.henrystewart.co.uk or contact Dr Carlos Horkan Henry Stewart Conference Studies Russell House 28-30 Little Russell Street London WC1A 2HN Tel: +44 207 404 3040 Fax: +44 207 404 2081 Email: [EMAIL PROTECTED] -- Frank E Harrell Jr Prof. of Biostatistics Statistics Div. of Biostatistics Epidem. Dept. of Health Evaluation Sciences U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Need Good book on foundations of statistics
I just posted this to sci.stat.edu. With apologies to those who would see it twice, I post it again, this time cross-posted to comp.ai.fuzzy, where it may also be of interest. Neville X. Elliven wrote: R. Jones wrote: Can anyone refer me to a good book on the foundations of statistics? I want to know of the limitations, assumptions, and philosophy behind statistics. Probability, Statistics, and Truth by Richard von Mises is available in paperback [ISBN 0-486-24214-5] and might be just what you seek. May I suggest my own Fuzziness and Probability (ACG Press, 1995). In attempting to reconcile the competing paradigmatic claims to representing uncertainty, of fuzzy set theory (FST) on the one hand, and probability/statistical inference theory (PST) on the other, I was driven to look deeply into the foundations, not only of these two, but also of measurement theory, deductive and inductive logic, decision analysis, and the relevant aspects of semantics. I also found it necessary to be clear as to the notion of what constitutes a model, and logically prior to that, what constitutes a phenomenon, which competing models seek in some way to represent. I think I have succeeded, not only in reconciling the competing claims of FST and PST, but also in finding the extended likelihood calculus which eluded Fisher, and the generations of statisticians since. Likelihood theory thus far has been considered inadequate because simple maximization rules of maginalization and set evaluation fail in significant cases, which may in part have temptingly led Bayesians to substitute a probabilistic model, now necessarily subjectivist, for what is in actuality a possibilistic sort of uncertainty. Classicists, quite rightly, have never accepted this insistent Bayesian subjectivism, while Bayesians, quite understandably, have been impatient with the cautious, indirect characterizations of statistical uncertainty that are the hallmark of classical (Neyman-Pearson) statistical method. An extended likelihood calculus which is as easy of manipulation as the probability calculus, but without the injection of subjective priors, seems to me to offer a solution to the disagreements that beset the foundations of statistical inference. At any rate, the original poster may want to take a look see. Be all that as it may, I would also commend to the original poster to the following two sources, which I found to be very helpful when I was asking the sorts of questions which the original poster now poses: 1) Sir Ronald A. Fisher. (1951). Statistical Methods and Scientific Inference. Collier MacMillan, 1973 (third edition). 2) V.P. Godambe and D.A. Sprott (eds.). (1971). Foundations of Statistical Inference: A Symposium. Toronto, Montreal: Holt, Rinehart and Winston. There are many other worthwhile references, but these two helped me enormously in framing the core issues. The latter was especially useful for the informal commentaries and rejoinders which saw respective champions of the three main schools of thought -- classical, Bayesian, and likelihood -- going at each other in vigorous debate. A discussion of how the quantum world may have different laws of statistics might be a plus. The statistical portion of statistical mechanics is fairly simple, and no different conceptually from other statistics. But from the standpoint of one whose interest is in the quantum-theoretic application domain, there is a very real question of where fuzziness ends, and probability begins. I remember reading Penrose's The Emperor's New Mind, and thinking -- idly, it's not my field and I haven't tried to follow up -- that at least some of the uncertainty in the quantum world is of the fuzzy rather than probabilistic sort. The original poster is certainly well-advised to explore the foundations of uncertainty, period, as distinct from purely statistical uncertainty. Hope this is of some help. Regards, S. F. Thomas = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: About kendall
On 12 Jun 2001 08:43:53 -0700, [EMAIL PROTECTED] (Monica De Stefani) wrote: When I aplly Kendall tau or Kendall's partial tau to a time series do I have to calcolate ranks or not? In fact a time series has a natural temporal order. ... but you are not partialing out time. Surely. Your program that does the Kendall tau must do some ranking, as part of the algorithm. Why do you think you might have to calculate ranks? -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
MDS algorithm needed
I need an algorithm for MDS (multidim. scaling) with real city distances. Such it must be metric and absolute or ratio, for symmetric matrices with missings. What is the best for that? Somewhere I read SMACOF, is this good? Thanks! Jens (Software Developer) f'up2 sci.stat.math = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: multivariate techniques for large datasets
srinivas wrote: Hi, I have a problem in identifying the right multivariate tools to handle datset of dimension 1,00,000*500. The problem is still complicated with lot of missing data. can anyone suggest a way out to reduce the data set and also to estimate the missing value. I need to know which clustering tool is appropriate for grouping the observations( based on 500 variables ). This may not be the answer to your question, but clearly you need a good statistical package that would allow you to manipulate the data in ways that make sense and that would allow you to devise simplification strategies appropriate in context. I recently went through a similar exercise, smaller than yours, but still complex ... approx. 5,000 cases by 65 variables. I used the statistical package R, and I can tell you it was a god-send. In previous incarnations (more than 10 years ago) I had used at various times (which varied with employer) BDMS, SAS, SPSS, and S. I had liked S best of the lot because of the advantages I found in the Unix environment. Nowadays, I have Linux on the desktop, and looked for the package closest to S in spirit, which turned out to be R. That it is freeware was a bonus. That it is a fully extensible programming language in its own right gave me everything I needed, as I tend to roll my own when I do statistical analysis, combining elements of possibilistic analysis of the likelihood function derived from fuzzy set theory. At any rate, if that was indeed your question, and if you're on a tight budget, I would say get a Linux box (a fast one, with lots of RAM and hard disk space) and download a copy of R, and start with the graphing tools that allow you as a first step to look at the data. Sensible ways of grouping and simplifying will suggest themselves to you, and inevitably thereafter you'll want to fit some regression models and/or do some analysis of variance. If you're *not* on a tight budget, and/or you have access to a fancy workstation, then you might also have access to your choice of expensive stats packages. If I were you, I would still opt for R, essentially because of its programmability, which in my recent work I found to be indispensable. Hope this is of help. Good luck. S. F. Thomas = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =