Re: Question on Conditional PDF
Chia C Chong [EMAIL PROTECTED] wrote in message a5d38d$63e$[EMAIL PROTECTED]">news:a5d38d$63e$[EMAIL PROTECTED]... Glen [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Do you want to make any assumptions about the form of the conditional, or the joint, or any of the marginals? Well, the X Y are dependent and hence there are being descibed by a joint PDF. This much is clear. I am not sure what other assumption I can make though.. I merely though you may have domain specific knowledge of the variables and their likely relationships which might inform the choice a bit (cut down the space of possibilities). Can you at least indicate whether any of them are restricted to be positive? Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: detecting outliers in NON normal data ?
Voltolini wrote: Hi, I would like to know if methods for detecting outliers using interquartil ranges are indicated for data with NON normal distribution. The software Statistica presents this method: data point value UBV + o.c.*(UBV - LBV) data point value LBV - o.c.*(UBV - LBV) where: UBV is the 75th percentile) and LBV is the 25th percentile). o.c. is the outlier coefficient. The values of the outlier coefficient are traditionally chosen by reference to some percentile of the normal distribution. (If anyone didn't recognise it, this is just the outliers on a boxplot.) If you choose that coefficient in some appropriate way, then it may be reasonable for non-normal data. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Cauchy PDF + Parameter Estimate
Herman Rubin wrote: In article a5daqb$72k$[EMAIL PROTECTED], Chia C Chong [EMAIL PROTECTED] wrote: Hi! Does anyone come across some Matlab code to estimate the parameters for the Cauchy PDF?? Or some other sources about the method to estimate their parameters?? What is so difficult about maximum likelihood? Start with a reasonable estimator, and use Newton's method. There are difficulties with Newton's method (and many other hill-climbing techniques) because the cauchy likelihood function is generally multimodal. You can end up somewhere other than the MLE unless you use a somewhat more sophisticated starting point than a reasonable estimator. There are good estimators that can start you off very close to the true maximum, but it's a long time since I've seen that literature, so I can't name names right now. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: What is an outlier ?
Voltolini wrote: Hi, My doubt isan outlier can be a LOW data value in the sample (and not just the highest) ? Several text boks dont make this clear !!! What makes an outlier an outlier is your model. If your model accounts for all the observations, you can't really call any of them an outlier. If your model adequately accounts for all but one or two unusual observations, you might regard them as coming from some process other than that which generated the data you model accounts for, and call them outliers. Such not adequately accounted for observations may be low observations, or high observations, or they may actually turn out be somewhere in the middle of the range of your data - as I have seen with time series for example, where in some applications an autoregressive models was a very good desctiption of a long series, apart from a few outliers in the first quarter or so of the time period (which did in the end turn out to have come from a different process, because the protocol wasn't always being properly followed early on). Two of those outliers - in the sense that the model didn't adequately account for them - turn out to be neither particularly high or low observations - but they were substantially higher or lower than expected from the model. Another case where you might have outliers in the middle of your data is in a regression context, where a generally increasing relationship shows a tight, gaussian-looking random scatter about the relationship, but with a couple of relatively low y-values at some of the higher x-values. The observations themselves may actually be very close to the mean of the y's, but the model of the relationship makes them unusual. A different model - for example, one where the observations come from a distribution which has the same expectation as a function of x, but which has a heavier tail to the left around that - might account for all the data and not find any outliers. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Question on CDF
Henry [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... On Fri, 22 Feb 2002 08:55:42 +1100, Glen Barnett [EMAIL PROTECTED] wrote: Bob [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... A straight line CDF would imply the data is uniformly distributed, that is, the probability of one event is the same as the probability of any other event. The slope of the line would be the probability of an event. I doubt that - if the data were distributed uniformly on [0,1/2), say, then the slope of the line would be 2! I suspect he meant probability density. I guess that's actually correct - the slope of the pdf is zero. However, I'm fairly certain that's not what he meant. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Question on CDF
Henry [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... I was trying to suggest that he meant the slope of the CDF was the height of the PDF. Oh, okay. Yes, that would be correct, but it shouldn't be called probability! Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Question on CDF
Bob [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... [EMAIL PROTECTED] (Linda) wrote in message news:[EMAIL PROTECTED]... Hi! If I plot CDF of a sample data and this CDF looks like a straight line cross through 0. What does this implies?? Normally, CDF will not look like a straight line but sth like a S2 shape, isn't?? Linda A straight line CDF would imply the data is uniformly distributed, that is, the probability of one event is the same as the probability of any other event. The slope of the line would be the probability of an event. I doubt that - if the data were distributed uniformly on [0,1/2), say, then the slope of the line would be 2! Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: How to test whether f(X,Y)=f(X)f(Y) is true??
Linda [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Hi! I have some experimental data collected and can be grouped into 2 variables, X and Y. One is the dependent variable (Y) and the other is an independent variable (X). What test shall I made to check whether there can be expressed as independent or not?? There are so many ways variables can fail to be independent that a truly general test usually won't have good power against specific alternatives. Essentially you'd need to estimate f(Y|X) somehow and compare it to f(Y) (also estimated somehow). I have no advice on the best way to tackle the test, since it depends on how you do the estimation (and you need to keep in mind that since the two distributions are estimated from the same data, they are not independent). If XY are categorical, there are a number of general tests of independence, of which the usual Pearson chi-squared test of independence is the best known. It's much better if you can specify the kind of alternatives you care about most, and the more specific the better. For example, one thing that would help to nail it down a little would be to say you only care about relationship in the mean - i.e. you need to detect if E(Y|X) = E(Y). This is still very general, but it's better. If you're only interested in monotonic relationships, it's easier still. But you need to clarify what you require. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Chi-square chart in Excel
Ronny Richardson wrote: Can anyone tell me how to produce a chart of the chi-square distribution in Excel? (I know how to find chi-square values but not how to turn those into a chart of the chi-square curve.) Ronny Richardson = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ = I assume you want the pdf, not the cdf. Set up a column of x's (e.g. 0,0.2, 0.4, ...), and beside it set up a column of pdf values (type in the pdf for the chisq you're after as a function of x): For m d.f.: 1/[Gamma(m/2)*2^(m/2)]*x^(m/2-1)*exp(-x/2) (in excel you'll need exp(gammaln()) because it doesn't have a Gamma function.) Note that you can set up m in a cell, so you can play around with the d.f. and see what it does to the curve. So now you have 2 columns you can plot. Click on the chart icon, choose the XY(scatter) plot option, pick either the joined with lines or joined with a curve pictures (without the points marked - either of the rightmost plots there). Choose any other options you need, and there you go. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Normalization procedures
Niko Tiliopoulos wrote: Hello everybody, Has anybody heard of the Bell-Doksum test? IIRC it's like a Wilcoxon 2-sample test, except that the ranks are transformed to normal scores. If that's the right test, it has ARE 1 vs the t-test (it has good power for small deviations), but as you move to larger deviations, its power curve flattens out short of 1. Checking the internet: ... 9. Bell, C. B.; Doksum, K. A. Some new distribution-free statistics. Ann. Math. Statist 36 1965 203--214. ... 12. Bell, C. B.; Doksum, K. A. Optimal one-sample distribution-free tests and their two-sample extensions. Ann. Math. Statist. 37 1966 120--132. ... it would just about have to be one of these two papers. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Which is faster? ziggurat or Monty Python (or maybe something else?)
Ian Buckner [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Glen Barnett [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Ian Buckner wrote: We generate pairs of properly distributed Gaussian variables at down to 10nsec intervals, essential in the application. Speed can be an issue, particularly in real time situations. Generated on what? (On a fast enough machine, even clunky old Box-Muller can probably give you that rate.) Generated on custom silicon (surprise). Box-Muller does not work for real time requirements. Of course it does, if the machine is fast enough that you're getting them at the rate you need. And the reason you're getting them fast is you have a fast machine - which is not much help if the machine is a given. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Numerical recipes in statistics ???
Charles Metz [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... The Truth wrote: I suppose I should have been more clear with my question. What I essentially require is a textbook which presents algorithms like Monte Carlo, Principal Component Analysis, Clustering methods, MANOVA/MANACOVA methods etc. and provides source code (in C , C++ or Fortran) or pseudocode together with short explanations of the algorithms. Although it doesn't contain much code/pseudocode, I highly recommend 'Elements of Statistical Computing: Numerical Computation,' by Ronald A. Thisted (New York and London: Chapman and Hall, 1988). To the best of my knowledge, this is as close to a statistics version of 'Numerical Recipes' as you'll find. Thisted's book is quite good. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Numerical recipes in statistics ???
The Truth [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Glen Barnett [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED]... The Truth wrote: Are there any Numerical Recipes like textbook on statistics and probability ? Just wondering.. What do you mean, a book with algorithms for statistics and probability or a handbook/cookbook list of techniques with some basic explanation? Glen I suppose I should have been more clear with my question. What I essentially require is a textbook which presents algorithms like Monte Carlo, Principal Component Analysis, Clustering methods, MANOVA/MANACOVA methods etc. and provides source code (in C , C++ or Fortran) or pseudocode together with short explanations of the algorithms. There are books on statistical computing that cover some algorithms (usually with pseudocode rather than actual source code), but to cover all of statistics is not possible. The particular subset you suggest above are not all covered in any one book I have seen. You should be able to find books that cover some Monte Carlo techniques and regression and maybe bootstrapping and a few other basic techniques - stuff that goes somewhat beyond what's in NR, but not nearly as far as you seem to be after. You can find code for many of these things (and much more besides) in journals like JRSS C (Applied Statistics), and a few others (e.g. ACM Transactions on Mathematical Software). A lot of these algorithms are on the Internet. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Which is faster? ziggurat or Monty Python (or maybe something else?)
Herman Rubin [EMAIL PROTECTED] wrote in message a4u99j$[EMAIL PROTECTED]">news:a4u99j$[EMAIL PROTECTED]... In article [EMAIL PROTECTED], Radford Neal [EMAIL PROTECTED] wrote: Box-Muller does not work for real time requirements. This isn't true, of course. A real time application is one where one must guarantee that an operation takes no more than some specified maximum time. The Box-Muller method for generating normal random variates does not involve any operations that could take arbitrary amounts of time, and so is suitable for real-time applications. This assumes that the time needed for Box-Muller is small enough, which will surely often be true. If the time allowed is very small, then of course one might need to use some other method. Rejection sampling methods would not be suitable for real-time applications, since there is no bound on how many points may be rejected before one is accepted, and hence no bound on the time required to generate a random normal variate. Radford Neal Acceptance-rejection, or the usually faster acceptance-replacement, methods are, strictly speaking, not real time. However, they may be much faster 99.99% of the time. In that circumstance, could one not generate more values than required each call (say an extra one, assuming there's time), and store the extras up for the rare case where it's looking like it will take too long? You could take enough that the probability you exhaust them is smaller than say the probability a cosmic ray will flip a crucial bit in your hardware. You'd need a few generated at the start, of course. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Numerical recipes in statistics ???
The Truth wrote: Are there any Numerical Recipes like textbook on statistics and probability ? Just wondering.. What do you mean, a book with algorithms for statistics and probability or a handbook/cookbook list of techniques with some basic explanation? Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Which is faster? ziggurat or Monty Python (or maybe something else?)
Ian Buckner wrote: We generate pairs of properly distributed Gaussian variables at down to 10nsec intervals, essential in the application. Speed can be an issue, particularly in real time situations. Generated on what? (On a fast enough machine, even clunky old Box-Muller can probably give you that rate.) How generated? Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Which is faster? ziggurat or Monty Python (or maybe something else?)
Alan Miller [EMAIL PROTECTED] wrote in message OC2b8.28457$[EMAIL PROTECTED]">news:OC2b8.28457$[EMAIL PROTECTED]... First - the reference to George's paper on the ziggurat, and the code: The Journal of Statistical Software (2000) at: http://www.jstatsoft.org/v05/i08 That I already have, thanks. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Which is faster? ziggurat or Monty Python (or maybe something else?)
Bob Wheeler [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Marsaglia's ziggurat and MCW1019 generators are available in the R package SuppDists. The gcc compiler was used. Thanks Bob. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Which is faster? ziggurat or Monty Python (or maybe something else?)
George Marsaglia [EMAIL PROTECTED] wrote in message 0l7b8.42092$[EMAIL PROTECTED]">news:0l7b8.42092$[EMAIL PROTECTED]... (3-year old) Timings, in nanoseconds, using Microsoft Visual C++ and gcc under DOS on a 400MHz PC. Comparisons are with methods by Leva and by Ahrens-Dieter, both said to be fast, using the same the same uniform RNG. MSgcc Leva 307384 Ahrens-Dieter161193 RNOR55 65 (Ziggurat) REXP 77 40 (Ziggurat) The Monty Python method is not quite as fast as as the Ziggurat. Thanks for the information. Could you give a rough idea about the relativities? roughly 5% slower? 10%? 30%? I realise it's machine-dependent, but I'm only after a rough picture. Some may think that Alan Miller's somewhat vague reference to a source for the ziggurat article suggests disdain. I didn't get that impression. (I don't have a web page, so the above can be considered my way to play Ozymandius.) I wish you did! Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Which is faster? ziggurat or Monty Python (or maybe something else?)
Art Kendall [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... I tend to be more concerned with the apparent randomness of the results than with the speed of the algorithm. This will be mainly a function of the randomness of the uniform generator. If we assume the same uniform generator for both, and assuming it's a pretty good one (our current one is reasonable, though I want to go back and update it soon), there shouldn't be a huge difference in the apparent randomness of the resulting gaussians. As a thought experiment, what is the cumulative time difference in a run using the fastest vs the slowest algorithm? A whole minute? A second? A fractional second? When you need millions of them (as we do; a run of 10,000 simulations could need as many as 500 million gaussians, and we sometimes want to do more than 10,000), and you also want your program to be interactive (in the sense that the user doesn't have to wander off and have coffee just to do one simulation run), knowing that one algorithm is, say, 30% faster is kind of important. Particularly if the user may want to do hundreds of simulations... A whole minute extra on a simulation run is a big difference, if the user is doing simulations all day. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: test differences between proportions
Rich Ulrich wrote: On Mon, 11 Feb 2002 13:56:46 +0100, nikolov [EMAIL PROTECTED] wrote: hello, i want to test the difference between two proportions. The problem is that some elements of these proportions are dependent (i can not isolate them). That is, the t-statistics does not work. How could i do? Do other kind of tests exist? Is there a book or a paper on the subject? Taking your questions in reverse order -- I don't know of a book or paper about general dependencies, but those concerns are implicit in estimation theory. If dependency is what makes the t-test hard to use, you will have trouble with everything else that is common, too. What you could do is -- (a) Use the t-test anyway, if the correlations are positive: because the bias would just reduce the power of the test. and the level... Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Ansari-Bradley dispersion test.
Rich Ulrich [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... On Sat, 09 Feb 2002 16:59:34 GMT, Johannes Fichtinger [EMAIL PROTECTED] wrote: Dear NG! I have been searching for a description of the Ansari-Bradley dispersion test up to now for analysing a psychological research. I am searching for a description of this test, specially a description how to use the test. Please, can you tell me, how to use the test, or show me a link, where it is described? Thank you very much in advance, I plugged Ansari-Bradley into a search by www.google.com and there were 287 hits. The first page contained the (aptly named) http://franz.stat.wisc.edu/~rossini/courses/intro-nonpar/text/Specifications_fo r_the_Ansari_Bradley_Test.html I suggest repeating the search. That also eliminates the pasting problem if your reader has broken the long URL into two lines. A warning, however; the Ansari-Bradley test (and similar tests like the Seigel-Tukey) has some drawbacks: i) it assumes the locations are identical ii) it is less powerful than some alternative tests If assumption (i) is false, the A-B test may have very little power to detect a difference in variance. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Method for determining gaussian distribution
Jennifer Golbeck wrote: i hope someone can help me with this. i have finished a computer science study that examines swarming behavior. my claim is that the swarming algorithm that i use produces a gaussian distribution - on a grid, the frequency that each area is visited is recorded. graphs of my data looks like there is a normal distribution around the center of the area. i'd like to statistically show that it is a gaussian distribution. i'm not sure how i would do this. i could imagine doing a test on each row and each column to show that all of those are normal. even for that, i'm not sure what test to use to show that data follows a normal distribution. i feel like this is incredibly basic and i'm just overlooking something i should know...but i need help. any advice would be really appreciated. It's impossible to do this. You may be able to show it is a (discretised) gaussian analytically, by deriving that from the problem set up, but you can't demonstrate that it is gaussian just from the output. You can demonstrate that the gaussian is a reasonable model for it. You can demonstrate that the deviations from the gaussian are small. You can demonstrate that the gaussian is in some sense a better model than a variety of plausible alternatives. But you cannot demonstrate that it *is* gaussian from the output. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: area under the curve
Dennis Roberts wrote: unless you had a table comparable to the z table for area under the normal distribution ... for EACH different level of skewness ... an exact answer is not possible in a way that would be explainable Even if you specify level of skewness, an exact answer is still not possible without specifying more about the distribution. Specifying to third moments (for example) doesn't pin disributions down very well at all. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: How to test f(X , Y)=f(X)f(Y)
Linda [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... I have 1000 observations of 2 RVs from an experiments. X is the independent variable and Y is the dependent variable. How do I perform the test whether the following statement is true or not?? f(X,Y)=f(X)f(Y) You'll probably want to make a few more assumptions than given here. A general approach would be to calculate estimates of f(X) and f(Y) or (more generally still) of F(X) and F(Y). Exactly how you might calculate the estimates of these depends in part on the assumptions you make, and the knowledge you have about X and Y. Then some comparison of F(X)F(Y) with F(X,Y) (or f(X)f(Y) with f(X,Y) would be made over the ranges of X and Y, but again, precisely how you evaluate these depends on the assumptions you make and the knowledge you have about X and Y. For example, if X and Y are nominal categories, you'd use a chi-square test. If there was further information (such as that found in ordered categories, or in continuous variables), you'd want to do other things. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Unique Root Test - Statistics
Shakti Sankhla [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Hi All: This is basically not a SAS problem but I believe that many of the list members could help. I am looking for information on Statistical topic called Unique Root Test. Do you mean unit root test? Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: 95% CI for a sum
Scheltema, Karen wrote: I have 2 independent samples and the standard errors and n's associated with each of them. If a and b are constants, what is the formula for the 95% confidence interval for (a(Xbar1)+b(xbar2))? Are the sample sizes big enough that you'd be prepared to use the CLT? Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Buy Book on Probability and statistical inference
Chia C Chong [EMAIL PROTECTED] wrote in message a1phfd$36e$[EMAIL PROTECTED]">news:a1phfd$36e$[EMAIL PROTECTED]... Hi! I wish to get a book in Probability and statistical inference . I wish to get some advices first..Any good suggestion?? (i) What do you know already? (ii) What do you need to know about? (iii) What level of mathematics (e.g. how much calculus, linear algrebra, etc) do you have? Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Modelling Problem
Alexander Hener wrote: I have a modelling problem where any help would be appreciated. Assume that I want to model a fraction, where the nominator is a sum of, Do you mean numerator? say, four continous random variables. I am thinking of using some parameter-additive distribution there, e.g. the gamma, since the sum in the nominator needs not be negative. The denominator should be continous and positive. Now my questions are : 1. Is anyone aware of distributions which lend themselves to such a model ? If the fractions are between zero and one, you may wish to consider the beta distribution for the fraction - if X and Y are independent gamma r.vs, then X/(X+Y) is beta. If X = X1 + X2 + X3 + X4 is your numerator, that would seem to suggest something like a beta at first glance. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Which one fit better??
Chia C Chong [EMAIL PROTECTED] wrote in message a1bpk5$62b$[EMAIL PROTECTED]">news:a1bpk5$62b$[EMAIL PROTECTED]... Glen [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Chia C Chong [EMAIL PROTECTED] wrote in message news:a0n001$b7v$[EMAIL PROTECTED]... I plotted a histogram density of my data and its smooth version using the normal kernel function. I tried to plot the estimated PDF (Laplacian Generalised Gaussian) estimated using maximum likelihood method on top as well. Graphically, its seems that Laplacian wil fit thr histogram density graph better while the Generalised Gaussian will fit the smooth version (i.e. the kernel densoty version). Imagine that you began with a sample from a Laplacian (double exponential) distribution. What will happen to the central peak after you smooth it with a KDE? The peak does not changed significantly...Maybe shifted to the left a bit...not too much!! No, I was not talking about your data, since you don't necessarily have Laplacian - that's what you're trying to decide! Imagine you have data actually from a Laplacian distribution. (It has a sharp peak in the middle, and exponential tails.) Now you smooth it (KDE via gaussian kernel). What happens to the peak? (assume a typical window width) [Answer? It gets smoothed, so it no longer looks like a sharp peak.] That's where your impression of a gaussian-looking KDE is probably coming from. Note that the tails of a normal and a laplace are different, so if those are the two choices, that may help. Glen = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Question on 2-D joint distribution...
Chia C Chong [EMAIL PROTECTED] wrote in message a145qk$qfq$[EMAIL PROTECTED]">news:a145qk$qfq$[EMAIL PROTECTED]... Hi! I have a series of observations of 2 random variables (say X and Y) from my measurement data. These 2 RVs are not independent and hence f(X,Y) ~= f(X)f(Y). Hence, I can't investigate f(X) and f(Y) separately. I tried to plot the 2-D kernel density estimates of these 2 RVs and from the it looks like Laplacian/Gaussian/Generalised Gaussian shape in one side and the other side looks like Gamma/Weibull/Exponential shape. My intention is to find the joint 2-D distribution of these 2 RVs so that I can reprenseted this by an equation (so that I could regenerate this plot using simulation later on). I wonder whether anyone has come across this kind of problem and what method that I should use?? = Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at http://jse.stat.ncsu.edu/ =
Re: Standardizing evaluation scores
Stan Brown wrote: But is it worth it? Don't the easy graders and :tough graders pretty much cancel each other out anyway? Not if some students only get hard graders and some only get easy graders. If all students got all graders an equal amount of time it probably won't matter at all. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Basics
colsul wrote: Does anyone know of a website that deals with basic statistic formulae and/or business math? Also, I am looking for a text book that could give me a grounding in the basics of statistics, stat. analysis and business maths. I need to cram so I have some idea for a job interview I have coming up. Any help or advice would be very much appreciated. Beware. Spouting crammed-but-not-understood knowledge can make you look like an idiot, which isn't a good thing to appear to be in an interview. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: 10 envelopes, 10 persons
Stan Brown wrote: Problem posed me by a student: ten persons (A through J) and ten envelopes containing cards marked with letters A through J. (Each letter is in one and only one envelope.) The random variable x is the number of people who get the right envelope when the envelopes are handed out randomly. Obviously 0 = x = 10. Question: How do we express the probability distribution P(x)? I've done some work on this, and I _must_ be missing something obvious. Here's part what I've got so far. 10! = number of possible arrangements. Only one of them assigns all ten envelopes to the right people, so P(10) = 1/10! If nine people get the right envelopes, the tenth must also get the right envelope. So P(9) = 0 I bogged down on figuring P(8), though. Then I tried to look at P(0) and got even more bogged down. Am I missing something here? Is there an elegant way to write expressions for the probabilities of the various x's? This is sometimes called the matching problem or the matching experiment. Type letters envelopes random into google and you get several relevant hits - e.g. http://www.math.uah.edu/stat/urn/urn6.html http://www.wku.edu/~neal/probability/matching.html either of these might be enough for you to see how to do it. If you put matching in there as well it would probably target your search better, but I just wanted to show you you didn't need more than your own description of the problem to find out lots just with google. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Testing for joint probability between 2 variables
Chia C Chong [EMAIL PROTECTED] wrote in message news:9rn4vc$8v2$[EMAIL PROTECTED]... Glen [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Chia C Chong [EMAIL PROTECTED] wrote in message news:9rjs94$lht$[EMAIL PROTECTED]... I have 2 variables and would like to test whether these 2 variables are correlated or not. What statistical tests that I should use?? I would guess sth like joint pdf tests but can somebody give me some suggestions to start with?? Are the observations numbers or categories, or something else? If they are categorical, are the categories ordered? Are we talking linear correlation or some more general association (e.g. a monotonic relationship)? Are the variables observed over time or space (or otherwise likely to be correlated with themselves)? In essence, what's your model for the variables (if any)? Glen The observations were numbers. To be specified, the 2 variables are DELAY and ANGLE. So, basically I am looking into some raw measurement data captured in the real environment and after post-proceesing these data, I will have information in these two domains. I do not know whether there are linearly correlated or sth else but, by physical mechanisms, there should be some kind of correlation between them. They are observed over the TIME domain. If you're wanting to measure monotonic association, the Speaman correlation has much to recommend it (including high efficiency against the Pearson when the data are bivariate normal - with resulting linear association). If you want to measure linear association, then the Pearson is generally the way to go, though Spearman is less influenced by extreme observations, so even here it has something to recommend it. If you want to measure some more general dependence, then I'm no expert on it, but you may be on the right track trying to estimate the bivariate distribution - perhaps with kernel density estimation, unless you have some more knowledge about the process (the more outside information you can put in, the easier it should be to identify if something is happening). I'd probably suggest not trying to group the data and do a chi-squared measure of association (you're throwing away the ordering, where most of the information will be), except perhaps just as an exploratory technique that's fast. If one of the variables is more like a predictor and the other more like a response, you might consider looking at nonparametric regression approaches (smoothing, basically). Most packages will at least do loess these days. If the variables aren't expected to reasonably fall into a functional-type relationship (maybe all the points lie on an arc that's 3/4 of a circle or something), then you could look at some of the methods that find principal curves. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Testing for joint probability between 2 variables
Glen Barnett [EMAIL PROTECTED] wrote in message 9rndu1$gqq$[EMAIL PROTECTED]">news:9rndu1$gqq$[EMAIL PROTECTED]... I'd probably suggest not trying to group the data and do a chi-squared measure of association (you're throwing away the ordering, where most of the information will be), except perhaps just as an exploratory technique that's fast. Actually, one of the approaches where the chi-square is split into orthogonal components, and you pick the components relevant to you (a bit like testing a contrast in ANOVA) to test, so you're not spreading your power over alternatives you don't want power in anyway might be a reasonable idea, since that can work quite well. I think Rayner and Best's book has some of it, but I believe they have done more on that since. (It relates to the Neyman and Barton Smooth tests, which can be shown to partition the chi-square statistic, but that's not the only way to partition it.) But you're still probably better off using some model for the continuous data if you have something appropriate. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Comparing percent correct to correct by chance
Donald Burrill [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... On Sun, 28 Oct 2001, Melady Preece wrote: Hi. I want to compare the percentage of correct identifications (taste test) to the percentage that would be correct by chance 50%? (only two items being tasted). Can I use a t-test to compare the percentages? What would I use for the s.d. for by chance percentage? (0?) Standard comparison would be the formal Z-test for a proportion; see any elementary stats text. If you have a reasonably large sample size, use the normal approximation to the binomial; if you have a small sample, it may be necessary to use the binomial distribution itself, which is considerably more tedious unless you have comprehensive tables. Sounds as though you'd wish to test H0: P = .50 vs. H1: P .50. I'd kind of expect them to want this one to be one tailed - it would seem strange to be interested in the circumstance where tastebuds do worse than chance (well, it'd be kinky and fun, but would it change your action from no difference? I can conceive of it, but I'd bet not.) Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Transformation function for proportions
Rich Ulrich [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... On Wed, 17 Oct 2001 15:50:35 +0200, Tobias Richter [EMAIL PROTECTED] wrote: We have collected variables that represent proportions (i. e., the proportion of sentences in a number of texts that belong to a certain category). The distributions of these variables are highly skewed (the proportions for most of the texts are zero or rather low). So my Low proportions, and a lot at zero? I missed the lot at zero on first reading - so my other post is nonsense. Rich is right - you can't do anything much about symmetry if you have a large clump at zero. *Any* monotonic increasing transformation will still leave you with a large clump at the bottom. No matter what you do all those values have to end up at the same place as all the other zeros, right? Why would symmetry be necessary? Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Transformation function for proportions
Rich Strauss [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED]... However, the arcsin transformation is for proportions (with fixed It's also designed for stabilising variance rather than specifically inducing symmetry. Does it actually produce symmetry as well? denominator), not for ratios (with variable denominator). The proportion of sentences in a number of texts that belong to a certain category sounds like a problem in ratios, since the total number of sentences undoubtedly vary among texts. Log transformations work well because they linearize such ratios. Additionally for small proportions logs are close to logits, so logs are sometimes helpful even if the data really are proportions. Logs also go some way to reducing the skewness and stabilising the variance, though they don't stabilise it as well as the arcsin square root that's specifically designed for it. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Are parametric assumptions importat ?
Yes [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Glenn Barnett wrote: One n in Glen. OK, I see what you were getting at - but I still disagree, if it is understood that we are talking about large samples. Your original comment that I was replying to was: (1) normality is rarely important, provided the sample sizes are largish. The larger the less important. And I take some issue with that. I guess it depends on what we mean by large. For large effects, and large samples, you have far more power than you need; the goal is not to get a p-value so small that you need scientific notation to express it! Correct. If the effect is not so large - and many of the people I help deal with pretty modest effects. Large samples don't always save you - even with the distribution under the null hypothesis, let alone power. If the effect is small, efficiency matters; but a fairly small deviation from normality will not have a large effect on efficiency either. Agreed. With an effect small enough to be marginally detectable even with a large sample, it is likely that a *large* deviation from normality will raise much more important questions about which measure of location is appropriate. Yes. For smaller samples, your point holds - with the cynical observation that the times when it would most benefit us to assume normality are precisely the times when we have not got the information that would allow us to do so! I might however quibble that for smaller samples it is risky to assume that asymptotic relative efficiency will be a good indication of relative efficiency for small N. In many cases it is. And if the samples are nice and small, even when it's difficult to do the computations algebraically, we can simulate from some plausible distributions to look at the properties. Or do something nonparametric that has good power properties when the population distribution happens to be close to normal. Permutation tests, for example. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Are parametric assumptions importat ?
Robert J. MacG. Dawson wrote: Voltolini wrote: Hi, I am Biologist preparing a class on experiments in ecology including a short and simple text about how to use and to choose the most commom statistical tests (chi-square, t tests, ANOVA, correlation and regression). I am planning to include the idea that testing the assumptions for parametric tests (normality and homocedasticity) is very important to decide between a parametric (e.g., ANOVA) or the non parametric test (e. g. Kruskal-Wallis). I am using the Shapiro-Wilk and the Levene test for the assumption testing but.. It's not that simple. Some points: (1) normality is rarely important, provided the sample sizes are largish. The larger the less important. The a.r.e won't change with larger samples, so I disagree here. (2) The Shapiro-Wilk test is far too sensitive with large samples and not sensitive enough for small samples. This is not the fault of Shapiro and Wilk, it's a flaw in the idea of testing for normality. The question that such a test answers is is there enough evidence to conclude that population is even slightly non-normal? whereas what we *ought* to be asking is do we have reason to believe that the population is approximately normal? Almost. I'd say Is the deviation from normality so large as to appreciably affect the inferences we're making?, which largely boils down to things like - are our estimates consistent? (the answer will be yes in any reasonable situation) are our standard errors approximately correct? is our significance level something like what we think it is? are our power properties reasonable? You want a measure of the degree of deviation from normality. For example, the Shapiro-Francia test is based on the squared correlation in the normal scores plot, and as n increases, the test detects smaller deviations from normality (which isn't what we want) - but the squared correlation itself is a measure of the degree of deviation from normality, and may be a somewhat helpful guide. As the sample size gets moderate to large, you can more easily asses the kind of deviation from normality and make some better assessment of the likely effect. Generally speaking, things like one-way ANOVA aren't affected much by moderate skewness or thin or somewhat thickish tails. With heavy skewness or extremely heavy tails you'd be better off with a Kruskal-Wallis. Levene's test has the same problem, as fairly severe heteroscedasticity can be worked around with a conservative assumption of degrees of freedom - which is essentially costless if the samples are large. In each case, the criterion of detectability at p=0.05 simply does not coincide withthe criterion far enough off assumption to matter Correct (3) Approximate symmetry is usually important to the *relevance* of mean-based testing, no matter how big the sample size is. Unless the sum of the data (or of population elements) is of primary importance, or unless the distribution is symmetirc (so that almost all measures of location coincide) you should not assume that the mean is a good measure of location. The median need not be either! (4) Most nonparametric tests make assumptions too. The rank-sum test assumes symmetry; You mean the signed rank test. The rank-sum is the W-M-W... the Wilcoxon-Mann-Whitney and Kruskal-Wallis tersts are usually taken to assume a pure shift alternative (which is actually rather unlikely for an asymmetric distribution.) In fact symmetry will do instead; Potthoff has shown that the WMW is a test for the median if distributions are symmetric. If there exists a transformation that renders the populations equally-distributed or symmetric (eg, lognormal family) they will work, too. e.g., the test will work for scale shift alternatives (since the - monotonic - log transform would render that as a location shift alternative, but of course the monotonic transformation won't affect the rank structure, so it works with the original data). Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Mean and Standard Deviation
Edward Dreyer wrote: A colleague of mine - not a subscriber to this helpful list - asked me if it is possible for the standard deviation to be larger than the mean. If so, under what conditions? Of course - for example, if you analyse mean-corrected data... It can even happen with data that are strictly positive. The log-normal distribution with sigma-squaredln(2) is an example that has standard deviation larger than the mean; e.g. with sigma-squared=1, the standard deviation will be about 130% of the mean. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: semi-studentized residual
James Ankeny wrote: Hello, I have a question regarding the so-called semi-studentized residual, which is of the form (e_i)* = ( e_i - 0 ) / sqrt(MSE). Here, e_i is the ith residual, 0 is the mean of the residuals, and sqrt(MSE) means the square root of MSE. Now, if I understand correctly, the population simple linear regression model assumes that the E_i, the error terms, are independent and identically distributed random variables with N(0, sigma^2). My question is, are semi-studentized residuals not fully studentized because MSE is not the variance of all the residuals? Correct. In fact, it probably isn't the variance of any of them, though it will often be reasonably close. It seems like MSE would be the variance of the residuals, unless of course the residuals from the sample data are not independent and identically distributed random variables. Don't confuse errors with residuals. In the model, the error term may be i.i.d., but the residuals (which estimate them) are neither independent nor identically distributed. If not, each residual may have its own variance, in which case we would have to find this and studentize each residual by its own standard error? I am not sure if I am thinking about this in the right way. Also, if the E_i are iid random variables, does this mean that the observations Y_i are iid random variables within a particular level of X? Yes. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: E as a % of a standard deviation
John Jackson [EMAIL PROTECTED] wrote in message MGns7.49824$[EMAIL PROTECTED]">news:MGns7.49824$[EMAIL PROTECTED]... re: the formula: n = (Z?/e)2 This formula hasn't come over at all well. Please note that newsgroups work in ascii. What's it supposed to look like? What's it a formula for? could you express E as a % of a standard deviation . What's E? The above formula doesn't have a (capital) E. What is Z? n? e? In other words does a .02 error translate into .02/1 standard deviations, assuming you are dealing w/a normal distribution? ? How does this relate to the formula above? Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Difference between BOX and JENKIN TRANSFER FUNCTION model and
Marg wrote: Greetings.. Can anyone suggest me what are the differences between Box and Jenkin Transfer function model and multiple regression model? Are there any good tutorials or freewares that deal with the Box and Jenkin Transfer function model? The basic difference is that the TF model is dealing both with a) lags in the variables - not just how is y related to x? but how is y(t) related to x(t-k), for various k?, and b) autocorrelation in the variables. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Normality in Factor Analysis
Robert Ehrlich [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Calculation of eigenvalues and eigenvalues requires no assumption. However evaluation of the results IMHO implicitly assumes at least a unimodal distribution and reasonably homogeneous variance for the same reasons as ANOVA or regression. So think of th consequencesof calculating means and variances of a strongly bimodal distribution where no sample ocurrs near the mean and all samples are tens of standard devatiations from the mean. The largest number of standard deviations all data can be from the mean is 1. To get some data further away than that, some of it has to be less than 1 s.d. from the mean. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Help me, please!
Monica De Stefani [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... 2) Can Kendall discover nonlinear dependence? He used to be able to, but he died. (Look at how Kendall's tau is calculated. Notice that it is not affected by any monotonic increasing transformation. So Kendall's tau measures monotonic association - the tendency of two variables to be in the same order.) Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Combinometrics
David Heiser [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... We seem to have a lot of recent questions involving combinations, and probabilities of combinations. I am puzzled. Are these concepts no longer taught as a fundamental starting point in stat? I remember all the urn problems and combinations of n taken m times, with and without replacements, the lot sampling problems, gaming problems, etc. These were all preliminary, early in the semester (fall). Now to see these questions popping up late in spring? Times may have changed, since the 1940's, and perhaps there is more important stuff to teach. Even if times hadn't changed, perhaps some of the posters aren't studying in the US, so their timetable may not match yours. (Right now it's late autumn where I am sitting.) Here in Australia, for example, the school year is the same as the calendar year - high schools will start in early February, universities will mostly start in early March (though it varies some from institution to institution). And not all posters are necessarily at university. However, I'd guess that many stats courses no longer do much combinatorial probability. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: A disarmingly simple conjecture
Giuseppe Andrea Paleologo wrote: I am dealing with a simple conjecture. Given two generic positive random variables, is it always true that the sum of the quantiles (for a given value p) is greater or equal than the quantile of the sum? In other words, let X, Y be positive random variables with continuous but arbitrary joint CDF F(x,y), and let Z = X + Y, with CDF Fz(z). Let Fx(x) and Fy(y) are the marginal CDFs for X and Y respectively. Is it true that Fx^-1 (p) + Fy^-1 (p) = Fz^-1(p) with 0 p 1 ? Any insight or counterexample is greatly appreciated. I am sure this is proved in some textbook, but independently from that, I think this should be doable via elementary methods... I'm sure I've seen it somewhere. It seems obvious for well-behaved cases, and I assume it is true in general, but I must admit my brain is completely not in gear at the moment. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Homework Problem
Michael Scheltgen [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Suppose X1, X2, X3, and X4 have a multivariate Normal Dist'n with mean vector u, and Covariance matrix, sigma. (a) Suppose it is known that X3 = x3 and X4 = x4. What is: 1)The expected value of X1 2)The expected value of X2 3)The variance of X1 4)The variance of X2 5)The correlation of X1 and X2 My approach was to find the conditional distribution, then designate E[X1] = u1 from the mean vector of the conditional dist'n E[X2] = u2 from the mean vector of the conditional dist'n same with the variance, etc... Is this the correct approach? Thank you very much for your comments :) Looks right to me. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Most Common Mistake In Statistical Inference
W. D. Allen Sr. [EMAIL PROTECTED] wrote in message nH9u6.6370$[EMAIL PROTECTED]">news:nH9u6.6370$[EMAIL PROTECTED]... A common mistake made in statistical inference is to assume every data set is normally distributed. This seems to be the rule rather than the exception, even among professional statisticians. The most common mistake to me seems to be the one where people use the data to answer a question other than the one in which they were interested. Either the Chi Square or S-K test, as appropriate, should be conducted to determine normality before interpreting population percentages using standard deviations. 1) The Chi-square test is effectively useless as a test of normality, since it ignores the ordering in the bins (the binning itself is an additional but relatively smaller effect). 2) A common mistake in inference is to assume, without checking, that a formal hypothesis test of normality followed by a normal-theory procedure will have desirable properties. In practice the first thing to do is to find out how big a deviation from normality you can tolerate with the procedure you have in mind, taking into account not just level but power (if you're testing) or size of confidence intervals (if you're doing point estimation), and so on. If it's large, you are probably safe unless it's obvious your data are drastically non-normal (extreme skewness can be a problem). If it's small, then you should look at a different procedure - either a robust or a nonparametric procedure, for example - or a different assumption. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: help with modelling
Debraj [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... hi, I have a set of data which indicates number of correct responses on a test (score) for 20 persons. I wanted to know if I can model the same mathematically based on certain factors, say Score = f(factor1, foactor2, factor3, factor4), so that I can simulate similar data with different values of the factors. How should I go about this ? There are a whole variety of models you might consider. Since the response is the number of correct responses out of 20, you will want some kind of discrete distribution on the range 0-20, presumably with one or more free parameters, at least some of which relate to the factors. For example, one simple model would be the Binomial(20,p), where the probability parameter, p, depends on the factors. It makes some assumptions that may be okay as a first approximation for some kinds of tests, and largely useless in other situations (and I can't tell you which case we have here). Read up firstly on discrete distributions, and then on GLMs, this may give you one starting point. Going back to that binomial model, the way that p depends on the factors is another choice you need to make. If you read about GLMs, look at typical link functions for the binomial. I'm not saying this would be a good model in your case, but it might be a good place to start thinking about the issues. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: accuracy, median or mean
Paul Foran wrote: Is Accuracy measured as sample mean or sample median distance from true value You could define something called accuracy as either of these, or indeed as something else. Is there a particular context you're asking about? It may be that in some areas the term has an accepted definition. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Tests of Statistical Significance
Rich Ulrich [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Sorry, I am missing it - -- I couldn't quite work it out either. I often have that problem though. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: ANOVA with dichotomous dependent variable
Gerhard Luecke wrote: Can anyone name some references where the problem of using a DICHOTOMOUS variable as a DEPENDENT variable in an ANOVA is discussed? Many thanks in advance, Gerhard Luecke I'd first try logistic regression. If all your variables are categorical, you can look at some of the categorical (contingency table-type) analyses (e.g. loglinear models). Most stats packages will do logistic regression. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Which book do you recommend?
[EMAIL PROTECTED] wrote: Comments, please, on the relative merits of the standard textbooks: Bickel Doksum Casella Berger Cox Hinkley Or is there some other book that you prefer? This question has been posted before, but nobody responded, so I'm asking again. Surely someone out there has an opinion! Depends on what you want to do with it. I somewhat prefer Casella and Berger to Cox and Hinkley for my purposes, but that's not going to be the same as what you want to use it for. Both are reasonable. I'm not familiar with Bickel and Doksum's book, though if their papers are anything to go by, it should be fairly readable. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: question about binomial distribution
Thomas Souers [EMAIL PROTECTED] wrote in message 17920451.972429742277.JavaMail.imail@slippery">news:17920451.972429742277.JavaMail.imail@slippery... I have a question regarding basic statistics, and while it might seem foolish to some of you, I would greatly appreciate any help: Suppose a variable can assume two values, success ( 1, probability p ) or failure ( 0, probability 1-p ). If n trials are independent and the probability of success remains the same for each trial, then obviously the count of successes in n trials is binomial with E(Y)=np and V(Y)=np(1-p). What I do not understand, and perhaps this doesn't make any sense, but, what distribution does the original binary variable have? Here, we don't consider just the count of successes, Actually, we do; that's where the zero and one come from! With a single trial there can be 0 successes (prob 1-p) or 1 success (prob. p). but rather, the variable with two values. Obviously, you can derive that the original binary variable has mean p and variance p(1-p). But does it make any sense to say that it has a distribution? Yes, of course it makes sense - it's a perfectly ordinary random variable. Sometimes called the Bernoulli distribution, because it's the distribution of the number of successes of a single trial from a Bernoulli process. Obviously it's also a binomial with n=1. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: probability questions
[EMAIL PROTECTED] wrote: Two probability questions... If X has chi-square distribution with 5 degrees of freedom 1. what is the probability of X 3 2. what is the probability of X 3 given that X 1.1 Homework, right? Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: More probability
[EMAIL PROTECTED] wrote: A random variable, X, has the Uniform distribution f(x) = [0.4, 0 a x 2.5 otherwise 1. what is a 2. what is the probability 1 x 2 given that x .5 3. what is the median 4. what is c such that P(x c) = .05 More homework. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: consistent statistic
Chuck Cleland wrote: Hello: If I understand the concept correctly, a consistent statistic is one whose value approaches the population value as the sample size increases. I am looking for examples of statistics that are _not_ consistent. The best examples would be statistics that are not computationally complex and could be understood by large and diverse audiences. Also, how can one go about demonstrating the statistic is not consistent thru simulation? thanks for any suggestions, Chuck I've always been fond of this statistic: "7". It is only consistent if the population value also happens to be 7, and it bears no relation whatever to the data, so it isn't affected by sample size. It makes a reasonable second or third example - I wouldn't lead with it. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Skewness and Kurtosis Questions
- Original Message - From: David A. Heiser [EMAIL PROTECTED] To: [EMAIL PROTECTED]; Glen Barnett [EMAIL PROTECTED] Sent: Friday, September 01, 2000 1:13 PM Subject: Re: Skewness and Kurtosis Questions Barnett then goes on... Now, if I delete the two 150's on the end of data set #1 and change the ranges on the formulae, I get a mean of 7.28 and I still get a median of 0. Again, the mean is larger than the median so this should be positively skewed but Excel returns a value of -0.370. It looks like you've just constructed just such an example as I mentioned. I have verified Excel's calculations manually and they appear to be correct so it would appear that the commonly used statement that: mean median: positive, or right-skewness mean = median: symmetry, or zero-skewness mean median: negative, or left-skewness is incorrect, or, am I overlooking something? It is correct if you measure skewness in terms of mean-median. If you measure it some other way, it is no longer true. Note in particular that zero third central moment does not imply symmetry (contrary to what some books assert). If you use form 1) or form 3) then a zero value represents complete symmetry. (I snipped them, but both forms were moment/cumulant based measures) I'm sorry, but this is wrong. Counterexamples are easy to construct and can be found in the literature. You can even set *all* odd moments to zero and still have non-symmetry. See, for example, Kendall and Stuart. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Skewness and Kurtosis Questions
christopher.mecklin [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... And as far as using EXCEL's help menus as a stat reference, well EXCEL 2000 also claims the following about the two-sample t-test: "You can use t-tests to determine whether two sample means are equal." Just in case any students are reading this and don't realise it, Chris is pointing out that that statement in Excel is nonsense (so other things it tells you are suspect). You can tell when sample means differ just by looking at them. It is for making inferences about populations that some people might use t-tests. Never use Excel help as a source of statistical knowledge! It is worse than nothing in that respect. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Skewness and Kurtosis Questions
Ronny Richardson [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Several references I have looked at define skewness as follows: mean median: positive, or right-skewness mean = median: symmetry, or zero-skewness mean median: negative, or left-skewness You see these kiind of statements quite often in books. They are okay if you *define* skewness as some scaled version of mean-median. Now, if I enter the following data into Excel: -125, -100, -50, -25, -1, 0, 0, 0, 0, 0, 0, 0, 25, 50, 75, 75, 100, 107, 150, 150 You get a mean of 21.55 and a median of 0 so the mean is larger than the median and the data is right-skewed. Excel returns a skewness of 0.028, with is positive but barely so. If I enter the second data set of: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 25, 50, 75, 100, 125 Excel returns a mean of 23.50 and a median of 8.00 so the mean and median are closer together than data set #1 but the skewness value is 2.035, much larger than #1. Why should a mean and median that are closer together generate a skewness measure that is so much larger? Does this mean that the magnitude of the skewness number has no meaning? There's several problems. (i) mean-median is measured in the units of the original data. A skewness measure based on standardised third central moment (as is commonly used) is unit-free. Double all your numbers in a data set and you double "mean-median", but skewness is unchanged. (ii) there is not necessarily any relationship between the standardised third central moment measure of skewness and a (standardised) mean-median measure of skewness (e.g [mean-median]/std.dev). It is easy to construct data sets where the third-moment skewness measure has one sign while the mean-median skewness measure has the opposite sign. Now, if I delete the two 150's on the end of data set #1 and change the ranges on the formulae, I get a mean of 7.28 and I still get a median of 0. Again, the mean is larger than the median so this should be positively skewed but Excel returns a value of -0.370. It looks like you've just constructed just such an example as I mentioned. I have verified Excel's calculations manually and they appear to be correct so it would appear that the commonly used statement that: mean median: positive, or right-skewness mean = median: symmetry, or zero-skewness mean median: negative, or left-skewness is incorrect, or, am I overlooking something? It is correct if you measure skewness in terms of mean-median. If you measure it some other way, it is no longer true. Note in particular that zero third central moment does not imply symmetry (contrary to what some books assert). Excel, and another reference I looked at, state that "The peakeness of a distribution is measured by its kurtosis. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution." These are relative to a normal distribution. This statement is also wrong (as pointed out in Kendall and Stuart). Kurtosis (as measured by standardized fourth central moment, sometimes with 3 subtracted, as would have been intended by the above reference) is a *combination* of peakedness and heavy-tailedness; more specifically it is a tendency to vary away from the mean +/- 1 std. deviation. If that is the case, what does it mean that data set #1 above has a kurtosis value of zero? It is supposedly of similar peakedness and heavy-tailedness as a normal distribution. I appreciate any comments you can supply. Beware those books! If they get that wrong, what else have the not understood? Fortunately you have had the sense to verify these things for yourself rather than just accept what some book tells you. Kendall and Stuart Vol I may help to clear up some of these issues for you. (Advanced Theory of Statistics. Don't be put off by the title - it is quite readable; moreso than many books with the word "Introduction" or "Introductory" in the title!) Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: transforming ratios
Jeff E. Houlahan [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... A colleague is looking at the relative amounts of two different types of fatty acids (say, fatty acids A and B) that are incorporated in two different types of tissues. He is comparing the ratio of A:B in the two tissues but the data are heteroscedastic.he has tried several transformations but nothing is stabilizing the variance. Is there a transformation that is specifically for ratios (the ratios range from 1:5 to 5:1)? Thanks a lot. The obvious transformation with ratios is logs, but presumably that was already considered. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: t-test normality assumption
Bob Hayden wrote: In addition to the approximation involved in using the CLT, most (possibly all) practical situations require that you estimate the population standard deviation with the sample standard deviation in calculating a standard error for use in constructing a confidence interval or doing a hypothesis test. This introduces additional error. Again, the error is small for large samples. For smaller samples, it can be fairly large. The usual way around that problem is to use the t distribution, which you can think of as a modified normal distribution -- the modifications being those needed to exactly offset this source of error. The trouble is, in order to calculate those corrections, we need to know the shape of the population distribution. The corrections incorporated into the t-distribution are those appropriate for a normal distribution. So, when we use the t-distribution, we need to have the population close to normally distributed in order for the usual test statistic to have a t(not z)-distribution. Yes. A lot of people miss the fact that the t-statistic has both a numerator and denominator. The numerator will go to the normal when the CLT holds (but how quickly depends on the distribution). However, the denominator needs to: 1) go to a multiple of the square root of (a chi-squared r.v. / d.f.) 2) be independent of the numerator to give you a t-distribution. In practice these only need to hold closely enough to yield something close to a t-distribution at the sample size you're interested in. This isn't all - even if you get this, you are only getting robustness to the /significance level/. You also want decent power-robustness. That may be a problem for the t in some circumstances; there's not much point in keeping close to the right Type I error rate if you take no account of the Type II error rate. There are times when a test of location for which the normality assumption is not required may be less of a risk; the tiny amount of power (the relative efficiencies are very close to 1) you give up when the data are exactly normal is a tiny price to pay to maintain good efficiency when you move away from the normal. This may be a more-robust version of the t-test, it may be a randomization/permutation test or it may be a rank-based equivalent. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: summarizing p-values
[EMAIL PROTECTED] wrote in message 8mbhrh$fuk$[EMAIL PROTECTED]">news:8mbhrh$fuk$[EMAIL PROTECTED]... Hello from Germany, as a part of my dissertation in medicine, I have to summarize some results of clinical trials. My question: By summarizing the results (percentage differences of certain parameters), how can I regard for the different p-values (which are calculated with different tests in the trials). Is it possible to form something like a weighet mean with the p-values and the sample sizes in the trials to generate an common effect size of the different results in the trials? Thanks for your comments, Marc There exist ways of combining p-values from independent tests. However, they don't generally weight by n, because that's already taken into account in the p-value. (e.g. Fisher's technique of summing -2log p_i and comparing with a chi- squared distribution with df equal to twice the number of tests.) It sounds, however, like you're attempting a meta-analysis, which other people would be more qualified than I am to explain the various pitfalls and problems of. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: skewness Kurtosis
jagan mohan [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Repected Members, Coefficient of Skewness (beta-1) = (3rd moment)^2/ (2nd moment)^3 Coefficient of Kurtosis (beta-2) = (4th moment)/(2nd moment)^2. where do I get proofs for these two.Please let me know about this. You don't prove definitions. Normally people look at the (signed) square root of beta_1 and call that skewness. beta_1 itself doesn't tell you about the direction the data is skewed in. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Power Function Neagtive Intercept
Dr. N.S. Gandhi Prasad [EMAIL PROTECTED] wrote in message 013501bff62b$f871d6e0$[EMAIL PROTECTED]">news:013501bff62b$f871d6e0$[EMAIL PROTECTED]... I have fitted a power function Y = a (X1^b1)*(x2^b2)*(X3^b3) by transfroming Y as well as Xs in to LOGs and followed least Squares procedure. However, the estimate of 'a' is found to be negative. Can we accept the results? What Not possible. Perhaps your estimate of *log(a)* is negative? This simply implies that your (median) estimate of a is less than 1. meaning can be attached to 'a'. Here Y is output and Xs are input variables Where? Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: extrapolation
Veeral Patel [EMAIL PROTECTED] wrote in message news:397cfc9a$[EMAIL PROTECTED]... Hi, I have a set of data (25000 samples), i have plotted a histogram , the Wow! How many observations in each sample? Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: contrasts for Kruskal Wallis
Richard M. Barton [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... Suppose I have 4 groups, and want to compare means. I do a one-way ANOVA using Bonferroni (my choice) contrasts to get at pairwise differences. Suppose I decide that I have non-normality problems and decide to treat dependent variable as ranks. I can do a Kruskal-Wallis test, or equivalently (I'm 99.9% sure) do a one-way ANOVA Equivalent if you take proper account of the distribution of ranks, yes. on the ranks. Can I then look at the Bonferroni pairwise tests as a reasonable follow-up for looking at where the differences lie (I'm only 75% sure I can)??? Only in a rough sense. There are multiple comparison procedures specifically for the Kruskal-Wallis. See, for example, Neave and Worthington's "Distribution-Free Tests". You might also find something in Conover, but I don't have it to hand, or I'd check. Procedures are given in several books. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Recommendation?
Michael Atherton [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... I will be applying for faculty positions in Education this year and I was wondering if any one can recommend departments where alternative views on education (i.e., non-constructivist) are encouraged or supported. This is a stats newsgroup - sci.STAT.edu This is not a group for discussion of education in general, but of statistical education. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: Skewness: is 1 Normal? Says Who?
Donald Burrill [EMAIL PROTECTED] wrote in message [EMAIL PROTECTED]">news:[EMAIL PROTECTED]... On Thu, 6 Jul 2000, John Nash wrote (to the AERA-D list): Many of us operate under the following assumption: For |skewness coefficient| 1, data is considered to be normally distributed. Well. A normal distribution has skewness = 0; but I presume you know that. Skewness only addresses the issue of symmetry, not other aspects of the shape of a distribution. Presumably the rule-of-thumb you state must be invoked along with some other rules, since (as other respondents have pointed out) skewness 1 (or any other arbitrary value) will not filter out U-shaped or rectangular or triangular or multimodal distributions, none of which could be reasonably described as "normal". I take it then that you do not really mean to claim that "If |skewness| 1, the data are normally distributed.", since the antecedent is not sufficient for the consequent. Probably the "rule" in its original form was more like this: "If |skewness| 1, the data are NOT normally distributed." Or, somewhat more precisely, "If |skewness| 1, the null hypothesis that the data are a random sample from a normally distributed population can be rejected." In that form, the rule presented can be investigated a bit further. Using one or more of the techniques mentioned in other responses, under what conditions (for openers, how large must the sample be?) would that null hypothesis be rejected when |skewness| 1? Indeed - for small samples from a normal distribution, sample skewness (based on standardized 3rd central moment I am assuming) can easily exceed 1 in absolute value. This means that without bringing sample size into your rule, you aren't controlling your significance level. If you are only interested in skewed alternatives, the sample skewness can be a pretty powerful test of normality (the idea effectively dates back to Karl Pearson in the 19th century), but - even if we choose our rejection rule so we have some idea of our significance level - it is useless at picking up any non-normal distribution with low third central moment. Even some non-symmetric distributions have zero third central moment! A good place to pursue this is the book on goodness of fit tests by D'Agostino and Stephens. IIRC Kendall and Stuart (vol II) has some stuff on it as well. Glen = Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =
Re: this list
Rich Ulrich wrote: [...] - I agree with that. - and here is something that I read today on another group, which is directly about the problem of protesting about posters who annoy you. Dealing with Chambers is easy - people like that infest most of usenet. If you have killfiles, *plonk*. Even if you don't have a newsreader with killfiles, you don't *have* to read a post when you see who it is from. His posts don't change, just the people he chooses to insult - why keep reading? Nobody makes you read every post. On any newsgroup with a person like that, I just ignore any thread from the moment they post to it (Terry Austin is a prime example on some other groups I read) - all posts to that thread after the person in question has posted are contaminated and will contain no useful information. I have more important stuff to read. Usenet is much more of a joy these days. Glen === This list is open to everyone. Occasionally, less thoughtful people send inappropriate messages. Please DO NOT COMPLAIN TO THE POSTMASTER about these messages because the postmaster has no way of controlling them, and excessive complaints will result in termination of the list. For information about this list, including information about the problem of inappropriate messages and information about how to unsubscribe, please see the web page at http://jse.stat.ncsu.edu/ ===
Re: Disadvantage of Non-parametric vs. Parametric Test
Frank E Harrell Jr wrote: Alex Yu wrote: Disadvantages of non-parametric tests: Losing precision: Edgington (1995) asserted that when more precise measurements are available, it is unwise to degrade the precision by transforming the measurements into ranked data. Edgington's comment is off the mark in most cases. The efficiency of the Wilcoxon-Mann-Whitney test is 3/pi (0.96) with respect to the t-test IF THE DATA ARE NORMAL. If they are non-normal, the relative efficiency of the Wilcoxon test can be arbitrarily better than the t-test. Likewise, Spearman's correlation test is quite efficient (I think the efficiency is 9/pi^2) relative to the Pearson r test if the data are bivariate normal. Where you lose efficiency with nonparametric methods is with estimation of absolute quantities, not with comparing groups or testing correlations. The sample median has efficiency of only 2/pi against the sample mean if the data are from a normal distribution. Yes, the median is inefficient at the normal. This is the location estimator corresponding to the sign test in the one-sample case. But if you use the location estimator corresponding to the signed-rank test (say) instead, the efficiency improves substantially. Glen
Re: Disadvantage of Non-parametric vs. Parametric Test
Rich Ulrich wrote: - In my vocabulary, these days, "nonparametric" starts out with data being ranked, or otherwise being placed into categories -- it is the infinite parameters involved in that sort of non-reversible re-scoring which earns the label, nonparametric. (I am still trying to get my definition to be complete and concise.) Well, I am happy for you to use this definition of nonparametric now that you've said what you want it to mean, but it isn't exactly what most statisticians - including those of us that distinguish between the terms "distribution-free" and "nonparametric" - mean by "nonparametric", so you'll have to excuse my earlier ignorance of your definition. If my recollection is correct, a parametric procedure is where the entire distribution is specified up to a finite number of parameters, whereas a nonparametric procedure is one where the distribution can't be/isn't specified with only a finite number of unspecified parameters. This typically includes the usual distribution-free procedures, including many rank-based procedures, but it also includes many other things - including some that don't transform the data in any way, and even some based on means. So, for example, ordinary simple linear regression is parametric, because the distribution of y|x is specified, up to the value of the parameters specifying the intercept and slope of the line, and the variance about the line. Nonparametric regression (as the term is typically used in the literature), by contrast, is effectively infinite-parametric, because the distribution of y|x doesn't depend only on a finite number of parameters (often the distribution *about* E[y|x] is parametric - typically gaussian - but E[y|x] itself is where the infinite-parametric part comes from). Nonparametric regression would not seem to fit your definition of "nonparametric", since your usage seems to require some loss of information through ranking or categorisation. Once we start using the same terminology, we tend to find the disagreements die down a bit. Glen
Re: Disadvantage of Non-parametric vs. Parametric Test
Alex Yu wrote: Disadvantages of non-parametric tests: Losing precision: Edgington (1995) asserted that when more precise measurements are available, it is unwise to degrade the precision by transforming the measurements into ranked data. So this is an argument against rank-based nonparametric tests rather than nonparametric tests in general. In fact, I think you'll find Edgington highly supportive of randomization procedures, which are nonparametric. In fact, surprising as it may seem, a lot of the location information in a two sample problem is in the ranks. Where you really start to lose information is in ignoring ordering when it is present. Low power: Generally speaking, the statistical power of non-parametric tests are lower than that of their parametric counterpart except on a few occasions (Hodges Lehmann, 1956; Tanizaki, 1997). When the parametric assumptions hold, yes. e.g. if you assume normality and the data really *are* normal. When the parametric assumptions are violated, it isn't hard to beat the standard parametric techniques. However, frequently that loss is remarkably small when the parametric assumption holds exactly. In cases where they both do badly, the parametric may outperform the nonparametric by a more substantial margin (that is, when you should use something else anyway - for example, a t-test outperforms a WMW when the distributions are uniform). Inaccuracy in multiple violations: Non-parametric tests tend to produce biased results when multiple assumptions are violated (Glass, 1996; Zimmerman, 1998). Sometimes you only need one violation: Some nonparametric procedures are even more badly affected by some forms of non-independence than their parametric equivalents. Testing distributions only: Further, non-parametric tests are criticized for being incapable of answering the focused question. For example, the WMW procedure tests whether the two distributions are different in some way but does not show how they differ in mean, variance, or shape. Based on this limitation, Johnson (1995) preferred robust procedures and data transformation to non-parametric tests. But since WMW is completely insensitive to a change in spread without a change in location, if either were possible, a rejection would imply that there was indeed a location difference of some kind. This objection strikes me as strange indeed. Does Johnson not understand what WMW is doing? Why on earth does he think that a t-test suffers any less from these problems than WMW? Similarly, a change in shape sufficient to get a rejection of a WMW test would imply a change in location (in the sense that the "middle" had moved, though the term 'location' becomes somewhat harder to pin down precisely in this case). e.g. (use a monospaced font to see this): :. .: ::. = .:: ... ... a b a b would imply a different 'location' in some sense, which WMW will pick up. I don't understand the problem - a t-test will also reject in this case; it suffers from this drawback as well (i.e. they are *both* tests that are sensitive to location differences, insensitive to spread differences without a corresponding location change, and both pick up a shape change that moves the "middle" of the data). However, if such a change in shape were anticipated, simply testing for a location difference (whether by t-test or not) would be silly. Nonparametric (notably rank-based) tests do have some problems, but making progress on understanding just what they are is difficult when such seemingly spurious objections are thrown in. His preference for robust procedures makes some sense, but the preference for (presumably monotonic) transformation I would see as an argument for a rank-based procedure. e.g. lets say we are in a two-sample situation, and we decide to use a t-test after taking logs, because the data are then reasonably normal... in that situation, the WMW procedure gives the same p-value as for the untransformed data. However, let's assume that the log-transform wasn't quite right... maybe not strong enough. When you finally find the "right" transformation to normality, there you finally get an extra 5% (roughly) efficiency over the WMW you started with. Except of course, you never know you have the right transformation - and if the distribution the data are from are still skewed/heavy-tailed after transformation (maybe they were log-gamma to begin with or something), then you still may be better off using WMW. Do you have a full reference for Johnson? I'd like to read what the reference actually says. Glen
Re: Sample size and non-parametric test
boonlert wrote: Dear All Can I use a non-parametric test for a sample size less than 30 (central limit theorem) Sorry, but (i) what has the central limit theorem have to do with any of this? (ii) for that matter, what does a sample size of 30 really have to do with the central limit theorem in any case? The rate at which the CLT can be regarded as having kicked in sufficiently depends on what the sampling distribution is (sometimes n=1 is enough, sometimes n=1 isn't enough), and what purpose you're wanting to use the theorem for. regardless the scale, nominal or ordinal scale, requirement? Not sure what this sentence is asking. If I can, what is the priority concern for using non-parametric test whether sample size or measurement scales? I'm not sure what you are asking about. Could you please write in shorter sentences, because you seem to be assuming stuff that isn't necessarily true. About all I can glean from what you've written is that you have some concern about sample size and measurement scale for some (unspecified) nonparametric procedure or procedures. However it isn't at all clear which ones you care about, nor what the concern actually is. I will say that for almost all nonparametric procedures in common use, the tables usually go down to very small numbers - there is generally no minimum sample size (except as required to actually calculate the quantities involved). Note also that some nonparametric procedures may only be suitable for some measurement scales, but this has nothing to do with the sample size AFAIK. Glen