Re: Question on Conditional PDF

2002-02-25 Thread Glen Barnett


Chia C Chong [EMAIL PROTECTED] wrote in message
a5d38d$63e$[EMAIL PROTECTED]">news:a5d38d$63e$[EMAIL PROTECTED]...


 Glen [EMAIL PROTECTED] wrote in message
 [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
  Do you want to make any assumptions about the form of the conditional,
  or the joint, or any of the marginals?

 Well, the X  Y are dependent and hence there are being descibed by a joint
 PDF.

This much is clear.

 I am not sure what other assumption I can make though..

I merely though you may have domain specific knowledge of the variables and
their likely relationships which might inform the choice a bit (cut down the
space
of possibilities).

Can you at least indicate whether any of them are restricted to be positive?

Glen




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: detecting outliers in NON normal data ?

2002-02-25 Thread Glen Barnett

Voltolini wrote:
 
 Hi,
 
 I would like to know if methods for detecting outliers
 using interquartil ranges are indicated for data with
 NON normal distribution.
 
 The software Statistica presents this method:
 data point value  UBV + o.c.*(UBV - LBV)
 data point value  LBV - o.c.*(UBV - LBV)
 
 where: UBV is the 75th percentile) and LBV is the 25th percentile).  o.c. is
 the outlier coefficient.

The values of the outlier coefficient are traditionally chosen by
reference
to some percentile of the normal distribution. (If anyone didn't
recognise it,
this is just the outliers on a boxplot.)

If you choose that coefficient in some appropriate way, then it may be
reasonable
for non-normal data.

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Cauchy PDF + Parameter Estimate

2002-02-25 Thread Glen Barnett

Herman Rubin wrote:
 
 In article a5daqb$72k$[EMAIL PROTECTED],
 Chia C Chong [EMAIL PROTECTED] wrote:
 Hi!
 
 Does anyone come across some Matlab code to estimate the parameters for the
 Cauchy PDF?? Or some other sources about the method to estimate their
 parameters??
 
 What is so difficult about maximum likelihood?  Start with a
 reasonable estimator, and use Newton's method.

There are difficulties with Newton's method (and many other
hill-climbing
techniques) because the cauchy likelihood function is generally
multimodal.

You can end up somewhere other than the MLE unless you use a somewhat
more
sophisticated starting point than a reasonable estimator. There are
good
estimators that can start you off very close to the true maximum, but
it's 
a long time since I've seen that literature, so I can't name names right
now.

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: What is an outlier ?

2002-02-25 Thread Glen Barnett

Voltolini wrote:
 
 Hi,
 
 My doubt isan outlier can be a LOW data value in the sample (and not
 just the highest) ?
 
 Several text boks dont make this clear !!!

What makes an outlier an outlier is your model. If your model accounts
for all the observations, you can't really call any of them an outlier.
If your model adequately accounts for all but one or two unusual
observations, you might regard them as coming from some process other
than that which generated the data you model accounts for, and call them
outliers.

Such not adequately accounted for observations may be low
observations, or high
observations, or they may actually turn out be somewhere in the middle
of the range of your data - as I have seen with time series for example,
where in some applications an autoregressive models was a very good
desctiption of a long series, apart from a few outliers in the first
quarter or so of the time period (which did in the end turn out to have
come from a different process, because the protocol wasn't always being
properly followed early on). Two of those outliers - in the sense that
the model didn't adequately account for them - turn out to be neither
particularly high or low observations - but they were substantially
higher or lower than expected from the model. 

Another case where you might have outliers in the middle of your data
is in a regression context, where a generally increasing relationship
shows a tight, gaussian-looking random scatter about the relationship,
but with a couple of relatively low y-values at some of the higher
x-values. The observations themselves may actually be very close to the
mean of the y's, but the model of the relationship makes them unusual.
A different model - for example, one where the observations come from a
distribution which has the same expectation as a function of x, but
which has a heavier tail to the left around that - might account for all
the data and not find any outliers.

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Question on CDF

2002-02-22 Thread Glen Barnett


Henry [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 On Fri, 22 Feb 2002 08:55:42 +1100, Glen Barnett
 [EMAIL PROTECTED] wrote:

 Bob [EMAIL PROTECTED] wrote in message
 [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
  A straight line CDF would imply the data is uniformly distributed,
  that is, the probability of one event is the same as the probability
  of any other event.  The slope of the line would be the probability of
  an event.
 
 I doubt that - if the data were distributed uniformly on [0,1/2), say, then
 the slope of the line would be 2!

 I suspect he meant probability density.

I guess that's actually correct - the slope of the pdf is zero. However, I'm
fairly certain that's not what he meant.

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Question on CDF

2002-02-22 Thread Glen Barnett


Henry [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 I was trying to suggest that he meant the slope of the CDF was the
 height of the PDF.

Oh, okay. Yes, that would be correct, but it shouldn't be called probability!

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Question on CDF

2002-02-21 Thread Glen Barnett


Bob [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 [EMAIL PROTECTED] (Linda) wrote in message
news:[EMAIL PROTECTED]...
  Hi!
 
  If I plot CDF of a sample data and this CDF looks like a straight line
  cross through 0. What does this implies?? Normally, CDF will not look
  like a straight line but sth like a S2 shape, isn't??
 
  Linda

 A straight line CDF would imply the data is uniformly distributed,
 that is, the probability of one event is the same as the probability
 of any other event.  The slope of the line would be the probability of
 an event.

I doubt that - if the data were distributed uniformly on [0,1/2), say, then
the slope of the line would be 2!

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: How to test whether f(X,Y)=f(X)f(Y) is true??

2002-02-20 Thread Glen Barnett


Linda [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Hi!

 I have some experimental data collected and can be grouped into 2
 variables, X and Y. One is the dependent variable (Y) and the other is
 an independent variable (X). What test shall I made to check whether
 there can be expressed as independent or not??


There are so many ways variables can fail to be independent that a truly
general test usually won't have good power against specific alternatives.

Essentially you'd need to estimate f(Y|X) somehow and compare it to f(Y) (also
estimated somehow). I have no advice on the best way to tackle the test, since
it depends on how you do the estimation (and you need to keep in mind that
since the two distributions are estimated from the same data, they are not
independent).

If XY are categorical, there are a number of general tests of independence, of
which the usual Pearson chi-squared test of independence is the best known.

It's much better if you can specify the kind of alternatives you care about
most, and the more specific the better. For example, one thing that would help
to nail it down a little would be to say you only care about relationship in
the mean - i.e. you need to detect if E(Y|X) = E(Y). This is still very
general, but it's better. If you're only interested in monotonic relationships,
it's easier still.

But you need to clarify what you require.

Glen




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Chi-square chart in Excel

2002-02-20 Thread Glen Barnett

Ronny Richardson wrote:
 
 Can anyone tell me how to produce a chart of the chi-square distribution in
 Excel? (I know how to find chi-square values but not how to turn those into
 a chart of the chi-square curve.)
 
 Ronny Richardson
 
 =
 Instructions for joining and leaving this list, remarks about the
 problem of INAPPROPRIATE MESSAGES, and archives are available at
   http://jse.stat.ncsu.edu/
 =

I assume you want the pdf, not the cdf.

Set up a column of x's (e.g. 0,0.2, 0.4, ...), and beside it set up a
column of pdf values (type in the pdf for the chisq you're after as a
function of x):

For m d.f.:
1/[Gamma(m/2)*2^(m/2)]*x^(m/2-1)*exp(-x/2)


(in excel you'll need exp(gammaln()) because it doesn't have a Gamma
function.)

Note that you can set up m in a cell, so you can play around with the
d.f. and see what it does to the curve.

So now you have 2 columns you can plot. Click on the chart icon, choose
the XY(scatter) plot option, pick either the joined with lines or joined
with a curve pictures (without the points marked - either of the
rightmost plots there).

Choose any other options you need, and there you go.

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Normalization procedures

2002-02-20 Thread Glen Barnett

Niko Tiliopoulos wrote:
 
 Hello everybody,
 
 Has anybody heard of the Bell-Doksum test? 

IIRC it's like a Wilcoxon 2-sample test, except that the ranks are
transformed to normal scores. If that's the right test, it has ARE 1 vs
the t-test (it has good power for small deviations), but as you move to
larger deviations, its power curve flattens out short of 1.

Checking the internet:
...
9. Bell, C. B.; Doksum, K. A. Some new distribution-free statistics.
Ann. Math. Statist 36 1965 203--214.
...
12. Bell, C. B.; Doksum, K. A. Optimal one-sample distribution-free
tests and their two-sample extensions. Ann. Math. Statist. 37 1966
120--132.
...
it would just about have to be one of these two papers.


Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Which is faster? ziggurat or Monty Python (or maybe something else?)

2002-02-19 Thread Glen Barnett


Ian Buckner [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Glen Barnett [EMAIL PROTECTED] wrote in message
 [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
  Ian Buckner wrote:
  
   We generate pairs of properly distributed Gaussian variables at
   down to 10nsec intervals, essential in the application. Speed can
   be an issue, particularly in real time situations.
 
  Generated on what? (On a fast enough machine, even clunky old
  Box-Muller can probably give you that rate.)

 Generated on custom silicon (surprise).
 Box-Muller does not work for real time requirements.

Of course it does, if the machine is fast enough that you're getting them at
the rate you need.

And the reason you're getting them fast is you have a fast machine - which is
not much help if the machine is a given.

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Numerical recipes in statistics ???

2002-02-19 Thread Glen Barnett


Charles Metz [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 The Truth wrote:

   I suppose I should have been more clear with my question. What
   I essentially require is a textbook which presents algorithms
   like Monte Carlo, Principal Component Analysis, Clustering
   methods, MANOVA/MANACOVA methods etc. and provides source code
   (in C , C++ or Fortran) or pseudocode together with short
   explanations of the algorithms.

 Although it doesn't contain much code/pseudocode, I highly recommend
 'Elements of Statistical Computing: Numerical Computation,' by Ronald A.
 Thisted (New York and London: Chapman and Hall, 1988).  To the best of
 my knowledge, this is as close to a statistics version of 'Numerical
 Recipes' as you'll find.

Thisted's book is quite good.

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Numerical recipes in statistics ???

2002-02-19 Thread Glen Barnett


The Truth [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Glen Barnett [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]...
  The Truth wrote:
  
   Are there any Numerical Recipes like textbook on statistics and
probability ?
   Just wondering..
 
  What do you mean, a book with algorithms for statistics and probability
  or a handbook/cookbook list of techniques with some basic explanation?
 
  Glen


 I suppose I should have been more clear with my question. What I
 essentially require is a textbook which presents algorithms like Monte
 Carlo, Principal Component Analysis, Clustering methods,
 MANOVA/MANACOVA methods etc. and provides source code (in C , C++ or
 Fortran) or pseudocode together with short explanations of the
 algorithms.

There are books on statistical computing that cover some algorithms (usually
with pseudocode rather than actual source code), but to cover all of statistics
is not possible. The particular subset you suggest above are not all covered in
any one book I have seen.

You should be able to find books that cover some Monte Carlo techniques and
regression and maybe bootstrapping and a few other basic techniques - stuff
that goes somewhat beyond what's in NR, but not nearly as far as you seem to be
after.

You can find code for many of these things (and much more besides) in journals
like JRSS C (Applied Statistics), and a few others (e.g. ACM Transactions on
Mathematical Software). A lot of these algorithms are on the Internet.

Glen




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Which is faster? ziggurat or Monty Python (or maybe something else?)

2002-02-19 Thread Glen Barnett


Herman Rubin [EMAIL PROTECTED] wrote in message
a4u99j$[EMAIL PROTECTED]">news:a4u99j$[EMAIL PROTECTED]...
 In article [EMAIL PROTECTED],
 Radford Neal [EMAIL PROTECTED] wrote:
 Box-Muller does not work for real time requirements.

 This isn't true, of course.  A real time application is one where
 one must guarantee that an operation takes no more than some specified
 maximum time.  The Box-Muller method for generating normal random
 variates does not involve any operations that could take arbitrary
 amounts of time, and so is suitable for real-time applications.

 This assumes that the time needed for Box-Muller is small enough,
 which will surely often be true.  If the time allowed is very small,
 then of course one might need to use some other method.

 Rejection sampling methods would not be suitable for real-time
 applications, since there is no bound on how many points may be
 rejected before one is accepted, and hence no bound on the time
 required to generate a random normal variate.

Radford Neal

 Acceptance-rejection, or the usually faster acceptance-replacement,
 methods are, strictly speaking, not real time.  However, they
 may be much faster 99.99% of the time.

In that circumstance, could one not generate more values than required each
call (say an extra one, assuming there's time), and store the extras up for the
rare case where it's looking like it will take too long? You could take enough
that the probability you exhaust them is smaller than say the probability a
cosmic ray will flip a crucial bit in your hardware. You'd need a few generated
at the start, of course.

Glen




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Numerical recipes in statistics ???

2002-02-18 Thread Glen Barnett

The Truth wrote:
 
 Are there any Numerical Recipes like textbook on statistics and probability ?
 Just wondering..

What do you mean, a book with algorithms for statistics and probability
or a handbook/cookbook list of techniques with some basic explanation?

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Which is faster? ziggurat or Monty Python (or maybe something else?)

2002-02-18 Thread Glen Barnett

Ian Buckner wrote:
 
 We generate pairs of properly distributed Gaussian variables at
 down to 10nsec intervals, essential in the application. Speed can
 be an issue, particularly in real time situations.

Generated on what? (On a fast enough machine, even clunky old Box-Muller
can probably give you that rate.)

How generated?

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Which is faster? ziggurat or Monty Python (or maybe something else?)

2002-02-17 Thread Glen Barnett


Alan Miller [EMAIL PROTECTED] wrote in message
OC2b8.28457$[EMAIL PROTECTED]">news:OC2b8.28457$[EMAIL PROTECTED]...
 First - the reference to George's paper on the ziggurat, and the code:
 The Journal of Statistical Software (2000) at:
 http://www.jstatsoft.org/v05/i08

That I already have, thanks.

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Which is faster? ziggurat or Monty Python (or maybe something else?)

2002-02-17 Thread Glen Barnett


Bob Wheeler [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Marsaglia's ziggurat and MCW1019 generators are
 available in the R package SuppDists. The gcc
 compiler was used.

Thanks Bob.

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Which is faster? ziggurat or Monty Python (or maybe something else?)

2002-02-17 Thread Glen Barnett


George Marsaglia [EMAIL PROTECTED] wrote in message
0l7b8.42092$[EMAIL PROTECTED]">news:0l7b8.42092$[EMAIL PROTECTED]...
 (3-year old) Timings, in nanoseconds,  using Microsoft Visual C++
  and gcc under DOS on a 400MHz PC.   Comparisons are with
 methods by Leva and by Ahrens-Dieter, both said to be fast,
 using the same the same uniform RNG.

MSgcc
 Leva  307384
 Ahrens-Dieter161193
 RNOR55  65 (Ziggurat)
 REXP 77  40 (Ziggurat)


 The Monty Python method is not quite as fast as as the Ziggurat.

Thanks for the information. Could you give a rough idea about the relativities?
roughly 5% slower? 10%? 30%?

I realise it's machine-dependent, but I'm only after a rough picture.

 Some may think that Alan Miller's somewhat vague reference to
 a source for the ziggurat article suggests disdain.

I didn't get that impression.

 (I don't have a web page, so the above can be considered
  my way to play Ozymandius.)

I wish you did!

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Which is faster? ziggurat or Monty Python (or maybe something else?)

2002-02-17 Thread Glen Barnett


Art Kendall [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 I tend to be more concerned with the apparent randomness of the results
than with the speed of the algorithm.

This will be mainly a function of the randomness of the uniform generator. If
we assume the same uniform generator for both, and assuming it's a pretty good
one (our current one is reasonable, though I want to go back and update it
soon), there shouldn't be a huge difference in the apparent randomness of the
resulting gaussians.

 As a thought experiment,  what is the cumulative time difference in a run
using the fastest vs the slowest algorithm? A
 whole minute? A second? A fractional second?

When you need millions of them (as we do; a run of 10,000 simulations could
need as many as 500 million gaussians, and we sometimes want to do more than
10,000), and you also want your program to be interactive (in the sense that
the user doesn't have to wander off and have coffee just to do one simulation
run), knowing that one algorithm is, say, 30% faster is kind of important.
Particularly if the user may want to do hundreds of simulations...

A whole minute extra on a simulation run is a big difference, if the user is
doing simulations all day.

Glen




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: test differences between proportions

2002-02-13 Thread Glen Barnett

Rich Ulrich wrote:
 
 On Mon, 11 Feb 2002 13:56:46 +0100, nikolov
 [EMAIL PROTECTED] wrote:
 
  hello,
 
  i want to test the difference between two proportions. The problem is that
  some elements of these proportions are dependent (i can not isolate them).
  That is, the t-statistics does not work. How could i do? Do other kind of
  tests exist? Is there a book or a paper on the subject?
 
 Taking your questions in reverse order --
 
 I don't know of a book or paper about general dependencies,
 but those concerns are implicit in estimation theory.
 
 If dependency is what makes the t-test hard to use, you will
 have trouble with everything else that is common, too.
 
 What you could do is  --
 (a) Use the t-test anyway, if the correlations are positive:
 because the bias would just reduce the power of the test.

and the level...

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Ansari-Bradley dispersion test.

2002-02-10 Thread Glen Barnett


Rich Ulrich [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 On Sat, 09 Feb 2002 16:59:34 GMT, Johannes Fichtinger
 [EMAIL PROTECTED] wrote:

  Dear NG!
  I have been searching for a description of the Ansari-Bradley dispersion
test up to now for
  analysing a psychological research. I am searching for a description of
this test, specially a
  description how to use the test.
 
  Please, can you tell me, how to use the test, or show me a link, where it
is described?
  Thank you very much in advance,

 I plugged Ansari-Bradley  into a search by  www.google.com  and
 there were 287  hits.  The first page contained the (aptly named)


http://franz.stat.wisc.edu/~rossini/courses/intro-nonpar/text/Specifications_fo
r_the_Ansari_Bradley_Test.html

 I suggest repeating the search.  That also eliminates the pasting
 problem if your reader has broken  the long URL into two lines.

A warning, however; the Ansari-Bradley test (and similar tests like the
Seigel-Tukey) has some drawbacks:
i) it assumes the locations are identical
ii) it is less powerful than some alternative tests


If assumption (i) is false, the A-B test may have very little power to detect a
difference in variance.

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Method for determining gaussian distribution

2002-02-04 Thread Glen Barnett

Jennifer Golbeck wrote:
 
 i hope someone can help me with this. i have finished a computer science
 study that examines swarming behavior. my claim is that the swarming
 algorithm that i use produces a gaussian distribution - on a grid, the
 frequency that each area is visited is recorded. graphs of my data looks
 like there is a normal distribution around the center of the area. i'd
 like to statistically show that it is a gaussian distribution.
 
 i'm not sure how i would do this. i could imagine doing a test on each row
 and each column to show that all of those are normal. even for that, i'm
 not sure what test to use to show that data follows a normal distribution.
 i feel like this is incredibly basic and i'm just overlooking something i
 should know...but i need help. any advice would be really appreciated.


It's impossible to do this.

You may be able to show it is a (discretised) gaussian analytically, by
deriving that from the problem set up, but you can't demonstrate that it
is gaussian just from the output. You can demonstrate that the gaussian
is a reasonable model for it. You can demonstrate that the deviations
from the gaussian are small. You can demonstrate that the gaussian is in
some sense a better model than a variety of plausible alternatives. But
you cannot demonstrate that it *is* gaussian from the output.

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: area under the curve

2002-01-31 Thread Glen Barnett

Dennis Roberts wrote:
 
 unless you had a table comparable to the z table for area under the normal
 distribution ... for EACH different level of skewness ... an exact answer
 is not possible in a way that would be explainable

Even if you specify level of skewness, an exact answer is still not
possible without specifying more about the distribution. Specifying to
third moments (for example) doesn't pin disributions down very well at
all.

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: How to test f(X , Y)=f(X)f(Y)

2002-01-28 Thread Glen Barnett


Linda [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 I have 1000 observations of 2 RVs from an experiments. X is the
 independent variable and Y is the dependent variable. How do I perform
 the test whether the following statement is true or not??

 f(X,Y)=f(X)f(Y)

You'll probably want to make a few more assumptions than given here.

A general approach would be to calculate estimates of f(X) and f(Y) or
(more generally still) of F(X) and F(Y). Exactly how you might calculate
the estimates of these depends in part on the assumptions you make, and
the knowledge you have about X and Y.

Then some comparison of F(X)F(Y) with F(X,Y) (or f(X)f(Y) with f(X,Y)
would be made over the ranges of X and Y, but again, precisely how you
evaluate these depends on the assumptions you make and the knowledge
you have about X and Y.

For example, if X and Y are nominal categories, you'd use a chi-square test.
If there was further information (such as that found in ordered categories, or
in continuous variables), you'd want to do other things.

Glen




=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Unique Root Test - Statistics

2002-01-22 Thread Glen Barnett


Shakti Sankhla [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Hi All:

 This is basically not a SAS problem but I believe that many of the list
 members could help.

 I am looking for information on Statistical topic called Unique Root
 Test.


Do you mean unit root test?

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: 95% CI for a sum

2002-01-21 Thread Glen Barnett

Scheltema, Karen wrote:
 
 I have 2 independent samples and the standard errors and n's associated with
 each of them.  If a and b are constants, what is the formula for the 95%
 confidence interval for
 (a(Xbar1)+b(xbar2))?

Are the sample sizes big enough that you'd be prepared to use the CLT?

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Buy Book on Probability and statistical inference

2002-01-14 Thread Glen Barnett


Chia C Chong [EMAIL PROTECTED] wrote in message
a1phfd$36e$[EMAIL PROTECTED]">news:a1phfd$36e$[EMAIL PROTECTED]...
 Hi!

 I wish to get a book in Probability and statistical inference . I wish to
 get some advices first..Any good suggestion??

(i) What do you know already?
(ii) What do you need to know about?
(iii) What level of mathematics (e.g. how much calculus, linear algrebra, etc)
do you have?

Glen



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Modelling Problem

2002-01-14 Thread Glen Barnett

Alexander Hener wrote:
 I have a modelling problem where any help would be appreciated.
 Assume that I want to model a fraction, where the nominator is a sum of,

Do you mean numerator?

 say, four continous random variables.  I am thinking of using some
 parameter-additive distribution there, e.g. the gamma, since the sum in
 the nominator needs not be negative. The denominator should be continous
 and positive. Now my questions are :
 
 1. Is anyone aware of  distributions which lend themselves to such a
 model ?

If the fractions are between zero and one, you may wish to consider the
beta distribution for the fraction - if X and Y are independent gamma
r.vs, then X/(X+Y) is beta. If X = X1 + X2 + X3 + X4 is your numerator,
that would seem to suggest something like a beta at first glance.

Glen


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Which one fit better??

2002-01-07 Thread Glen Barnett


Chia C Chong [EMAIL PROTECTED] wrote in message
a1bpk5$62b$[EMAIL PROTECTED]">news:a1bpk5$62b$[EMAIL PROTECTED]...

 Glen [EMAIL PROTECTED] wrote in message
 [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
  Chia C Chong [EMAIL PROTECTED] wrote in message
 news:a0n001$b7v$[EMAIL PROTECTED]...
   I plotted a histogram density of my data and its smooth version using
 the
   normal kernel function. I tried to plot the estimated PDF (Laplacian 
   Generalised Gaussian) estimated using maximum likelihood method on top
 as
   well. Graphically, its seems that Laplacian wil fit thr histogram
 density
   graph better while the Generalised Gaussian will fit the smooth version
   (i.e. the kernel densoty version).
  
 
  Imagine that you began with a sample from a Laplacian (double
  exponential) distribution. What will happen to the central peak after
  you smooth it with a KDE?

 The peak does not changed significantly...Maybe shifted to the left a
 bit...not too much!!

No, I was not talking about your data, since you don't necessarily have
Laplacian - that's what you're trying to decide!

Imagine you have data actually from a Laplacian distribution.
(It has a sharp peak in the middle, and exponential tails.)

Now you smooth it (KDE via gaussian kernel).

What happens to the peak?  (assume a typical window width)

[Answer? It gets smoothed, so it no longer looks like a sharp peak.]

That's where your impression of a gaussian-looking KDE is probably coming from.

Note that the tails of a normal and a laplace are different, so if those are
the two choices, that may help.

Glen





=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Question on 2-D joint distribution...

2002-01-05 Thread Glen Barnett


Chia C Chong [EMAIL PROTECTED] wrote in message
a145qk$qfq$[EMAIL PROTECTED]">news:a145qk$qfq$[EMAIL PROTECTED]...
 Hi!

 I have a series of observations of 2 random variables (say X and Y) from my
 measurement data. These 2 RVs are not independent and hence f(X,Y) ~=
 f(X)f(Y). Hence, I can't investigate f(X) and f(Y) separately. I tried to
 plot the 2-D kernel density estimates of these 2 RVs and from the it looks
 like Laplacian/Gaussian/Generalised Gaussian shape in one side and the other
 side looks like Gamma/Weibull/Exponential shape.

 My intention is to find the joint 2-D distribution of these 2 RVs so that I
 can reprenseted this by an equation (so that I could regenerate this plot
 using simulation later on). I wonder whether anyone has come across this
 kind of problem and what method that I should use??






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



Re: Standardizing evaluation scores

2001-12-19 Thread Glen Barnett

Stan Brown wrote:
 But is it worth it? Don't the easy graders and :tough graders
 pretty much cancel each other out anyway?

Not if some students only get hard graders and some only get easy
graders.

If all students got all graders an equal amount of time it probably
won't matter at all.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Basics

2001-12-12 Thread Glen Barnett

colsul wrote:
 
 Does anyone know of a website that deals with basic statistic formulae
 and/or business math? Also, I am looking for a text book that could give me
 a grounding in the basics of statistics, stat. analysis and business maths.
 I need to cram so I have some idea for a job interview I have coming up. Any
 help or advice would be very much appreciated.

Beware. Spouting crammed-but-not-understood knowledge can make you
look like an idiot, which isn't a good thing to appear to be in an
interview. 

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: 10 envelopes, 10 persons

2001-11-19 Thread Glen Barnett

Stan Brown wrote:
 
 Problem posed me by a student: ten persons (A through J) and ten
 envelopes containing cards marked with letters A through J. (Each
 letter is in one and only one envelope.)
 
 The random variable x is the number of people who get the right
 envelope when the envelopes are handed out randomly. Obviously
 0 = x = 10.
 
 Question: How do we express the probability distribution P(x)?
 
 I've done some work on this, and I _must_ be missing something
 obvious. Here's part what I've got so far.
 
 10! = number of possible arrangements. Only one of them assigns all
 ten envelopes to the right people, so
 P(10) = 1/10!
 
 If nine people get the right envelopes, the tenth must also get the
 right envelope. So
 P(9) = 0
 
 I bogged down on figuring P(8), though. Then I tried to look at P(0)
 and got even more bogged down.
 
 Am I missing something here? Is there an elegant way to write
 expressions for the probabilities of the various x's?

This is sometimes called the matching problem or the matching
experiment.

Type letters envelopes random into google and you get several
relevant hits - e.g. 
http://www.math.uah.edu/stat/urn/urn6.html
http://www.wku.edu/~neal/probability/matching.html 

either of these might be enough for you to see how to do it.

If you put matching in there as well it would probably
target your search better, but I just wanted to show you
you didn't need more than your own description of the
problem to find out lots just with google.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Testing for joint probability between 2 variables

2001-10-30 Thread Glen Barnett


Chia C Chong [EMAIL PROTECTED] wrote in message
news:9rn4vc$8v2$[EMAIL PROTECTED]...

 Glen [EMAIL PROTECTED] wrote in message
 [EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
  Chia C Chong [EMAIL PROTECTED] wrote in message
 news:9rjs94$lht$[EMAIL PROTECTED]...
   I have 2 variables and would like to test whether these 2 variables are
   correlated or not. What statistical tests that I should use?? I would
 guess
   sth like joint pdf tests but can somebody give me some suggestions to
 start
   with??
 
  Are the observations numbers or categories, or something else?
  If they are categorical, are the categories ordered?
 
  Are we talking linear correlation or some more general association
  (e.g. a monotonic relationship)?  Are the variables observed over
  time or space (or otherwise likely to be correlated with themselves)?
 
  In essence, what's your model for the variables (if any)?
 
  Glen

 The observations were numbers. To be specified, the 2 variables are DELAY
 and ANGLE. So, basically I am looking into some raw measurement data
 captured in the real environment and after post-proceesing these data, I
 will have information in these two domains.

 I do not know whether there are linearly correlated or sth else but, by
 physical mechanisms, there should be some kind of correlation between them.
 They are observed over the TIME domain.

If you're wanting to measure monotonic association, the Speaman correlation
has much to recommend it (including high efficiency against the Pearson when
the data are bivariate normal - with resulting linear association).

If you want to measure linear association, then the Pearson is generally the
way to go,
though Spearman is less influenced by extreme observations, so even here it has
something to recommend it.

If you want to measure some more general dependence, then I'm no expert on it,
but you may be on the right track trying to estimate the bivariate
distribution -
perhaps with kernel density estimation, unless you have some more knowledge
about the process (the more outside information you can put in, the easier it
should
be to identify if something is happening).

I'd probably suggest not trying to group the data and do a chi-squared measure
of
association (you're throwing away the ordering, where most of the information
will be), except perhaps just as an exploratory technique that's fast.

If one of the variables is more like a predictor and the other more like a
response,
you might consider looking at nonparametric regression approaches (smoothing,
basically). Most packages will at least do loess these days.

If the variables aren't expected to reasonably fall into a functional-type
relationship
(maybe all the points lie on an arc that's 3/4 of a circle or something), then
you could look at some of the methods that find principal curves.

Glen




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Testing for joint probability between 2 variables

2001-10-30 Thread Glen Barnett


Glen Barnett [EMAIL PROTECTED] wrote in message
9rndu1$gqq$[EMAIL PROTECTED]">news:9rndu1$gqq$[EMAIL PROTECTED]...
 I'd probably suggest not trying to group the data and do a chi-squared
measure
 of association (you're throwing away the ordering, where most of the
information
 will be), except perhaps just as an exploratory technique that's fast.

Actually, one of the approaches where the chi-square is split into orthogonal
components, and you pick the components relevant to you (a bit like testing
a contrast in ANOVA) to test, so you're not spreading your power over
alternatives you don't want power in anyway might be a reasonable idea,
since that can work quite well. I think Rayner and Best's book has some of it,
but I believe they have done more on that since. (It relates to the Neyman
and Barton Smooth tests, which can be shown to partition the chi-square
statistic, but that's not the only way to partition it.)

But you're still probably better off using some model for the continuous
data if you have something appropriate.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Comparing percent correct to correct by chance

2001-10-30 Thread Glen Barnett


Donald Burrill [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 On Sun, 28 Oct 2001, Melady Preece wrote:

  Hi.  I want to compare the percentage of correct identifications (taste
  test) to the percentage that would be correct by chance 50%?  (only two
  items being tasted).  Can I use a t-test to compare the percentages?
  What would I use for the s.d. for by chance percentage?  (0?)

 Standard comparison would be the formal Z-test for a proportion;  see
 any elementary stats text.  If you have a reasonably large sample size,
 use the normal approximation to the binomial;  if you have a small
 sample, it may be necessary to use the binomial distribution itself,
 which is considerably more tedious unless you have comprehensive tables.

 Sounds as though you'd wish to test  H0: P = .50  vs.  H1:  P  .50.

I'd kind of expect them to want this one to be one tailed - it would
seem strange to be interested in the circumstance where tastebuds do
worse than chance (well, it'd be kinky and fun, but would it change
your action from no difference? I can conceive of it, but I'd bet not.)

Glen





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Transformation function for proportions

2001-10-18 Thread Glen Barnett


Rich Ulrich [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 On Wed, 17 Oct 2001 15:50:35 +0200, Tobias Richter
 [EMAIL PROTECTED] wrote:

  We have collected variables that represent proportions (i. e., the
  proportion of sentences in a number of texts that belong to a certain
  category). The distributions of these variables are highly skewed (the
  proportions  for most of the texts are zero or rather low). So my

 Low proportions, and a lot at  zero?

I missed the lot at zero on first reading - so my other post is nonsense.
Rich is right - you can't do anything much about symmetry if you have a
large clump at zero.

*Any* monotonic increasing transformation will still leave you with a large
clump at the bottom. No matter what you do all those values have
to end up at the same place as all the other zeros, right?

Why would symmetry be necessary?

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Transformation function for proportions

2001-10-18 Thread Glen Barnett


Rich Strauss [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]...
 However, the arcsin transformation is for proportions (with fixed

It's also designed for stabilising variance rather than specifically inducing
symmetry.
 Does it actually produce symmetry as well?

 denominator), not for ratios (with variable denominator).  The proportion
 of sentences in a number of texts that belong to a certain category sounds
 like a problem in ratios, since the total number of sentences undoubtedly
 vary among texts.  Log transformations work well because they linearize
 such ratios.

Additionally for small proportions logs are close to logits, so logs are
sometimes
helpful even if the data really are proportions. Logs also go some way to
reducing
the skewness and stabilising the variance, though they don't stabilise it as
well
as the arcsin square root that's specifically designed for it.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Are parametric assumptions importat ?

2001-10-18 Thread Glen Barnett


Yes [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...

 Glenn Barnett wrote:

One n in Glen.
 OK, I see what you were getting at - but I still disagree, if it is
 understood that we are talking about large samples.

Your original comment that I was replying to was:
 (1)  normality is rarely important, provided the sample sizes are
 largish. The larger the less important.

And I take some issue with that. I guess it depends on what we mean by large.

 For large effects,
 and large samples, you have far more power than you need; the goal is
 not to get a p-value so small that you need scientific notation to
 express it!

Correct. If the effect is not so large - and many of the people I help deal
with pretty modest effects. Large samples don't always save you - even
with the distribution under the null hypothesis, let alone power.


 If the effect is small, efficiency matters; but a fairly small
 deviation from normality will not have a large effect on efficiency
 either.

Agreed.

 With an effect small enough to be marginally detectable even
 with a large sample, it is likely that a *large* deviation from
 normality will raise much more important questions about which measure
 of location is appropriate.

Yes.

 For smaller samples, your point holds - with the cynical observation
 that the times when it would most benefit us to assume normality are
 precisely the times when we have not got the information that would
 allow us to do so!  I might however quibble that for smaller samples it
 is risky to assume that asymptotic relative efficiency will be a good
 indication of relative efficiency for small N.

In many cases it is. And if the samples are nice and small, even when
it's difficult to do the computations algebraically, we can simulate
from some plausible distributions to look at the properties.

Or do something nonparametric that has good power properties when
the population distribution happens to be close to normal. Permutation
tests, for example.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Are parametric assumptions importat ?

2001-10-16 Thread Glen Barnett

Robert J. MacG. Dawson wrote:
 
 Voltolini wrote:
 
  Hi, I am Biologist preparing a class on experiments in ecology including
  a short and simple text about how to use and to choose the most commom
  statistical tests (chi-square, t tests, ANOVA, correlation and regression).
 
  I am planning to include the idea that testing the assumptions for
  parametric tests (normality and homocedasticity) is very important
  to decide between a parametric (e.g., ANOVA) or the non parametric
  test (e. g. Kruskal-Wallis). I am using the Shapiro-Wilk and the Levene
  test for the assumption testing  but..
 
 It's not that simple.  Some points:
 
 (1)  normality is rarely important, provided the sample sizes are
 largish. The larger the less important.

The a.r.e won't change with larger samples, so I disagree here.

 (2)  The Shapiro-Wilk test is far too sensitive with large samples and
 not sensitive enough for small samples. This is not the fault of Shapiro
 and Wilk, it's a flaw in the idea of testing for normality.  The
 question that such a test answers is is there enough evidence to
 conclude that population is even slightly non-normal? whereas what we
 *ought* to be asking  is do we have reason to believe that the
 population is approximately normal?  

Almost. I'd say Is the deviation from normality so large as to
appreciably
affect the inferences we're making?, which largely boils down to things
like - 
are our estimates consistent? (the answer will be yes in any reasonable
situation)
are our standard errors approximately correct?
is our significance level something like what we think it is?
are our power properties reasonable?

You want a measure of the degree of deviation from normality. For
example,
the Shapiro-Francia test is based on the squared correlation in the
normal
scores plot, and as n increases, the test detects smaller deviations
from
normality (which isn't what we want) - but the squared correlation
itself
is a measure of the degree of deviation from normality, and may be a
somewhat
helpful guide. As the sample size gets moderate to large, you can more 
easily asses the kind of deviation from normality and make some better
assessment of the likely effect.

Generally speaking, things like one-way ANOVA aren't affected much by 
moderate skewness or thin or somewhat thickish tails. With heavy
skewness 
or extremely heavy tails you'd be better off with a Kruskal-Wallis.

 Levene's test has the same
 problem, as fairly severe heteroscedasticity can be worked around with a
 conservative assumption of degrees of freedom - which is essentially
 costless if the samples are large.



 In each case, the criterion of detectability at p=0.05 simply does
 not coincide withthe criterion far enough off assumption to matter

Correct

 
 (3) Approximate symmetry is usually important to the *relevance* of
 mean-based testing, no matter how big the sample size is.  Unless the
 sum of the data (or of population elements) is of primary importance, or
 unless the distribution is symmetirc (so that almost all measures of
 location coincide) you should not assume that the mean is a good measure
 of location.  The median need not be either!
 
 (4) Most nonparametric tests make assumptions too. The rank-sum test
 assumes symmetry;

You mean the signed rank test. The rank-sum is the W-M-W...

 the Wilcoxon-Mann-Whitney and Kruskal-Wallis tersts
 are usually taken to assume a pure shift alternative (which is actually
 rather unlikely for an asymmetric distribution.)  In fact symmetry will
 do instead; Potthoff has shown that the WMW is a test for the median if
 distributions are symmetric. If there exists a transformation that
 renders the populations equally-distributed or symmetric (eg, lognormal
 family) they will work, too.

e.g., the test will work for scale shift alternatives (since the -
monotonic -
log transform would render that as a location shift alternative, but of
course
the monotonic transformation won't affect the rank structure, so it
works
with the original data).

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Mean and Standard Deviation

2001-10-16 Thread Glen Barnett

Edward Dreyer wrote:
 
 A colleague of mine - not a subscriber to this helpful list - asked me if
 it is possible for the standard deviation
 to be larger than the mean.  If so, under what conditions?

Of course - for example, if you analyse mean-corrected data...

It can even happen with data that are strictly positive.

The log-normal distribution with sigma-squaredln(2) is an example 
that has standard deviation larger than the mean; e.g. with
sigma-squared=1,
the standard deviation will be about 130% of the mean.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: semi-studentized residual

2001-10-09 Thread Glen Barnett

James Ankeny wrote:
 
  Hello,
   I have a question regarding the so-called semi-studentized residual,
 which is of the form (e_i)* = ( e_i - 0 ) / sqrt(MSE). Here, e_i is the ith
 residual, 0 is the mean of the residuals, and sqrt(MSE) means the square
 root of MSE. Now, if I understand correctly, the population simple linear
 regression model assumes that the E_i, the error terms, are independent and
 identically distributed random variables with N(0, sigma^2). My question is,
 are semi-studentized residuals not fully studentized because MSE is not the
 variance of all the residuals? 

Correct. In fact, it probably isn't the variance of any of them, though
it
will often be reasonably close.

 It seems like MSE would be the variance of
 the residuals, unless of course the residuals from the sample data are not
 independent and identically distributed random variables. 

Don't confuse errors with residuals. In the model, the error term
may be i.i.d., but the residuals (which estimate them) are neither 
independent nor identically distributed. 

 If not, each
 residual may have its own variance, in which case we would have to find this
 and studentize each residual by its own standard error? I am not sure if I
 am thinking about this in the right way.
  Also, if the E_i are iid random variables, does this mean that the
 observations Y_i are iid random variables within a particular level of X? 

Yes.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: E as a % of a standard deviation

2001-09-27 Thread Glen Barnett


John Jackson [EMAIL PROTECTED] wrote in message
MGns7.49824$[EMAIL PROTECTED]">news:MGns7.49824$[EMAIL PROTECTED]...
 re: the formula:

   n   = (Z?/e)2

This formula hasn't come over at all well.  Please note that newsgroups
work in ascii. What's it supposed to look like? What's it a formula for?

 could you express E as a  % of a standard deviation .

What's E? The above formula doesn't have a (capital) E.

What is Z? n? e?

 In other words does a .02 error translate into .02/1 standard deviations,
 assuming you are dealing w/a normal distribution?

? How does this relate to the formula above?

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Difference between BOX and JENKIN TRANSFER FUNCTION model and

2001-08-28 Thread Glen Barnett

Marg wrote:
 
 Greetings..
 Can anyone suggest me what are the differences between Box and Jenkin
 Transfer function model and multiple regression model?
 Are there any good tutorials or freewares that deal with the Box and
 Jenkin Transfer function model?

The basic difference is that the TF model is dealing both with 

a) lags in the variables - not just how is y related to x? but 
how is y(t) related to x(t-k), for various k?, and

b) autocorrelation in the variables.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Normality in Factor Analysis

2001-06-25 Thread Glen Barnett


Robert Ehrlich [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Calculation of eigenvalues and eigenvalues requires no assumption.
 However evaluation of the results IMHO implicitly assumes at least a
 unimodal distribution and reasonably homogeneous variance for the same
 reasons as ANOVA or regression.  So think of th consequencesof calculating
 means and variances of a strongly bimodal distribution where no sample
 ocurrs near the mean and all samples are tens of standard devatiations
 from the mean.

The largest number of standard deviations all data can be from the mean is 1.

To get some data further away than that, some of it has to be less than 1 s.d.
from the mean.

Glen





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Help me, please!

2001-06-18 Thread Glen Barnett


Monica De Stefani [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 2) Can Kendall discover nonlinear dependence?

He used to be able to, but he died.

(Look at how Kendall's tau is calculated. Notice that it is
not affected by any monotonic increasing transformation. So
Kendall's tau measures monotonic association - the tendency
of two variables to be in the same order.)

Glen





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Combinometrics

2001-05-07 Thread Glen Barnett


David Heiser [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 We seem to have a lot of recent questions involving combinations, and
 probabilities of combinations.

 I am puzzled.

 Are these concepts no longer taught as a fundamental starting point in stat?
 I remember all the urn problems and combinations of n taken m times, with
 and without replacements, the lot sampling problems, gaming problems, etc.
 These were all preliminary, early in the semester (fall). Now to see these
 questions popping up late in spring?

 Times may have changed, since the 1940's, and perhaps there is more
 important stuff to teach.

Even if times hadn't changed, perhaps some of the posters aren't
studying in the US, so their timetable may not match yours. (Right
now it's late autumn where I am sitting.)

Here in Australia, for example, the school year is the same as the
calendar year - high schools will start in early February, universities
will mostly start in early March (though it varies some from institution
to institution).

And not all posters are necessarily at university.

However, I'd guess that many stats courses no longer do much
combinatorial probability.

Glen




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: A disarmingly simple conjecture

2001-04-26 Thread Glen Barnett

Giuseppe Andrea Paleologo wrote:
 
 I am dealing with a simple conjecture. Given two generic positive random
 variables, is it always true that the sum of the quantiles (for a given
 value p) is greater or equal than the quantile of the sum?
 
 In other words, let X, Y be positive random variables with continuous
 but arbitrary joint CDF F(x,y), and let Z = X + Y, with CDF Fz(z). Let
 Fx(x) and Fy(y) are the marginal CDFs for X and Y respectively. Is it
 true that
 
 Fx^-1 (p) + Fy^-1 (p) = Fz^-1(p)
 
 with 0  p  1 ?
 
 Any insight or counterexample is greatly appreciated. I am sure this is
 proved in some textbook, but independently from that, I think this
 should be doable via elementary methods...
 

I'm sure I've seen it somewhere. 

It seems obvious for well-behaved cases, and I assume it is true in
general,
but I must admit my brain is completely not in gear at the moment.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Homework Problem

2001-04-02 Thread Glen Barnett


Michael Scheltgen [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Suppose X1, X2, X3, and X4 have a multivariate Normal Dist'n
 with mean vector u,
 and Covariance matrix, sigma.

 (a) Suppose it is known that X3 = x3 and X4 = x4.  What is:

 1)The expected value of X1
 2)The expected value of X2
 3)The variance of X1
 4)The variance of X2
 5)The correlation of X1 and X2

 My approach was to find the conditional distribution, then
 designate

 E[X1] = u1 from the mean vector of the conditional dist'n
 E[X2] = u2 from the mean vector of the conditional dist'n
 same with the variance, etc...

 Is this the correct approach?  Thank you very much for your
 comments :)

Looks right to me.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Most Common Mistake In Statistical Inference

2001-03-22 Thread Glen Barnett


W. D. Allen Sr. [EMAIL PROTECTED] wrote in message
nH9u6.6370$[EMAIL PROTECTED]">news:nH9u6.6370$[EMAIL PROTECTED]...
 A common mistake made in statistical inference is to assume every data set
 is normally distributed. This seems to be the rule rather than the
 exception, even among professional statisticians.

The most common mistake to me seems to be the one where
people use the data to answer a question other than the
one in which they were interested.

 Either the Chi Square or S-K test, as appropriate, should be conducted to
 determine normality before interpreting population percentages using
 standard deviations.

1) The Chi-square test is effectively useless as a test of normality, since
 it ignores the ordering in the bins (the binning itself is an additional
 but relatively smaller effect).

2) A common mistake in inference is to assume, without checking, that
a formal hypothesis test of normality followed by a normal-theory
procedure will have desirable properties.

In practice the first thing to do is to find out how big a deviation from
normality you can tolerate with the procedure you have in mind, taking
into account not just level but power (if you're testing) or size of
confidence intervals (if you're doing point estimation), and so on.

If it's large, you are probably safe unless it's obvious your data are
drastically non-normal (extreme skewness can be a problem). If it's
small, then you should look at a different procedure - either a robust
or a nonparametric procedure, for example - or a different assumption.

Glen




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: help with modelling

2000-12-18 Thread Glen Barnett


Debraj [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 hi,

 I have a set of data which indicates number of correct responses on a
 test (score)  for 20 persons. I wanted to know if I can model the same
 mathematically based on certain factors, say Score = f(factor1,
 foactor2, factor3, factor4), so that I can simulate similar data with
 different values of the factors. How should I go about this ?

There are a whole variety of models you might consider.

Since the response is the number of correct responses out of 20, you
will want some kind of discrete distribution on the range 0-20, presumably
with one or more free parameters, at least some of which relate to the
factors.

For example, one simple model would be the Binomial(20,p), where the
probability parameter, p, depends on the factors. It makes some assumptions
that may be okay as a first approximation for some kinds of tests, and largely
useless in other situations (and I can't tell you which case we have here).
Read up firstly on discrete distributions, and then on GLMs, this may
give you one starting point.

Going back to that binomial model, the way that p depends on the factors
is another choice you need to make. If you read about GLMs, look at
typical link functions for the binomial.

I'm not saying this would be a good model in your case, but it might be
a good place to start thinking about the issues.

Glen




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: accuracy, median or mean

2000-11-20 Thread Glen Barnett

Paul Foran wrote:
 
 Is Accuracy measured as sample mean or sample median distance from true
 value

You could define something called accuracy as either of these, or indeed
as something else. Is there a particular context you're asking about?
It may be that in some areas the term has an accepted definition.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Tests of Statistical Significance

2000-11-07 Thread Glen Barnett


Rich Ulrich [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Sorry, I am missing it -
 --

I couldn't quite work it out either. I often have that problem though.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: ANOVA with dichotomous dependent variable

2000-11-02 Thread Glen Barnett

Gerhard Luecke wrote:
 
 Can anyone name some references where the problem of using a DICHOTOMOUS
 variable as a DEPENDENT variable in an ANOVA is discussed?
 
 Many thanks in advance,
 Gerhard Luecke


I'd first try logistic regression. If all your variables
are categorical, you can look at some of the categorical 
(contingency table-type) analyses (e.g. loglinear models). 

Most stats packages will do logistic regression.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Which book do you recommend?

2000-11-02 Thread Glen Barnett

[EMAIL PROTECTED] wrote:
 
 Comments, please, on the relative merits of the standard textbooks:
  Bickel  Doksum
  Casella  Berger
  Cox  Hinkley
 Or is there some other book that you prefer? This question has been
 posted before, but nobody responded, so I'm asking again. Surely someone
 out there has an opinion!

Depends on what you want to do with it. I somewhat prefer Casella 
and Berger to Cox and Hinkley for my purposes, but that's not going
to be the same as what you want to use it for. Both are reasonable. 
I'm not familiar with Bickel and Doksum's book, though if their papers 
are anything to go by, it should be fairly readable.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: question about binomial distribution

2000-10-24 Thread Glen Barnett


Thomas Souers [EMAIL PROTECTED] wrote in message
17920451.972429742277.JavaMail.imail@slippery">news:17920451.972429742277.JavaMail.imail@slippery...
 I have a question regarding basic statistics, and while it might seem
 foolish to some of you, I would greatly appreciate any help:

 Suppose a variable can assume two values, success ( 1, probability p ) or
 failure ( 0, probability 1-p ). If n trials are independent and the
 probability of success remains the same for each trial, then obviously the
 count of successes in n trials is binomial with E(Y)=np and V(Y)=np(1-p).
 What I do not understand, and perhaps this doesn't make any sense, but, what
 distribution does the original binary variable have? Here, we don't consider
 just the count of successes,

Actually, we do; that's where the zero and one come from!
With a single trial there can be 0 successes (prob 1-p) or
1 success (prob. p).

 but rather, the variable with two values.
 Obviously, you can derive that the original binary variable has mean p and
 variance p(1-p). But does it make any sense to say that it has a
 distribution?

Yes, of course it makes sense - it's a perfectly ordinary random variable.

Sometimes called the Bernoulli distribution, because it's the distribution
of the number of successes of a single trial from a Bernoulli process.

Obviously it's also a binomial with n=1.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: probability questions

2000-10-19 Thread Glen Barnett

[EMAIL PROTECTED] wrote:
 
 Two probability questions...
 If X has chi-square distribution with 5 degrees of freedom
 
 1.  what is the probability of X  3
 2.  what is the probability of X  3 given that X  1.1
 

Homework, right?

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: More probability

2000-10-19 Thread Glen Barnett

[EMAIL PROTECTED] wrote:
 
 A random variable, X, has the Uniform distribution
 f(x) = [0.4, 0  a  x  2.5 otherwise
 
 1. what is a
 2. what is the probability 1  x  2 given that x .5
 3. what is the median
 4. what is c such that P(x  c) = .05

More homework.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: consistent statistic

2000-09-12 Thread Glen Barnett

Chuck Cleland wrote:
 
 Hello:
   If I understand the concept correctly, a consistent statistic is one
 whose value approaches the population value as the sample size
 increases.  I am looking for examples of statistics that are _not_
 consistent.  The best examples would be statistics that are not
 computationally complex and could be understood by large and diverse
 audiences.  Also, how can one go about demonstrating the statistic is
 not consistent thru simulation?
 
 thanks for any suggestions,
 
 Chuck

I've always been fond of this statistic: "7".

It is only consistent if the population value also
happens to be 7, and it bears no relation whatever
to the data, so it isn't affected by sample size.

It makes a reasonable second or third example -
I wouldn't lead with it.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Skewness and Kurtosis Questions

2000-09-01 Thread Glen Barnett


- Original Message - 
From: David A. Heiser [EMAIL PROTECTED]
To: [EMAIL PROTECTED]; Glen Barnett [EMAIL PROTECTED]
Sent: Friday, September 01, 2000 1:13 PM
Subject: Re: Skewness and Kurtosis Questions


 Barnett then goes on...
 
   Now, if I delete the two 150's on the end of data set #1 and change the
   ranges on the formulae, I get a mean of 7.28 and I still get a median of
 0.
   Again, the mean is larger than the median so this should be positively
   skewed but Excel returns a value of -0.370.
 
  It looks like you've just constructed just such an example as I mentioned.
 
   I have verified Excel's calculations manually and they appear to be
 correct
   so it would appear that the commonly used statement that:
  
   mean  median: positive, or right-skewness
   mean = median: symmetry, or zero-skewness
   mean  median: negative, or left-skewness
  
   is incorrect, or, am I overlooking something?
 
 It is correct if you measure skewness in terms of mean-median. If you
 measure it some other way, it is no longer true.  Note in particular
 that zero third central moment does not imply symmetry (contrary
 to what some books assert).
 
 If you use form 1) or form 3) then a zero value represents complete
 symmetry. 

(I snipped them, but both forms were moment/cumulant based
measures)

I'm sorry, but this is wrong.
Counterexamples are easy to construct and can be found
in the literature. You can even set *all* odd moments to zero
and still have non-symmetry. See, for example, Kendall and Stuart.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Skewness and Kurtosis Questions

2000-08-30 Thread Glen Barnett


christopher.mecklin [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 And as far as using EXCEL's help menus as a stat reference, well EXCEL 2000
 also claims the following about the two-sample t-test:  "You can use t-tests
 to determine whether two sample means are equal."

Just in case any students are reading this and don't realise it, Chris is
pointing out that
that statement in Excel is nonsense (so other things it tells you are suspect).
You can
tell when sample means differ just by looking at them. It is for making
inferences about
populations that some people might use t-tests.

Never use Excel help as a source of statistical knowledge! It is worse than
nothing in that
respect.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Skewness and Kurtosis Questions

2000-08-30 Thread Glen Barnett


Ronny Richardson [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Several references I have looked at define skewness as follows:

 mean  median: positive, or right-skewness
 mean = median: symmetry, or zero-skewness
 mean  median: negative, or left-skewness

You see these kiind of statements quite often in books.
They are okay if you *define* skewness as some scaled
version of mean-median.

 Now, if I enter the following data into Excel:

 -125, -100, -50, -25, -1, 0, 0, 0, 0, 0, 0, 0, 25, 50, 75, 75, 100, 107,
 150, 150

 You get a mean of 21.55 and a median of 0 so the mean is larger than the
 median and the data is right-skewed. Excel returns a skewness of 0.028,
 with is positive but barely so.

 If I enter the second data set of:

 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 9, 8, 7, 6, 25, 50, 75, 100, 125

 Excel returns a mean of 23.50 and a median of 8.00 so the mean and median
 are closer together than data set #1 but the skewness value is 2.035, much
 larger than #1. Why should a mean and median that are closer together
 generate a skewness measure that is so much larger? Does this mean that the
 magnitude of the skewness number has no meaning?

There's several problems.
(i) mean-median is measured in the units of the original data.
 A skewness measure based on standardised third central moment
(as is commonly used) is unit-free. Double all your numbers in a
data set and you double "mean-median", but skewness is unchanged.
(ii) there is not necessarily any relationship between the standardised
third central moment measure of skewness and a (standardised)
mean-median measure of skewness (e.g [mean-median]/std.dev).
It is easy to construct data sets where the third-moment skewness
measure has one sign while the mean-median skewness measure has
the opposite sign.

 Now, if I delete the two 150's on the end of data set #1 and change the
 ranges on the formulae, I get a mean of 7.28 and I still get a median of 0.
 Again, the mean is larger than the median so this should be positively
 skewed but Excel returns a value of -0.370.

It looks like you've just constructed just such an example as I mentioned.

 I have verified Excel's calculations manually and they appear to be correct
 so it would appear that the commonly used statement that:

 mean  median: positive, or right-skewness
 mean = median: symmetry, or zero-skewness
 mean  median: negative, or left-skewness

 is incorrect, or, am I overlooking something?

It is correct if you measure skewness in terms of mean-median. If you
measure it some other way, it is no longer true.  Note in particular
that zero third central moment does not imply symmetry (contrary
to what some books assert).

 Excel, and another reference I looked at, state that "The peakeness of a
 distribution is measured by its kurtosis. Positive kurtosis indicates a
 relatively peaked distribution. Negative kurtosis indicates a relatively
 flat distribution."

These are relative to a normal distribution.

This statement is also wrong (as pointed out in Kendall and Stuart). Kurtosis
(as measured by standardized fourth central moment, sometimes with 3
subtracted,
as would have been intended by the above reference) is a *combination* of
peakedness
and heavy-tailedness; more specifically it is a tendency to vary away from the
mean +/- 1
std. deviation.


 If that is the case, what does it mean that data set #1 above has a
 kurtosis value of zero?

It is supposedly of similar peakedness and heavy-tailedness as a normal
distribution.


 I appreciate any comments you can supply.


Beware those books! If they get that wrong, what else have the not understood?

Fortunately you have had the sense to verify these things for yourself rather
than
just accept what some book tells you.

Kendall and Stuart Vol I may help to clear up some of these issues for you.
(Advanced Theory of Statistics. Don't be put off by the title - it is quite
readable; moreso than many books with the word "Introduction" or "Introductory"
in the title!)

Glen




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: transforming ratios

2000-08-27 Thread Glen Barnett


Jeff E. Houlahan [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 A colleague is looking at the relative amounts of two different types
 of fatty acids (say, fatty acids A and B) that are incorporated in two
 different types of tissues.  He is comparing the ratio of A:B in the
 two tissues but the data are heteroscedastic.he has tried
 several transformations but nothing is stabilizing the variance.  Is
 there a transformation that is specifically for ratios (the ratios range
 from 1:5 to 5:1)?  Thanks a lot.

The obvious transformation with ratios is logs, but presumably
that was already considered.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: t-test normality assumption

2000-08-07 Thread Glen Barnett

Bob Hayden wrote:
 In addition to the approximation involved in using the CLT, most
 (possibly all) practical situations require that you estimate the
 population standard deviation with the sample standard deviation in
 calculating a standard error for use in constructing a confidence
 interval or doing a hypothesis test.  This introduces additional
 error.  Again, the error is small for large samples.  For smaller
 samples, it can be fairly large.  The usual way around that problem is
 to use the t distribution, which you can think of as a modified normal
 distribution -- the modifications being those needed to exactly offset
 this source of error.  The trouble is, in order to calculate those
 corrections, we need to know the shape of the population
 distribution.  The corrections incorporated into the t-distribution
 are those appropriate for a normal distribution.  So, when we use the
 t-distribution, we need to have the population close to normally
 distributed in order for the usual test statistic to have a
 t(not z)-distribution.

Yes.

A lot of people miss the fact that the t-statistic has both
a numerator and denominator. The numerator will go to the
normal when the CLT holds (but how quickly depends on the
distribution). 

However, the denominator needs to:
1) go to a multiple of the square root of (a chi-squared r.v. / d.f.)
2) be independent of the numerator

to give you a t-distribution. In practice these only need 
to hold closely enough to yield something close to a 
t-distribution at the sample size you're interested in.

This isn't all - even if you get this, you are only getting
robustness to the /significance level/. You also want decent 
power-robustness. That may be a problem for the t in some
circumstances; there's not much point in keeping close to
the right Type I error rate if you take no account of the
Type II error rate.

There are times when a test of location for which the 
normality assumption is not required may be less of a risk;
the tiny amount of power (the relative efficiencies are very 
close to 1) you give up when the data are exactly normal is
a tiny price to pay to maintain good efficiency when you
move away from the normal. This may be a more-robust version
of the t-test, it may be a randomization/permutation test
or it may be a rank-based equivalent.

Glen


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: summarizing p-values

2000-08-04 Thread Glen Barnett


[EMAIL PROTECTED] wrote in message 8mbhrh$fuk$[EMAIL PROTECTED]">news:8mbhrh$fuk$[EMAIL PROTECTED]...
 Hello from Germany,
 as a part of my dissertation in medicine, I have
 to summarize some results of clinical trials.
 My question: By summarizing the results
 (percentage differences of certain parameters),
 how can I regard for the different p-values
 (which are calculated with different tests in the
 trials). Is it possible to form something like a
 weighet mean with the p-values and the sample
 sizes in the trials to generate an common effect
 size of the different results in the trials?
 
 Thanks for your comments,
 Marc

There exist ways of combining p-values from independent 
tests. However, they don't generally weight by n, because 
that's already taken into account in the p-value. (e.g. Fisher's 
technique of summing -2log p_i and comparing with a chi-
squared distribution with df equal to twice the number of tests.)

It sounds, however, like you're attempting a meta-analysis, 
which other people would be more qualified than I am to 
explain the various pitfalls and problems of.

Glen



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: skewness Kurtosis

2000-07-30 Thread Glen Barnett


jagan mohan [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Repected Members,

 Coefficient of Skewness (beta-1) = (3rd moment)^2/ (2nd moment)^3

 Coefficient of Kurtosis (beta-2) = (4th moment)/(2nd moment)^2.

 where do I get proofs for these two.Please let me know about
 this.

You don't prove definitions.

Normally people look at the (signed) square root of beta_1 and call that
skewness. beta_1 itself doesn't tell you about the direction the data is
skewed in.

Glen




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Power Function Neagtive Intercept

2000-07-26 Thread Glen Barnett


Dr. N.S. Gandhi Prasad [EMAIL PROTECTED] wrote in message
013501bff62b$f871d6e0$[EMAIL PROTECTED]">news:013501bff62b$f871d6e0$[EMAIL PROTECTED]...

I have fitted a power function

Y = a (X1^b1)*(x2^b2)*(X3^b3)

by transfroming Y as well as Xs in to LOGs and followed
least Squares procedure. However, the estimate of 'a' is
found to be negative. Can we accept the results? What

Not possible. Perhaps your estimate of *log(a)* is
negative? This simply implies that your (median)
estimate of a is less than 1.

meaning can be attached to 'a'. Here Y is output and Xs
are input variables

Where?

Glen





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: extrapolation

2000-07-26 Thread Glen Barnett


Veeral Patel [EMAIL PROTECTED] wrote in message
news:397cfc9a$[EMAIL PROTECTED]...
 Hi,

 I have a set of data (25000 samples), i have plotted a histogram , the

Wow! How many observations in each sample?

Glen




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: contrasts for Kruskal Wallis

2000-07-25 Thread Glen Barnett


Richard M. Barton [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 Suppose I have 4 groups, and want to compare means. I do a
 one-way ANOVA using Bonferroni (my choice) contrasts to get
 at pairwise differences.

 Suppose I decide that I have non-normality problems and decide
 to treat dependent variable as ranks. I can do a Kruskal-Wallis
 test, or equivalently (I'm 99.9% sure) do a one-way ANOVA

Equivalent if you take proper account of the distribution of
ranks, yes.

 on the ranks. Can I then look at the Bonferroni pairwise tests as
 a reasonable follow-up for looking at where the differences lie
 (I'm only 75% sure I can)???

Only in a rough sense. There are multiple comparison procedures
specifically for the Kruskal-Wallis. See, for example, Neave and
Worthington's "Distribution-Free Tests". You might also find
something in Conover, but I don't have it to hand, or I'd check.

Procedures are given in several books.

Glen





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Recommendation?

2000-07-12 Thread Glen Barnett


Michael Atherton [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...


 I will be applying for faculty positions in Education
 this year and I was wondering if any one can
 recommend departments where alternative
 views on education (i.e., non-constructivist)
 are encouraged or supported.

This is a stats newsgroup - sci.STAT.edu

This is not a group for discussion of education in general,
but of statistical education.

Glen





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Skewness: is 1 Normal? Says Who?

2000-07-09 Thread Glen Barnett


Donald Burrill [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...
 On Thu, 6 Jul 2000, John Nash wrote (to the AERA-D list):

  Many of us operate under the following assumption:
 
  For |skewness coefficient|  1, data is considered to be normally
  distributed.

 Well.
 A normal distribution has skewness = 0;  but I presume you know that.
 Skewness only addresses the issue of symmetry, not other aspects of the
 shape of a distribution.  Presumably the rule-of-thumb you state must be
 invoked along with some other rules, since (as other respondents have
 pointed out) skewness  1 (or any other arbitrary value) will not filter
 out U-shaped or rectangular or triangular or multimodal distributions,
 none of which could be reasonably described as "normal".

 I take it then that you do not really mean to claim that
   "If |skewness|  1, the data are normally distributed.", since the
 antecedent is not sufficient for the consequent.  Probably the "rule" in
 its original form was more like this:
   "If |skewness|  1, the data are NOT normally distributed."
  Or, somewhat more precisely,
   "If |skewness|  1, the null hypothesis that the data are a random
 sample from a normally distributed population can be rejected."

 In that form, the rule presented can be investigated a bit further.
 Using one or more of the techniques mentioned in other responses, under
 what conditions (for openers, how large must the sample be?) would that
 null hypothesis be rejected when |skewness|  1?

Indeed - for small samples from a normal distribution, sample skewness
(based on standardized 3rd central moment I am assuming) can easily
exceed 1 in absolute value. This means that without bringing sample
size into your rule, you aren't controlling your significance level.

If you are only interested in skewed alternatives, the sample skewness
can be a pretty powerful test of normality (the idea effectively dates
back to Karl Pearson in the 19th century), but - even if we choose
our rejection rule so we have some idea of our significance level - it
is useless at picking up any non-normal distribution with low third
central moment. Even some non-symmetric distributions have zero
third central moment!

A good place to pursue this is the book on goodness of fit tests by
D'Agostino and Stephens. IIRC Kendall and Stuart (vol II) has
some stuff on it as well.

Glen






=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: this list

2000-03-01 Thread Glen Barnett

Rich Ulrich wrote:
[...]
  - I agree with that.
 
  - and here is something that I read today on another group, which is
 directly about the problem of protesting about posters who annoy you.

Dealing with Chambers is easy - people like that infest most of
usenet. If you have killfiles, *plonk*. Even if you don't have 
a newsreader with killfiles, you don't *have* to read a post when
you see who it is from. His posts don't change, just the people
he chooses to insult - why keep reading? Nobody makes you read
every post.

On any newsgroup with a person like that, I just ignore any thread
from the moment they post to it (Terry Austin is a prime example on
some other groups I read) - all posts to that thread after the person
in question has posted are contaminated and will contain no useful
information. I have more important stuff to read. Usenet is much
more of a joy these days. 

Glen


===
This list is open to everyone.  Occasionally, less thoughtful
people send inappropriate messages.  Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.

For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===



Re: Disadvantage of Non-parametric vs. Parametric Test

1999-12-08 Thread Glen Barnett

Frank E Harrell Jr wrote:
 
   Alex Yu wrote:
   
Disadvantages of non-parametric tests:
   
Losing precision: Edgington (1995) asserted that when more precise
measurements are available, it is unwise to degrade the precision by
transforming the measurements into ranked data.
 
 Edgington's comment is off the mark in most cases.  The efficiency of the
 Wilcoxon-Mann-Whitney test is 3/pi (0.96) with respect to the t-test
 IF THE DATA ARE NORMAL.  If they are non-normal, the relative
 efficiency of the Wilcoxon test can be arbitrarily better than the t-test.
 Likewise, Spearman's correlation test is quite efficient (I think the
 efficiency is 9/pi^2) relative to the Pearson r test if the data are
 bivariate normal.
 
 Where you lose efficiency with nonparametric methods is with estimation
 of absolute quantities, not with comparing groups or testing correlations.
 The sample median has efficiency of only 2/pi against the sample mean
 if the data are from a normal distribution.

Yes, the median is inefficient at the normal. This is the
location estimator corresponding to the sign test in the one-sample
case. But if you use the location estimator corresponding to the 
signed-rank test (say) instead, the efficiency improves substantially.

Glen



Re: Disadvantage of Non-parametric vs. Parametric Test

1999-12-08 Thread Glen Barnett

Rich Ulrich wrote:
  - In my vocabulary, these days, "nonparametric"  starts out with data
 being ranked, or otherwise being placed into categories -- it is the
 infinite parameters involved in that sort of non-reversible re-scoring
 which earns the label, nonparametric.  (I am still trying to get my
 definition to be complete and concise.)

Well, I am happy for you to use this definition of nonparametric now 
that you've said what you want it to mean, but it isn't exactly
what most statisticians - including those of us that distinguish
between the terms "distribution-free" and "nonparametric" - mean 
by "nonparametric", so you'll have to excuse my earlier ignorance 
of your definition.

If my recollection is correct, a parametric procedure is where the
entire distribution is specified up to a finite number of parameters,
whereas a nonparametric procedure is one where the distribution 
can't be/isn't specified with only a finite number of unspecified
parameters. This typically includes the usual distribution-free 
procedures, including many rank-based procedures, but it also 
includes many other things - including some that don't transform 
the data in any way, and even some based on means.

So, for example, ordinary simple linear regression is parametric,
because the distribution of y|x is specified, up to the value of 
the parameters specifying the intercept and slope of the line, and
the variance about the line.

Nonparametric regression (as the term is typically  
used in the literature), by contrast, is effectively
infinite-parametric, because the distribution of y|x
doesn't depend only on a finite number of parameters 
(often the distribution *about* E[y|x] is parametric 
- typically gaussian - but E[y|x] itself is where the 
infinite-parametric part comes from).

Nonparametric regression would not seem to fit your definition 
of "nonparametric", since your usage seems to require some
loss of information through ranking or categorisation. 

Once we start using the same terminology, we tend to find the
disagreements die down a bit. 

Glen



Re: Disadvantage of Non-parametric vs. Parametric Test

1999-12-07 Thread Glen Barnett

Alex Yu wrote:
 
 Disadvantages of non-parametric tests:
 
 Losing precision: Edgington (1995) asserted that when more precise
 measurements are available, it is unwise to degrade the precision by
 transforming the measurements into ranked data.

So this is an argument against rank-based nonparametric tests
rather than nonparametric tests in general. In fact, I think
you'll find Edgington highly supportive of randomization procedures,
which are nonparametric.

In fact, surprising as it may seem, a lot of the location 
information in a two sample problem is in the ranks. Where
you really start to lose information is in ignoring ordering
when it is present.
 
 Low power: Generally speaking, the statistical power of non-parametric
 tests are lower than that of their parametric counterpart except on a few
 occasions (Hodges  Lehmann, 1956; Tanizaki, 1997).

When the parametric assumptions hold, yes. e.g. if you assume normality
and the data really *are* normal. When the parametric assumptions are
violated, it isn't hard to beat the standard parametric techniques.

However, frequently that loss is remarkably small when the parametric
assumption holds exactly. In cases where they both do badly, the
parametric may outperform the nonparametric by a more substantial
margin (that is, when you should use something else anyway - for
example, a t-test outperforms a WMW when the distributions are
uniform).

 Inaccuracy in multiple violations: Non-parametric tests tend to produce
 biased results when multiple assumptions are violated (Glass, 1996;
 Zimmerman, 1998).

Sometimes you only need one violation:
Some nonparametric procedures are even more badly affected by
some forms of non-independence than their parametric equivalents.
 
 Testing distributions only: Further, non-parametric tests are criticized
 for being incapable of answering the focused question. For example, the
 WMW procedure tests whether the two distributions are different in some
 way but does not show how they differ in mean, variance, or shape. Based
 on this limitation, Johnson (1995) preferred robust procedures and data
 transformation to non-parametric tests.

But since WMW is completely insensitive to a change in spread without
a change in location, if either were possible, a rejection would 
imply that there was indeed a location difference of some kind. This
objection strikes me as strange indeed. Does Johnson not understand
what WMW is doing? Why on earth does he think that a t-test suffers
any less from these problems than WMW?
 
Similarly, a change in shape sufficient to get a rejection of a WMW
test would imply a change in location (in the sense that the "middle"
had moved, though the term 'location' becomes somewhat harder to pin
down precisely in this case).  e.g. (use a monospaced font to see this):

:. .:
::.   =  .::
...   ...
a b   a b
 
would imply a different 'location' in some sense, which WMW will
pick up. I don't understand the problem - a t-test will also reject
in this case; it suffers from this drawback as well (i.e. they are
*both* tests that are sensitive to location differences, insensitive
to spread differences without a corresponding location change, and
both pick up a shape change that moves the "middle" of the data).

However, if such a change in shape were anticipated, simply testing
for a location difference (whether by t-test or not) would be silly. 

Nonparametric (notably rank-based) tests do have some problems,
but making progress on understanding just what they are is 
difficult when such seemingly spurious objections are thrown in.

His preference for robust procedures makes some sense, but the
preference for (presumably monotonic) transformation I would 
see as an argument for a rank-based procedure. e.g. lets say
we are in a two-sample situation, and we decide to use a t-test
after taking logs, because the data are then reasonably normal...
in that situation, the WMW procedure gives the same p-value as 
for the untransformed data. However, let's assume that the 
log-transform wasn't quite right... maybe not strong enough. When 
you finally find the "right" transformation to normality, there
you finally get an extra 5% (roughly) efficiency over the WMW you
started with. Except of course, you never know you have the right
transformation - and if the distribution the data are from are
still skewed/heavy-tailed after transformation (maybe they were
log-gamma to begin with or something), then you still may be better
off using WMW.

Do you have a full reference for Johnson? I'd like to read what
the reference actually says.

Glen



Re: Sample size and non-parametric test

1999-11-21 Thread Glen Barnett

boonlert wrote:
 
 Dear All
 Can I use a non-parametric test for a sample size less than 30 (central
 limit theorem) 


Sorry, but
(i) what has the central limit theorem have to do with any of this?
(ii) for that matter, what does a sample size of 30 really have to 
 do with the central limit theorem in any case? The rate at which
 the CLT can be regarded as having kicked in sufficiently depends on
 what the sampling distribution is (sometimes n=1 is enough, sometimes
 n=1 isn't enough), and what purpose you're wanting to use the
 theorem for.


 regardless the scale, nominal or ordinal scale, requirement?

Not sure what this sentence is asking. 

 If I can, what is the priority concern for using non-parametric test whether
 sample size or measurement scales?

I'm not sure what you are asking about. Could you please write
in shorter sentences, because you seem to be assuming stuff that
isn't necessarily true.

About all I can glean from what you've written is that you have
some concern about sample size and measurement scale for some
(unspecified) nonparametric procedure or procedures. However
it isn't at all clear which ones you care about, nor what the
concern actually is.

I will say that for almost all nonparametric procedures in
common use, the tables usually go down to very small numbers - 
there is generally no minimum sample size (except as required
to actually calculate the quantities involved). Note also that
some nonparametric procedures may only be suitable for some 
measurement scales, but this has nothing to do with the sample
size AFAIK.

Glen