Re: Logarithms (was: When to Use t and When to Use z Revisited)

2001-12-25 Thread Donald Burrill

On Tue, 11 Dec 2001, Vadim and Oxana Marmer wrote:

 besides, who needs those tables? we have computers now, don't we?
 I was told that there were tables for logarithms once. I have not seen 
 one in my life. Is not it the same kind of stuff?

If you _want_ to see one, you have no farther to go than to Sterling 
Library and look up what there is under mathematical tables.  (Unless, 
in the years since I worked there as an undergraduate, they've thrown 
them all out, which I would hope to be unlikely.)

-- DFB.
 
 Donald F. Burrill [EMAIL PROTECTED]
 184 Nashua Road, Bedford, NH 03110  603-471-7128



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread kjetil halvorsen



Ronny Richardson wrote:
 
 A few weeks ago, I posted a message about when to use t and when to use z.
 In reviewing the responses, it seems to me that I did a poor job of
 explaining my question/concern so I am going to try again.
 
 I have included a few references this time since one responder doubted the
 items to which I was referring. The specific references are listed at the
 end of this message.
 
 Bluman has a figure (2, page 333) that is suppose to show the student When
 to Use the z or t Distribution. I have seen a similar figure in several
 different textbooks. The figure is a logic diagram and the first question
 is Is sigma known? If the answer is yes, the diagram says to use z. I do
 not question this; however, I doubt that sigma is ever known in a business
 situation and I only have experience with business statistics books.
 
 If the answer is no, the next question is Is n=30? If the answer is yes,
 the diagram says to use z and estimate sigma with s. This is the option I
 question and I will return to it briefly.
 
 In the diagram, if the answer is no to the question about n=30, you are to
 use t. I do not question this either.
 
 Now, regarding using z when n=30. If we always use z when n=30, then you
 would never need a t table with greater than 28 degrees of freedom. (n=29
 would always yield df=28.) Bluman cuts his off at 28 except for the
 infinity row so he is consistent. (The infinity row shows that t becomes z
 at infinity.)
 
 However, other authors go well beyond 30. Aczel (3, inside cover) has
 values for 29, 30, 40, 60, and 120, in addition to infinity. Levine (4,
 pages E7-E8) has values for 29-100 and then 110 and 112, along with
 infinity. I could go on, but you get the point. If you always switch to z
 at 30, then why have t tables that go above 28? Again, the infinity entry I
 understand, just not the others.
 
 Berenson states (1, page 373), However, the t distribution has more area
 in the tails and less in the center than down the normal distribution. This
 is because sigma is unknown and we are using s to estimate it. Because we
 are uncertain of the value of sigma, the values of t that we observe will
 be more variable than for Z. So, Berenson seems to me to be saying that
 you always use t when you must estimate sigma using s.
 
 Levine (4, page 424) says roughly the same thing, However, the t
 distribution has more area in the tails and less in the center than does
 the normal distribution. This is because sigma is unknown and we are using
 s to estimate it. Because we are uncertain of the value sigma, the values
 of t that we observe will be more variable than for Z.
 
 So, I conclude 1) we use z when we know the sigma and either the data is
 normally distributed or the sample size is greater than 30 so we can use
 the central limit theorem.
 
 2) When n30 and the data is normally distributed, we use t.
 
 3) When n is greater than 30 and we do not know sigma, we must estimate
 sigma using s so we really should be using t rather than z.
 
 Now, every single business statistics book I have examined, including the
 four referenced below, use z values when performing hypothesis testing or
 computing confidence intervals when n30.
 
 Are they
 
 1. Wrong
 2. Just oversimplifying it without telling the reader 

They are not oversimplifying, they are  complexifying. To quote Polya
How to solve it : If you need rules, use this one first: 1) Use your
own brains first.

Sigma is hardly ever known, so you must use t. Then why not simply tell
the students: use the t table as far as it goes, (usually around
n=120), and after that, use the n=\infty line (which corresponds to the
normal distribution). Then there is no need for a rule for when to use
z, when to use t.

Kjetil Halvorsen
 
 or am I overlooking something?
 
 Ronny Richardson
 
 References
 --
 (1) Basic Business Statistics, Seventh Edition, Berenson and Levine.
 
 (2) Elementary Statistics: A Step by Step Approach, Third Edition, Bluman.
 
 (3) Complete Business Statistics, Fourth Edition, Aczel.
 
 (4) Statistics for Managers Using Microsoft Excel, Second Edition, Levine,
 Berenson, Stephan.
 
 =
 Instructions for joining and leaving this list and remarks about
 the problem of INAPPROPRIATE MESSAGES are available at
   http://jse.stat.ncsu.edu/
 =


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Dennis Roberts

At 04:14 AM 12/10/01 +, Jim Snow wrote:
Ronny Richardson [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...

  A few weeks ago, I posted a message about when to use t and when to use z.

I did not see the earlier postings, so forgive me if I repeat advice already
given.:-)

 1. The consequences of using the t distribution instead of the normal
distribution for sample sizes greater than 30 are of no importance in
practice.

what's magical about 30? i say 33 ... no actually, i amend that to 28

 2. There is no good reason for statistical tables for use in practical
analysis of data to give figures for t on numbers of degrees of freedom over
30 except that it makes it simple to routinely use one set of tables when
the variance is estimated from the sample.

with software, there is no need for tables ... period!


 3. There are situations where the error variance is known. They
generally arise when the errors in the data arise from the use of a
measuring instrument with known accuracy or when the figures available are
known to be truncated to a certain number of decimal places. For example:
 Several drivers use cars in a car pool. The distance tavelled on each
trip by a driver is recorded, based on the odometer reading. Each
observation has an error which is uniformly distributed in (0,0.2). The
variance of this error is (0.2)^2)/12  = .00  and standard deviation
0.0578  . To calculate confidence limits for the average distance travelled
by each driver, the z statistic should be used.

this is pure speculation ... i have yet to hear of any convincing case 
where the variance is known but, the mean is not


_
dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:[EMAIL PROTECTED]
http://roberts.ed.psu.edu/users/droberts/drober~1.htm



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Gus Gassmann

Dennis Roberts wrote:

 this is pure speculation ... i have yet to hear of any convincing case
 where the variance is known but, the mean is not

What about that other application used so prominently in texts of
business statistics, testing for a proportion?





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Jon Cryer

But then you should use a binomial (or hypergeometric)
distribution.
Jon Cryer
p.s. Of course, you might approximate
by an appropriate normal distribution.
At 11:39 AM 12/10/01 -0400, you wrote:
Dennis Roberts wrote:
 this is pure speculation ... i have yet to hear of any convincing
case
 where the variance is known but, the mean is not
What about that other application used so prominently in texts of
business statistics, testing for a proportion?
=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at

http://jse.stat.ncsu.edu/
=



Jon Cryer, Professor Emeritus
Dept. of Statistics
www.stat.uiowa.edu/~jcryer

and Actuarial Science office 319-335-0819
The University of Iowa home 319-351-4639
Iowa City, IA 52242 FAX 319-335-3017 
It ain't so much the things we don't know that get us into trouble. 
It's the things we do know that just ain't so. --Artemus Ward 


Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Jerry Dallal

Dennis Roberts wrote:

 this is pure speculation ... i have yet to hear of any convincing case
 where the variance is known but, the mean is not

A scale (weighing device) with known precision.


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Jon Cryer

I always thought that the precision of a scale was
proportional
to the amount weighed. So don't you have to know the mean
before you
know the standard deviation? But wait a minute - we are trying
assess
the size of the mean!
Jon Cryer
At 03:42 PM 12/10/01 +, you wrote:
Dennis Roberts wrote:
 this is pure speculation ... i have yet to hear of any convincing
case
 where the variance is known but, the mean is not
A scale (weighing device) with known precision.

=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at

http://jse.stat.ncsu.edu/
=



Jon Cryer, Professor Emeritus
Dept. of Statistics
www.stat.uiowa.edu/~jcryer

and Actuarial Science office 319-335-0819
The University of Iowa home 319-351-4639
Iowa City, IA 52242 FAX 319-335-3017 
It ain't so much the things we don't know that get us into trouble. 
It's the things we do know that just ain't so. --Artemus Ward 


Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Art Kendall

the sample mean of the dichotomous (one_zero, dummy) variable is known, It
is the proportion.

Gus Gassmann wrote:

 Dennis Roberts wrote:

  this is pure speculation ... i have yet to hear of any convincing case
  where the variance is known but, the mean is not

 What about that other application used so prominently in texts of
 business statistics, testing for a proportion?



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Gus Gassmann

Art Kendall wrote:

(putting below the previous quotes for readability)

 Gus Gassmann wrote:

  Dennis Roberts wrote:
 
   this is pure speculation ... i have yet to hear of any convincing case
   where the variance is known but, the mean is not
 
  What about that other application used so prominently in texts of
  business statistics, testing for a proportion?

 the sample mean of the dichotomous (one_zero, dummy) variable is known, It
 is the proportion.

Sure. But when you test Ho: p = p0, you know (or pretend to  know) the
population variance. So if the CLT applies, you should use a z-table, no?





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Art Kendall

Usually I would use software.  As I tried to show is the sample syntax I posted
earlier, it doesn't usually make much difference whether you use z or t.

Gus Gassmann wrote:

 Art Kendall wrote:

 (putting below the previous quotes for readability)

  Gus Gassmann wrote:
 
   Dennis Roberts wrote:
  
this is pure speculation ... i have yet to hear of any convincing case
where the variance is known but, the mean is not
  
   What about that other application used so prominently in texts of
   business statistics, testing for a proportion?

  the sample mean of the dichotomous (one_zero, dummy) variable is known, It
  is the proportion.

 Sure. But when you test Ho: p = p0, you know (or pretend to  know) the
 population variance. So if the CLT applies, you should use a z-table, no?



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Jon Cryer

Only as an approximation.

At 12:57 PM 12/10/01 -0400, you wrote:
Art Kendall wrote:

(putting below the previous quotes for readability)

  Gus Gassmann wrote:
 
   Dennis Roberts wrote:
  
this is pure speculation ... i have yet to hear of any convincing case
where the variance is known but, the mean is not
  
   What about that other application used so prominently in texts of
   business statistics, testing for a proportion?

  the sample mean of the dichotomous (one_zero, dummy) variable is known, It
  is the proportion.

Sure. But when you test Ho: p = p0, you know (or pretend to  know) the
population variance. So if the CLT applies, you should use a z-table, no?





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
   http://jse.stat.ncsu.edu/
=



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Dennis Roberts

At 03:42 PM 12/10/01 +, Jerry Dallal wrote:
Dennis Roberts wrote:

  this is pure speculation ... i have yet to hear of any convincing case
  where the variance is known but, the mean is not

A scale (weighing device) with known precision.

as far as i know ... knowing the precision is expressed in terms of ... 
'accurate to within' ... and if there is ANY 'within' attached ... then 
accuracy for SURE is not known





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
   http://jse.stat.ncsu.edu/
=

_
dennis roberts, educational psychology, penn state university
208 cedar, AC 8148632401, mailto:[EMAIL PROTECTED]
http://roberts.ed.psu.edu/users/droberts/drober~1.htm



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Rich Ulrich

On Mon, 10 Dec 2001 12:57:29 -0400, Gus Gassmann
[EMAIL PROTECTED] wrote:

 Art Kendall wrote:
 
 (putting below the previous quotes for readability)
 
  Gus Gassmann wrote:
 
   Dennis Roberts wrote:
  
this is pure speculation ... i have yet to hear of any convincing case
where the variance is known but, the mean is not
  
   What about that other application used so prominently in texts of
   business statistics, testing for a proportion?
 
  the sample mean of the dichotomous (one_zero, dummy) variable is known, It
  is the proportion.
GG  
 Sure. But when you test Ho: p = p0, you know (or pretend to  know) the
 population variance. So if the CLT applies, you should use a z-table, no?
 

That is the textbook justification for chi-squared and z  tests
in the sets of 'nonparametric tests'  which are based on 
rank-order transformations and dichotomizing.

The variance is known, so the test statistic has the shorter tails.
(It works for ranks when you don't have ties.)

-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html


=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Vadim and Oxana Marmer

besides, who needs those tables? we have computers now, don't we?
I was told that there were tables for logarithms once. I have not seen one
in my life. Is not it the same kind of stuff?


   3.  Outdated.

 on the grounds that when sigma is unknown, the proper distribution is t
 (unless N is small and the parent population is screwy) regardless how
 large the sample size may be.  The main (if not the only) reason for the
 apparent logical bifurcation at N = 30 or thereabouts was that, when
 one's only sources of information about critical values were printed
 tables, 30 lines was about what fit on one page (plus maybe a few extra
 lines for 40, 60, 120 d.f.) and one could not (or at any rate did not)
 expect one's business students to have convenient access to more
 extensive tables of the t distribution.  And, one suspects latterly,
 authors were skeptical that students would pay attention to (or perhaps
 be able to master?) the technique of interpolating by reciprocals between
 30 df and larger numbers of df (particularly including infinity).

 But currently, _I_ would not expect business students to carry out the
 calculations for hypothesis tests, or confidence intervals, by hand,
 except maybe half a dozen times in class for the good of their souls:
 I'd expect them to learn to invoke a statistical package, or else
 something like Excel that pretends to supply adequate statistical
 routines.  And for all the packages I know of, there is a built-in
 function for calculating, or approximating, the cumulative distribution
 of t for ANY number of df.  The advice in any _current_ business-
 statistics text ought to be, therefore, to use t _whenever_ sigma is not
 known.  And if the textbook isn't up to that standard, the instructor
 jolly well should be.




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Vadim and Oxana Marmer

 3) When n is greater than 30 and we do not know sigma, we must estimate
 sigma using s so we really should be using t rather than z.


you are wrong. you use t-distribution not because you don't know sigma,
but because your statistic has EXACT t-distribution under certain
conditions. I know that the textbook says if we knew sigma then the
distribution would be normal, but because we used s instead the
distribution turned out to be t. It does not say how exactly it becomes
t, so you make the conclusion: use t instead of normal whenever you use s
instead of sigma. But it's wrong, it does not go like this.

when you don't know underlying distribution of the sample you may use
normal distribution (under certain regularity conditions),
as an APPROXIMATION to the actual distribution of your statistic.
approximate distribution in most cases is not parameter-free, it may
depend, for example, on unknown sigma. in such situation you may replace
the
unknown parameter by its consistent estimator.the  approximate
distribution is
still normal. think about it as iterated approximation. first you
approximate the actual distribution by N(0,sigma^2), then you approximate
it by N(0,S^2), where S^2 is a consistent estimator for sigma. there are
formal theorems that allow you to do this kind of thigs.

The essential difference between two approaches is that the first one
tries to derive the
EXACT disribution, second says I will use APPROXIMATION.

number 30 has no importance at all, throw away all the tables you have. I
cannot believe they still teach you this stuff. I wish it was that
simle:30!

Your confusion is the result of oversimplification and desire to provide
students with simple stratagies which present in basic statistics
textbooks. I guess it makes teaching very simple, but it mislead students.
Your confusion is an example. The problem is that there is no simple strategies,
and things are much-much more complicated than they appear in basic textbooks.
Basic text books don't tell you the whole story, and they don't even try,
because you simply cannot do this at their level. Don't make any strong
conclusions after reading only basic textbooks.

In practice, in business and economics statistics, nobody uses
t-tests, but normal and chi-square approximations are used a lot. The
assumptions that you have to make for t-test are too strong.







=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-10 Thread Vadim and Oxana Marmer


 Sigma is hardly ever known, so you must use t. Then why not simply tell
 the students: use the t table as far as it goes, (usually around
 n=120), and after that, use the n=\infty line (which corresponds to the
 normal distribution). Then there is no need for a rule for when to use
 z, when to use t.


but the data is not normal either in 99.9(9) of the cases. Furthermore,
the data that you see in economics/business is very often is not  an iid
sample either. So, one way or another you end up with normal or
chi-square.

actually, there is an alternative to both approaches. it's bootstrap. but
it does not always work and should not be used blindly.



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



When to Use t and When to Use z Revisited

2001-12-09 Thread Ronny Richardson

A few weeks ago, I posted a message about when to use t and when to use z.
In reviewing the responses, it seems to me that I did a poor job of
explaining my question/concern so I am going to try again.

I have included a few references this time since one responder doubted the
items to which I was referring. The specific references are listed at the
end of this message.

Bluman has a figure (2, page 333) that is suppose to show the student When
to Use the z or t Distribution. I have seen a similar figure in several
different textbooks. The figure is a logic diagram and the first question
is Is sigma known? If the answer is yes, the diagram says to use z. I do
not question this; however, I doubt that sigma is ever known in a business
situation and I only have experience with business statistics books.

If the answer is no, the next question is Is n=30? If the answer is yes,
the diagram says to use z and estimate sigma with s. This is the option I
question and I will return to it briefly.

In the diagram, if the answer is no to the question about n=30, you are to
use t. I do not question this either.

Now, regarding using z when n=30. If we always use z when n=30, then you
would never need a t table with greater than 28 degrees of freedom. (n=29
would always yield df=28.) Bluman cuts his off at 28 except for the
infinity row so he is consistent. (The infinity row shows that t becomes z
at infinity.)

However, other authors go well beyond 30. Aczel (3, inside cover) has
values for 29, 30, 40, 60, and 120, in addition to infinity. Levine (4,
pages E7-E8) has values for 29-100 and then 110 and 112, along with
infinity. I could go on, but you get the point. If you always switch to z
at 30, then why have t tables that go above 28? Again, the infinity entry I
understand, just not the others.

Berenson states (1, page 373), However, the t distribution has more area
in the tails and less in the center than down the normal distribution. This
is because sigma is unknown and we are using s to estimate it. Because we
are uncertain of the value of sigma, the values of t that we observe will
be more variable than for Z. So, Berenson seems to me to be saying that
you always use t when you must estimate sigma using s.

Levine (4, page 424) says roughly the same thing, However, the t
distribution has more area in the tails and less in the center than does
the normal distribution. This is because sigma is unknown and we are using
s to estimate it. Because we are uncertain of the value sigma, the values
of t that we observe will be more variable than for Z.

So, I conclude 1) we use z when we know the sigma and either the data is
normally distributed or the sample size is greater than 30 so we can use
the central limit theorem.

2) When n30 and the data is normally distributed, we use t.

3) When n is greater than 30 and we do not know sigma, we must estimate
sigma using s so we really should be using t rather than z.

Now, every single business statistics book I have examined, including the
four referenced below, use z values when performing hypothesis testing or
computing confidence intervals when n30.

Are they

1. Wrong
2. Just oversimplifying it without telling the reader

or am I overlooking something?

Ronny Richardson



References
--
(1) Basic Business Statistics, Seventh Edition, Berenson and Levine.

(2) Elementary Statistics: A Step by Step Approach, Third Edition, Bluman.

(3) Complete Business Statistics, Fourth Edition, Aczel.

(4) Statistics for Managers Using Microsoft Excel, Second Edition, Levine,
Berenson, Stephan.



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-09 Thread Donald Burrill

On Sun, 9 Dec 2001, Ronny Richardson wrote in part:

 Bluman has a figure (2, page 333) that is supposed to show the student
 When to Use the z or t Distribution.  I have seen a similar figure in
 several different textbooks. 

So have I, sometimes as a diagram or flow chart, sometimes in paragraph 
or outline form.

 The figure is a logic diagram and the first question is Is sigma
 known? If the answer is yes, the diagram says to use z. I do not 
 question this;  however, I doubt that sigma is ever known in a business 
 situation and I only have experience with business statistics books. 

Depends partly on what parameter one is addressing (either as a 
hypothesis test or as a confidence interval).  For the mean of an unknown 
empirical distribution, I expect you're right.  But for the proportion of 
persons in a population who would want to purchase (for a currently 
topical example) a Segway, the population variance is a known function of 
the proportion (the underlying distribution being, presumably, binomial), 
and for this case the t distribution is simply inappropriate, and one 
ought to use either the proper binomial distribution function, or else 
the normal approximation to the binomial (perhaps after satisfying 
oneself that N is sufficiently large for the approximation to be credible 
with the hypothesized (or observed) value of the proportion;  various 
textbook authors offer assorted recipes for this purpose).

{  Snip, discourse on N = 30, although I'd 
   think it were rather on  df = 30.  }

 However, other authors go well beyond 30.  Aczel (3, inside cover) has
 values for 29, 30, 40, 60, and 120, in addition to infinity.  Levine 
 (4, pages E7-E8) has values for 29-100 and then 110 and 112, along with 
 infinity.  I could go on, but you get the point.  If you always switch 
 to z at 30, then why have t tables that go above 28?  Again, the 
 infinity entry I understand, just not the others. 

{  Snip, assorted quotes ...  }

 So, Berenson seems to me to be saying that you always use t when you
 must estimate sigma using s.  Levine (4, page 424) says roughly the 
 same thing, ...

 So, I conclude  {slightly edited -- DB}

 1) we use z when we know the sigma and either the data are normally
 distributed or the sample size is greater than 30 so we can use the
 central limit theorem. 

I would amend this to the sample size is large enough that we can... 
Whether 30 is in fact large enough or not depends rather heavily on what 
the true shape of the parent population actually is.  (If it's roughly 
symmetrical and bell-shaped, 30 may be O.K.)

 2) When n30 and the data are normally distributed, we use t. 

 3) When n is greater than 30 and we do not know sigma, we must estimate 
 sigma using s so we really should be using t rather than z. 

 Now, every single business statistics book I have examined, including 
 the four referenced below, use z values when performing hypothesis 
 testing or computing confidence intervals when n30. 

 Are they 

 1. Wrong 
 2. Just oversimplifying it without telling the reader 

 or am I overlooking something? 

I vote for both 1. and 2., since 2. is in my view a subset of 1, although 
others may not share this opinion.  I would add 

  3.  Outdated.

on the grounds that when sigma is unknown, the proper distribution is t 
(unless N is small and the parent population is screwy) regardless how 
large the sample size may be.  The main (if not the only) reason for the 
apparent logical bifurcation at N = 30 or thereabouts was that, when 
one's only sources of information about critical values were printed 
tables, 30 lines was about what fit on one page (plus maybe a few extra 
lines for 40, 60, 120 d.f.) and one could not (or at any rate did not) 
expect one's business students to have convenient access to more 
extensive tables of the t distribution.  And, one suspects latterly, 
authors were skeptical that students would pay attention to (or perhaps 
be able to master?) the technique of interpolating by reciprocals between 
30 df and larger numbers of df (particularly including infinity). 

But currently, _I_ would not expect business students to carry out the 
calculations for hypothesis tests, or confidence intervals, by hand, 
except maybe half a dozen times in class for the good of their souls:  
I'd expect them to learn to invoke a statistical package, or else 
something like Excel that pretends to supply adequate statistical 
routines.  And for all the packages I know of, there is a built-in 
function for calculating, or approximating, the cumulative distribution 
of t for ANY number of df.  The advice in any _current_ business-
statistics text ought to be, therefore, to use t _whenever_ sigma is not 
known.  And if the textbook isn't up to that standard, the instructor 
jolly well should be.

{  Snip, references.  See the original post for more details.  }

-- DFB.
 

Re: When to Use t and When to Use z Revisited

2001-12-09 Thread Jim Snow

Ronny Richardson [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...

 A few weeks ago, I posted a message about when to use t and when to use z.

I did not see the earlier postings, so forgive me if I repeat advice already
given.:-)

1. The consequences of using the t distribution instead of the normal
distribution for sample sizes greater than 30 are of no importance in
practice. The difference in the numbers given as confidence limits are so
small that no sensible person would change their course of action based on
that miniscule variation. In the case of a significance test a result just
over or just under, say, the 5% level should always be examined in the
knowledge that the 5% is an arbitrary level and that a level of 4.9%  or
5.1%  could equally well have been chosen.

2. There is no good reason for statistical tables for use in practical
analysis of data to give figures for t on numbers of degrees of freedom over
30 except that it makes it simple to routinely use one set of tables when
the variance is estimated from the sample.
Another reason that books of tables do not include t values for degrees of
freedom between 30,60,sometimes 120 and infinity is that there is no
need,even for the extreme tails of the distribution and when ,for whatever
reason, high accuracy is required, because the intermediate values can be
obtained by harmonic interpolation. That is, the tail entries in the
distribution can be  obtained by linear interpolation on 1/n.

3. There are situations where the error variance is known. They
generally arise when the errors in the data arise from the use of a
measuring instrument with known accuracy or when the figures available are
known to be truncated to a certain number of decimal places. For example:
Several drivers use cars in a car pool. The distance tavelled on each
trip by a driver is recorded, based on the odometer reading. Each
observation has an error which is uniformly distributed in (0,0.2). The
variance of this error is (0.2)^2)/12  = .00  and standard deviation
0.0578  . To calculate confidence limits for the average distance travelled
by each driver, the z statistic should be used.

A similar situation could arise in dealing with data in which the error
arises from the rounding of all numbers to the nearest thousand.

   This is an uncommon situation in a business context, but it arises
quite often in scientific work where the inherent accuracy of a measuring
instrument may be known from long experience and need not be estimated from
the small sample currently being examined.

4. You seem to think the Central Limit Theorem is behind the validity of
t vs z tables. This is not so. The CLT only bears on the Normal shape and
the relation of the variance of an average or sum to the population
variance.

Commenting specifically on points in your posting:

Ronny Richardson [EMAIL PROTECTED] wrote in message
[EMAIL PROTECTED]">news:[EMAIL PROTECTED]...

 A few weeks ago, I posted a message about when to use t and when to use z.
(snip)
 So, I conclude 1) we use z when we know the sigma and either the data is
 normally distributed or the sample size is greater than 30

   Yes, but the difference if you use t is tiny and of no importance.

so we can use the central limit theorem.

No. The CLT is not the reason. The CLT ensures that the average and
sum are Normally distributed for large enough n. Unless the data is very
skewed or bimodal, n=5 is usually large enough in practice. This is a
separate issue to the choice of Normal or t distribution for inference.

 2) When n30 and the data is normally distributed, we use t.

 3) When n is greater than 30 and we do not know sigma, we must estimate
 sigma using s so we really should be using t rather than z.

but the difference in the resulting numbers is miniscule and of no
importance.

 Now, every single business statistics book I have examined, including the
 four referenced below, use z values when performing hypothesis testing or
 computing confidence intervals when n30.

 Are they

 1. Wrong
 2. Just oversimplifying it without telling the reader

 or am I overlooking something?

 Ronny Richardson

I hope that helps
Jim Snow




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: When to Use t and When to Use z Revisited

2001-12-09 Thread Glen

[EMAIL PROTECTED] (Ronny Richardson) wrote in message 
news:[EMAIL PROTECTED]...
 A few weeks ago, I posted a message about when to use t and when to use z.
 In reviewing the responses, it seems to me that I did a poor job of
 explaining my question/concern so I am going to try again.
 
 I have included a few references this time since one responder doubted the
 items to which I was referring. The specific references are listed at the
 end of this message.
 
 Bluman has a figure (2, page 333) that is suppose to show the student When
 to Use the z or t Distribution. I have seen a similar figure in several
 different textbooks. The figure is a logic diagram and the first question
 is Is sigma known? If the answer is yes, the diagram says to use z. I do
 not question this; however, I doubt that sigma is ever known in a business
 situation and I only have experience with business statistics books.
 
 If the answer is no, the next question is Is n=30? If the answer is yes,
 the diagram says to use z and estimate sigma with s. This is the option I
 question and I will return to it briefly.
 
 In the diagram, if the answer is no to the question about n=30, you are to
 use t. I do not question this either.
 
 Now, regarding using z when n=30. If we always use z when n=30, then you
 would never need a t table with greater than 28 degrees of freedom. (n=29
 would always yield df=28.) Bluman cuts his off at 28 except for the
 infinity row so he is consistent. (The infinity row shows that t becomes z
 at infinity.)
 
 However, other authors go well beyond 30. Aczel (3, inside cover) has
 values for 29, 30, 40, 60, and 120, in addition to infinity. Levine (4,
 pages E7-E8) has values for 29-100 and then 110 and 112, along with
 infinity. I could go on, but you get the point. If you always switch to z
 at 30, then why have t tables that go above 28? Again, the infinity entry I
 understand, just not the others.
 
 Berenson states (1, page 373), However, the t distribution has more area
 in the tails and less in the center than down the normal distribution. This
 is because sigma is unknown and we are using s to estimate it. Because we
 are uncertain of the value of sigma, the values of t that we observe will
 be more variable than for Z. So, Berenson seems to me to be saying that
 you always use t when you must estimate sigma using s.

Yes, but as n becomes large the difference becomes extremely small.

The question is, when is small small enough?

 Levine (4, page 424) says roughly the same thing, However, the t
 distribution has more area in the tails and less in the center than does
 the normal distribution. This is because sigma is unknown and we are using
 s to estimate it. Because we are uncertain of the value sigma, the values
 of t that we observe will be more variable than for Z.
 
 So, I conclude 1) we use z when we know the sigma and either the data is
 normally distributed or the sample size is greater than 30 so we can use
 the central limit theorem.

 2) When n30 and the data is normally distributed, we use t.
 
 3) When n is greater than 30 and we do not know sigma, we must estimate
 sigma using s so we really should be using t rather than z.


Uh, wait a sec. 

i) The CLT doesn't kick in at the same point for every distribution.
If the distribution is close to normal, you don't need anything like
n=30. If the distribution is (say) highly skew, then n=30 may not be
anywhere near close enough.
ii) Even at a given distribution, a sample size that's close enough
for one application won't necessarily be close enough for another
application.
iii) How much accuracy you get also depends on how far into the tails
you need precision. There's no point knowing the 2.5% points aren't
far out if you need it (for your application) to be accurate near the
0.25% points.
iv) the rate at which the variance approaches the appropriate multiple
of a chi-square depends on the sampling frequency. It's possible it
may never do so, but with large sample size you should generally still
get normality because of Slutzky's theorem. Even if n=30 was right
when we're talking about the mean, it won't in general also be just
right when we're dealing with what's happening with the variance (see
above).
v) the degree to which the dependence between the mean and variance
affects the distribution of the t statistic itself depends on the
distribution you're sampling from (but again, Slutzky should save you
eventually).


For these sorts of reasons, n=30 is oversimplistic. Sometimes it's far
too stringent, sometimes too weak. Better to make some assessment of
the effect of what you regard as possible situations and see if the
consequences are okay for your situation.



 Now, every single business statistics book I have examined, including the
 four referenced below, use z values when performing hypothesis testing or
 computing confidence intervals when n30.
 
 Are they
 
 1. Wrong
 2. Just oversimplifying it without telling the reader
 
 or