RE: Chi-square chart in Excel

2002-02-21 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Ronny Richardson
Sent: Wednesday, February 20, 2002 7:29 PM
To: [EMAIL PROTECTED]
Subject: Chi-square chart in Excel


Can anyone tell me how to produce a chart of the chi-square distribution in
Excel? (I know how to find chi-square values but not how to turn those into
a chart of the chi-square curve.)


Ronny Richardson
---
Excel does not have a function that gives the Chi-Square density

The following might be helpful regarding future graphs. It is a fraction of
a larger package I am preparing. It is awkward to present it in .txt
format.


DISTRIBUTIONDENSITY CUMMULATIVE I   NVERSE
BetaBETADISTBETAINV
BinomialBINOMST CRITBINOM
Chi-Square  CHIDIST CHINV
Exponential EXPONDIST   EXPONDIST
F   FDIST   FINV
Gamma   GAMMADIST   GAMMADIST   GAMMAINV
Hyper geometric HYPGEOMDIST
Log Normal  LOGNORMDIST LOGINV
Negative Binomial   NEGBINOMDIST
Normal(with parameters) NORMDISTNORMDISTNORMINV
Normal (z values)   NORMSDIST   NORMSINV
Poisson POISSON
t   TDIST   TINV
Weibull WEIBULL

You have to build a column (say B) of X values.

Build an expression for column C calculating the Chi-Square density, given
the x value in col B and the df value in A1.

It would be =exp(($A$1/2)*LN(2) + GAMMALN($A$1/2) + (($A$1/2)-1)*LN(B1) -
B1/2) without the quotes.
You can equation-drag this cell down column C for each X value.

Now build a smoothed scatter plot graph as series 1 with the X value column
B and the Y value as column C.

DAHeiser



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



An Interesting Student Problem

2002-02-04 Thread David Heiser

Was Darwin's statement It has been experimentally proved that if a plot of
ground be sown with one species of grass, and a similar plot be sown with
several distinct genera of grasses, a greater number of plants and a greater
weight of dry herbage can be raised, a valid statement?

In the 25th January 2002 edition of Science on page 639, Andy Hector
([EMAIL PROTECTED]) and Rowan Hooper ([EMAIL PROTECTED]) describe the work
conducted at Woburn Abby in 1824, which was the basis of Darwin's statement.
The article describes the plots used, and gives a full list of references.
The results described in the 1824 source have been put into an interesting
table at (www.sciencemag.org/cgi/content/full/295//639/DC1).

Basically there were 242 experimental plots set up at Wolburn Abbey, each 4
square feet, enclosed by boards set in cast iron frames, with leaded tanks
for aquatic species. This work predates modern methods of experimental
design and statistical analysis.

Given the table, the problem given to students would be
1) What would be your hypothesis to test?
2) What statistical methods would you apply given the actual missing data
and unknowns (from the table)? Apply (them) and state your conclusions.
3) What would be a modern experimental design (for 242 plots) given the
variables suggested by the table (and the additional information in the
article and in (if available) the references)?

The interesting aspect is that there as many ways to do 1), 2) and 3). The
interest would be on verbal presentations by each student.

As Hector and Hooper point out that how ecosystems operate is one of the
hottest and most active areas in ecology today. This really is a tremendous
area for the application of statistics.

DAHeiser



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



RE: Interpreting multiple regression Beta is only way?

2002-02-04 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Wuzzy
Sent: Monday, February 04, 2002 4:14 PM
To: [EMAIL PROTECTED]
Subject: Re: Interpreting mutliple regression Beta is only way?

...
I've heard of ridge regression will try to investigate this area
more..

Ridge analysis perturbs the coefficient set obtained from least mean squares
regression. It does it in a way to reduce the main effects coefficient
values. The result is no longer minimum variance of the residuals, but a
more tolerant solution of the coefficients.

Originated by Hoerl back in the 1950's. Basically its effectiveness is in
its ability to make reasonable (and better) predictions of Y values from X
values outside of the data set used to obtain coefficient values. The other
advantage is that it tends to bring coefficient values closer to reality
in a physical sense. The biggest disadvantage is that there is no logical
stopping point at some optimum set of coefficient values. This is why it
is seldom used except in industry, where valid prediction is more important
than the correct model.

DAHeiser



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



RE: how to adjust for variables

2002-01-24 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Wuzzy
Sent: Thursday, January 24, 2002 3:30 PM
To: [EMAIL PROTECTED]
Subject: Re: how to adjust for variables



I find it extremely difficult to interpret multivariate equations.
Are there any good books on conceptualizing the equation?

For instance:
If you are assessing whether protein, fat, or carbohydrate is
important in obesity independant of calories, do you do the following
model:

Disease=carb+proten+fat+calories

and if so, isn't the word calories meaningless as it is equal to the
sum of the other three.
Perhaps it should not be included in the model.

I have read of studies were they will use everything except carb as
follows:

disease=protein+fat+calories

and from here you can determine what substituting carb with protein or
fat will have on the disease.

It is very difficult to conceptualize and very difficult to understand
what the word calories means anymore in a multivariate model..

It seems if you use univariate adjusted values it is easier to model,
I have very little experience in statistics as everyone can tell..

Just commenting, no real question here..  I will probably understand
it with time..

--
The points that Wuzzy makes here illustrate one of the difficulties of model
building.

There are two ways to adjust. I will illustrate below using totally
artificial numbers, because I don't have my handbooks in front of me that
has realistic values.

Assume we have a subject:
Assume subject consumed 120 grams of fat (with a calorie content of 20
cals/gram), 211 grams of protein (with a calorie content of 15 cals/gram)
and 350 grams of starch (with a calorie content of 33 cals/gram):

I. Adjust for a total calorie level of 10,000 calories per day: Multiply
each by 1/16715 gives an adjusted fat consumption of 71.8 grams, protein
of 126.2 grams and starch of 209.4 grams. The three X values then would be
71.8, 126.2 and 209.4 rather then the actual values of 120, 211 and 350
grams

II. Adjust for individual calorie levels (15,10,20): Multiply 120 times
20/15 to give a standard fat intake of 133.3 grams. Do the same for the
others to get (211*15/10=316.5) and (350*33/20=412.5). The three X values
would then be 133.3, 316.5 and 412.5 rather than the actual values of 120,
211 and 350 grams.

This of course ignors the really important attributes that may correlate to
the disease, such as this:
Fat
Saturated
Unsaturated
Animal
Vegtable
Calories (heat of combustion, higher or lower)
Thermal processed
Chemical and physical modifications (i,e, hydogenation, extraction, low
temperature filtering, etc.)
Protein
Animal
Vegtable
Fraction of added chemicals (i.e. Lysine)
Thermal processed
Calories
Carbohydrates
Starch from cereals
Cellulose
Hydolyzed cellulose
Starches from animal sources
Water soluble sugars (Sucrose)
Polysacharides
Chemically generated sugars (i.e. corn starch derived sweeteners)
Thermally processed starches
Calories (heat of combustion)
Chemically esterified starches

If you are going to measure the amount of fat by a calorie measure, this
excludes all the other attributes.

DAHeiser






=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



RE: Faults and Errors in EXCEL

2002-01-21 Thread David Heiser


-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Humberto Barreto
Sent: Monday, January 21, 2002 6:29 AM
To: David Heiser; [EMAIL PROTECTED]
Subject: Re: Faults and Errors in EXCEL


At 05:41 PM 1/20/02 -0800, David Heiser wrote:
These are my questions:
 1. Given that X is singular, how can I test this in EXCEL given
 that I only
have the MMULT, MINVERSE and MDETERM matrix functions available in EXCEL?
One test is to calculate MDETERM on the X'X matrix. For this data set, the
determinant of X'X is ~E-08, which is not particularly small. Another test
is to multiply (X'X)^-1 times XX and look for something clearly not a
unitary matrix. In this case it very clearly is nothing like 'I'. I tend to
favor the latter.

If you are saying that you don't trust the latter, (X'X)-1 x (X'X) should
equal Identity matrix, in other cases, then why not always do both and if
either fails, you know there's a problem?
-
I don't see this as a 'trust' problem. I see it a a problem in making a
decision about the results of the regression. What is the (true/false)
decision on the validity of the values of the regession coefficients? Are
they good to 2 significant figures, 3, 4 or what?  DAH
-
How about taking the inverse of the (X'X)-1 result?  How does that do?

I was playing with

2   3
2   3.01

With 5 zeroes (like above), minverse on the matrix, then minverse on the
inverse gives you the original matrix back, but six or more zeroes and it
does not, getting worse and worse as you add zeroes.

Would this help?
-
You sort of are just restating my problem. It degrads. McCullough's position
is that any results from a program that has a LRE value for the computed
coefficient that is less than the corresponding LRE value computed by STATA
is absolutely to be rejected.

If my data has only 4 significant figures, all I need is a program that will
give 5 figure coefficient accuracy. My objective is to be able to come up
with something that can be derived from the data-program that tells me I
have at least 5 figure accuracy. DAH

If so, now you have 3 tests that must be passed.

 3. What is the appropriate screen to use with EXCEL (without
 additional
macros) to indicate that the results are wrong in a regression. With the
complicated data sets now being fitted, singularity is not obvious.

Please send me the Excel workbook with the data. I'd like to try a few
ideas.  I'm thinking a chart of y and predicted y might show some obvious
problems.

I will send you by separate message copies of the EXCEL files that I have
been working with, so you can try out your ideas. DAH

 4. Telling students that EXCEL does not properly compute
multivariate
regression is obviously an over-kill.

I agree, although LINEST's limitation of 16 X variables is pretty bad,
don't you think?

No I don't.

Any regression program that uses/depends on the basic IEEE 64 bit floating
point instruction set as implemented in the Intel (and others) chip, has
very severe problems with some matrix problems. The problem is the inherent
inaccuracy of the inversion operation on a matrix. Although the rounding
unit error is about E-16, the error propagates in subsequent computations.
In any summation series, the errors are bounded by n-1 times the adjusted
rounding unit (Stewart), where n is the number of additions. All matrix
inversions methods that I am aware of depend on the Shur complement
operation which is S=A22-A21*A11'*A12 (after Stewart), which involves
subtraction. In any floating point subtraction, the result has fewer
significant figures then the two original terms. The floating point
operation then in effect adds zeros on the right to fill out the mantissa to
a base length of 52 bits. This an inherent loss of significant figures.

Consequently, a matrix inversion is the most inaccurate computation in any
software package.

My position is that any commercial software package that uses IEEE 64 bit
arithmetic will have significant errors in the X'X inversion matrix, and
this is the primary source of uncorrectable errors in regression
coefficients. I would say that any attempt to do multivariate work with more
than 16 variables should be avoided.

Now if you have a Fortran package and can do 128 bit work in software then,
you can look at many more variables and the values have a higher trust
level.

The future of computations depends on having 128 and 256 bit floating point
chip sets within the next 5 years.

DAH

Faults and Errors in EXCEL

2002-01-20 Thread David Heiser

I have gone through McCullough and Wilson, two of the papers from the Stern
school (Simon and Simonoff), Cryer, and Goldwater and have workarounds or
fixes for each of the problems that they raise (with data sets) about EXCEL.
There is one however that I don't have a fix for. For example using EXCEL, I
can beat Stata (IRE(coefficients)=11.1 to 15.1) on a 10th order polynomial
fit to the NIST Filip Data Set.

The one that I can't fix, is the problem of multivariate linear regression
fit. In this case there is a singular or near singular X matrix. Given N
columns of X, the rank of X here is N-1 or less.

Lets assume we have the case of the dental data from Gary Simon. Here we
have 54 observations, with 5 variables, 1 for the intercept, 1 for a set of
different values from 0 to 8 (Molar Numb), 2 variables which are indicator
variables (Alc and Drg) and one variable (Drg Grp) is a sum (actually 4
minus the sum) of the two indicator variables. X clearly is singular, but
EXCEL gives an answer with LINEST. I have my own set of matrix and
regression subroutines in QB, which I used to clarify why this occurs.

When I do a Gaussian triangularization with pivoting on X, I get a very
clear true zero on the pivoting, indicating singularity. Going to the X'X
positive definite 5x5 matrix, I end up with reasonable values, but it still
is a singular matrix. Doing the Gaussian again, I get a diagonal value of
8.4E-15, which is close to zero, but not a true zero. Therefore I can invert
X'X in EXCEL and it works, but the values are screwy. I get values of about
E+14 with one row and one column of E-03 values. No divide check errors.
With finite values of X'Y and (X'X)^-1, you get a reasonable set of
coefficient values.

These are my questions:
1. Given that X is singular, how can I test this in EXCEL given that I only
have the MMULT, MINVERSE and MDETERM matrix functions available in EXCEL?
One test is to calculate MDETERM on the X'X matrix. For this data set, the
determinant of X'X is ~E-08, which is not particularly small. Another test
is to multiply (X'X)^-1 times XX and look for something clearly not a
unitary matrix. In this case it very clearly is nothing like 'I'. I tend to
favor the latter.
2. Do we or do we not teach accountants and business majors in their
introductory stat class based on EXCEL about matrices and singularities,
rank and eigenvalues? Assuming that this is never covered or taught, what
supplementary material should be passed out along with the training on the
use of EXCEL?
3. What is the appropriate screen to use with EXCEL (without additional
macros) to indicate that the results are wrong in a regression. With the
complicated data sets now being fitted, singularity is not obvious.
4. Telling students that EXCEL does not properly compute multivariate
regression is obviously an over-kill.

Any thoughts.

DAHeiser



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



RE: How to compute Beta variates

2002-01-12 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of MIchael Bals
Sent: Friday, January 11, 2002 7:15 AM
To: [EMAIL PROTECTED]
Subject: How to compute Beta variates


Hi !

I am new to this group, so I hope you haven't been bothered too often
with such questions. I looked in the group but didn't really find
anything.

I want to compute the inverse of the beta distribution in VB. I know
that there is no closed form for it. I tried to translate the AS109
Algo. from StatLib but don't know how to get the log of the complete
beta distribution that is needed for it. perhaps someone can help me
out here.

Are there any other ways to get variates from beta distribution for
ONE specific x~U(0,1)? I need them for Latin Hypercube sampling and as
far as I understood I need the inverse of any distribution I am
sampling in order to use LHS.

I am also looking for inverse methods for the gamma, possion, binomial
distributions or any approximation to them

A lot of questions...hope someone can help.

Bye
Michael Bals
-
I was waiting for all the experts who know a lot more than me about doing
this.

1. The basic distributions in mathematical form are given in Abramowitz.
Just about everybody who provides packaged programs to calculate
distributions, relies on this source. There are also a number of
approximation (polynomials) methods developed in Fortran from the 60's on,
when computers were large and slow. With the advent of small fast computers,
the solutions using infinite series (i.e. Abramowitz) are fast enough to
give accurate results, if special algorithms are used for p values close to
zero or one. It is the tails of these distributions that present
underflow/overflow/error problems using IEEE 64 bit double precision
floating point numbers. Fortran 90 has a method of doing computations in 128
bit floating point numbers, but this is in software, and as a result is very
slow. The 128 bit machine language arithmetic functions were never
incorporated in VB, even up to the .NET version. I tried building some 128
bit arithmetic functions, but did not get anywhere. One of the problems was
that VB does not have the assembly level building capability that C++ has.
Fortran programs can be easily translated to VB, if you know enough about
both languages.

2. The inverse has traditionally been computed by looping, using Newton's
method, since the density of the distribution is the derivative of the
cumulative function. It is fast and works if the density if smooth. Some
of the software packages use different infinite series (or approximations)
depending on the values of the parameters. When these are combined, the
discontinuities will give closure problems and inaccurate results. The beta
distribution in EXCEL is like this.

3. A common element in these distributions is the factorial function. The
general approach is to use an infinite series to calculate the log of the
factorial function. Equations 6.1.34 and 6.1.48 in Abramowitz are examples.
Equation 6.2.2 defines the Beta function, and the log(Beta) can be directly
calculated from the logs of the factorials. These equations work for real,
floating point numbers. To obtain the beta function, test the log values for
the allowable exponential range and do an EXP(logBeta). This will give you
values within the 10^+308 to 10^-308 range of a double precision number.

4. In general, random numbers from these distributions are obtained by
getting a good random number from a generator (that passes the die-hard
tests), computing the inverse of the cumulative distribution for a true
random deviate. This is slow, and there have been several approaches to
directly calculate a random deviate that is a close approximation to a
true random deviate. Methods for the normal and gamma distribution have
been developed, to give fast values suitable for Monte Carlo studies.

5. I did most of my work on distributions and inverses in tha mid 1980's,
starting on a Honeywell mainframe back in the late 70's, so I may not be
familiar with recent developments in this century.

DAHeiser



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



RE: Excel vs Quattro Pro

2002-01-12 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Jay Warner
Sent: Friday, January 11, 2002 6:45 PM
Cc: [EMAIL PROTECTED]
Subject: Re: Excel vs Quattro Pro

-
Clap, clap, clap (sound of applause)

DAHeiser


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



EXCEL 2000 Statistical ToolPac Faults and Problems

2002-01-08 Thread David Heiser

I have started going through McCullough and Wilson's paper On the Accuracy
of Statistical Procedures in Microsoft Excel 2000. I have found one error,
and may find more. However I need to put them all together, and that will
take some time.

My point is that the Excel 2000 faults are not that severe when Excel is
used in the intended environment. The NIST tests are pretty severe and
represent primarily invented data sets or unusual data fitting situations.
I can see workarounds to bypass some of the Excel limitations. I need
however to test them for validity against the NIST data sets first. Again
this is going to take some time.

What I intended is to say is, don't jump to conclusions yer.

DAHeiser



=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



RE: Stat Requirement (was Excel2000)

2002-01-06 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of David Firth
Sent: Sunday, January 06, 2002 1:22 PM
To: [EMAIL PROTECTED]
Subject: Re: Stat Requirement (was Excel2000)

Very good points. Appreciated.

DAHeiser


=
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at
  http://jse.stat.ncsu.edu/
=



RE: Excel2000- the same errors in stat. computations and graphics

2002-01-05 Thread David Heiser





  -Original Message-From: Jon Cryer 
  [mailto:[EMAIL PROTECTED]]Sent: Saturday, January 05, 2002 9:14 
  AMTo: David HeiserCc: 
  [EMAIL PROTECTED]Subject: RE: Excel2000- the same errors 
  in stat. computations and graphics 
  David:I have certainly never said nor implied that Excel cannot produce reasonablygood graphics. My 
  concern is that it makes it so easy to produce poorgraphics. The defaults are absurd and 
  should never be used. It seems to me thatdefaults should produce at least 
  something useful. The default graphs are certainly not goodbusiness graphs 
  if the intent is to produce good visual display of quantitative 
  information!Isn't that what graphs are for?[David 
  Heiser]
  
  The 
  EXCEL chart defaults are as you say are poor,for an 
  audience/usersof statisticians/scientists/engineers. Even going to the 
  effort to make a lot of changes to the defauts, you never really get an 
  outstanding graph, like you would find in a professional 
  publication.
  
  The 
  default charts are specifically set-up for the type of business applications 
  (i.e. sales, gross income, expenses, operating costs, salesmen performances, 
  product distributions, etc.), in which bar and column charts appear to be 
  meaningful and the presentation of 3D graphs impressive (at least for 
  alate 1980's audience).The Microsoft User's manual clearly 
  identifies the audience for which EXCEL charts were developed for. This 
  user/audience is essentially the same one that ACCESS was written for. The 
  chart defaults generate useful charts for management and tracking of product 
  sales and sales efforts.
  
  What 
  we have now is an entirely different user/audience using EXCEL to do something 
  it was never designed to do.
  
  The 
  EXCEL ToolPak pack represents the way things were done and viewed in the 
  1980's. Essentially it is a 20 year old package, that never has been updated. 
  In 2002 we have been exposed to the tremendously impressive game displays and 
  the capability of the many (new) statistical software programs, and now want 
  better graphics. 
  
  I 
  don't think that Microsoft will improve the graphics, leaving it to the 
  developers to create and market separate programs and add-ones to EXCEL to 
  give better graphics. I don't think Office XP (EXCEL 2002) is any different 
  from EXCEL version 5.0.
  
  What 
  would be helpful, would be an EXCEL front end fora developer computation 
  package that would in turn generate a standard interface to separate graphics 
  packages. In a competitive world, there would be interface standards, so that 
  one could buy and separate software packages. However again cost enters in, 
  and EXCEL as a stand-alone package is for most users the economic choice. So 
  we have to accept the poor EXCEL graphics and computational limitations, 
  because in many business's they only have the Microsoft Office application, 
  and under their ground rules (and usage/liscensing requirements), you have to 
  use their EXCEL for any computations and presentations. This is the 
  environment, I have had to work in.
  
  Then 
  on top of it you have such schools as the University of Phoenix who teach 
  undergraduate stat as a course in using EXCEL only.
  


RE: Stat Requirement (was Excel2000)

2002-01-05 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of David Firth
Sent: Saturday, January 05, 2002 3:28 PM
To: [EMAIL PROTECTED]

Subject: Stat Requirement (was Excel2000)
thank you for your response

Thanks for the wealth of Excel trivia. Use the right tool for the job, I
say.
Excel might not be it.

I do have to take a little offense to the accuracy remark regarding business
calcs -- I learned early on that bcd reals or integer-based math libs were
the only appropriate mechanisms for business calcs. I prefer not to use
regular
real/float types if I have alternatives. But I'm a measurement/data acq man.
We can be a bit, well, anal about accuracy.
---
What I had in mind was that the variant form of data input in each cell (in
EXCEL, in VB and in VBA), accepts the following:
0   Empty   No data/entry (once held something but now is blank).
1   NullNo value. Unknown data.
2   Integer Whole number -32768 to +32767, 2 bytes.
3   LongWhole number (integer) -2,147,483,648 to +2,147,483,647, 4 
bytes.
4   Single  Floating point decimal number, aprox 7 decimal digits, 4 bytes.
5   Double  Floating point decimal number, aprox 15 decimal digits, 8
bytes. -1.79769313486231E308 to
-4.94065645841247E-324 and +4.94065645841247E-324 to
+1.79769313486232E308
6   CurrencyDecimal number with 4 decimal places, 8 bytes, 19 digits max. 
Use
it to minimize rounding errors.
-922,337,203,685,477.5808 to +922,337,203,685,477.5807
7   Date/time   A number, the left integer portion representing days and the
right decimal portion representing time as afraction of a day, 8 
bytes
There are a number of functions that will extract calendar and time
information from it.
8   String  Text. 10 bytes + string length, up to 2 billion characters.
However EXCEL limits cell contents to a maximum of 256 
characters.
9   OLE Object  4 bytes.
10  Error   Code number returned if an error in a computation occurred, 2
bytes.
11  Boolean True or false, 2 bytes. Integer, 0 is false, -1 is true.
12  Variant An array of variants, 16 bytes numbers, 22 bytes + string length
13  Non-Ole Object
14  Decimal 14 bytes. +/-79,228,162,514,264,337,593,543,950,335 with no
decimal point.
+/-7.9228162514264337593543950335 with 28 places to the right 
of the
decimal.
Smallest non-zero number is +/-0.(27 0's)1
17  Byte1 byte. 0 to 255.
8192Array   An ordered table of values

The number in the left column comes from the VarType() function. EXCEL uses
a separate formatting code to indicate how the number appears in the cell
(i.e. number of decimal points and % conversion). If I do calculations with
currency, I have up to 19 accurate digits, whereas with double, only 15. If
I do integer arithmetic with long integers, I only have up to 10 digits.
When you format a cell, EXCEL allows only 0, 1, 5, 6, 7, 8 and 11. Macros
can use all the other types. I have no experience with using decimal
numbers.
DAHeiser

-

A CS program more aligned with the needs of science, business, or econ might
be found, but CS is a general thing. The idea is that with the help of
content
experts or reference books the capable CS grad could do good work. If the
programmers involved slapped out some code and then went roller skating
(apologies to Dilbert) then they didn't do a good job. My 11 years in the
software side of embedded systems has contained too many recommendations to
peers about spending time to understand the customer's needs.
--
Bravo. Applause. (We will overlook the fact that the customer usually
doesn't know his needs until 2 weeks before product delivery date. This is
why software development is an involved time consuming interactive process.)
DAHeiser
--
IMHO it is more
of a process issue than a knowledge base problem. I took stats as an
elective
when I was an electronics engr tech undergrad because I thought it would be
handy.
---
Great, Good. Applause.
DAHeiser
--

 From what I have observed, many business type have a very limited math
background, and even learning simple business stat is a major problem. For
example try getting them to understand the difference between using z and
t tests,

Yes, but the idea again is to be a generalist who makes use of content
experts. I am now in an MBA program and have 

RE: Excel2000- the same errors in stat. computations and graphics

2002-01-04 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Shareef Siddeek
Sent: Friday, January 04, 2002 1:22 PM
To: [EMAIL PROTECTED]
Subject: Excel2000- the same errors in stat. computations and graphics



Happy new year to all.

I frequently use Excel2000 for graphic presentation, spreadsheet maths,
simple nonlinear model fitting (using the Excel solver) with one or two
parameters, and simulations. I thought Excel2000 corrected those errors
found in the analysis tool pack and other in-built computational
procedures in the older 97 version. However, following articles point
out that the developers have done nothing to corrected those errors. I
would like your comments on this. Thanks. Siddeek
---
1. I appreciate receiving your note and the URLs.

2. One really can't effectively use EXCEL without having to make the effort
of learning it from the books. Some of the complaints from Cryer have to do
with the fact that he never learned how to build charts in EXCEL. This
includes chart layouts, legends, scales, axis, labels, etc. One can use the
drawing overlay features to build up text on the charts. I always recommend
spending time reading the big commercial manuals available on EXCEL 2000. I
have several. EXCEL HELP is lousy for finding the information you really
need.

3. The EXCEL stat package was an add-on developer package by GreyMatter
International Inc, Cambridge, MA. back in the early 90's. Microsoft did not
write it. Being familiar with developers, the people writing the software
have to be familiar with an enormous lexicon of object links and protocols.
Stat is not one of the courses toward a degree in computer science.
Consequently much of the formula building comes from a convenient textbook.
I really am surprised at the developers/programmers out there that have no
knowledge of basic math, or how time works (calendar-time linkage). Much of
the problem has to do with the assumption that software built-in functions
work as the programmer thinks they work, not how they actually work. It is
obvious that Bill Gates has no interest in fixing EXCEL accuracy, only in
it's appearance and ability to fit in as a part of larger program packages.
His only interest now is .NET and the ability to pull off company data in
spreadsheet format using the internet as the company's internal network.

3. There is a problem with EXCEL histograms. This has been commented on in
previous edstat e-mails. In general EXCEL produces simple graphs, primarily
for business purposes. It does not produce good scientific graphics. All it
does is get you a quick graph with a minimum of effort.

4. Part of the inaccuracy problem has to do with the fact that each EXCEL
cell by default is treated as a variant variable. Unless you format all the
numerical cells properly (as decimal or integer), you are likely to have
problems. I Always format all my cells properly, declaring the type of cell
contents. If for example you are to precede a number by a space, EXCEL may
interpret the number as text. By use of the variant, empty cells can be
handled, and not cause computational halts.

5. The primary use of EXCEL is in business, doing the type of calculations
and reports described in Microsoft EXCEL User's Guide. In business
applications, accuracy is not that important, except when money is involved.
If for example if McCullough were to declare his numbers as currency instead
of variant, his accuracy would probably improve. Considering the type of
business applications for stat (for example see The Complete Idiot's Guide
to Business Statistics) what EXCEL does is fine. From what I have observed,
many business type have a very limited math background, and even learning
simple business stat is a major problem. For example try getting them to
understand the difference between using z and t tests, and to understand
confidence intervals. Business people expect the computer to give them a
number. The statement by McCullough that ..it is important for the package
to determine whether the answer is likely to be corrupted by cumulated
rounding errors as to be worthless and if so, not to display the answer.
This policy is not acceptable to business types, and this is one of the
ongoing problems on the nets. They would rather get a wrong number, then
none. In most cases, the computed result is not the sole basis for a
business decision. (Please note here that these comments do not apply to
those in quality control, research or product improvement/development.)

6. The EXCEL solver was developed by Frontline Systems, at Incline Village,
NV. (Incline Village is an expensive skiing/condominium/housing area up at
the north end of Lake Tahoe. It was named for a huge inclined water 'trough'
that was used in the past to bring logs from the mountains down to sawmills
at the lake.) The solver algorithm has not been divulged, and obviously it
doesn't compare to 

RE: Question about concatenating probability distributions

2001-12-10 Thread David Heiser


RE: The Poisson process and Lognormal action time.

This kind of problem arises a lot in the actuarial literature (a
process for the number of claims and a process for the claim size),
and the Poisson and the lognormal have been used in this context - it
might be worth your while to look there for results.

Glen
...
This is a very general and important event process. It is also used to
describe
the general failure-repair process that occurs at any repair shop. The
Poisson
is a good approximation of the arrival times of equipment to be repaired,
and
the log-normal is a good approximation of the time it takes to repair it.

From an operations standpoint, the downtime is approximated by the
exponential
distribution (occurrence) and a log-normal repair time, which includes
diagnosis,
replacement and validation.

In the Air Force (1982-1995) where the reliability and maintainability of
equipment has to be
characterized, the means are determined and used in a form called
availability.
We never got beyond the use of availability. They never got into the
distribution and confidence interval aspects.

As a general approximation, the log-normal distribution approximates human
reaction times to events.

 DAHeiser



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



RE: What is a confidence interval?

2001-09-27 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Gordon D. Pusch
Sent: Thursday, September 27, 2001 7:33 PM
To: [EMAIL PROTECTED]
Subject: Re: What is a confidence interval?


John Jackson [EMAIL PROTECTED] writes:

 this is the second time I have seen this word used: frequentist?
 What does it mean?

``Frequentist'' is the term used by Bayesians to describe partisans of
Fisher et al's revisionist edict that ``probability'' shall be declared
to be semantically equivalent to ``frequency of events'' in some mythical
ensemble. Bayesians instead hold to the original Laplace-Bernoulli concept
that probability is a measure of one's degree of confidence in an
hypothesis,
whereas the frequency of occurance of an outcome in a set of trials is a
totally independent concept that does not even live in the same space as
a probability.


-- Gordon D. Pusch
--
I disagee with Pusch.

Bayesians have a way of modifying definitions to support their arguments.

Bayesians are those people who have to invent loss functions in order to
make a decision.

A frequentist defines the concept of probability in terms of gaming,
where the probability is defined as the ratio of the number of times
an event (such as the occurance of a one showing on a die) is favorable
to the number of times all other events occur (all 6 sides of the die)
as the number of repeats (identically distributed independent random
events) becomes very, very large. This was very difficult to define
mathematically, since what is a repetition could not be adequately
defined.

Von Mises is usually taken as the main source of this concept.

There is a fundamental problem of defining probability, without involving
circular references. The terms identically distributed and independent
(random) events depend on the term equi-probable, and then we are right
back at square 1. The definition of random involves something that
can't be defined, except by saying that the next random event can't be
predicted. When with my die a 2 keeps comming up by chance, then what?

Bayesians say it is just a matter of belief, whatever that is. This leaves
probability undefined, as a mathematical property with values between 0 and
1.

Whether there is such a real thing as zero probability or a probability of
1 or not, for values between 0 and 1, statisticians have to resort to a
frequentist viewpoint in order to establish limiting values as the
number repetitions approach infinity.

This is why it so hard to teach statistics. It all depends on what is
the students internal understanding of what probability means. If you
are comfortable with belief then fine. Now tell me what the difference
is between a p value of 0.05 and 0.06 in real world terms?

If my study has a lot of sizzle and has important ramifications
about what we believe about our universe, a p value may not be
important. Afterall, proof of Einstein's theory of relativity was
based on a pretty sloppy single observation of the position of
Mercury at an eclipse.

Fisher in his reflective later life, took great pains to avoid making
a hard and fast decision based on probability values. He always said that it
was up to the investigator to determine whether a p value of 0.06 meant
that there was an improbable chance that random events could have
determined the outcome of his experiment, not the publication editor.

Nowdays it is determined by the stupid Peer system. Also by editors that
are looking hard at the best way to determine belief in the claims of the
experimenter when they haven't the foggiest idea of what the investigation
was about, and corporate profits or the status quo is the most important
issue. This was very probably the situation in England in the 1950's
which pushed Fisher to go to Australia.

Joseph F. Lucke said this in a recent post:
--
I saw the same show on Nova.  Flower had a different definition of
randomness than we now use.  We now define randomness as (probabilistic)
independence, but that was not always the case.  In the 1930s or so, the
mathematician-philosopher-statistician von Mises developed a theory of
probability based on frequencies.  This was not the Kolmogorov version in
which the axioms are interpreted as frequencies, but an axiomatic system
derived from the properties of repeated events.  Von Mises introduced the
notion of a collective or sequence of potentially infinitely repeatable
events.  Probability was defined as the limiting relative frequency in this
collective. One of his axioms was that the events within the collective were
random.   But because he had not yet developed the concept of independence
in his system, he could not define randomness as mutual independence among
the events within the collective.  Indeed, randomness was a primitive
concept in his axiomatic system.  Von Mises defined randomness 

RE: what type of distribution on this sampling

2001-09-21 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of Joe Galenko
Sent: Friday, September 21, 2001 12:30 PM
To: [EMAIL PROTECTED]
Subject: Re: what type of distribution on this sampling

Just out of curiousity, I'd like to know what kind of population you could
have such that a sample mean with N = 200 wouldn't be approximately
Normally distributed.  That would have to be a very, very strange
distribution indeed.


On Fri, 21 Sep 2001, Gus Gassmann wrote:

 Joe Galenko wrote:

  The mean of a random sample of size 81 from a population of size 1
billion
  is going to be Normally distributed regardless of the distribution of
the
  overall population (i.e., the 1 billion).  Oftentimes the magic number
of
  30 is used to say that the mean will have a Normal distribution,
although
  that is when we're drawing from an infinitely large population.  But for
  the purposes of determining the distribution of a mean, 1 billion is
  effectively infinite.  And so, 81 is plenty.


Certain generating processes such as those generating particles by
mechanical grinders generate products with multimode distributions, with
very evident density peaks. I can take 10^23 particles and measure them and
I would still not have a normal distribution.

I encountered this 40 years ago when trying to identify the effect of
grinding process variables on grinding ammonium perchlorate when the
criteria is propellant burning rate.

DAHeiser



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Comment on Powerball---Expected Value

2001-09-01 Thread David Heiser

Jordan Ellenberg has an interesting comment on the statistical aspects of
the expected value of a Powerball ticket. It was posted on Microsoft
Internet Explored, Saturday Sept 1st under Slate.

If you can't get it, I saved the text version and can send it as an e-mail
attachment upon request. Try Internet Explorer first.



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



RE: Bayesian analyses in education

2001-07-12 Thread David Heiser



-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of KKARMA
Sent: Wednesday, July 11, 2001 2:04 AM
To: [EMAIL PROTECTED]
Subject: Bayesian analyses in education


As a teacher of research methodology in (music) education I am
interested in the relation between traditional statistics and the
bayesian approach. Bayesians claim that their approach is superior
compared with the traditional, for instance because it does not assume
normal distributions, is intuitively understandable, works with small
samples, predicts better in the long run etc.

If this is so, why is it so rare in educational research? Are there some
hidden flaws in the approach or are the researchers just ignorant?
Comments?
-
S.F. Thomas gave a good reply.

I might add that probably most statisticians use methods that are
appropriate to the problem. In some problems, only a Bayesian approach
works. There are other problems in which a Bayesian approach is not
appropriate or is not a part of the problem.

In much of educational research, the focus is on the application of
measurement theory to develop relationships between factors, and ways to
measure concepts. The issue of the probability of a parameter value is not
of major concern, since general concepts of means, normality (multivariate)
and chi-square distributions are considered adequate in presenting results.
One example is the use of structural equation modeling (SEM), where the
focus is on model fit, and the cause and effect relationships that the model
implies.

A more recent interest is in the application of Bayesian concepts to
causality, such as, We will adhere to the Bayesian interpretation of
probability, according to which probabilities encode degrees of belief about
events in the world and data are used to strengthen, update, or weaken those
degrees of belief. In this formalism, degrees of belief are assigned to
propositions in some language, and these degrees of of belief are combined
and manipulated according to the rules of probability calculus. (Judea
Pearl, Causality, Cambridge Press, 2000). The SEM modeling is non-Bayesian,
but the nature of the conclusions may be expressed in the form of Bayesian
Networks. I would expect to see more of these concepts to show up in
educational research.

It allows one to express a degree of uncertainty in conclusions in a highly
technical language in which very few understand.

DAHeiser




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



FW: Diagnosing and addressing co linearity in Survival Analysis

2001-06-06 Thread David Heiser



-Original Message-
From: David Heiser [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 06, 2001 1:55 PM
To: ELANMEL
Subject: RE: Diagnosing and addressing co linearity in Survival Analysis




-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]]On Behalf Of ELANMEL
Sent: Tuesday, June 05, 2001 11:47 PM
To: [EMAIL PROTECTED]
Subject: Diagnosing and addressing collinearity in Survival Analysis


Any assistance would be appreciated:  I am attempting to run some survival
analyses using Stata STCOX, and am getting messages that certain variables
are
collinear and have been dropped.  Unfortunately, these variables are the
ones I
am testing in my analysis!

I would appreciate any information or recommendations on how best to
diagnose
and explore solutions to this problem.

Thanks!
Elan

---
I  see this as a deficiency in your software product Stata STCOX. You
should be climbing down the neck of the company that you bought the software
from. Their manual should describe how their software arrived at that
declaration, what was the logic that selected those particular variables
that were dropped, and how to work around the problem.

These software companies are strictly for profit companies. We should hold
them responsible for bad products, just as we hold Ford and Firestone
responsible for faulty products.

These companies hire software programmers primarily to develop software that
has a lot of flash, just like a computer game. These are the selling
features. Once the product is sold, they have no interest. It is up to the
user to challenge the company and get the problems solved.

We are seeing more and more of users getting their advanced statistical
training from software manuals. We (the statistical community) should be
putting pressure on these developers to put into their manuals all the text
that would be normally found in a textbook on the subject.

David A. Heiser



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



The False Placebo Effect

2001-05-24 Thread David Heiser


Be careful on your assumptions in your models and studies!
---

Placebo Effect An Illusion, Study Says
By Gina Kolata
New York Times
(Published in the Sacramento Bee, Thursday, May 24, 2001)

In a new report that is being met with a mixture of astonishment and some
disbelief, two Danish researchers say that the placebo effect is a myth.

The investigators analyzed 114 published studies involving about 7,500
patients with 40 different conditions. They found no support for the common
notion that, in general, about one-third of patients will improve if they
are given a dummy pill and told it is real.

Instead, they theorize, patients seem to improve after taking placebos
because most diseases have uneven courses in which their severity waxes and
wanes. In studies in which treatments are compared not just to placebos but
also to no treatment at all, they said, participants given no treatment
improve at about the same rate as participants given placebos.

The paper appears today in the New England Journal of Medicine. Both
authors, Dr. Asbjorn Hrobjartsson and Dr. Peter C. Gotzsche, are with the
University of Copenhagen and the Nordic Cochran Center, an international
organization of medical researchers who review randomized clinical trials.

Reaction to the report covers the spectrum.

Dr. Donald Berry a statistician at the M.D. Anderson Cancer Center in
Houston, said: I believe it. In fact, I have long believed that the placebo
effect is nothing more than a regression effect, referring to a statistical
observation that patients who feel terrible one day will almost in- variably
feel better the next day, no matter what is done for them.

But others, like David Freedman, a statistician at the University of
California, Berkeley, said he was not convinced. He said that the
statistical method the researchers used -pooling data from many studies and
using a statistical tool called meta-analysis to analyze them -could give
results that were misleading.

I just don't find this report to be incredibly persuasive, Freedman said.

The researchers said they saw a slight effect of placebos on subjective
outcomes reported by patients, like their descriptions of how much pain they
experienced. But Hrobjartsson said he questioned that effect. It could be a
true effect, but it also could be a reporting bias, he said. The patient
wants to please the investigator and tells the investigator, 'I feel
slightly better. ' 

Placebos still are needed in clinical research, Hrobjartsson said, to
prevent researchers from knowing who is getting a real treatment.

Curiosity prompted Hrobjartsson and Gotzsche to act. Over and over, medical
journals and textbooks asserted that placebo effects were so powerful that,
on average, 35 percent of patients would improve if they were told a dummy
treatment was real.

They began asking where this assessment came from. Every paper ,
Hrobjartsson said, seemed to refer back to other papers.

He began peeling back the onion, finally coming to the original paper. It
was written by a Boston doctor, Henry Beecher, who had been chief of
anesthesiology at Massachusetts General Hospital in Boston and published a
paper in the Journal of the American Medical Association in 1955 titled,
The Powerful Placebo. In it, Beecher, who died in 1976, reviewed about a
dozen studies that compared placebos to active treatments and concluded that
placebos had medical effects.

He came up with the magical 35 percent number that has entered placebo
mythology, Hrobjartsson said.

But, Hrobjartsson said, diseases naturally wax and wane.

Of the many articles I looked through, no article distinguished between a
placebo effect and the natural course of a disease, Hrobjartsson said.

He and Gotzsche began looking for well-conducted studies that divided
patients into three groups, giving one a real medical treatment, one a
placebo and one nothing at all. That was the only way, they reasoned, to
decide whether placebos had any medical effect.

They found 114, published between 1946 and 1998. When they analyzed the
data, they could detect no effects of placebos on objective measurements,
like cholesterol levels or blood pressure.

The Washington Post contributed to this report.
-end of article-




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: [Q] Generate a Simple Linear Model

2001-03-29 Thread David Heiser

 well, if you have access to a routine that will generate two
variables with
 specified r ... then, you can do it ... i have one that runs in
minitab ...
 it is a macro ... and i know that jon cryer has one too ...

 http://roberts.ed.psu.edu/users/droberts/macro.htm ... check #1 ...

I am not familiar with the Minitab coding schemes. I have never used
Minitab.

To get some insight, I went thru "1. Generate X,Y Data With Desired n
and r". I see some basic FORTRAN in the I/O and early BASIC (dark ages
BASIC) in the statements. Noted an error. The line "cent c3 c6, c12
13" should be "cent c5 c6, c12 c13".

Otherwise, the lines seem pretty obvious as to what is going on. I
don't see extensive use of functions and subroutines which are the
basis for modern programming. They would really reduce much of your
coding.

Are you doing all calculations in double precision or single
precision?

To go through and find other errors would be a prohibitive task for
me.

I am glad that I don't have to teach in your "summer cottage". The
lecture halls must be dreadful. I remember my psych 1 course in a huge
(1000 seat capacity and it was filled) auditorium.

DAHeiser




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: multivariate normality

2001-03-08 Thread David Heiser


- Original Message -
From: "yogab" [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, March 08, 2001 12:23 PM
Subject: multivariate normality



 Hello all.,

 Iam using mutivariate normality test (Mardia 1974) to check for
 normaility for my sample set of size 1000 (multivariate data).

Please give me the details on your reference. I have Mardia (1970) in
which he developed a multivariate skewness and kurtosis relationship.
His eq. 2.23 was a brilliant insight on multivariate skewness. His
convergent form, eq. 2.24 is a poor measure of b1,p and requires
calculations from the correlation matrix. I have no idea of what SAS
does inside their black box to arrive at b1,p. Mardia also went on to
a different approach in his later papers, ignoring eq 2.23.





=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: probability definition

2001-03-02 Thread David Heiser


- Original Message -
From: "Alex Yu" [EMAIL PROTECTED]
To: "Shareef Siddeek" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, March 01, 2001 5:01 PM
Subject: Re: probability definition


1. Very Interesting. Would it be possible to get a copy of your paper
on Probability. I gave a review of  "The Meanings of 'Probability'"
several years ago for our ASA chapter, and would like to redo it.

 For a quick walk through of various prob. theories, you may consult
"The
 Cambridge Dictionary of Philosophy." pp.649-651.

 Basically, propensity theory is to deal with the problem that
frequentist
 prob. cannot be applied to a single case. Propensity theory defines
prob.
 as the disposition of a given kind of physical situation to yield an
 outcome of a given type.

 The following is extracted from one of my papers. It brielfy talks
about
 the history of classical theory, Reichenbach's frequentism and
Fisherian
 school:

 
 Fisherian hypothesis testing is based upon relative frequency in
long
 run. Since a version of the frequentist view of probability was
developed
 by positivists Reichenbach (1938) and von Mises (1964), the two
schools
 of thoughts seem to share a common thread.

2. Von Mises (1957) quotes Johannes von Kries and goes on the address
the use as "I shall assume therefore a definite probability of the
death of Caius, Sempronius or Titus in the course of the next year".,
as support of his concept of "probability in a collective". He does
include the "single event" as part of his "collective". He then
states, "The term 'probability' will be reserved for the limiting
value of the relative frequency in a true collective which satisfies
the condition of randomness." With respect to Caius, Sempronius and
Titus" he was considering the collective of aged rulers of Rome.

However, it is not necessarily
 true. Both Fisherian and positivist's frequency theory were proposed
as
 an opposition to the classical Laplacean theory of probability.

3. My reading of Fisher was that he opposed the Laplacian view because
it had no mathematical basis, Bayes however did, and was fully
accepted.

 In the
 Laplacean perspective, probability is deductive, theoretical, and
 subjective. To be specific, this probability is subjectively deduced
from
 theoretical principles and assumptions in the absence of objective
 verification with empirical data. Assume that every member of a set
has
 equal probability to occur (the principle of indifference),
probability
 is treated as a ratio between the desired event and all possible
events.
 This probability, derived from the fairness assumption, is made
before
 any events occur.

 Positivists such as Reichenbach and von Mises maintained that a very
 large number of empirical outcomes should be observed to form a
reference
 class. Probability is the ratio between the frequency of desired
outcome
 and the reference class. Indeed, the empirical probability hardly
concurs
 with the theoretical probability. For example, when a dice is
thrown, in
 theory the probability of the occurrence of number "one" should be
1/6.
 But even in a million simulations, the actual probability of the
 occurrence of "one" is not exactly one out of six times. It appears
that
 positivist's frequency theory is more valid than the classical one.
 However, the usefulness of this actual, finite, relative frequency
theory
 is limited for it is difficult to tell how large the reference class
is
 considered large enough.

4. The idea of a "limiting condition" is based on the same
understanding of differential calculus and infinite series. That is a
limit is reached, not on the value of N, only that some value
converges to a limit as N increases. This does not depend on any
arbitrary large value of N

5. von Mises also based his position on the laws of probability, which
can be defined as a "natural" outcome from the frequentist view, where
limiting occurs as a converging value of a ratio. He differentiated
between the "actual occurances" and as a "thought experiment". It was
the latter that he was using.

 Fisher (1930) criticized that Laplace's theory is subjective and
 incompatible with the inductive nature of science. However, unlike
the
 positivists' empirical based theory, Fisher's is a hypothetical
infinite
 relative frequency theory. In the Fisherian school, various
theoretical
 sampling distributions are constructed as references for comparing
the
 observed.

6. My reading of the historical record is that this was the K. Pearson
school that did this. Fisher stuck to the Uniform, Normal and Poisson.

 Since Fisher did not mention Reichenbach or von Mises, it is
 reasonable to believe that Fisher developed his frequency theory
 independently.

7. Fisher was aware of many who challenged his views, but chose not to
respond, except to K and E Pearson. Anyone who challenged his views
like von Mises did on likelihood (back in 1930), was either to be
ignored or challenged by an exchange of 

Re: stat question

2000-11-23 Thread David Heiser


- Original Message -
From: Herman Rubin [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, November 23, 2000 4:55 PM
Subject: Re: stat question


 Herman Rubin wrote:

 anyone wanting to learn good statistics should not even
 consider taking an "undergraduate" statistics course

 Nonsense.(Reply by this anonymous, lion soaking in oil)

 Not only is that not nonsense, but it is quite difficult
 to get students who have learned techniques to consider
 what, if any, basis was behind those techniques.
 Meaningful statistics is based on the concept of
 probability, not the computation of probabilities, and
 consideration of the totality of consequences.

Wow, Herman, this is deep stuff. There is a huge literature on the attempt
to understand what probability is. Even Fisher had problems trying to
understand it outside of the frequentist viewpoint. There is a lot of stat
work involving maximum likelihood estimates, where there is no probability
support unless you take a Bayesian approach. (Which is infrequent.)

Just look at the extent of the literature on the 2X2 table, and the
difficulty there is in understanding the concepts behind an analysis for
effects.

I have been reading the absolutely wonderful discussion and raging arguments
on SEMNET between Mulaik, Pearl, Hayduk, Shipley and others on the meaning
of b in the simple linear equation Y=bX+e1, where X is one variable and e1
is a combination of the effects of all other variables and random effects.
When e1 is large with respect to Y, it becomes very difficult to define a
simple meaning of b in terms of a quantitative causality. This is deep
stuff, that even the professors have difficulty in understanding.

Considering most of the important work involving statistics is in
psychology, marketing, medicine, economics, physics, social studies, and
every other hard or soft science out there, we cannot assume that all these
PhD practitioners understand probability or really understand the nuances of
the models, equations and conclusions they arrive at. (I did it. One
sentence per paragraph. Does this put me on Fisher's level?)

It would be nice if all these practitioners had graduate courses in stat,
but more than likely it is a undergraduate level course taught at the
graduate level to a student say in psychology, or medicine or...(e.g.,
Abelson in his book, "Statistics as Principled Argument", in the first
sentence of the introduction says, 'This book arises from 35 years of
teaching a first-year graduate statistics course in the Yale Psychology
Department." This is typical of most graduate schools, where the first year
stat course is all that the student gets.) People in these fields will get
exposed to huge data bases with large numbers of variables and find the
impossibility of assessing all the implications of any model, hypothesis or
set of conclusion made.

It is very clear that an education in statistics never stops. The
undergraduate level exposes you to the concepts, and the understanding comes
with continued education and experience.

There has been a long discussion previously on EDSTAT about the 0.05
probability value and the use of it. There was no common agreement, which is
typical of most of the basic fundamental things we use in statistics. Since
we as statisticians can't agree on what is significant (in terms of
probability), how can we expect practitioners to fully understand what
probability is?

DAHeiser



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Help needed ... :-(

2000-11-14 Thread David Heiser

Based on the problems we have in ansering vague questions on edstat, I can
say that any requestor must be able to state the question, so we here (using
American English) can understand what he is saying and give a helpful
answer.

It is obvious that all of us have problems understanding the questions in
English. The complex field of statistics involves so many variations and
such a large body of knowledge, that giving a helpful answer is not easy. I
just don't reply in areas that I am weak in.

The issue is not on an individual being from Germany, or on international
relations, or on responding to people from different cultures and countries,
etc., etc. ; the issue is that the requestor should be able to state the
question so we can understand it. If he can state the question in German, do
it. Requestors post questions in Spanish, Italian, Swedish and in other
languages that I can't recognise, and get answers. Let us encourage those on
edstat from foreign countries to answer the questions in their own languages
and to use their own references. If edstat is to be truely international, we
need a lot mre questions and responses in other languages.

DAHeiser



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: Two t tests

2000-11-03 Thread David Heiser


- Original Message -
From: Robert J. MacG. Dawson [EMAIL PROTECTED]
To: Richard Lehman [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, November 03, 2000 1:16 PM
Subject: Re: Two t tests




 Richard Lehman wrote:
 
  A colleague sent me this note.
 
  A statistics question.
  
  Temperatures taken from different portions of a stream:
  
  Portion 1
  16.9
  17
  15.8
  17.1
  18.7
  18
  
  mean = 17.25
  variance = 0.995
  
  Portion 2
  18.3
  18.5
  
  mean = 18.4
  variance = 0.02
  
  Do these portions have different temperatures?
  
  Obviously the variances are unequal and a 2-sample [unequal variance]

From knowing the characteristics of flowing streams, one should expect
considerable variance. The concept of experimental design would enable one
to plan where to measure to obtain a reasonable mean value. The bottom, if
after a pool, will be colder, and a cross section of the stream will have
high variance. Side stream entries will result in one side being a different
temperature then the other side. A turbulent area will have less variance.
You have to determine where the measurements were made in terms of the flow
characteristics of the locality. Therefore from a practical standpoint, you
can't conclude that the two groups have different means.

DAHeiser.



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: November Issue of SecondMoment

2000-11-02 Thread David Heiser


- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Wednesday, November 01, 2000 10:32 AM
Subject: November Issue of SecondMoment

Now I know why data mining is considered so much hog wash. No attempt to
deal with uncertainty or to quantify it. Everything is gospel by a prime
authority. Believe everything that is found in the data.

DAHeiser



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: questions on hypothesis

2000-10-17 Thread David Heiser


- Original Message -
From: [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, October 16, 2000 4:24 PM
Subject: Re: questions on hypothesis


 In article [EMAIL PROTECTED],
  Chris: That's not what Jerry means. What he's saying is that if
  your sample size is large enough, a difference may be statistically
  significant (a term which has a very precise meaning, especially to
  the Apostles of the Holy 5%) but not large enough to be practically
  important. [A hypothetical very large sample might show, let us say,
  that a very expensive diet supplement reduced one's chances of a heart
  attack by 1/10 of 1%.]

 Firstly, I think we can thank publication pressures for the church of
 the Holy 5%. I go with Keppel's approach in suspending judgement for mid
 range significance levels (although we should do this for nonsignificant
 results anyway as they are inherently indeterminant).
-
The 5% is a historical arifact, the result of statistics being invented
before electronic computers were invented.

The work in the early 1900's was severely restricted by the fact that
computations of the cummulative probability distribution involved tedious
paper and pencil calculations, and later on the use of mechanical
calculators. Available tables only gave the values for 5% and in some cases
1%. R.A. Fisher in his publications consistently referred to values well
below 1% as being "convincing". To illustrate the fundamental test methods,
he had to rely on available tables and chose 5% in most of his examples.
However he did not consider 5% as being "scientifically convincing".

DAH



=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=



Re: questions on hypothesis

2000-10-14 Thread David Heiser


- Original Message -
From: Ting Ting [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Friday, October 13, 2000 10:57 PM
Subject: Re: questions on hypothesis


 
  A good example of a simple situation for which exact P values are
  unavailable is the Behrens-Fisher problem (testing the equality of
  normal means from normal populations with unequal variances).  Some
  might say we have approximate solutions that are good enough.
.
I see this as an imprecise statement of a hypothesis.

From set theory, I can see several different logical constructs, each of
which would arrive at a different probability distribution, and consequently
different p values. It boils down to just what is the hypothesis on the
generator of the data. Is it a statement of logical equality or the value of
a difference function.

Does sample "A" come from process "a" and sample "B" come from process "b",
or do both samples come from process "c"?

The problem is simplified when process "a" and process "b" are known. When
process "a" and "b" are not known, we have that Fisher problem of defining a
set of all "a" parameter values = to a given p1 value and defining a set of
all "b" parameter values = to a given p2 value. When the processes are one
parameter processes, every thing is straightforward. (Fisher in his book-set
very nicely used one parameter distributions to illustrate his ideas.)
However for a two parameter process, the Fisher-Berens problem states an
equality (intersection) of mean values and a disjoint of variance values,
which cannot be analytically combined (given the normal distribution
function) in terms of a single p value.

Consequently, one finds in the textbooks, all the different approaches to
establish a "c" process, for which tests can be constructed to determine if
"A" and "B" come from the process "c" or not. The hypothesis being tested is
then based on process "c", not on the original idea.

DAH




=
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
  http://jse.stat.ncsu.edu/
=