In article <[EMAIL PROTECTED]>,
Richard Ulrich  <[EMAIL PROTECTED]> wrote:
>sci.stat.edu  people: There have been other Replies to the original
>post,  in sci.stat.math.

>On 21 Apr 2004 09:10:14 -0500, [EMAIL PROTECTED] (Herman
>Rubin) wrote:

>> In article <[EMAIL PROTECTED]>, S Fan <[EMAIL PROTECTED]> wrote:
>> >If you plot the residuals, the residuals seem getting bigger (or
>> >smaller), then you may need transformation. 
>> >When doing regression, one assumption is that the data follow a
>> >constant (though unknown) sigma. 
>> >Hope it helps.
>> >S Fan 

>> This is NOT the most important assumption; one can modify
>> the regression approach to take it into account.  The MOST
>> important assumption is that the relationship is a linear
>> relationship, with the "errors" independent of (or at least
>> uncorrelated with) the predictors.  Non-trivial transformations
>> are extremely unlikely to preserve this property.

>Herman is accustomed to data that *have* these properties of
>linearity and independent errors at the start.  He is also facile
>with non-linear analyses where he knows how to  accommodate 
>the error structure directly -- something not always easy to do, and
>sometimes easier to do than to *explain*  to an audience which
>is not sophisticated with numbers.

I am accustomed to dealing with situations in which the
client with the data can provide adequate information to
model the data.  If the model, possibly with transformations
from the user's assumptions, is linear, linear methods can,
and should, be used.  There are also different versions of
linear methods.  The client needs to concentrate on the
assumptions, not on the methodology.

It may be a major undertaking to extract the client's 
assumptions; it is possible to inform the client on which
assumptions are the more important.  That of normality is
the least important in linear models.

>My experience is different from his.  In clinical research, bioassays 
>(for one instance) have unit 'concentrations'  but the proper unit 
>of measurement, with those properties he mentions, is apt to be the
>log().  The proper unit of the growth curve is apt to be the logit.
>Bioassay is an area with a long and healthy tradition of 
>transformations; check any textbook.

I have not said anything against transformations based on
the PROBLEM; it is those based on the DATA, where the 
transformation is used to make the data "look right" in
some respect or another, which should not be done, unless
a good theoretician provides the information that it will
not greatly affect the conclusions.

>Tukey provided a rule of thumb for data with natural zero:  IF the
>largest value is 10 or 20 times the smallest, then you probably
>want to transform.  Tukey also provided other guidelines, 
>talking about 'folded' transformations such as the logit, and 
>about the family of power transformations.

Unfortunately, Tukey left his excellent understanding of 
abstract mathematics behind, deliberately, when he went
into statistics.  Some of his insights are excellent, while
others are just plain wrong.  The user of statistics does
not need to understand either the methods or the fine points
of theory, but needs to understand the basic probability 
concepts, and to transform the real-world problem into
"statistical" space.  

>Some people are fond of the rank-transformation:  That is the
>useful way, in my opinion, of referring a large fraction of the
>'non-parametric' alternatives, which I avoid when I can.  

There are a few tests which can be based on rank.  This is
not a transformation, but a complete ignoring of the scale.

>Finally, some people like arbitrary transformations, including
>adding arbitrary constants before taking the log or power:

How often does one consider what one is doing to the problem 
of interpreting results?

>What I am thinking of are the ones with the single virtue
>of giving residuals that are apparently normal, for the data
>on hand -- That is done in order to improve (or justify) using
>the F-test.

This is not done that often, and is generally quite difficult.
It requires changing the form of the model.  The more typical
transformations attempt to get normal marginals, and there is
rarely justification for this.  It has done harm; many of the
newer IQ tests never return "profoundly gifted", as this is 
beyond the range which the "normal" transformation of the scores
from the too-small sample yields.

         The proper p-level is not achieved if you do not 
>meet the assumption about residuals, so this DOES THAT.
>I can admit that I did that a time or two, a long time ago,
>and I might someday do it again.  

In more than a half-century, there have been many justifications
of NOT blindly using a p-value, and none for using it.

>However, the F-test will be more simply wrong, if, say,
>the linearity is fouled up by the transformation, making the 
>coefficients wrong and mis-measuring the error.

The only transformations which do not foul up linearity are
linear transformations.

                 I don't know
>if I avoid 'arbitrary transformations'  because of that, or because
>they are inelegant and hard to justify to anyone else.




>> >On 19 Apr 04 03:24:58 -0400 (EDT), opaow wrote:
>> >>Hi.I am just quite confused about data transformations (specially in
>> >>doing ANOVA and Regression)... When and why do we transform data?...
>> >>Any help??? I'm not quite good at it.....thanks in advance..


>-- 
>Rich Ulrich, [EMAIL PROTECTED]
>http://www.pitt.edu/~wpilib/index.html


-- 
This address is for information only.  I do not claim that these views
are those of the Statistics Department or of Purdue University.
Herman Rubin, Department of Statistics, Purdue University
[EMAIL PROTECTED]         Phone: (765)494-6054   FAX: (765)494-0558
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to