Re: GLM, encodings and SSQs

Jason Stover Tue, 22 Nov 2011 14:12:04 -0800

On Mon, Nov 21, 2011 at 08:08:11PM +0000, John Darrington wrote:
> The good news is, that I found and fixed a bug which was causing the Effects 
> Coding
> to produce garbage results.  The surprising news (surprising to me anyway) is 
> that
> having fixed it, Effects Coding produces identical results to Dummy Coding.


That's good news. At first, I was surprised when the sums of squares
did not agree, because, typically, the sums of squares should not
change over different coding schemes.

> 2. Another example, this time for SAS, at 
> http://www.sfu.ca/sasdoc/sashtml/stat/chap30/sect52.htm
> 
>    I copied the data given there, and ran it through PSPP and got:
> 
> #===============#=======================#==#============#==========#=======#
> #     Source    #Type III Sum of Squares|df| Mean Square|     F    |  Sig. #
> #===============#=======================#==#============#==========#=======#
> #Corrected Model#            4259,338506|11|  387,212591|  3,505692|,001298#
> #Intercept      #           20672,844828| 1|20672,844828|187,164963|,000000#
> #drug           #            3063,432863| 3| 1021,144288|  9,245096|,000067#
> #disease        #             418,833741| 2|  209,416870|  1,895990|,161720#
> #drug * disease #             707,266259| 6|  117,877710|  1,067225|,395846#
> #Error          #            5080,816667|46|  110,452536|          |       #
> #Total          #           30013,000000|58|            |          |       #
> #Corrected Total#            9340,155172|57|            |          |       #
> 
> 
>   Now these numbers are exactly what the SAS example gives for the type II 
> sums of squares,
> (although PSPP is labelling them as Type III)
> 
> 
> 3.  A concise but quite useful description of the various ssq "types" can be 
> found at
>    http://afni.nimh.nih.gov/sscc/gangc/SS.html
>    It says this about Type III :
> 
>   "SS gives the sum of squares that would be obtained for each variable if it 
>    were entered last into the model. That is, the effect of each variable is 
>    evaluated after all other factors have been accounted for. Therefore the 
> result 
>    for each term is equivalent to what is obtained with Type I analysis when 
> the
>    term enters the model as the last one in the ordering."

This is what I coded in glm.c to begin with, but it wasn't giving the
same result as SPSS. I found that SPSS drops an interaction term if it
first drops a main effect contained in it. So I went back to mimic
that behavior, and now it doesn't seem to match SAS.

Now I wonder if SAS and SPSS agree on the meaning of type 3 sums of
squares.  It would be nice to have some tests using both programs.

> 4.  However, none of the SPSS examples I have found which feature unbalanced 
> designs 
>     actually correspond to what PSPP currently produces for type III ssq.  
> The 
>     interactions are the same, but the main effects quite different.
> 
> The forgoing leads me to infer that SPSS has the meaning of Type II and Type 
> III 
> transposed, in comparison to the rest of the world.  
> 
> This sounds somewhat incredible, but seems to be consistent with the evidence 
> so far.
> 
> I can only suggest that we try to implement the Type II next, and see what 
> happens.

I agree, and in the mean time, if anybody out there has access to both
SAS and SPSS, please send us a few test results showing the type 1,
type 2 and type 3 sums of squares from both programs. And the data.

Maybe now is the time for me to mention where this comes from, even
though I'm not sure if this will resolve the meaning of 'Type 1',
'Type 2' and 'Type 3'. Pardon me if this explanation is self-evident
by now:

We are talking about sums of squares due to regression, that is:

sum ((predicted Y's - other version of predicted Y's)^2)

if Y is the dependent variable. In this case, the predicted Y may be
the 'final' predicted Y, or a predicted value based only on some of
the predictors, or the sample mean of Y.

The idea is to compute the reduction in the sums of squared errors:

sum ((predicted Y - observed Y)^2)

...by adding more predictors.

The usual ways to measure this reduction in sums of squared errors is
to look at sums of squares due to regression, as mentioned above. That
is, we look at the *drop* in sums of squared errors by looking for a
corresponding *rise* in sums of squares due to regression. The two
ways to do this are via 'sequential' sums of squares and 'partial'
sums of squares:

Suppose X1 and X2 are predictors. Let SSR (X1) be the sums of squares
due to regression of Y on X1, that is,

SSR (X1) = sum ((predicted Y - mean of Y)^2)

...where 'predicted Y' is the predicted value of Y using X1 as the sole
predictor. 

Now we add another predictor, X2. It's sequential sums of squares are

SSR (X2 | X1) = sum ((predicted Y given X1 - predicted Y given X1 *and* X2)^2)

This measures the improvement in our prediction by adding X2, when
X1 is already present in the model.

The *partial* sums of squares for X2 is just 

SSR (X2) = sum ((predicted Y from X2 - mean of Y)^2)

...with no X1 in the model.

The first issue here is: What should we call the partial and
sequential sums of squares? One is usually called 'type 1' and the
other, 'type 2'. But I think those names are mostly used by software,
and not by practitioners. I'm scanning the beginning of chapter 8 of
Neter, Wasserman and Kutner's book, and they seem to be satisfied to
refer only to 'extra' sums of squares, then using something like the
notation above to be more specific. Chapter 4 of Meyers' and Milton's
'A First Course in the Theory of Linear Statistical Models' refers to
sums of squares of X2 'in the presence of' X1. It then refers to
partial and sequential tests. To be more specific, they use vector and
matrix notation. Mendenhall's and Sincich's 'A Second Course in
Statistics: Regression Analysis' mentions 'reduced', 'nested' and
'full' models, but doesn't seem to dwell on the differences in the
types of sums of squares, though they do mention these a bit in chapter 4. I
myself remember in graduate school that several of us, including the
instructor, had to occasionally pause to figure out which sums of
squares was called 'type 2' by the software. On the other hand, I
remember that some coworkers at an old job always used the terms 'type
*' and couldn't say much about what the sums of squares meant.

One other question: What should we do with an interaction if one of its
main effects is dropped? Drop the interaction? There are cases where
it makes sense to retain an interaction without the main effect. But there
are cases to the contrary. 

What this means is that there *may* not be a single definition of
'type 3' sums of squares that we can code that will always
work. Authors and professors seem to prefer other, more descriptive
terms than just 'types 1, 2 and 3'. I'm pretty sure there is no
standard definition of 'type 4'. I guess we should figure out how to
mimic SPSS in the case of unbalanced designs, though.

When I coded type 3 sums of squares, I assumed it meant 'partial' for
each variable. That is 'drop Xi, fit the model, add Xi, and find the
sums of squares':

SSR_type3 (Xi|all but Xi) = sum ((predicted Y from all - predicted Y from all 
but Xi)^2)

This matched SPSS, unless Xi was in some interaction. So I dropped the
interaction terms involving Xi, too. Then this was correct, until the
design was unbalanced. Now I'm not sure what it should be.

-Jason

_______________________________________________
pspp-dev mailing list
[email protected]
https://lists.gnu.org/mailman/listinfo/pspp-dev

Re: GLM, encodings and SSQs

Reply via email to