Hi all I wanted to give a brief summary of my thoughts so far on how to do GLM/ANOVA in pspp.
The linear regression code (math/linreg/* ) already handles multiple regression, i.e. it solves y = Xb + e for the vector b of regression coefficients, given y a vector of observations, X a matrix of independent variables and e an error term. In particular it handles the case where X is not of full rank, which is one of the distinguishing characteristics of GLM. The GLM problem is just YM = XB + E (all matrices), where the new term M is a linear transform of the independent variables. This could be handled by the existing linear regression code if vectors are changed to matrices (could select between vector and matrix where applicable if there's a big efficiency gain). The solution is just the multivariate multiple regression solution postmultiplied by the given M, which could happen elsewhere (and probably should to allow efficient testing of different contrasts). Now, this part is pretty trivial, especially with almost all the code in place already (method of solution is identical). The bit that's quite a lot harder than I'd realised is extracting all the [M]AN[C]OVA bits from this; I've never really seen this in full generality. The algorithm is just to take your fitted regression means B and compare sums of squared error from various parts of the model. There are at least 6 common ways of doing this (type i through vi sums of squares). Random factors in this method are handled by linear combinations of dependent variables Y in the M matrix (although I'm not 100% sure about this - my textbook says it must be so, but its not obvious to me that this yields the same random factors calculations as via other methods?). Anyway my feeling is that the first step is to build design matrices based on a given ANOVA/model spec. The second step is to implement the various kinds of sums of squares. The final step is to provide the UNIANOVA command/other glm interfaces for users in PSPP (and presumably the GLM/GLM repeated measures/etc dialogs too). If this seems reasonable, it would be really useful if anyone has a good references on (pointers will do, I have access to a decent library): * the exact method of calculation of the various kinds of sums of squares (especially type iv - it seems to be an unsafe method that enjoyed brief popularity, but is now warned about but not described adequately). * a decent discussion of fitting random factors in a GLM; is the method I've read about (i.e. an orthonormal matrix M with columns summing to 0, and having magnitude 1) the correct one to use? And how do we estimate degrees of freedom here (I read about the details of this once, but it was only of academic interest at the time and I've quickly forgotten!)? * a bit prosaic, but the command format isn't in our library (we have no non-gui SPSS references). I pulled the SPSS 15 Base User's Guide from somewhere on the internet, but the description of the command format is terse to say the least, and it doesn't cover all the commands (for example I read somewhere that there are legacy SPSS MANOVA etc commands still available, but no longer documented). Ed _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
