On Tue, Oct 27, 2009 at 06:25:32PM +0000, John Darrington wrote: > Will that be enough to allow a subset of GLM to be implemented?
Yes, except for the interactions. > > J' > > On Tue, Oct 27, 2009 at 11:47:23AM -0400, Jason Stover wrote: > On Tue, Oct 27, 2009 at 06:38:19AM +0000, John Darrington wrote: > > Just to make sure I understand things correctly, consider the > following example, > > where x and y are numeric variables and A and B are categorical ones: > > > > x y A B > > ======= > > 3 4 x v > > 5 6 y v > > 7 8 z w > > > > We replace the categorical variables with bit_vectors: > > > > x y A_0 A_1 A_2 B_0 B_1 > > ======================== > > 3 4 1 0 0 1 0 > > 5 6 0 1 0 1 0 > > 7 8 0 0 1 0 1 > > > > and arbitrarily drop the (say zeroth) subscript: > > > > x y A_1 A_2 B_1 > > ================== > > 3 4 0 0 0 > > 5 6 1 0 0 > > 7 8 0 1 1 > > > > That will produce a 5x5 matrix. 5 is calculated from n + m - p, where > > n is the number of numeric variables, m is the total number of > categories, > > and p is the number of categorical variables. > > This is correct. > > > However I don't see how such a matrix can be very useful. A better one > would involve > > the products of the categorical and numeric variables: > > > > x y x*A_1 x*A_2 y*A_1 y*A_2 x*B_1 y*B_1 > > =========================================== > > 3 4 0 0 0 0 0 0 > > 5 6 5 0 6 0 0 0 > > 7 8 0 7 0 8 7 8 > > > > This makes an 8x8 matrix, where 8 is calculated from n + n * (m - p) , > > which happens to be identical to n * (1 + m - p). But this involves > > a whole lot more calculations. > > This second choice would give you the covariance of x and y, and the > covariances of the *interactions* between x and A, x and B, y and A, > and y and B, but not the covariance between (say) x and A. The > covariance between x and A would be stored in the first matrix you > mentioned, in elements (0,2), (0,3), (2,0) and (3,0) assuming we kept > both upper and lower triangles. > > You mention that matrix not being very useful, and in a sense it > isn't: No human would care about the covariance between x and the > column corresponding to the first bit vector of A. But in another > sense, that matrix is absolutely necessary: It's used to solve the > least squares problem, whose solution we use to tell us if A and our > dependent variable are related. That relation is shown via analysis of > variance, whose p-value is many computations away from the covariance > matrix, but depends on it nevertheless. > > This matrix is unnecessary for a one-way ANOVA, whose computations from > the matrix above can be simplified into the simple sums used in > oneway.q. But for a bigger model, with many factors and interactions > and covariates, we need that first matrix because we can't reduce the > problem to a few easy-to-read summations. > > > _______________________________________________ > pspp-dev mailing list > [email protected] > http://lists.gnu.org/mailman/listinfo/pspp-dev > > -- > PGP Public key ID: 1024D/2DE827B3 > fingerprint = 8797 A26D 0854 2EAB 0285 A290 8A67 719C 2DE8 27B3 > See http://pgp.mit.edu or any PGP keyserver for public key. > > _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
