On Tue, Oct 27, 2009 at 06:38:19AM +0000, John Darrington wrote: > Just to make sure I understand things correctly, consider the following > example, > where x and y are numeric variables and A and B are categorical ones: > > x y A B > ======= > 3 4 x v > 5 6 y v > 7 8 z w > > We replace the categorical variables with bit_vectors: > > x y A_0 A_1 A_2 B_0 B_1 > ======================== > 3 4 1 0 0 1 0 > 5 6 0 1 0 1 0 > 7 8 0 0 1 0 1 > > and arbitrarily drop the (say zeroth) subscript: > > x y A_1 A_2 B_1 > ================== > 3 4 0 0 0 > 5 6 1 0 0 > 7 8 0 1 1 > > That will produce a 5x5 matrix. 5 is calculated from n + m - p, where > n is the number of numeric variables, m is the total number of categories, > and p is the number of categorical variables.
This is correct. > However I don't see how such a matrix can be very useful. A better one would > involve > the products of the categorical and numeric variables: > > x y x*A_1 x*A_2 y*A_1 y*A_2 x*B_1 y*B_1 > =========================================== > 3 4 0 0 0 0 0 0 > 5 6 5 0 6 0 0 0 > 7 8 0 7 0 8 7 8 > > This makes an 8x8 matrix, where 8 is calculated from n + n * (m - p) , > which happens to be identical to n * (1 + m - p). But this involves > a whole lot more calculations. This second choice would give you the covariance of x and y, and the covariances of the *interactions* between x and A, x and B, y and A, and y and B, but not the covariance between (say) x and A. The covariance between x and A would be stored in the first matrix you mentioned, in elements (0,2), (0,3), (2,0) and (3,0) assuming we kept both upper and lower triangles. You mention that matrix not being very useful, and in a sense it isn't: No human would care about the covariance between x and the column corresponding to the first bit vector of A. But in another sense, that matrix is absolutely necessary: It's used to solve the least squares problem, whose solution we use to tell us if A and our dependent variable are related. That relation is shown via analysis of variance, whose p-value is many computations away from the covariance matrix, but depends on it nevertheless. This matrix is unnecessary for a one-way ANOVA, whose computations from the matrix above can be simplified into the simple sums used in oneway.q. But for a bigger model, with many factors and interactions and covariates, we need that first matrix because we can't reduce the problem to a few easy-to-read summations. _______________________________________________ pspp-dev mailing list [email protected] http://lists.gnu.org/mailman/listinfo/pspp-dev
