Date sent: Sat, 22 Apr 2000 13:31:50 -0400 (EDT)
From: "Donald F. Burrill" [EMAIL PROTECTED]
To: Kermit Rose [EMAIL PROTECTED]
Copies to: [EMAIL PROTECTED]
Subject:Re: categorical data analysis
Hello Donald!
[ KR: Please post any response to the edstat list as well as to me.
I may not have the leisure to continue this conversation, and others on
the list may have better advice for you in any case. -- DFB. ]
OK. edstat is included on the reply list.
To begin with, you have not told us what the professor in criminology is
trying to find out, and why; without that information, no-one can offer
you (or the professor) useful advice on data analysis.
The dependent variable is based on the following question.
Recently the problem of overcrowing in the state university system
in Florida has been the subject of considerable debate. Some have
suggested the elimination of preferential treatment in college
admissions as a potential remedy. In your opinion, which of the
following groups listed should continue receiving preferential
treatment in the admissions process. (1 = yes, 0 = no)
q45a athelets
q45b national merit scholars (honor students)
q45c economically disadvantaged
q45d historically disadvantaged
q45e children of wealthy benefactors
q45f children of university alumni
q45g ethnic and racial minorities
q45h students with disabilities
q45i students with prior criminal records
q45j students with unique artistic talents
The goal is to determine whether or not a persons choices on the
10 questions is predictable in terms of independent variable values
of education, age, race, income, marital status, political party, sex,
racial attitudes, etc
Your proposed procedure seems unnecessarily cumbersome. So far as I can
tell from all that algebra, you're effecitively substituting a whole bunch
of 2x2 tables for a single RxC table (R = number of rows, C = number of
columns) with R2 and(/or?) C2. Or, for each of several RxC tables.
Yes. This is the intent. I wanted to reduce the R by C table to a
series of 2 by 2 tables.
Why do you not first do the obvious contingency table chi-square
to see if there's anything worth following up? (And if I were doing it,
the follow-up(s) would be in the RxC format as well.)
It is because the dependent variable is multivalued. A person may
check 0, 1,2 ,3,4,5,6,7,8,9 or 10 of the preferences. I want to be
able to identify which of the choices was checked, not just the
number of them. There is no theory for weighing the choices before
the analysis is complete. I wanted to construct a method for
systematically examing all 1023 possible choices of dependent
variable.
Your #1 null hypothesis is that two dichotomized variables are
independent. I don't believe the test statistic you propose -- more about
that later -- but this is the formal null hyp. for the usual contingency
table chi square.
I can show the derivation of the test statistic. I made a mistype in
typing the formula and have given the correct version below.
Do you mean that I've stated correctly that the null hypothesis for
the chi square test on crosstab is that the variables are
independent?
The dependent variable is a multivalued categorical variable.
So far, so good.
The model dependent variable is an interaction variable. It is the
interaction of some subset of the 10 two-valued dummy variables
representing the dependent variable. The 10 values of the dependent
variable are choices of preferential treatment for affirmative action.
Here it begins to get sticky. I cannot tell whether you mean the
same thing by "interaction" that I would mean. In particular,
there seems to be no difference between "interaction variable",
in your terms, and "indicator variable", in my terms.
I'm not sure what you mean by "indicator variable". I've only seen
the term in connection with latent variables in structual equation
modeling.
I'm pretty sure my use of "interaction" is consistent with your use
of "interaction".
Suppose I chose the 1st, 3rd, and 5th dependent variables as my
subset to represent the model dependent variable. I chose the
word "interaction" because I expected readers would know I meant
the product of the chosen dependent variables. Since each
dependent variable is 0 or 1, the product will be 1 if and only if
every one of the chosen dependent variables is 1.
That is, I would call the model dependent variable true (meaning =
1) if it corresponded to dependent variables 1,3 and 5 and all three
of dependent variables 1,3,5 were equal to 1. For this model I
would not care what dependent variables 2,4,6,7,8,9,10 were.
When I inplement this in programming I would of course figure out
efficient ways to calculate the frequencies.