Re: categorical data analysis

2000-04-22 Thread kermit

Date sent:  Sat, 22 Apr 2000 13:31:50 -0400 (EDT)
From:   "Donald F. Burrill" [EMAIL PROTECTED]
To:     Kermit Rose [EMAIL PROTECTED]
Copies to:  [EMAIL PROTECTED]
Subject:Re: categorical data analysis


Hello  Donald!


 
 [ KR:  Please post any response to the edstat list as well as to me.
   I may not have the leisure to continue this conversation, and others on
   the list may have better advice for you in any case. -- DFB. ]

OK.  edstat is included on the reply list.

 
 To begin with, you have not told us what the professor in criminology is
 trying to find out, and why;  without that information, no-one can offer
 you (or the professor) useful advice on data analysis.
 

The dependent variable is based on the following question.

Recently the problem of overcrowing in the state university system 
in Florida has been the subject of considerable debate. Some have 
suggested the elimination of preferential treatment in college 
admissions as a potential remedy.  In your opinion, which of the 
following groups listed should continue receiving preferential 
treatment in the admissions process. (1 = yes, 0 = no)

q45a athelets
q45b national merit scholars (honor students)
q45c economically disadvantaged
q45d historically disadvantaged
q45e children of wealthy benefactors
q45f children of university alumni
q45g ethnic and racial minorities
q45h students with disabilities
q45i students with prior criminal records
q45j students with unique artistic talents

The goal is to determine whether or not a persons choices on the 
10 questions is predictable in terms of independent variable values 
of education, age, race, income, marital status, political party, sex, 
racial attitudes, etc


 Your proposed procedure seems unnecessarily cumbersome.  So far as I can
 tell from all that algebra, you're effecitively substituting a whole bunch
 of 2x2 tables for a single RxC table (R = number of rows, C = number of
 columns) with R2 and(/or?) C2.  Or, for each of several RxC tables.

Yes.  This is the intent.  I wanted to reduce the R by C table to a 
series of 2 by 2 tables.

  Why do you not first do the obvious contingency table chi-square 
 to see if there's anything worth following up?  (And if I were doing it,
 the follow-up(s) would be in the RxC format as well.)

It is because the dependent variable is multivalued.  A person may 
check 0, 1,2 ,3,4,5,6,7,8,9 or 10 of the preferences.  I want to be 
able to identify which of the choices was checked, not just the 
number of them.  There is no theory for weighing the choices before 
the analysis is complete.  I wanted to construct a method for 
systematically examing all 1023 possible choices of dependent 
variable.


  Your #1 null hypothesis is that two dichotomized variables are 
 independent.  I don't believe the test statistic you propose -- more about
 that later -- but this is the formal null hyp. for the usual contingency
 table chi square.
 

I can show the derivation of the test statistic.  I made a mistype in 
typing the formula and have given the correct version below. 

Do you mean that I've stated correctly that the null hypothesis for 
the chi square test on crosstab is that the variables are 
independent?


  The dependent variable is a multivalued categorical variable.
  So far, so good.
 
  The model dependent variable is an interaction variable.  It is the
  interaction of some subset of the 10 two-valued dummy variables
  representing the dependent variable.  The 10 values of the dependent
  variable are choices of preferential treatment for affirmative action. 



  Here it begins to get sticky.  I cannot tell whether you mean the 
  same thing by "interaction" that I would mean.  In particular, 
  there seems to be no difference between "interaction variable", 
  in your terms, and "indicator variable", in my terms.


I'm not sure what you mean by "indicator variable".  I've only seen 
the term in connection with latent variables in structual equation 
modeling.

I'm pretty sure my use of "interaction" is consistent with your use 
of "interaction".

Suppose I chose the 1st, 3rd, and 5th dependent variables as my 
subset to represent the model dependent variable.  I chose the 
word "interaction" because I expected readers would know I meant 
the product of the chosen dependent variables.  Since each 
dependent variable is 0 or 1, the product will be 1 if and only if 
every one of the chosen dependent variables is 1.

That is, I would call the model dependent variable true (meaning = 
1)  if it corresponded to dependent variables 1,3 and 5  and all three 
of dependent variables 1,3,5 were equal to 1.  For this model I 
would not care what  dependent variables 2,4,6,7,8,9,10 were.

When I inplement this in programming I would of course figure out 
efficient ways to calculate the frequencies. 

categorical data analysis

2000-04-21 Thread Kermit Rose

Hello friend.

The following is my proposed methodology for analyzing survey data for a
professor in criminology at Florida State University.

I'd like someone experienced in categorical data analysis to review it and
email me comments, or criticisms, or suggestions.

Thank you.

Kermit Rose
[EMAIL PROTECTED]


The dependent variable is a multivalued categorical variable. The model
dependent variable is an interaction variable.  It is the interaction of
some subset of the 10 two-valued dummy variables representing the dependent
variable.  The 10 values of the dependent variable are choices of
preferential treatment for affirmative action.

The model independent variable is also an interaction variable.  It is
the interaction of two-valued dummy variables representing some subset
of the predicting variables.

The raw data variables are either categorical variables or ordinal level
variables.

Suppose V is a categorical varible with values v1,v2,v3.
Then V is converted to two variables  X1,X2 with

X1 = 1 if V = v1, and X1 = 0 otherwise.
X2 = 1 if V = v2, and X2 = 0 otherwise.

Suppose R is an ordinal level variable with values r1  r2  r3  r4.
Then R is converted to three variables S1,S2,S3 with

S1 = 1 if R = r1 and S1 = 0 otherwise.
S2 = 1 if R = r1 or r2 and S2 = 0 otherwise.
S3 = 1 if R = r1 or r2 or r3 and S3 = 0 otherwise.

We say the model independent variable is true for a case if all the
independent variable values are present in the case. Otherwise the
mocel independent variable is false for that case.

We say the model dependent variable is true for a case if all the dependent
variable values are present in the case.  Otherwise the dependent variable
is false for that case.

We define parameters t,I,D and N as follows.

t is the number of cases where both the model independent and model dependent
variable are true.

I is the number of cases where the model independent variable is true.

D is the number of cases where the model Dependent variable is true.

N is the total number of cases.

Model   Model
Dependent   Independent Count  Expected Col pct
variablevarible   

truetrue t  D*I/Nt/D

truefalse  D-t  D(1-I/n) 1 - t/D

truetotal   D   D1

false   true   I-t  I(1-D/N) (I-t)/(n-D)

false   false  N-I-D+t  N-I-D+D*I/N  1-(I-t)/(N-D)

false   Total  N-D  N - D1

Total   true I  II/N

Total   False  N-I  N - I1 - I/N

Total   TotalNN  1



  Covariance of model independent and model dependent variable =

  (t - I^2/N - D^2/N + D * I/N)/(N-1)

  variance of model independent variable =

  I (1 - I/N)/(N-1)

  Variance of model Dependent variable =

  D (1 - D/N)/(N-1)

  square of correlation between model independent variable and model dependent
  variable is  r-square =

  [ t - I^2 /N - D^2 /N  + D*I/N ]/ [I* D * (1 - I/N)* (1-D/N) ] 


 Chisquare of crosstab of model independent variable with model dependent
 variable is

 (t - D*I/N)*( N/[D*I] + N/[I*(N-D)] + N/[D*(N-I)] + N/[(N-I)*(N-D)] )

The significance number that is calculated for a statistic is the
predicted probability that the null hypothesis is true.

There are two different null hypotheses of interest.

null_1:
The dependent variable does not depend on independent variable.  

The significance of null_1 is

(D - t)/N


null_2:

There is not a bidirectional relationship between the independent variable
and the dependent variable.

The significance of null_2 is (I+D-2*t)/N


-- 
signature
To be sure I see your response, use e-mail.
You may post, repost, or publish ANY communication received from me.


===
This list is open to everyone.  Occasionally, less thoughtful
people send inappropriate messages.  Please DO NOT COMPLAIN TO
THE POSTMASTER about these messages because the postmaster has no
way of controlling them, and excessive complaints will result in
termination of the list.

For information about this list, including information about the
problem of inappropriate messages and information about how to
unsubscribe, please see the web page at
http://jse.stat.ncsu.edu/
===