[R] Multivariate discrete HMMs

2013-09-02 Thread Claus O'Rourke
Hi r-help,

I have been using your RHmm package for some time and have recently
had to try using the package for a new dataset.

Basically I have a dataset with a number of discrete observation
variables that change over time, and I would love to try modeling them
using a HMM.

Basically I was wondering if RHmm can be used to model a multivariate
discrete HMM, i.e., the observations are a vector of discrete
measurements? From what I see in the documentation and from playing
around with examples, it seems like this may not be possible. My
understand of the mathematics behind multivariate HMMs is limited, so
I would appreciate any advance you might be able to give.

Thanks for any help anyone can give

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Splitting a categorical variable into multiple variables

2013-08-09 Thread Claus O'Rourke
Thanks Bert. I guess I was just wondering if there was a way to create
the new factors automatically without me having to hard code the level
names manually in my R code.

Rgds

Claus

On Fri, Aug 9, 2013 at 3:42 PM, Bert Gunter  wrote:
> ... or if you want to keep the unchanged levels the same:
>
> zz <- factor(ifelse( z %in% c("a", "b"),"d" ,levels(z)[z]))
>
> -- Bert
>
> On Fri, Aug 9, 2013 at 7:35 AM, Bert Gunter  wrote:
>> If I understand what you mean, just recode them.
>>
>> z <- factor(letters[1:3])
>> z
>> zz <- factor(ifelse( z %in% c("a", "b"),"d" ,z))
>> zz
>>
>> Cheers,
>> Bert
>>
>> On Fri, Aug 9, 2013 at 7:10 AM, Claus O'Rourke  
>> wrote:
>>> Hello R-Help,
>>> I have a variable with > 32 levels and I'd like to split this into two
>>> variables such that both new variables have >= 32 variables. This is
>>> to handle the limit of 32 level predictor variables in R's Random
>>> Forest implementation. Might someone be able to suggest an elegant way
>>> to do this? I've tried googling for this, but haven't hit the right
>>> search terms.
>>>
>>> Regards
>>>
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>>
>> Bert Gunter
>> Genentech Nonclinical Biostatistics
>>
>> Internal Contact Info:
>> Phone: 467-7374
>> Website:
>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Splitting a categorical variable into multiple variables

2013-08-09 Thread Claus O'Rourke
Hello R-Help,
I have a variable with > 32 levels and I'd like to split this into two
variables such that both new variables have >= 32 variables. This is
to handle the limit of 32 level predictor variables in R's Random
Forest implementation. Might someone be able to suggest an elegant way
to do this? I've tried googling for this, but haven't hit the right
search terms.

Regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Levels in new data fed to SVM

2013-01-10 Thread Claus O'Rourke
Thanks for clarifying!

On Thu, Jan 10, 2013 at 12:47 PM, Uwe Ligges
 wrote:
>
>
> On 08.01.2013 21:14, Claus O'Rourke wrote:
>>
>> Hi all,
>> I've encountered an issue using svm (e1071) in the specific case of
>> supplying new data which may not have the full range of levels that
>> were present in the training data.
>>
>> I've constructed this really primitive example to illustrate the point:
>>
>>> library(e1071)
>>> training.data <- data.frame(x = c("yellow","red","yellow","red"), a =
>>> c("alpha","alpha","beta","beta"), b = c("a", "b", "a", "c"))
>>> my.model <- svm(x ~ .,data=training.data)
>>> test.data <- data.frame(x = c("yellow","red"), a = c("alpha","beta"), b =
>>> c("a", "b"))
>>> predict(my.model,test.data)
>>
>> Error in predict.svm(my.model, test.data) :
>>test data does not match model !
>>>
>>>
>>> levels(test.data$b) <- levels(training.data$b)
>>> predict(my.model,test.data)
>>
>>   1  2
>> yellowred
>> Levels: red yellow
>>
>> In the first case test.data$b does not have the level "c" and this
>> results in the input data being rejected. I've debugged this down to
>> the point of model matrix creation in the SVM R code. Once I fill up
>> the levels in the test data with the levels from the original data,
>> then there is no problem at all.
>>
>> Assuming my test data has to come from another source where the number
>> of category levels seen might not always be as large as those for the
>> original training data, is there a better way I should be handling
>> this?
>
>
>
> You have to tell the factor about the possible levels, it does not
> necessarily contain examples.
> That means:
>
> levels(test.data$b) <- C("a", "b", "c")
> predict(my.model,test.data)
>
> will help.
>
> Best,
> Uwe Ligges
>
>
>
>> Thanks
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Levels in new data fed to SVM

2013-01-08 Thread Claus O'Rourke
Hi all,
I've encountered an issue using svm (e1071) in the specific case of
supplying new data which may not have the full range of levels that
were present in the training data.

I've constructed this really primitive example to illustrate the point:

> library(e1071)
> training.data <- data.frame(x = c("yellow","red","yellow","red"), a = 
> c("alpha","alpha","beta","beta"), b = c("a", "b", "a", "c"))
> my.model <- svm(x ~ .,data=training.data)
> test.data <- data.frame(x = c("yellow","red"), a = c("alpha","beta"), b = 
> c("a", "b"))
> predict(my.model,test.data)
Error in predict.svm(my.model, test.data) :
  test data does not match model !
>
> levels(test.data$b) <- levels(training.data$b)
> predict(my.model,test.data)
 1  2
yellowred
Levels: red yellow

In the first case test.data$b does not have the level "c" and this
results in the input data being rejected. I've debugged this down to
the point of model matrix creation in the SVM R code. Once I fill up
the levels in the test data with the levels from the original data,
then there is no problem at all.

Assuming my test data has to come from another source where the number
of category levels seen might not always be as large as those for the
original training data, is there a better way I should be handling
this?

Thanks

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Training parameters for a HMM

2011-12-19 Thread Claus O'Rourke
Hi,
I'm a newbie to the world of HMMs and HMMs in R. I've had a look at
the hmm package and the RHmm package but I couldn't see anything
straightforward on how a labelled sequential dataset with observed
values and underlying states might be used to construct and train a
HMM based on that data and no pre-computed values for the transition,
emission or initial state distributions. Does anyone have any excerpts
of code that could get me moving in the right direction?

To put it another way, lets say that I have the simple HMM topology
that is shown here:
http://en.wikipedia.org/wiki/File:HiddenMarkovModel.png
And I have somehow collected datasets with observations and labelled
hidden states:

Sequence 1:
Obs Hid
  y1X1
  y2X2
  y2X2
  y4X1
 ...  ...
  y3X3

...

Sequence N:
Obs Hid
  y2X1
  y2X2
  y2X1
  y4X1
 ...  ...
  y4X1

I'm assuming categorial variables for y and x.

I know I really am starting from from scratch here, so I'd be very
grateful if anyone could point out to me how I could go about
automatically constructing and parameterizing a HMM for data like this.

Thanks for your patience.

Claus.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Newbie-ish question on iteratively applying function to dataframe

2011-03-16 Thread Claus O'Rourke
Brilliant - that was really useful!

On Tue, Mar 15, 2011 at 3:46 PM, Ista Zahn  wrote:
> Hi Claus,
>
> On Tue, Mar 15, 2011 at 9:33 AM, Claus O'Rourke  
> wrote:
>> Hi,
>> I am trying to recursively apply a function to a selection of columns
>> in a dataframe. I've had a look around and from what I have read, I
>> should be using some version of the apply function, but I'm really
>> having some headaches with it.
>
> I would just do it in a loop (see below)
>>
>> Let me be more specific with an example.
>>
>> Say I have a data frame similar to the following
>>
>> A     x     y     z     r1    r2    r3    r4
>> 0.1  0.2  0.1 ...
>> 0.1  0.3 ...
>> 0.2 ...
>>
>> i.e., a number of columns, each of the same length, and all containing
>> real numbers. Of these columns, I want to model one variable, say A,
>> as a function of other variables, say x, y, z, and any one of my r1,
>> r2, r3, ... variables.
>>
>> i.e., I want to model
>> A ~ x + y + z + r1
>> A ~ x + y + z + r2
>> 
>> A ~ x + y + z + rn
>>
>> But where the number of 'r' variables I will have will be large, and I
>> don't know the specific number of these variables in advance.
>>
>> My question first is, how can I select all the columns in a dataframe
>> that have a heading that matches a string pattern?
>
> ?grep
>
>>
>> And then related to this, what would be the best way of repeatedly
>> applying my modelling function to the result?
>
> Well, I don't know about the "best" way. But why not just
>
> set.seed(21 )
> dat <- as.data.frame(matrix(rnorm(10 ), ncol=100, dimnames=list
> (1:1000, c("A", "x", "y", "z", paste("r", 1:96, sep="" )
>
> mods <- list()
> for(i in grep("r", names(dat ), value=TRUE)) {
>    mods[[i]] <- lm(as.formula(paste("A ~ x + y + z + ", i)), data=dat )
> }
>
> Note that  you should be cautious about making any inferences based on
> this kind of method. In the example above 9 r variables are
> "significant" at the .05 level, even though the data was generated
> "randomly":
>
> sort(sapply(mods, function(x) coef(summary(x))[5, 4]))
>
> Best,
> Ista
>>
>> Many thanks for any help for this occasional R armature.
>>
>> Claus
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Ista Zahn
> Graduate student
> University of Rochester
> Department of Clinical and Social Psychology
> http://yourpsyche.org
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Newbie-ish question on iteratively applying function to dataframe

2011-03-15 Thread Claus O'Rourke
Hi,
I am trying to recursively apply a function to a selection of columns
in a dataframe. I've had a look around and from what I have read, I
should be using some version of the apply function, but I'm really
having some headaches with it.

Let me be more specific with an example.

Say I have a data frame similar to the following

A x y z r1r2r3r4
0.1  0.2  0.1 ...
0.1  0.3 ...
0.2 ...

i.e., a number of columns, each of the same length, and all containing
real numbers. Of these columns, I want to model one variable, say A,
as a function of other variables, say x, y, z, and any one of my r1,
r2, r3, ... variables.

i.e., I want to model
A ~ x + y + z + r1
A ~ x + y + z + r2

A ~ x + y + z + rn

But where the number of 'r' variables I will have will be large, and I
don't know the specific number of these variables in advance.

My question first is, how can I select all the columns in a dataframe
that have a heading that matches a string pattern?

And then related to this, what would be the best way of repeatedly
applying my modelling function to the result?

Many thanks for any help for this occasional R armature.

Claus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Mixed Design ANOVA - singular error model

2010-10-18 Thread Claus O'Rourke
Dear r-help list,
I would like to run a mixed design anova to compare the results from
one population sample to another. Here my within subject variable
(stiulusID) has 45 levels and my between subject variable (group) has
two levels. In addition to my number of levels in the within subject
variable being very large, one other 'feature' of my data is that it
is not balanced on the between subject variable. I have attached a
copy of my data for reference.

To run the ANOVA, I did this:
summary(aov(Result ~ StimulusID * Group +
Error(ParticipantToken/(StimulusID + Group)), data = data))

My results were roughly as I had expected, but at the end of the
output I have the warning that the model is singular. I have seen this
warning listed in other help requests, but from what I saw there this
meant that one of my variables was redundant as it was a nesting of
the other. I don't think this is the case in my data.

Can I trust the results of a Mixed Design ANOVA with the warning? Or
indeed with the unbalanced between subject variable?

Many thanks for any help for a stats novice.

Error: ParticipantToken
  Df Sum Sq Mean Sq F value Pr(>F)
Group  1   0.69  0.6934  0.0437 0.8356
Residuals 39 619.47 15.8840

Error: ParticipantToken:StimulusID
   Df Sum Sq Mean Sq F valuePr(>F)
StimulusID 44 4607.4 104.713 60.3868 < 2.2e-16 ***
StimulusID:Group   44  170.3   3.870  2.2317 8.106e-06 ***
Residuals1716 2975.6   1.734
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Warning message:
In aov(Result ~ StimulusID * Group + Error(ParticipantToken/(StimulusID +  :
  Error() model is singular
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Pairwise cross correlation from data set

2010-06-13 Thread Claus O'Rourke
Dear list,

Following up on an earlier post, I would like to reorder a dataset and
compute pairwise correlations. But I'm having some real problems
getting this done.

My data looks something like:

Participant Stimulus Measurement
 p1 s`15
 p1 s`26.1
 p1 s`37
 p2 s`14.8
 p2 s`26
 p2 s`36.5
 p3 s`14
 p3 s`27
 p3 s`36

As a first step I would imagine that I have to rearrange my data into
a frame more like this

Stimulus  p1   p2  p3
   s1   5  4.8 4
   s2   6.1   67
   s3   7  6.5 6

And then do the pairwise correlations between {p1,p2},{p2,p3}.{p2,p3}

I can do all of this manually, i.e., using some messy case specific
code, but can anyone please point out the best way to do this in a
more generalizable way.

Thanks for any help you can give a novice!

Claus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ANOVA of a sort

2010-06-09 Thread Claus O'Rourke
Dear R Help,

I have a general question - I know this is the R list, but I hope
someone can help me out a little as I've always found the help here to
be absolutely fantastic.

I have run a psychological study where participants are given multiple
stimuli and their responses to those stimuli are measured on the same
numerical scale, i.e., the data is something like

Participant Stimulus Measurement
  p1 s`15
  p1 s`26.1
  p1 s`37
  p2 s`14.8
  p2 s`26
  p2 s`36.5
  p3 s`14
  p3 s`27
  p3 s`36

I would like to be able to measure the between participant variability
for my data - i.e., determine whether measurements are relatively
homogeneous across participants and whether there are very strange
outliers (i.e., participants who maybe gave random or purposefully
incorrect answers).

Can anyone point me towards the correct type of tests for quantifying
this?  I have read that a repeated measure ANOVA might be a starting
point.

Many many thanks for any help you can give me!

Claus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] substring comparison

2010-04-29 Thread Claus O'Rourke
Thanks. It works perfectly.

On Thu, Apr 29, 2010 at 6:24 PM, Henrique Dallazuanna  wrote:
> Try with grepl:
>
> data$ContainsThe <- ifelse(grepl("the",data$Utt),"y","n")
>
> On Thu, Apr 29, 2010 at 2:17 PM, Claus O'Rourke 
> wrote:
>>
>> Hi all,
>>
>> I'm writing a script to do some basic text analysis in R. Let's assume
>> I have a data frame named data which contains a column named 'utt'
>> which contains strings. Is there a straightforward way to achieve
>> something like this:
>>
>> data$ContainsThe <- ifelse(startsWith(data$Utt,"the"),"y","n")
>>
>> or
>>
>> data$ContainsThe <- ifelse(contains(data$Utt,"the"),"y","n")
>> ?
>>
>> I tried using grep
>> data$ContainsThe <- ifelse(grep("the",data$Utt),"y","n")
>>
>> but this doesn't work becausee grep only returns the rows for which
>> grep succeeded.
>>
>> Thanks for any pointers
>>
>> Claus
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Henrique Dallazuanna
> Curitiba-Paraná-Brasil
> 25° 25' 40" S 49° 16' 22" O
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] substring comparison

2010-04-29 Thread Claus O'Rourke
Hi all,

I'm writing a script to do some basic text analysis in R. Let's assume
I have a data frame named data which contains a column named 'utt'
which contains strings. Is there a straightforward way to achieve
something like this:

data$ContainsThe <- ifelse(startsWith(data$Utt,"the"),"y","n")

or

data$ContainsThe <- ifelse(contains(data$Utt,"the"),"y","n")
?

I tried using grep
data$ContainsThe <- ifelse(grep("the",data$Utt),"y","n")

but this doesn't work becausee grep only returns the rows for which
grep succeeded.

Thanks for any pointers

Claus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Practical work with logistic regression

2010-04-23 Thread Claus O'Rourke
Thanks everyone for the replies, that sure cleared up some things for me.

On Fri, Apr 23, 2010 at 9:11 AM, Jan van der Laan
 wrote:
> When you just want to calculate the probability of belong to class A
> or B of a new observation xi and do not have to do any new model
> estimations or other analyses, the easiest way is probably to write
> the estimated coefficients to a text write and read them in in your
> java/c/whatever program and use them directly to calculate the
> probabilities. This is simply
>
>  p_i = 1/(1+e^(-(b0 + b1 x1i + ... + bk xki)))
>
> with the b0 ... bk your parameters. Easy to implement. Having an
> interface to R for this seems to me overkill.
>
> Regards,
> Jan
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Practical work with logistic regression

2010-04-22 Thread Claus O'Rourke
Dear all,

I have a couple of short noob questions for whoever can take them. I'm
from a very non-stats background so sorry for offending anybody with
stupid questions ! :-)

I have been using logistic regression care of glm to analyse a binary
dependent variable against a couple of independent variables. All has
gone well so far. In my work I have to compare the accuracy of
analysis to a C4.5 machine learning approach. With the machine
learning, a straight-forward measure of the quality of the classifier
is simply the percentage of correctly classified instances. I can
calculate this for the resultant model by comparing predictions to
original values 'manually'. My question: is this not automatically -
or easily - calculated in the produced model or the summary of that
model?

I want to use my model in real time to produce results for new inputs.
Basically this model is to be used as a classifier for a robot in real
time. Can anyone suggest the best way that a produced model can be
used directly in external code once the model has been developed in R?
If my external code is in Java, then using jri is one option. A more
efficient method would be to take the intercept and coefficients and
actually code up the function in the appropriate programming language.
Has anyone ever tried doing this?

Apologies again for the stupid questions, but the sooner I get some of
these things straight, the better.

Claus

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] interpretation of p values for highly correlated logistic analysis

2010-04-01 Thread Claus O'Rourke
Thank you both for your advice. I'll follow up on it, but it is good
to know that this is a known effect.

Claus

On Wed, Mar 31, 2010 at 3:02 PM, Stephan Kolassa  wrote:
> Hi Claus,
>
> welcome to the wonderful world of collinearity (or multicollinearity, as
> some call it)! You have a near linear relationship between some of your
> predictors, which can (and in your case does) lead to extreme parameter
> estimates, which in some cases almost cancel out (a coefficient of +/-40 on
> a categorical variable in logistic regression is a lot, and the intercept
> and two of the roman parameter estimates almost cancel out) but which are
> rather unstable (hence your high p-values).
>
> Belsley, Kuh and Welsch did some work on condition indices and variance
> decomposition proportions, and variance inflation factors are quite popular
> for diagnosing multicollinearity - google these terms for a bit, and
> enlightenment will surely follow.
>
> What can you do? You should definitely think long and hard about your data.
> Should you be doing separate regressions for some factor levels? Should you
> drop a factor from the analysis? Should you do a categorical analogue of
> Principal Components Analysis on your data before the regression? I
> personally have never done this, but correspondence analysis has been
> recommended as a "discrete alternative" to PCA on this list, see a couple of
> books by M. J. Greenacre.
>
> Best of luck!
> Stephan
>
>
> claus orourke schrieb:
>>
>> Dear list,
>>
>> I want to perform a logistic regression analysis with multiple
>> categorical predictors (i.e., a logit) on some data where there is a
>> very definite relationship between one predicator and the
>> response/independent variable. The problem I have is that in such a
>> case the p value goes very high (while I as a naive newbie would
>> expect it to crash towards 0).
>>
>> I'll illustrate my problem with some toy data. Say I have the
>> following data as an input frame:
>>
>>   roman animal colour
>> 1  alpha    dog black
>> 2   beta    cat white
>> 3  alpha    dog black
>> 4  alpha    cat black
>> 5   beta    dog white
>> 6  alpha    cat black
>> 7  gamma    dog white
>> 8  alpha    cat black
>> 9  gamma    dog white
>> 10  beta    cat white
>> 11 alpha    dog black
>> 12 alpha    cat black
>> 13 gamma    dog white
>> 14 alpha    cat black
>> 15  beta    dog white
>> 16  beta    cat black
>> 17 alpha    cat black
>> 18  beta    dog white
>>
>> In this toy data you can see that roman:alpha and roman:beta are
>> pretty good predictors of colour
>>
>> Let's say I perform logistic analysis directly on the raw data with
>> colour as a response variable:
>>
>>> options(contrasts=c("contr.treatment","contr.poly"))
>>> anal1 <- glm(data$colour~data$roman+data$animal,family=binomial)
>>
>> then I find that my P values for each individual level coefficient
>> approach 1:
>>
>> Coefficients:
>>                Estimate Std. Error z value Pr(>|z|)
>> (Intercept)       -41.65   19609.49  -0.002    0.998
>> data$romanbeta     42.35   19609.49   0.002    0.998
>> data$romangamma    43.74   31089.48   0.001    0.999
>> data$animaldog     20.48   13866.00   0.001    0.999
>>
>> while I expect the p value for roman:beta to be quite low because it
>> is a good predictor of colour:white
>>
>> On the other hand, if I then run an anova with a Chi-sq test on the
>> result model, I find as I would expect that 'roman' is a good
>> predictor of colour.
>>
>>> anova(anal1,test="Chisq")
>>
>> Analysis of Deviance Table
>>
>> Model: binomial, link: logit
>>
>> Response: data$colour
>>
>> Terms added sequentially (first to last)
>>
>>
>>            Df Deviance Resid. Df Resid. Dev P(>|Chi|)
>> NULL                           17    24.7306
>> data$roman   2  19.3239        15     5.4067 6.366e-05 ***
>> data$animal  1   1.5876        14     3.8191    0.2077
>> ---
>> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Can anyone please explain why my p value is so high for the individual
>> levels?
>>
>> Sorry for what is likely a stupid question.
>>
>> Claus
>>
>> p.s., when I run logistic analysis on data that is more 'randomised'
>> everything comes out as I expect.
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.