Hi all,

I have a question about linear model with interaction:

I created a data frame df like this:

>df
        V1 V2 V3 V4 V5
1 6.414094  c  t  a  g
2 6.117286  t  a  g  t
3 5.756922  a  g  t  g
4 6.090402  g  t  g  t
...

which holds the response in the first column and letters (a,c,g,t) in the
other columns. I am interested to see if there are interactions between the
neigbouring letters so I have defined the following linear model:

>lm<-lm(df[,1] ~ (df[,2]:df[,3]) + (df[,3]:df[,4]) + (df[,4]:df[,5]) )

the result then looks like this:
Coefficients: (1 not defined because of singularities)
                            Estimate Std. Error t value Pr(>|t|)
(Intercept)                8.8987163  0.0211457 420.828  < 2e-16 ***
df[, 2]a:df[, 3]a   -0.1021543  0.0253486  -4.030 5.59e-05 ***
df[, 2]c:df[, 3]a    0.0535562  0.0255685   2.095 0.036213 *
df[, 2]g:df[, 3]a    0.0224073  0.0318965   0.703 0.482372
df[, 2]t:df[, 3]a    0.0024165  0.0259862   0.093 0.925911
df[, 2]a:df[, 3]c    0.0355502  0.0260197   1.366 0.171861
df[, 2]c:df[, 3]c    0.0433014  0.0252535   1.715 0.086415 .
df[, 2]g:df[, 3]c    0.1472222  0.0309441   4.758 1.97e-06 ***
df[, 2]t:df[, 3]c    0.0613779  0.0270601   2.268 0.023323 *
df[, 2]a:df[, 3]g    0.0646498  0.0299286   2.160 0.030770 *
df[, 2]c:df[, 3]g    0.1302731  0.0359439   3.624 0.000290 ***
df[, 2]g:df[, 3]g    0.1512754  0.0360951   4.191 2.78e-05 ***
df[, 2]t:df[, 3]g    0.1084278  0.0339142   3.197 0.001389 **
df[, 2]a:df[, 3]t   -0.0249016  0.0262402  -0.949 0.342633
df[, 2]c:df[, 3]t    0.0860302  0.0253518   3.393 0.000691 ***
df[, 2]g:df[, 3]t    0.0241031  0.0358496   0.672 0.501372
df[, 2]t:df[, 3]t           NA         NA      NA       NA
df[, 3]a:df[, 4]1   -0.0970149  0.0143730  -6.750 1.50e-11 ***
df[, 3]c:df[, 4]1   -0.0153732  0.0152519  -1.008 0.313486
df[, 3]g:df[, 4]1   -0.0706682  0.0225665  -3.132 0.001740 **
df[, 3]t:df[, 4]1   -0.0581889  0.0158485  -3.672 0.000241 ***
df[, 3]a:df[, 4]2    0.0485333  0.0150167   3.232 0.001231 **
df[, 3]c:df[, 4]2   -0.0790008  0.0150513  -5.249 1.54e-07 ***
df[, 3]g:df[, 4]2    0.0604465  0.0217557   2.778 0.005465 **
df[, 3]t:df[, 4]2    0.0232283  0.0167224   1.389 0.164826
df[, 3]a:df[, 4]3    0.0740046  0.0182221   4.061 4.89e-05 ***
df[, 3]c:df[, 4]3    0.0797502  0.0234485   3.401 0.000672 ***
df[, 3]g:df[, 4]3    0.0720160  0.0253456   2.841 0.004495 **
df[, 3]t:df[, 4]3    0.0778484  0.0221196   3.519 0.000433 ***
df[, 4]a:df[, 5]1   -0.0916618  0.0143707  -6.378 1.81e-10 ***
df[, 4]c:df[, 5]1   -0.0138048  0.0152609  -0.905 0.365691
df[, 4]g:df[, 5]1   -0.0700765  0.0225639  -3.106 0.001900 **
df[, 4]t:df[, 5]1   -0.0734513  0.0158534  -4.633 3.62e-06 ***
df[, 4]a:df[, 5]2    0.0438002  0.0150128   2.918 0.003531 **
df[, 4]c:df[, 5]2   -0.1107056  0.0150634  -7.349 2.04e-13 ***
df[, 4]g:df[, 5]2    0.0652739  0.0217520   3.001 0.002694 **
df[, 4]t:df[, 5]2    0.0219305  0.0167259   1.311 0.189811
df[, 4]a:df[, 5]3    0.0804106  0.0182290   4.411 1.03e-05 ***
df[, 4]c:df[, 5]3    0.0970780  0.0234745   4.135 3.55e-05 ***
df[, 4]g:df[, 5]3    0.0704516  0.0253372   2.781 0.005430 **
df[, 4]t:df[, 5]3    0.0911914  0.0221237   4.122 3.77e-05 ***



questions:
1.) What could be the reason that the lm function changes the names of the
interactions terms (after the first undefined coefficient)
from a:a, c:a, g:a, ... to a:1, c:1, g:1, ... and obviously omits direct
calculation of interaction terms of the form a:4, c:4, g:4, t:4 which
(if I correctly assume) correspond to a:t, c:t, g:t, t:t.

2.) How I have to correctly define a data frame new_df for a new sequence of
letters to get the predicted response by using the predict function, I tried
something like this:

>new_df[2:5]=as.data.frame(t('g'))
>new_df[1]=0
>predict(lm, new_df)

and also the original data frame which was used to fit the model:
>predict(lm, df[1,])

outputs all predicted values with respect to the previously fitted linear
model and gives Warning messages:
1: 'newdata' had 1 rows but variable(s) found have 7020 rows
2: In predict.lm(lm_pm, new_df) :
  prediction from a rank-deficient fit may be misleading

Thanks for any help,
Marian

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to