Hi all, I have a question about linear model with interaction:
I created a data frame df like this: >df V1 V2 V3 V4 V5 1 6.414094 c t a g 2 6.117286 t a g t 3 5.756922 a g t g 4 6.090402 g t g t ... which holds the response in the first column and letters (a,c,g,t) in the other columns. I am interested to see if there are interactions between the neigbouring letters so I have defined the following linear model: >lm<-lm(df[,1] ~ (df[,2]:df[,3]) + (df[,3]:df[,4]) + (df[,4]:df[,5]) ) the result then looks like this: Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 8.8987163 0.0211457 420.828 < 2e-16 *** df[, 2]a:df[, 3]a -0.1021543 0.0253486 -4.030 5.59e-05 *** df[, 2]c:df[, 3]a 0.0535562 0.0255685 2.095 0.036213 * df[, 2]g:df[, 3]a 0.0224073 0.0318965 0.703 0.482372 df[, 2]t:df[, 3]a 0.0024165 0.0259862 0.093 0.925911 df[, 2]a:df[, 3]c 0.0355502 0.0260197 1.366 0.171861 df[, 2]c:df[, 3]c 0.0433014 0.0252535 1.715 0.086415 . df[, 2]g:df[, 3]c 0.1472222 0.0309441 4.758 1.97e-06 *** df[, 2]t:df[, 3]c 0.0613779 0.0270601 2.268 0.023323 * df[, 2]a:df[, 3]g 0.0646498 0.0299286 2.160 0.030770 * df[, 2]c:df[, 3]g 0.1302731 0.0359439 3.624 0.000290 *** df[, 2]g:df[, 3]g 0.1512754 0.0360951 4.191 2.78e-05 *** df[, 2]t:df[, 3]g 0.1084278 0.0339142 3.197 0.001389 ** df[, 2]a:df[, 3]t -0.0249016 0.0262402 -0.949 0.342633 df[, 2]c:df[, 3]t 0.0860302 0.0253518 3.393 0.000691 *** df[, 2]g:df[, 3]t 0.0241031 0.0358496 0.672 0.501372 df[, 2]t:df[, 3]t NA NA NA NA df[, 3]a:df[, 4]1 -0.0970149 0.0143730 -6.750 1.50e-11 *** df[, 3]c:df[, 4]1 -0.0153732 0.0152519 -1.008 0.313486 df[, 3]g:df[, 4]1 -0.0706682 0.0225665 -3.132 0.001740 ** df[, 3]t:df[, 4]1 -0.0581889 0.0158485 -3.672 0.000241 *** df[, 3]a:df[, 4]2 0.0485333 0.0150167 3.232 0.001231 ** df[, 3]c:df[, 4]2 -0.0790008 0.0150513 -5.249 1.54e-07 *** df[, 3]g:df[, 4]2 0.0604465 0.0217557 2.778 0.005465 ** df[, 3]t:df[, 4]2 0.0232283 0.0167224 1.389 0.164826 df[, 3]a:df[, 4]3 0.0740046 0.0182221 4.061 4.89e-05 *** df[, 3]c:df[, 4]3 0.0797502 0.0234485 3.401 0.000672 *** df[, 3]g:df[, 4]3 0.0720160 0.0253456 2.841 0.004495 ** df[, 3]t:df[, 4]3 0.0778484 0.0221196 3.519 0.000433 *** df[, 4]a:df[, 5]1 -0.0916618 0.0143707 -6.378 1.81e-10 *** df[, 4]c:df[, 5]1 -0.0138048 0.0152609 -0.905 0.365691 df[, 4]g:df[, 5]1 -0.0700765 0.0225639 -3.106 0.001900 ** df[, 4]t:df[, 5]1 -0.0734513 0.0158534 -4.633 3.62e-06 *** df[, 4]a:df[, 5]2 0.0438002 0.0150128 2.918 0.003531 ** df[, 4]c:df[, 5]2 -0.1107056 0.0150634 -7.349 2.04e-13 *** df[, 4]g:df[, 5]2 0.0652739 0.0217520 3.001 0.002694 ** df[, 4]t:df[, 5]2 0.0219305 0.0167259 1.311 0.189811 df[, 4]a:df[, 5]3 0.0804106 0.0182290 4.411 1.03e-05 *** df[, 4]c:df[, 5]3 0.0970780 0.0234745 4.135 3.55e-05 *** df[, 4]g:df[, 5]3 0.0704516 0.0253372 2.781 0.005430 ** df[, 4]t:df[, 5]3 0.0911914 0.0221237 4.122 3.77e-05 *** questions: 1.) What could be the reason that the lm function changes the names of the interactions terms (after the first undefined coefficient) from a:a, c:a, g:a, ... to a:1, c:1, g:1, ... and obviously omits direct calculation of interaction terms of the form a:4, c:4, g:4, t:4 which (if I correctly assume) correspond to a:t, c:t, g:t, t:t. 2.) How I have to correctly define a data frame new_df for a new sequence of letters to get the predicted response by using the predict function, I tried something like this: >new_df[2:5]=as.data.frame(t('g')) >new_df[1]=0 >predict(lm, new_df) and also the original data frame which was used to fit the model: >predict(lm, df[1,]) outputs all predicted values with respect to the previously fitted linear model and gives Warning messages: 1: 'newdata' had 1 rows but variable(s) found have 7020 rows 2: In predict.lm(lm_pm, new_df) : prediction from a rank-deficient fit may be misleading Thanks for any help, Marian [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.