Dear Bert,
Thanks for helping.
Your questions 'answers' why I get the expected behavior if
'group' is a factor. My question was why I don't get the expected
behavior if 'group' is not a factor.
From a theoretical (non-programming) point of view, there is no
difference in a factor with two levels and a 0-1 (bool/integer)
variable (in my case the 1-2 variable 'group'). gam() interprets
these inputs differently, thus distinguishes these cases (and I
was wondering why; In my opinion, this is a purely R/mgcv related
question and belongs here).
As it turned out, the problem was merely the following: By using
factors and thus specifying a GAM, the intercept was 'hidden' in
the estimated coefficients. When using integers as group
variables, this is a glm and there one needs the intercept. The
examples below provide the details.
With best wishes,
Marius
require(mgcv)
n - 10
yrs - 2000+seq_len(n)
loss - c(seq_len(n)+runif(n), 5+seq_len(n)+runif(n))
## Version 1: gam() with 'group' as factor #
set.seed(271)
dat - data.frame(year = rep(yrs, 2),
group = as.factor(rep(1:2, each=n)), # could also be A, B
resp = loss)
fit1 - glm(resp ~ year + group - 1, data=dat)
plot(yrs, fit1$fitted.values[seq_len(n)], type=l, ylim=range(dat$resp),
xlab=Year, ylab=Response) # fit group A; mean over all
responses in this group
lines (yrs, fit1$fitted.values[n+seq_len(n)], col=blue) # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col=blue) # actual response group B
## Version 2: gam() with 'group' as numeric (= glm) ###
set.seed(271)
dat - data.frame(year = rep(yrs, 2),
group = rep(1:2, each=n), # could also be 0:1
resp = loss)
fit2 - glm(resp ~ year + group - 1, data=dat) # (*)
plot(yrs, fit2$fitted.values[seq_len(n)], type=l, ylim=range(dat$resp),
xlab=Year, ylab=Response) # fit group A; mean over all
responses in this group
lines (yrs, fit2$fitted.values[n+seq_len(n)], col=blue) # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col=blue) # actual response group B
## Note: without '-1' (intercept) in (*), an unexpected behavior results
## Explanation:
## S. Wiki GAM (without beta_0):
##g(E(Y)) = f_1(x_1) + f_2(x_2)
## where f_i(x_i) may be functions with a specified parametric form
(for example a
## polynomial, or a coefficient depending on the levels of a factor variable)
## = for f_i's being coefficients (numbers) beta_i, this is a GLM:
##g(E(Y)) = beta_1 x_1 + beta_2 x_2 (x_1 = year, x_2 = group)
## Problem: (*) does not specify an intercept and thus the lines are
not picked up correctly
fit2$coefficients
## Version 3: Version 2 with intercept #
set.seed(271)
dat - data.frame(year = rep(yrs, 2),
group = rep(1:2, each=n), # could also be 0:1
resp = loss)
fit3 - glm(resp ~ year + group, data=dat) # now with intercept
plot(yrs, fit3$fitted.values[seq_len(n)], type=l, ylim=range(dat$resp),
xlab=Year, ylab=Response) # fit group A; mean over all
responses in this group
lines (yrs, fit3$fitted.values[n+seq_len(n)], col=blue) # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col=blue) # actual response group B
## = correct/as expected
fit3$coefficients
## Note: in Version 1, the intercept is already included in the group
coefficients:
fit1$coefficients
On Tue, Oct 29, 2013 at 9:31 PM, Bert Gunter gunter.ber...@gene.com wrote:
Think about it. How can one define a smooth term with a factor???
Further discussion is probably offtopic. Post on
stats.stackexchange.com if it still isn't obvious.
Cheers,
Bert
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.