Re: [R] (gam) formula: Why different results for terms being factor vs. numeric?

2013-11-02 Thread Marius Hofert
Dear Bert,

Thanks for helping.

Your questions 'answers' why I get the expected behavior if
'group' is a factor. My question was why I don't get the expected
behavior if 'group' is not a factor.

From a theoretical (non-programming) point of view, there is no
difference in a factor with two levels and a 0-1 (bool/integer)
variable (in my case the 1-2 variable 'group'). gam() interprets
these inputs differently, thus distinguishes these cases (and I
was wondering why; In my opinion, this is a purely R/mgcv related
question and belongs here).

As it turned out, the problem was merely the following: By using
factors and thus specifying a GAM, the intercept was 'hidden' in
the estimated coefficients. When using integers as group
variables, this is a glm and there one needs the intercept. The
examples below provide the details.

With best wishes,

Marius



require(mgcv)
n - 10
yrs - 2000+seq_len(n)
loss - c(seq_len(n)+runif(n), 5+seq_len(n)+runif(n))


## Version 1: gam() with 'group' as factor #

set.seed(271)
dat - data.frame(year  = rep(yrs, 2),
  group = as.factor(rep(1:2, each=n)), # could also be A, B
  resp  = loss)
fit1 - glm(resp ~ year + group - 1, data=dat)
plot(yrs, fit1$fitted.values[seq_len(n)], type=l, ylim=range(dat$resp),
 xlab=Year, ylab=Response) # fit group A; mean over all
responses in this group
lines (yrs, fit1$fitted.values[n+seq_len(n)], col=blue) # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col=blue) # actual response group B


## Version 2: gam() with 'group' as numeric (= glm) ###

set.seed(271)
dat - data.frame(year  = rep(yrs, 2),
  group = rep(1:2, each=n), # could also be 0:1
  resp  = loss)
fit2 - glm(resp ~ year + group - 1, data=dat) # (*)
plot(yrs, fit2$fitted.values[seq_len(n)], type=l, ylim=range(dat$resp),
 xlab=Year, ylab=Response) # fit group A; mean over all
responses in this group
lines (yrs, fit2$fitted.values[n+seq_len(n)], col=blue) # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col=blue) # actual response group B

## Note: without '-1' (intercept) in (*), an unexpected behavior results
## Explanation:
## S. Wiki GAM (without beta_0):
##g(E(Y)) = f_1(x_1) + f_2(x_2)
## where f_i(x_i) may be functions with a specified parametric form
(for example a
## polynomial, or a coefficient depending on the levels of a factor variable)
## = for f_i's being coefficients (numbers) beta_i, this is a GLM:
##g(E(Y)) = beta_1 x_1 + beta_2 x_2 (x_1 = year, x_2 = group)
## Problem: (*) does not specify an intercept and thus the lines are
not picked up correctly
fit2$coefficients


## Version 3: Version 2 with intercept #

set.seed(271)
dat - data.frame(year  = rep(yrs, 2),
  group = rep(1:2, each=n), # could also be 0:1
  resp  = loss)
fit3 - glm(resp ~ year + group, data=dat) # now with intercept
plot(yrs, fit3$fitted.values[seq_len(n)], type=l, ylim=range(dat$resp),
 xlab=Year, ylab=Response) # fit group A; mean over all
responses in this group
lines (yrs, fit3$fitted.values[n+seq_len(n)], col=blue) # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col=blue) # actual response group B
## = correct/as expected
fit3$coefficients

## Note: in Version 1, the intercept is already included in the group
coefficients:
fit1$coefficients



On Tue, Oct 29, 2013 at 9:31 PM, Bert Gunter gunter.ber...@gene.com wrote:
 Think about it. How can one define a smooth term with a factor???

 Further discussion is probably offtopic. Post on
 stats.stackexchange.com if it still isn't obvious.

 Cheers,
 Bert

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] (gam) formula: Why different results for terms being factor vs. numeric?

2013-10-29 Thread Marius Hofert
Dear expeRts,

If I specify group = as.factor(rep(1:2, each=n)) in the below
definition of dat, I get the expected behavior I am looking for. I
wonder why I
don't get it if group is *not* a factor... My guess was that,
internally, factors are treated as natural numbers (and this indeed
seems to be true if you convert the latter to factors [essentially
meaning changing the levels]), but replacing factors by numeric values
(as below) does not provide the same answer.

Cheers,
Marius


require(mgcv)

n - 10
yrs - 2000+seq_len(n)
set.seed(271)
dat - data.frame(year  = rep(yrs, 2),
  group = rep(1:2, each=n), # *not* a factor
(as.factor() provides the expected behavior)
  resp  = c(seq_len(n)+runif(n), 5+seq_len(n)+runif(n)))
fit3 - gam(resp ~ year + group - 1, data=dat)
plot(yrs, fit3$fitted.values[seq_len(n)], type=l, ylim=range(dat$resp),
 xlab=Year, ylab=Response) # fit group A; mean over all
responses in this group
lines (yrs, fit3$fitted.values[n+seq_len(n)], col=blue) # fit group
B; mean over all responses in this group
points(yrs, dat$resp[seq_len(n)]) # actual response group A
points(yrs, dat$resp[n+seq_len(n)], col=blue) # actual response group B
## = hmmm... because it is not a factor (?), this does not give an
expected answer,
##but gam() still correctly figures out that there are two groups

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] (gam) formula: Why different results for terms being factor vs. numeric?

2013-10-29 Thread Bert Gunter
Think about it. How can one define a smooth term with a factor???

Further discussion is probably offtopic. Post on
stats.stackexchange.com if it still isn't obvious.

Cheers,
Bert

On Tue, Oct 29, 2013 at 1:16 PM, Marius Hofert
marius.hof...@math.ethz.ch wrote:
 Dear expeRts,

 If I specify group = as.factor(rep(1:2, each=n)) in the below
 definition of dat, I get the expected behavior I am looking for. I
 wonder why I
 don't get it if group is *not* a factor... My guess was that,
 internally, factors are treated as natural numbers (and this indeed
 seems to be true if you convert the latter to factors [essentially
 meaning changing the levels]), but replacing factors by numeric values
 (as below) does not provide the same answer.

 Cheers,
 Marius


 require(mgcv)

 n - 10
 yrs - 2000+seq_len(n)
 set.seed(271)
 dat - data.frame(year  = rep(yrs, 2),
   group = rep(1:2, each=n), # *not* a factor
 (as.factor() provides the expected behavior)
   resp  = c(seq_len(n)+runif(n), 5+seq_len(n)+runif(n)))
 fit3 - gam(resp ~ year + group - 1, data=dat)
 plot(yrs, fit3$fitted.values[seq_len(n)], type=l, ylim=range(dat$resp),
  xlab=Year, ylab=Response) # fit group A; mean over all
 responses in this group
 lines (yrs, fit3$fitted.values[n+seq_len(n)], col=blue) # fit group
 B; mean over all responses in this group
 points(yrs, dat$resp[seq_len(n)]) # actual response group A
 points(yrs, dat$resp[n+seq_len(n)], col=blue) # actual response group B
 ## = hmmm... because it is not a factor (?), this does not give an
 expected answer,
 ##but gam() still correctly figures out that there are two groups

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 

Bert Gunter
Genentech Nonclinical Biostatistics

(650) 467-7374

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.