Re: [R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?

Michael Dewey Thu, 09 Jun 2022 06:53:07 -0700

Dear Jade

Do you really need to fit a separate parameter for each heaping day? Canyou not just make it a binary predictor or a categorical one with fewerlevels, perhaps 14 (for heaping in each year) or 12 (for each calendarmonth). I have no idea whether that would help but it seems worth a try.


Michael

On 08/06/2022 18:15, jade.shodan--- via R-help wrote:

Hi Simon,

Thanks so much for this!! I have two follow up questions, if you don't mind.

1. Does including an autoregressive term not adjust away part of the
effect of the response in a distributed lag model (where the outcome
accumulates over time)?
2. I've tried to fit a model using bam (just a first attempt without
AR term), but including the factor variable heap creates errors:

bam0 <- bam(deaths~te(year, month, week, weekday,
bs=c("cr","cc","cc","cc"), k = c(4,5,5,5)) + heap +
                       te(temp_max, lag, k=c(8, 3)) +
                       te(precip_daily_total, lag, k=c(8, 3)),
                       data = dat, family = nb, method = 'fREML',
select = TRUE, discrete = TRUE,
                       knots = list(month = c(0.5, 12.5), week = c(0.5,
52.5), weekday = c(0, 6.5)))

This model results in errors:

Warning in estimate.theta(theta, family, y, mu, scale = scale1, wt = G$w,  :
   step failure in theta estimation
Warning in sqrt(family$dev.resids(object$y, object$fitted.values,
object$prior.weights)) :
   NaNs produced


Including heap as as.numeric(heap) runs the model without error
messages or warnings, but model diagnostics look terrible, and it also
doesn't make sense (to me) to make heap a numeric. The factor variable
heap (with 169 levels) codes the fact that all deaths for which no
date was known, were registered on the 15th day of each month. I've
coded all non-heaping days as 0. All heaping days were coded as a
value between 1-168. The time series spans 14 years, so a heaping day
in each month results in 14*12 levels = 168, plus one level for
non-heaping days.

So my second question is: Does bam allow factor variables? And if not,
how should I model this heaping on the 15th day of the month instead?

With thanks,

Jade

On Wed, 8 Jun 2022 at 12:05, Simon Wood <simon.w...@bath.edu> wrote:


I would not worry too much about high concurvity between variables like
temperature and time. This just reflects the fact that temperature has a
strong temporal pattern.

I would also not be too worried about the low p-values on the k check.
The check only looks for pattern in the residuals when they are ordered
with respect to the variables of a smooth. When you have time series
data and some smooths involve time then it's hard not to pick up some
degree of residual auto-correlation, which you often would not want to
model with a higher rank smoother.

The NAs  for the distributed lag terms just reflect the fact that there
is no obvious way to order the residuals w.r.t. the covariates for such
terms, so the simple check for residual pattern is not really possible.

One simple approach is to fit using bam(...,discrete=TRUE) which will
let you specify an AR1 parameter to mop up some of the residual
auto-correlation without resorting to a high rank smooth that then does
all the work of the covariates as well. The AR1 parameter can be set by
looking at the ACF of the residuals of the model without this. You need
to look at the ACF of suitably standardized residuals to check how well
this has worked.

best,

Simon

On 05/06/2022 20:01, jade.shodan--- via R-help wrote:

Hello everyone,

A few days ago I asked a question about concurvity in a GAM (the
anologue of collinearity in a GLM) implemented in mgcv. I think my
question was a bit unfocussed, so I am retrying again, but with
additional information included about the autocorrelation function. I
have also posted about this on Cross Validated. Given all the model
output, it might make for easier
reading:https://stats.stackexchange.com/questions/577790/high-concurvity-collinearity-between-time-and-temperature-in-gam-predicting-dea

As mentioned previously, I have problems with concurvity in my thesis
research, and don't have access to a statistician who works with time
series, GAMs or R. I'd be very grateful for any (partial) answer,
however short. I'll gladly return the favour where I can! For really
helpful input I'd be more than happy to offer co-authorship on
publication. Deadlines are very close, and I'm heading towards having
no results at all if I can't solve this concurvity issue :(

I'm using GAMs to try to understand the relationship between deaths
and heat-related variables (e.g. temperature and humidity), using
daily time series over a 14-year period from a tropical, low-income
country. My aim is to understand the relationship between these
variables and deaths, rather than pure prediction performance.

The GAMs include distributed lag models (set up as 7-column matrices,
see code at bottom of post), since deaths may occur over several days
following exposure.

Simple GAMs with just time, lagged temperature and lagged
precipitation (a potential confounder) show very high concurvity
between lagged temperature and time, regardless of the many different
ways I have tried to decompose time. The autocorrelation functions
(ACF) however, shows values close to zero, only just about breaching
the 'significance line' in a few instances. It does show patterning
though, although the regularity is difficult to define.

My questions are:
1) Should I be worried about the high concurvity, or can I ignore it
given the mostly non-significant ACF? I've read dozens of
heat-mortality modelling studies and none report on concurvity between
weather variables and time (though one 2012 paper discussed
autocorrelation).

2) If I cannot ignore it, what should I do to resolve it? Would
including an autoregressive term be appropriate, and if so, where can
I find a coded example of how to do this? I've also come across
sequential regression][1]. Would this be more or less appropriate? If
appropriate, a pointer to an example would be really appreciated!

Some example GAMs are specified as follows:
```r
conc38b <- gam(deaths~te(year, month, week, weekday,
bs=c("cr","cc","cc","cc")) + heap +
                        te(temp_max, lag, k=c(10, 3)) +
                        te(precip_daily_total, lag, k=c(10, 3)),
                        data = dat, family = nb, method = 'REML', select = TRUE,
                        knots = list(month = c(0.5, 12.5), week = c(0.5,
52.5), weekday = c(0, 6.5)))
```
Concurvity for the above model between (temp_max, lag) and (year,
month, week, weekday) is 0.91:

```r
$worst
                                      para te(year,month,week,weekday)
te(temp_max,lag) te(precip_daily_total,lag)
para                        1.000000e+00                1.125625e-29
       0.3150073                  0.6666348
te(year,month,week,weekday) 1.400648e-29                1.000000e+00
       0.9060552                  0.6652313
te(temp_max,lag)            3.152795e-01                8.998113e-01
       1.0000000                  0.5781015
te(precip_daily_total,lag)  6.666348e-01                6.652313e-01
       0.5805159                  1.0000000
```

Output from ```gam.check()```:
```r
Method: REML   Optimizer: outer newton
full convergence after 16 iterations.
Gradient range [-0.01467332,0.003096643]
(score 8915.994 & scale 1).
Hessian positive definite, eigenvalue range [5.048053e-05,26.50175].
Model rank =  544 / 544

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                                    k'      edf k-index p-value
te(year,month,week,weekday) 319.0000  26.6531    0.96    0.06 .
te(temp_max,lag)             29.0000   3.3681      NA      NA
te(precip_daily_total,lag)   27.0000   0.0051      NA      NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```

Some output from ```summary(conc38b)```:
```r
Approximate significance of smooth terms:
                                    edf Ref.df  Chi.sq p-value
te(year,month,week,weekday) 26.653127    319 166.803 < 2e-16 ***
te(temp_max,lag)             3.368129     27  11.130 0.00145 **
te(precip_daily_total,lag)   0.005104     27   0.002 0.69636
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.839   Deviance explained = 53.3%
-REML =   8916  Scale est. = 1         n = 5107
```


Below are the ACF plots (note limit y-axis = 0.1 for clarity of
pattern). They show peaks at 5 and 15, and then there seems to be a
recurring pattern at multiples of approx. 30 (suggesting month is not
modelled adequately?). Not sure what would cause the spikes at 5 and
15. There is heaping of deaths on the 15th day of each month, to which
deaths with unknown date were allocated. This heaping was modelled
with categorical variable/ factor ```heap``` with 169 levels (0 for
all non-heaping days and 1-168 (i.e. 14 * 12 for each subsequent
heaping day over the 14-year period):

    [2]: https://i.stack.imgur.com/FzKyM.png
    [3]: https://i.stack.imgur.com/fE3aL.png


I get an identical looking ACF when I decompose time into (year,
month, monthday) as in model conc39 below, although concurvity between
(temp_max, lag) and the time term has now dropped somewhat to 0.83:

```r
conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap +
                       te(temp_max, lag, k=c(10, 4)) +
                       te(precip_daily_total, lag, k=c(10, 4)),
                       data = dat, family = nb, method = 'REML', select = TRUE,
                       knots = list(month = c(0.5, 12.5)))
```
```r

Method: REML   Optimizer: outer newton
full convergence after 14 iterations.
Gradient range [-0.001578187,6.155096e-05]
(score 8915.763 & scale 1).
Hessian positive definite, eigenvalue range [1.894391e-05,26.99215].
Model rank =  323 / 323

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                                  k'     edf k-index p-value
te(year,month,monthday)    79.0000 25.0437    0.93  <2e-16 ***
te(temp_max,lag)           39.0000  4.0875      NA      NA
te(precip_daily_total,lag) 36.0000  0.0107      NA      NA
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
```
Some output from ```summary(conc39)```:
```r
Approximate significance of smooth terms:
                                  edf Ref.df  Chi.sq  p-value
te(year,month,monthday)    38.75573     99 187.231  < 2e-16 ***
te(temp_max,lag)            4.06437     37  25.927 1.66e-06 ***
te(precip_daily_total,lag)  0.01173     36   0.008    0.557
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.839   Deviance explained = 53.8%
-REML =   8915  Scale est. = 1         n = 5107
```


```r
$worst
                                     para te(year,month,monthday)
te(temp_max,lag) te(precip_daily_total,lag)
para                       1.000000e+00            3.261007e-31
0.3313549                  0.6666532
te(year,month,monthday)    3.060763e-31            1.000000e+00
0.8266086                  0.5670777
te(temp_max,lag)           3.331014e-01            8.225942e-01
1.0000000                  0.5840875
te(precip_daily_total,lag) 6.666532e-01            5.670777e-01
0.5939380                  1.0000000
```

Modelling time as ```te(year, doy)``` with a cyclic spline for doy and
various choices for k or as ```s(time)``` with various k does not
reduce concurvity either.


The default approach in time series studies of heat-mortality is to
model time with fixed df, generally between 7-10 df per year of data.
I am, however, apprehensive about this approach because a) mortality
profiles vary with locality due to sociodemographic and environmental
characteristics and b) the choice of df is based on higher income
countries (where nearly all these studies have been done) with
different mortality profiles and so may not be appropriate for
tropical, low-income countries.

Although the approach of fixing (high) df does remove more temporal
patterns from the ACF (see model and output below), concurvity between
time and lagged temperature has now risen to 0.99! Moreover,
temperature (which has been a consistent, highly significant predictor
in every model of the tens (hundreds?) I have run, has now turned
non-significant. I am guessing this is because time is now a very
wiggly function that not only models/ removes seasonal variation, but
also some of the day-to-day variation that is needed for the
temperature smooth  :

```r
conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap +
                        te(temp_max, lag, k=c(10,3)) +
                        te(precip_daily_total, lag, k=c(10,3)),
                        data = dat, family = nb, method = 'REML', select = TRUE)
```
Output from ```gam.check(conc20a, rep = 1000)```:

```r
Method: REML   Optimizer: outer newton
full convergence after 9 iterations.
Gradient range [-0.0008983099,9.546022e-05]
(score 8750.13 & scale 1).
Hessian positive definite, eigenvalue range [0.0001420112,15.40832].
Model rank =  336 / 336

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

                                   k'      edf k-index p-value
s(time)                    111.0000 111.0000    0.98    0.56
te(temp_max,lag)            29.0000   0.6548      NA      NA
te(precip_daily_total,lag)  27.0000   0.0046      NA      NA
```
Output from ```concurvity(conc20a, full=FALSE)$worst```:

```r
                                     para      s(time) te(temp_max,lag)
te(precip_daily_total,lag)
para                       1.000000e+00 2.462064e-19        0.3165236
                  0.6666348
s(time)                    2.462398e-19 1.000000e+00        0.9930674
                  0.6879284
te(temp_max,lag)           3.170844e-01 9.356384e-01        1.0000000
                  0.5788711
te(precip_daily_total,lag) 6.666348e-01 6.879284e-01        0.5788381
                  1.0000000

```

Some output from ```summary(conc20a)```:
```r
Approximate significance of smooth terms:
                                   edf Ref.df  Chi.sq p-value
s(time)                    1.110e+02    111 419.375  <2e-16 ***
te(temp_max,lag)           6.548e-01     27   0.895   0.249
te(precip_daily_total,lag) 4.598e-03     27   0.002   0.868
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.843   Deviance explained = 56.1%
-REML = 8750.1  Scale est. = 1         n = 5107
```

ACF functions:

[4]: https://i.stack.imgur.com/7nbXS.png
[5]: https://i.stack.imgur.com/pNnZU.png

Data can be found on my [GitHub][6] site in the file
[data_cross_validated_post2.rds][7]. A csv version is also available.
This is my code:

```r
library(readr)
library(mgcv)

df <- read_rds("data_crossvalidated_post2.rds")

# Create matrices for lagged weather variables (6 day lags) based on
example by Simon Wood
# in his 2017 book ("Generalized additive models: an introduction with
R", p. 349) and
# gamair package documentation
(https://cran.r-project.org/web/packages/gamair/gamair.pdf, p. 54)

lagard <- function(x,n.lag=7) {
n <- length(x); X <- matrix(NA,n,n.lag)
for (i in 1:n.lag) X[i:n,i] <- x[i:n-i+1]
X
}

dat <- list(lag=matrix(0:6,nrow(df),7,byrow=TRUE),
deaths=df$deaths_total,doy=df$doy, year = df$year, month = df$month,
weekday = df$weekday, week = df$week, monthday = df$monthday, time =
df$time, heap=df$heap, heap_bin = df$heap_bin, precip_hourly_dailysum
= df$precip_hourly_dailysum)
dat$temp_max <- lagard(df$temp_max)
dat$temp_min <- lagard(df$temp_min)
dat$temp_mean <- lagard(df$temp_mean)
dat$wbgt_max <- lagard(df$wbgt_max)
dat$wbgt_mean <- lagard(df$wbgt_mean)
dat$wbgt_min <- lagard(df$wbgt_min)
dat$temp_wb_nasa_max <- lagard(df$temp_wb_nasa_max)
dat$sh_mean <- lagard(df$sh_mean)
dat$solar_mean <- lagard(df$solar_mean)
dat$wind2m_mean <- lagard(df$wind2m_mean)
dat$sh_max <- lagard(df$sh_max)
dat$solar_max <- lagard(df$solar_max)
dat$wind2m_max <- lagard(df$wind2m_max)
dat$temp_wb_nasa_mean <- lagard(df$temp_wb_nasa_mean)
dat$precip_hourly_dailysum <- lagard(df$precip_hourly_dailysum)
dat$precip_hourly <- lagard(df$precip_hourly)
dat$precip_daily_total <- lagard( df$precip_daily_total)
dat$temp <- lagard(df$temp)
dat$sh <- lagard(df$sh)
dat$rh <- lagard(df$rh)
dat$solar <- lagard(df$solar)
dat$wind2m <- lagard(df$wind2m)


conc38b <- gam(deaths~te(year, month, week, weekday,
bs=c("cr","cc","cc","cc")) + heap +
                        te(temp_max, lag, k=c(10, 3)) +
                        te(precip_daily_total, lag, k=c(10, 3)),
                        data = dat, family = nb, method = 'REML', select = TRUE,
                        knots = list(month = c(0.5, 12.5), week = c(0.5,
52.5), weekday = c(0, 6.5)))

conc39 <- gam(deaths~te(year, month, monthday, bs=c("cr","cc","cr")) + heap +
                       te(temp_max, lag, k=c(10, 4)) +
                       te(precip_daily_total, lag, k=c(10, 4)),
                       data = dat, family = nb, method = 'REML', select = TRUE,
                       knots = list(month = c(0.5, 12.5)))

conc20a <- gam(deaths~s(time, k=112, fx=TRUE) + heap +
                        te(temp_max, lag, k=c(10,3)) +
                        te(precip_daily_total, lag, k=c(10,3)),
                        data = dat, family = nb, method = 'REML', select = TRUE)

```
Thank you if you've read this far!! :-))

    [1]: 
https://scholar.google.co.uk/scholar?output=instlink&q=info:PKdjq7ZwozEJ:scholar.google.com/&hl=en&as_sdt=0,5&scillfp=17865929886710916120&oi=lle
    [2]: https://i.stack.imgur.com/FzKyM.png
    [3]: https://i.stack.imgur.com/fE3aL.png
    [4]: https://i.stack.imgur.com/7nbXS.png
    [5]: https://i.stack.imgur.com/pNnZU.png
    [6]: https://github.com/JadeShodan/heat-mortality
    [7]: 
https://github.com/JadeShodan/heat-mortality/blob/main/data_cross_validated_post2.rds

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Simon Wood, School of Mathematics, University of Edinburgh,
https://www.maths.ed.ac.uk/~swood34/


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Michael
http://www.dewey.myzen.co.uk/home.html

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] High concurvity/ collinearity between time and temperature in GAM predicting deaths but low ACF. Does this matter?

Reply via email to