Dear R users,
I hope that someone may be able to assist me with a problem I have.
I have data on the number of sports facilities located within datazones (a
small measure of area) in Scotland. There are over six thousand datazones in
Scotland and these are nested within local authorities. There are a high number
of zero observations so a Poisson regression does not seem to be appropriate
for the modelling as there is great overdispersion present. I am interested in
assessing whether or not there is an association between the number of
facilities and a datazone level measure of deprivation, adjusting for both
datazone level population (using log(population) as an offset in the model) and
urban/rural classification.
I tried fitting single level quasipoisson regression models and negative
binomial models, adopting dummy variables for local authority (a fairly
inefficient way to tackle this problem) using glm() with family=quasipoisson
and glm.nb(), respectively. The negative binomial regression appeared to be the
better option so I then assessed the residuals for this model using moran.mc()
from the spdep package (an approach recommended by a spatial colleague) and
found that there was significant residual spatial autocorrelation in my model
(value of 0.23 with p-value < 0.0001). After adopting a multilevel negative
binomial model using glmmPQL(), with local authorities as the higher level, the
residual spatial autocorrelation remained the same.
I hoped to carry out some sort of spatial regression, such as CAR modelling, to
attempt to remove the spatial autocorrelation to see if this has an impact on
the association between area level deprivation and the number of facilities. I
created a neighbours matrix for the datazones but as this is of dimension
6412x6412, I struggled to identify a modelling approach that was equipped to
deal with this matrix. I was advised to try a variety of approaches using
information on the eastings and northings of my datazone centroids, such as
including polynomials of the eastings and northings or smoothing terms of the
eastings and northings (using GAM models) in my single level or multilevel
models and reassessing the residual spatial autocorrelation. I tried this and
it did not reduce the spatial autocorrelation. I was then advised to include a
spatial parameter in the model based on the neighbours matrix, as follows:
S <- vector(length=6412)
for(i in 1:6412)
{
S[i] <- sum(bigmatrix[i,]*log((walk$all_walk10+0.5)/walk$totpop))
}
where bigmatrix is my 6412x6412 neighbours matrix of 0s and 1s and, walk is my
dataset with all_walk10 as my response and totpop is the population size, used
as the offset. (This is similar to an approach adopted in 'Bayesian quantile
regression for count data with application to environmental epidemiology', D.
Lee and T. Neocleous, Royal Statistical Society Journal Series C). This
approach reduced the residual spatial autocorrelation to 0.16 in the multilevel
model. However, the correlation remained significant. I have no further ideas
about how to remove the residual spatial autocorrelation and would appreciate
any suggestions. It may be the case that the removal of the correlation would
have no impact on the results obtained and conclusions reached for the
association between deprivation and the number of facilities. However, I was
hoping to be able to explore the difference that the removal of the correlation
would have on the results.
Any suggestions would be greatly appreciated.
Cheers,
Karen
[[alternative HTML version deleted]]
_______________________________________________
R-sig-Geo mailing list
[email protected]
https://stat.ethz.ch/mailman/listinfo/r-sig-geo