Re: [R] svy / weighted regression

David Winsemius Tue, 13 Oct 2009 08:26:55 -0700


On Oct 13, 2009, at 6:07 AM, Laust wrote:

Dear David,

Thanks again for your input! I realize that I did a bad job of
explaining this in my first email, but the setup is that in Finland
persons who die are sampled with a different probability (1) from
those who live (.5). This was done by the Finnish data protection
authorities to protect individuals against identification. In the rest
of the countries everyone is sampled with a probability of 1. The data
that I am supplying to R is summarized data for each country
stratified by case status. Another way of organizing the data would
be:

# creating data
listc <- c("Denmark","Finland","Norway","Sweden")
listw <- c(1,2,1,1)
listd <- c(1000,1000,1000,2000)
listt <- c(755000,505000,905000,1910000)
list.cwdt <- c(listc, listw, listd, listt)

country2 <-data.frame(country=listc,weight=listw,deaths=listd,time=listt)


I hope that it is clearer now that for no value of the independent
variable 'country' is the rate going to be zero.

It is clearer now, and I think you were correct in believing thatshould not have been the problem, so please accept my apologies. Thedenominators and numerators should have been properly summed prior toestimation.

I think this was also
not the case in my original example, but this was obscured by my poor
communication- & R-skills. But if data is organized this way then
sampling weight of 2 for Finland should only be applied to the
time-variable that contains person years at risk and *not* to the
number of deaths, which would complicate matters further. I would know
how to get this to work in R or in any other statistical package.

Perhaps it is - as Peter Dalgaard suggested - the estimation of the
dispersion parameter by the survey package that is causing trouble,
not the data example eo ipso. Or perhaps I am just using survey in a
wrong way.

I think it is likely that we are now both using it incorrectly, but myefforts are also creating nonsense. From the help page I thought thatthe formula in svydesign might be need to be the country variable ...wrong. Or that the weights might need to be the inverse of what youhad used ...wrong. Or that you ought to use quasipoisson for thefamily ,,,, wrong again.

Lumley is preparing a book to accompany the package but that is stillseveral months away from release. He and Norm Breslow also published apaper very recently in the American Journal of Epidemiology on theusing of survey sampling for analysis of case-cohort designs (of whichyour problem seems to be an exceedingly simple example, albeit only inone of the four strata.) I don't have access to the original paper atthe moment, but perhaps you are in an academic setting where suchaccess would be routine.

Or probably even more efficient would be to shoot a letter to ThomasLumley.


--
David

Best
Laust

****
Post doc. Laust Mortensen, PhD
Epidemiology Unit
University of Southern Denmark
On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius <dwinsem...@comcast.net> wrote:
I think you are missing the point. You have 4 zero death countsassociatedwith much higher person years of exposure followed by 4 deathcounts in thethousands associated with lower degrees of exposures. It seemsunlikely thatthese are real data as there are not cohorts that would exhibitsuch lowerdeath-rates. So it appears that in setting up your test case, youhave
created an impossibly unrealistic test problem.

--
David


On Oct 12, 2009, at 9:12 AM, Laust wrote:
Dear Peter,

Thanks for the input. The zero rates in some strata occurs because
sampling depended on case status: In Finland only 50% of the non-cases
were sampled, while all others were sampled with 100% probability.

Best
Laust

On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
<p.dalga...@biostat.ku.dk> wrote:
Sorry, forgot to "reply all"...

Laust wrote:
Dear list,

I am trying to set up a propensity-weighted regression using the
survey package. Most of my population is sampled with a sampling
probability of one (that is, I have the full population).However, fora subset of the data I have only a 50% sample of the fullpopulation.
In previous work on the data, I analyzed these data using SAS and
STATA. In those packages I used a propensity weight of 1/[sampling
probability] in various generalized linear regression-procedures, butI am having trouble setting this up. I bet the solution issimple, but
I’m a R newbie. Code to illustrate my problem below.
Hi Laust,
You probably need the package author to explain fully, but as faras Ican see, the crux is that a dispersion parameter is being used,based onPearson residuals, even in the Poisson case (i.e. you effectivelyget
the same result as with quasipoisson()).
I don't know what the rationale is for this, but it is clear thatwithyour data, an estimated dispersion parameter is going to belarge. E.g.the data has both 0 cases in 750000 person-years and 1000 casesin 5000person-years for Denmark, and in your model they are supposed tohave
the same Poisson rate.

summary.svyglm starts off with

  est.disp <- TRUE
and AFAICS there is no way it can get set to FALSE. KnowingThomas,there is probably a perfectly good reason not to just set thedispersion
to 1, but I don't get it either...
Thanks
Laust

# loading survey
library(survey)

# creating data
listc <-
c("Denmark","Finland","Norway","Sweden","Denmark","Finland","Norway","Sweden")
listw <- c(1,2,1,1,1,1,1,1)
listd <- c(0,0,0,0,1000,1000,1000,2000)
listt <- c(750000,500000,900000,1900000,5000,5000,5000,10000)
list.cwdt <- c(listc, listw, listd, listt)
country <-
data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)
# running a frequency weighted regression to get the correct point
estimates for comparison
glm <- glm(deaths ~ country + offset(log(yrs_at_risk)),
weights=weight, data=country, family=poisson())
summary(glm)
regTermTest(glm, ~ country)

# running survey weighted regression
svy <- svydesign(~0,,data=country, weight=~weight)
svyglm <- svyglm(deaths ~ country + offset(log(yrs_at_risk)),
design=svy, data=country, family=poisson())
summary(svyglm)
# point estimates are correct, but standard error is way too large
regTermTest(svyglm, ~ country)
# test indicates no country differences

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
 O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
 c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45)35327918~~~~~~~~~~ - (p.dalga...@biostat.ku.dk) FAX: (+45)35327907
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
Heritage Laboratories
West Hartford, CT


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] svy / weighted regression

Reply via email to