Dear Ruben:

>I'm still very interested in discussion of the issue of probability
sampling vs. model-based sampling, but have been waiting until reading and
digesting the Godambe and Hansen et al papers.  I struggled years ago with
how/whether to incorporate inclusion probabilities in regional estimation
and ended up just using separate models (variances) by strata.
 We demonstrated the use of a cokriging model to "find" low alkalinity
streams at high elevations in the Southern Blue Ridge that were missed by
the sample, but this depended in part on extrapolating a relationship
between elevation and alkalinity.

Hi Yetta: the above is an example of model-based inference.

>Say we have a finite population (i.e., lakes) and we need to estimate the
total of some attribute. Both approaches are attempting to quantify
uncertainty in the attributes of unmeasured lakes.  In the case of
geostatistics, the kriging variances reflect uncertainty due to
interpolating to unmeasured locations.  Because of the strong assumed
superpopulation model,   kriging estimates of variance are (in my
experience) too small, in the sense that a new sample would show larger
actual MSEs than kriging MSEs.

I see that there are two sources of variance from a geostatistical model:
the variance due to partial observation of the spatial process and the
variance due to incomplete specification of the model. These are accounted
for in the model, at least when it is formulated as a stochastic process.
Thus variance estimates of total estimates (or other functions or
parameters) from the geostatistical model are higher than variance
estimates from a pure random sampling approach, which only takes into
account the variance due to incomplete observation of the population.

>For simplicity, lets assume there is no spatial autocorrelation, a
stationary model, and the joint probabilities of inclusion are zero.  Is
there a difference in the uncertainty represented by the design-based and
model-based variance estimates?

Yes, the model-based estimate of variance is generally higher than the
design-based one. This is because the model-based variance is composed at
least of a term for model specification plus a term due to incomplete
observation (i.e. sampling). This can be shown after some algebra for the
simplest example of the expansion estimator for finite populations. The
simple regression estimator of the total has a variance estimated by a
formula which includes a term due to the model plus a term due to the
sampling, while the variance of the equivalent design-based estimator only
has the term coming from the sampling. This shows that in the simplest of
cases, the model-based estimator actually has higher variance, i.e. is
more conservative, than the corresponding design-based estimator, contrary
to popular belief.

>If there is no SA, the model-based pop. variance is the number of lakes
not sampled (N-n) times the overall sample variance, V(n).

I think there should be another term for the model, whatever this model
might be, probably a relation with another, predictor variable, since the
coordinates are irrelevant (no spatial autocorrelation). If there is no
term for a model, then we do not have a model-based estimation. Perhaps I
am missing something in your scenario.

>As sample size, n, increases, uncertainty decreases only because N-n
decreases.  Note that the actual observed values play no part in the
variance, which depends only on the distances involved (and with no SA,
not even that).  Implicitly, the total, T, is estimated as though the
sample is equal-probablity, by summing the values obtained in the sample,
and the kriged estimates [here, =sample mean x (N-n)].

I think you are talking of the kriging estimator of the mean when the
variogram is flat. In that case I guess the kriging estimator should be no
different from the simple expansion estimator of design-based inference
and we do not have a case of model-based estimation.

>If there is SA, one way of thinking about it is that higher weight is
assigned to measured values of lakes that are near many unmeasured lakes.
 The variance of a Horowitz-Thompson estimator will also decrease as
sample size increases as the probabilities of inclusion n/N increase, but
unlike the kriging-model-based estimator, it depends on the values
observed in the sample.  What are the implications of this?

I understand that the variance of the kriging model-based estimator of the
total depends on the observations as a consequence of the use of the sill
parameter in the computation of the estimation variance.

>Each sampled lake represents a number of others, but we don't assume
anything about their location and the "number of others" cloned from each
sampled lake is determined by the inclusion probability.  If these are
equal, then each sample lake is given equal weight regardless of how many
close neighbors it has.  Using a list frame would therefore give more
similar results to kriging, whereas an equal-area design would yield less
similar results.  I think, in general, the underestimation of variance
increases as the semivariogram model gets away from the reality of the
sample, in terms of requiring a sill that relates to sample variance.  I
would like to see studies comparing the two approaches for different

The particular case of predicting over spatial processes that are
spatially separate (lime predicting with some observed lakes for other
unobserved lakes) seems to me rather difficult. However, regarding the
underestimation of variance, in general I expect higher estimation
variances from model-based as compared with design-based estimators.

>I have digested the Hansen et al paper and discussion thereafter, but am
still struggling with Godambe.  I don't see how the Hansen et al. paper
supports the conclusion that probabilistic sampling design has been called
into question as a basis for inference.

It weren’t Hansen et al. the ones who questioned randomization-based
inference in finite populations (rather they are one of the main
proponents of that approach) but rather, the discussion around that paper.

>Hansen's point is that model-based inference requires an assumption that
the superpopulation model is true, and can lead to bad results if this is
not the case.  (As I recall that's how this discussion started was from
the observation that the model is always wrong).

It is true that models are always wrong in some sense but,
1)the construction of model-based estimators forces you to think of how
the system under study works, which are the mechanistic relations among
observable variables, how the known laws of physics (or biology, etc)
affect your spatial process or finite population, etc. Building models
have allowed the development of science as we know it.
2)it is also true that sampling is never truly random, because
practitioners very often violate the dictates of random sampling theory;
this is only because of common sense since samples derived from taking
numbers from a hat usually are very inconvenient or misleading.
3)models can be made robust by balancing on predictor variables.

>From the back-and-forth following the Hansen et al. paper (and reading
about adaptive design), I get the sense that few discount the importance
of beginning with a sample drawn according to a probabilistic design. 
This is important to emphasize for the geostatistics community because
many practitioners are in the habit of beginning with a "found" sample
with unknown relationship to the population.  Where the statisticians
diverge is in the use of a model vs. design to draw inferences.

True. It is convenient to design a sampling program, to plan in advance
how to take samples, but it is not completely clear that the sampling
shall be probabilistic. For example if I know of a certain nuisance
parameter I want to get rid of at the inference stage then I sample in
such a way, a deterministic way, that the nuisance parameter is eliminated
by simple algebraic operations. For example pairing usually allows
computing a difference which deterministically eliminates a nuisance
parameter. Probabilistic sampling enters into the picture when I suspect
there are hidden, latent parameters, of which I am not aware of, and then
I apply a randomization procedure in order to average over, i.e. in
expectation, those hidden, latent parameters. However, after I have
obtained my sample and made my deterministic computations to eliminate
nuisance parameter, then I forget about the sampling procedure and make
inference based on my model, and conditioning on the observed sample.

>I tried to read Godambe, but its too godambe hard to follow - and the
editors/reviewers let him get away with not defining his terms.  Its
interesting to think of the value of a probabilistic design as a means of
removing nuisance parameters (such as spatial autocorrelation), but I
confess I can't follow his ancillary principle.  If you have time and can
explain it to me/us in English, I'd sure appreciate it.

After the time I have put into this reply, I guess there is no point in
avoiding this other topic. Right away, I believe Godambe’s paradox paper
is a landmark in statistics. In a nutshell, Godambe shows that when a
purely randomization-based inferential approach is used along with
theoretically sound pivotal methods for a parameter of a finite
population, you arrive at a unavoidable contradiction. This is that the
procedure gives a correct probability coverage for an estimated value of
the parameter of interest, say Theta1, and also a correct probability
coverage for ANY other parameter value, say Theta2, in the parameter

I have a more detailed discussion and an analysis of the meaning of the
paradox in my thesis. Also, you can use Google – groups, and check
sci.stat.math, and then search for Godambe. The thread is entitled
Godambe’s paradox.

Note that Godambe defends the randomization theory, which is ironic.

>I disagree agree with Royall's assessment that a particular random sample
is biased just because its mean is not that of the population -- the
expectation of the mean of a sample is still unbiased and it is
unreasonable to expect to draw a "balanced" sample without first
enumerating the population.  That's why the standard error is of
interest.  Sure, if you can stratify or use ancillary variables to improve
balance, fine.

I think Royall is right. You know that *the expectation of the mean*
equals the population mean, but you don’t know if *your particular mean*,
the one derived from the actual sample you obtained, approximates the true
mean. In contrast, in model-based inference you condition on the observed
sample and calculates expectations based on repetitions of the model,
rather than of the sampling (or you use likelihood-based inference).

>Sorry this has gotten so long.  Sometime (when i retire?), I'd like to
write a paper arguing that worries about spatial autocorrelation, except
in the case of regression, are misplaced.  As far as I'm concerned, its
perfectly reasonable to guarantee an unbiased estimate of the proportion
of black/white marbles in a hat by shaking them up before drawing a
its a hell of a lot easier than mapping out their spatial positions in the
hat and fitting some variogram model. :-)

Yes, that is a nice comment. However, you cannot shake natural population
to destroy the mechanisms that determine their functioning and make them
random. I gotta go now, or else I miss the soccer game!


