***  For details on how to be removed from this list visit the  ***
***          CCP4 home page http://www.ccp4.ac.uk         ***


Douglas

It seems that this example highlights an important difference between ML and LS. Although the absolute configuration parameter is primarily of interest to small-molecule crystallographers, the same arguments apply to twin fractions, occupancies and B-values that should be of interest to protein crystallographers too.

As you point out, Q(hkl) (see below) will have a much more Gaussian distribution than I(hkl). At first sight, if Q(hkl) rally has a Gaussian error distribution, ML and LS refinement should give the same values for x and its esd. This is however wrong, because by integrating from x=0 to x=1 we have constrained the ML refinement by our prior knowledge that only x values in this range are physically possible; this contraint cannot be applied in the LS framework.

So if I determine the Flack x (absolute configuration) parameter from 10 different crystals of sucrose by LS, I will get values scattered around 0.0 with (I hope) a scatter corresponding to the estimated esd. If I do the same by ML, because of the integration between x=0 and x=1, ALL experiments will give x values slightly greater than zero!

Through Bayesian eyes the ML result makes perfect sense, despite the fact that the true value of x is 0.0. However I could never persuade my small-molecule colleagues to accept it. The combined might of the Acta C editors and the CIF people has never forgiven me for deciding, when I released the program in 1993, that SHELXL would never print physically impossible results (so negative occupancies and B values were truncated to zero as well as imposing 0=<x=<1). They would argue that x is the result of an experiment, so negative values (within a couple of esds) are not only possible, they must occur! And their final twist is that SHELXL refines against Iobs rather than Fobs specifically so that negative experimental values can be used directly without fudging.

George

PS. Thanks for spotting the accidentally missing (1-x) in my original email, I hope that this did not confuse too many readers.


Depending on the model assumed, this LS method can be interpreted as ML. For
instance, let's assume that the 'real' intensity I(hkl) is subject to random
Gaussian error. Then, one model is

Iobs(hkl) = (1-x)I(hkl) + x.I(-h-k-l) + e.sigmaI(hkl)

where e is a standard Gaussian random variable (with variance 1) and
sigmaI(hkl) is the std deviation for each hkl. Here, x is a variable
dependent on the current molecular model.

Then, a LS fit of Iobs(hkl) to (1-x)I(hkl) + xI(-h-k-l) weighted by the
inverse of sigmaI(hkl) is equivalent to the ML solution given by the above
model. In this case there is a relatively simple analytical solution for x,
if we are estimating I(hkl), I(-h-k-l), and sigmaI(hkl) by Iobs(hkl),
Iobs(-h-k-l), and sigmaIobs(hkl) respectively.

The LL(x) is given by

-0.5[(Iobs(hkl) - (1-x)I(hkl) - x.I(-h-k-l))/sigmaI(hkl)]^2

up to an arbitrary constant.

Note that this model has at least one theoretical problem, in that real,
physically measurable intensities are always positive, and this model will
give negative Iobs with high probability when sigmaI is large enough. IOW,
intensities cannot be distributed by a Gaussian, except approximately.

The following 'ML' approach was proposed in discussions at the Computing
School in Siena (contributions from Rob Hooft, Simon Parsons and David
Watkin are acknowledged, but I'm sure that there were others). We define
the quantity Qobs = [Iobs(hkl)-Iobs(-h-k-l)]/[Iobs(hkl)+Iobs(-h-k-l)]
and a corresponding Qcalc. This should cancel some systematic errors.
The esd of Qobs can be calculated by standard methods from the known
esds of Iobs(hkl) and Iobs(-h-k-l). Then we estimate the log likelihood
(LL) of a particular value of x by -0.5[(Qobs-Qcalc)/esd(Qobs)]^2 summed
over all reflections.

This model is both LS and ML in the same way as the above model. The
difference is that the implied model is a little different: here you assume
that there is a 'real' Q(hkl), and that Qobs(hkl) is subject to random
Gaussian errors as

Qobs(hkl) = Q(hkl) + e.sigmaQ(hkl)

So my question is: should I estimate the value of x and its esd by
integrating over -infinity < x < +infinity (in practice this range can
be truncated to say -5 to +5) or should I integrate over 0 =< x =< 1 or
should I use only the values of the LL at x = 0 and x = 1?

My opinion is to use 0 =< x =< 1 if you expect a reasonable possibility
of racemic twinning; otherwise, constrain x to 0 and 1.

Preliminary tests suggest that these methods give appreciably lower esds for x
than the LS approach.

As I said, the 'LS' approach above (the first model discussed) can also be
thought of as ML. So the only real difference is in the assumed model,
whether random Gaussian errors are applied to I(hkl) vs Q(hkl). This
immediately begs the question as to which model is more appropriate. For one
thing, it appears to me that Qobs(hkl) can take negative values in reality,
so Gaussian errors are probably more justifiable for this model, and this
may be the main reason it performs better.

Reply via email to