refinement programs

Ian Tickle Mon, 17 Nov 2008 06:03:14 -0800

Hi Marc

Sorry it's taken me so long to get back to you on this, I've had it on
my mind for a while that your comments deserved a response, but you
raised many interesting points and it seemed that the objections that
you raised to what I was doing required that I think more carefully
about the statistics.

You say, and I agree, that what matters in the end are the individual
intensities and amplitudes, nevertheless the purpose of computing
various statistics such as Rmerge, mean(I/SU(I)) etc is to *condense*
this information for presentation to the user.  If asked to summarise
the data quality you could hardly present a list of all the structure
factors, even though it may contain all the information required to
assess the data quality.

I was originally focused on the shell intensity averages because George
Sheldrick raised an objection to my suggestion of using the
Bayesian-corrected intensities, which are actually calculated by the
TRUNCATE program but not output to the MTZ file (see s/r NEGIAS & NEGICS
in truncate.f).  He thought that the corrected shell intensity averages
would be biased and would give a misleading indication of data quality
particularly in the highest resolution shells where it's clearly of some
importance to decide where one should make a high resolution cutoff of
the data.  So I wanted to show that the average of the truncated Is
using the French & Wilson method is unbiased.

I've since retracted the idea of using corrected intensities in
refinement, as I think the only effect it would have is to reduce the R
factors for the weak data without having any effect on the model, though
I think corrected intensities would have some value in twinning tests,
particularly those which rely on testing the proportion of de-twinned
negative intensities generated for an assumed twin fraction (e.g.
Britton's test).

As you say the shell average of the intensity is relevant to the Wilson
plot, but it's also relevant to the calculation of normalised amplitudes
(Es) using Karle & Hauptman's 'k-curve' method (implemented in ECALC),
which are widely used and indeed critical in calculations of structure
factor probabilities, so I think cannot be dismissed as of little
importance in crystallography.

As you point out the related average of I/SU(I) is the statistic that is
widely used to guide the decision of where to apply the high resolution
cutoff.  For the high resolution shells where this matters, the
intensities are of course very weak and so the SU(I) is mainly
determined by the X-ray background so will be roughly constant for a
given shell at high resolution.  The main reason the background is not
constant at a given d-spacing for strong reflections is because the
'acoustic' contribution to the diffuse scattering background is
correlated to the Bragg intensity.  However if the Bragg I is weak the
TDS I is weaker by an order of magnitude and so won't contribute
significantly to the background in comparison with other sources of
background X-rays such as scattering from the cryo-glass and the air
path, and noise in the detector electronics, and the first two of these
of course still vary with Bragg angle but have no other directional
dependence.  Hence it's reasonable to assume that SU(I) is constant for
the outer shells and so I arbitrarily set it uniformly to 1, and
therefore I am implicitly calculating the average I/SU(I).

On the question of expectations, what I'm doing is calculating
expectations of various quantities related to the Is, as 'corrected' by
the various methods which have been proposed (I use 'corrected' in a
loose sense since in one method it merely consists of rejecting all Is
<= 0).  So for the F&W method for example I'm calculating the
expectation of the bias of the corrected I and the sqrt(expected squared
error) (see here for definition:
http://en.wikipedia.org/wiki/Mean_squared_error), given only the Wilson
distribution parameter S for the shell, and assuming the form of the
Wilson PDF (acentric) and the assumed error distribution (normal) for
the observed Is.  I am certainly not averaging over reflections, even
though the method of integration that I'm using (Monte Carlo) requires
the generation of randomised discrete points in the sample space, which
you could I suppose identify as simulated reflections, nevertheless it
is still an integral, and is therefore in my view a perfectly valid
expectation.

Note that the Wilson PDF is used both to calculate the correction in the
F&W case and to calculate the expectation based on the assumed
distribution of intensities given S.  What you are describing is the
expectation given S *and* given the observed I, and assuming the Wilson
PDF in the F&W case.  Both of these seem to me to be equally valid
calculations of expectation, they just make different assumptions.
However to show the results for the variation with I as well as S, I
would need a much bigger table, and I was trying to keep the information
presented as condensed as possible (this was in fact the rationale for
making the simplifying assumption that SU(I) = 1, as it means there's
one less variable to vary!).

You say that my calculations are flawed because I assume that in a real
situation I would have to know S precisely.  First, I would say that
taking the average observed I in shells in most cases should give a
perfectly good estimate, and indeed the TRUNCATE program has worked in
practice with real data for ~30 years with very few problems (at least
AFAIK no problems of getting S < 0 have been reported), but I'm sure you
can contrive simulated data which will cause it to fail.  Second, in any
case if you read my first postings on this subject you'll see that I
anticipated this very problem and suggested that S should be calculated
iteratively from the corrected Is, hence it would be impossible for S to
be < 0.  Though I haven't tried this yet, I think it actually requires
that the new estimate of S be unbiased, otherwise I can see that it
might increase without bound.

As indicated above the RMS error is not the sample SD about the mean as
you suppose, 'error' is defined in what I believe is the conventional
way, namely the difference between the sample and the true value, so the
RMS error is a direct measure of the accuracy of the estimate (the
advantage of a simulation is of course that one knows the 'true' value,
a luxury not available in real experiments!).  You could of course argue
that the use of the RMS is not ideal, it should perhaps be the mean
absolute error, or weighted in some way.  However I suspect that using
alternative measures would not change the conclusions fundamentally.

Further on you seem to recognise that the values in the table represent
I/SU(I) and you say that the S&D results are perfectly acceptable
because then I/SU(I) ~ 1 (actually 0.9) for the case S=0, however it's
clear from the table that the bias remains significant (0.2) even up to
I/SU(I) ~ 5.  If we take the typical cutoff of I/SU(I) > 1.5 in the
outer shell, what this means is that you would have to change the cutoff
value to 1.9 in order to get the same effect, and I think that many
users would find this very confusing.

However, even supposing the table values were the sample SD (which is
always <= RMS error, see second equation in Wikipedia article), I'm
rather puzzled that you say:

> The F&W procedure completely distorts the second moment about the mean
of the
> intensity distributions. For weak data it is close to 0, whereas the
true second
> moment about the mean of the distribution is 1.0. Now, I would find
this much more
> worrisome than a slightly distorted shell average.

Surely the effect of including prior information is invariably to reduce
the uncertainties of the derived parameters, often drastically? - and
doesn't a reduced uncertainty or RMS error represent an improvement?
For example, see this nice paper by Parisini, Sheldrick et al:
http://journals.iucr.org/d/issues/1999/11/00/jn0061/jn0061.pdf, where
they show that the effect of geometric restraints in refinement is to
drastically reduce the parameter uncertainties compared with
unrestrained refinement (see Fig. 2).  From my own experience I know
that a typical co-ordinate uncertainty at 1.5 A resolution is ~ 0.2 Ang
for unrestrained refinement, but ~ 0.02 Ang for restrained - an order of
magnitude less!  Would you say the large uncertainties from the
unrestrained refinement are 'distorted' by the use of restraints?

Again, apologies for the long delay, and for the essay!

Cheers

-- Ian

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
> Sent: 06 October 2008 21:17
> To: Ian Tickle
> Cc: CCP4BB@JISCMAIL.AC.UK
> Subject: Re: [ccp4bb] Reading the old literature / truncate / 
> refinement programs
> 
> Hi Ian,
> 
> I did not follow up our recent debate about the respective 
> merits of various "truncate" procedures, in particular the 
> comparison of the French & Wilson (1978) and Sivia & David 
> (1994) methods. Since you have extended the previous 
> simulations, I feel that the discussion should be continued.
> 
> I can still not understand why you are so focused on the 
> shell averages (i.e. intensity averages over many reflections 
> in intensity or resolution shells). This is apparently the 
> criterion that you use to claim that the FW method is 
> superior to all others. 
> 
> The truncate method (whichever prior is used) is a procedure 
> to estimate amplitudes for individual reflections. This is 
> the goal of the whole procedure. So we have an estimator 
> E(Im,Sm) for the true intensity J (or amplitude sqrt(J)) of 
> one reflection, given its "measured" intensity Im and sigma 
> Sm. According to the usual statistical definition of bias, 
> this estimator is unbiased if its expectation value (NOT its 
> average aver many different reflections) is equal to the true 
> value J (or sqrt(J)), for all values of J. We seem to agree 
> that neither the estimator proposed by F&W, nor the one 
> proposed by S&D are unbiased, and I gave a simple example in 
> an earlier mail : for J=0, both estimators will return an 
> estimate E > 0, whatever the measured data (Im,Sm) are. They 
> are thus biased.
> 
> Now, you seem to be be highly preoccupied by the intensity 
> averages over many different reflections in a 
> resolution/intensity shell. Of course, both the F&W and the 
> S&D methods also return an estimate for such shell averages, 
> and your simulations indicate that the F&W estimates for 
> these shell averages are unbiased. But the quantities that 
> are used in all subsequent computations are individual 
> reflection intensities/amplitudes. The shell averages are not 
> used in any important crystallographic computation (apart 
> maybe the Wilson plot). So what matters really is to get good 
> estimates for individual reflection intensities/amplitudes. 
> Shell averages are not used in crystallographic refinements: 
> we refine against individual reflections ! They are not used 
> in phasing: we phase individual reflections. So what matters 
> is the bias on individual reflections, not on shell averages. 
> I think that you can not simply claim that, because the F&W 
> method returns unbiased shell averages, it is necessarily 
> superior to other methods.
> 
> In that sense, your previous statement that
> 
> the average bias of <J> is the same as the bias of the average
> <J>
> may be formally correct, but I doubt that it is of any 
> relevance. Because even if the average bias is zero, this 
> does not mean that the estimator is unbiased. Otherwise, I 
> would suggest a truncate procedure where the intensities of 
> all reflections are simply set equal to their shell averages. 
> Clearly, this would yield unbiased estimates for the shell 
> averages (and would therefore be a "good" procedure according 
> to your criteria), but the estimates for individual 
> reflection intensities would be highly biased.
> 
> Also, your simulations are flawed by the fact that you assume 
> you exactly know S. However, this is not the case in reality. 
> In the F&W procedure, S is estimated in shells of resolution 
> from the measured intensities. This turns S into a random 
> variable and you will never have S=0 exactly. If you now 
> imagine the case of data collected and integrated when there 
> is no diffraction at all, you would get some random number 
> for S in each shell. Half of the time it would be negative 
> and so you could not even apply the F&W procedure. I have 
> tested this: TRUNCATE (with the option truncate yes) fails on 
> such data.
> 
> More importantly, to decide whether your crystal is 
> diffracting up to a certain resolution (or whether there is 
> any diffraction at all), no one would look at the average I 
> or average F. This is not a meaningful quantity because the 
> I's or F's are not on an absolute scale and because, in any 
> event, they would have to be gauged against a measure of 
> their uncertainty. Therefore, what really matters is the 
> average I/sigI. This is the sensible criterion to decide 
> whether you have a signal (or crystal) or not. The average I 
> is of little use for this purpose. Now, if there is no 
> crystal at all, the S&D procedure returns average I/sigI 
> (corrected I's) of 1 and I believe that most 
> crystallographers would then conclude that the data are at 
> the noise level (i.e. there is no measurable diffraction). No 
> one would look at the average I. So, contrary to what you 
> wrote earlier, the S&D results are perfectly acceptable, even 
> in the case when there is no diffraction (or crystal) at all.
> 
> Regarding your simulations, I am much more concerned about 
> the columns which you label by "rmsE". You state that this is 
> the RMS error for each average. I presume (but I am not sure; 
> please provide a formula) that this is the sample second 
> moment about the sample mean (i.e. the sample variance); the 
> sample being all reflections in a given shell ? If so, then 
> the F&W procedure appears to be somewhat problematic for weak 
> data. It seems that the F&W procedure completely distorts the 
> second moment about the mean of the intensity distributions. 
> For weak data it is close to 0, whereas the true second 
> moment about the mean of the distribution is 1.0. Now, I 
> would find this much more worrisome than a slightly distorted 
> shell average.
> 
> So I am not yet convinced that your simulations demonstrate 
> in any way the superiority of the FW method over other methods.
> 
> Cheers,
> 
> Marc 
> 
> 
> 
> Ian J. Tickle wrote: 
> 
>           
> <[EMAIL PROTECTED]
> anford.edu> 
> <mailto:[EMAIL PROTECTED]
> slac.stanford.edu> 
>                      <[EMAIL PROTECTED]> 
> <mailto:[EMAIL PROTECTED]>  
> <[EMAIL PROTECTED]> 
> <mailto:[EMAIL PROTECTED]> 
>       From: "Ian Tickle" <[EMAIL PROTECTED]> 
> <mailto:[EMAIL PROTECTED]> 
>       To: "Frank von Delft" <[EMAIL PROTECTED]> 
> <mailto:[EMAIL PROTECTED]> ,
>       "Pavel Afonine"
>           <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]> 
>       Cc: <CCP4BB@JISCMAIL.AC.UK> <mailto:CCP4BB@JISCMAIL.AC.UK> 
>       Return-Path: [EMAIL PROTECTED]
>       X-OriginalArrivalTime: 03 Oct 2008 16:38:18.0609 (UTC)
>           FILETIME=[70D42610:01C92576]
>       
>       
>       I agree completely with Frank, IMO this is not 
> something you should be
>       doing, particularly as the likelihood function for 
> intensities can
>       handle negative & zero intensities perfectly well 
> (Randy assures me).
>       Out of interest I've updated my simulation where I 
> calculate the average
>       intensity after correction by the various methods in 
> use, to include the
>       case where you simply drop I <= 0.  I didn't include 
> this case before
>       because I didn't think anyone would be using it!
>       
>       Here:
>       
>       S is the Wilson distribution parameter (true 
> intensity), assuming an
>       acentric distribution;
>       Iav is the average uncorrected (raw) intensity;
>       I'av is the average intensity after thresholding at 
> zero (i.e. I' =
>       max(I,0) );
>       I"av is the new case, the average ignoring I <= 0;
>       <Ja>av is the average of the Bayesian estimate assuming 
> a uniform
>       distribution of the true intensity as the prior (Sivia & David);
>       <Jb>av is the average of the Bayesian estimate assuming 
> an acentric
>       Wilson distribution of the true intensity as the prior 
> (French & Wilson
>       a la TRUNCATE);
>       rmsE are the respective RMS errors (RMS difference between the
>       respective 'corrected' intensity and the true intensity).
>       
>       sigma(I) = 1 throughout.
>       
>        S     Iav   rmsE    I'av  rmsE    I"av  rmsE   <Ja>av rmsE  
>       <Jb>av
>       rmsE
>       
>       0.0    0.00  1.00    0.40  0.71    0.80  1.00    0.90  
> 1.00    0.00
>       0.00
>       0.5    0.50  1.01    0.74  0.78    1.06  0.92    1.17  
> 0.89    0.51
>       0.43
>       1.0    1.00  1.00    1.16  0.84    1.40  0.90    1.50  
> 0.86    1.00
>       0.64
>       1.5    1.50  0.99    1.62  0.87    1.80  0.91    1.90  
> 0.86    1.49
>       0.74
>       2.0    2.00  1.00    2.10  0.90    2.25  0.93    2.34  
> 0.88    2.00
>       0.80
>       2.5    2.50  1.00    2.58  0.91    2.71  0.94    2.79  
> 0.89    2.50
>       0.83
>       3.0    3.00  1.00    3.07  0.93    3.19  0.95    3.26  
> 0.91    3.00
>       0.86
>       3.5    3.50  1.00    3.56  0.93    3.66  0.95    3.72  
> 0.91    3.50
>       0.88
>       4.0    4.00  0.99    4.05  0.94    4.13  0.95    4.20  
> 0.91    4.00
>       0.89
>       4.5    4.50  0.99    4.55  0.94    4.62  0.95    4.68  
> 0.92    4.50
>       0.90
>       5.0    5.00  1.00    5.04  0.95    5.08  0.96    5.17  
> 0.93    5.00
>       0.91
>       
>       It can be seen that the new case (I"av) is worse than 
> using all the
>       uncorrected intensities (Iav) in terms of average bias 
> (difference
>       between average corrected I and true average S), and 
> only marginally
>       better in terms of rmsE, and is significantly worse 
> than including the
>       negative intensities as zero (I'av) on both counts, 
> particularly for low
>       average intensity (<= 1 sigma).  The Bayesian-corrected 
> intensities are
>       not needed in practice for refinement (but may be 
> better for other
>       purposes such as twinning tests) because the likelihood 
> function can
>       handle the uncorrected negative & zero intensities.
>       
>       Cheers
>       
>       -- Ian
>       
>         
> 
>               -----Original Message-----
>               From: [EMAIL PROTECTED]
>               [mailto:[EMAIL PROTECTED] On Behalf
>               Of Frank von Delft
>               Sent: 03 October 2008 10:41
>               To: Pavel Afonine
>               Cc: CCP4BB@JISCMAIL.AC.UK
>               Subject: Re: [ccp4bb] Reading the old 
> literature / truncate /
>               refinement programs
>               
>               
>                   
> 
>                               I mentioned previously 
> phenix.refine tosses your weak data
>                                       
> 
>               if IMEAN,
>                   
> 
>                               SIGIMEAN are chosen during refinement.
>                               
>                                       
> 
>                       phenix.refine does not automatically 
> remove the data based on sigma
>                       (it does it by user's request only). 
> phenix.refine removes only
>                       negative or zero values for Iobs (Fobs).
>                             
> 
>               That is in fact the same as removing based on sigma:
>               standard practice
>               has been for some time that no data is removed, ever.
>               
>               At the refinement stage, that is.  Of course, 
> we do remove data at
>               merging, for various reasons which probably 
> also need investigating
>               (e.g. "high res cutoff" = truncation; cf .Free Lunch).
>               
>               phx.
>               
>               
>                   
> 
>       
>       
>       Disclaimer
>       This communication is confidential and may contain 
> privileged information
>       intended solely for the named addressee(s). It may not 
> be used or disclosed
>       except for the purpose for which it has been sent. If 
> you are not the intended
>       recipient you must not review, use, disclose, copy, 
> distribute or take any
>       action in reliance upon it. If you have received this 
> communication in error,
>       please notify Astex Therapeutics Ltd by emailing 
> [EMAIL PROTECTED] and
>       destroy all copies of the message and any attached documents.
>       Astex Therapeutics Ltd monitors, controls and protects 
> all its messaging traffic
>       in compliance with its corporate email policy. The 
> Company accepts no liability
>       or responsibility for any onward transmission or use of 
> emails and attachments
>       having left the Astex Therapeutics domain.  Unless 
> expressly stated, opinions
>       in this message are those of the individual sender and 
> not of Astex
>       Therapeutics Ltd. The recipient should check this email 
> and any attachments for
>       the presence of computer viruses. Astex Therapeutics 
> Ltd accepts no liability
>       for damage caused by any virus transmitted by this 
> email. E-mail is susceptible
>       to data corruption, interception, unauthorized 
> amendment, and tampering, Astex
>       Therapeutics Ltd only send and receive e-mails on the 
> basis that the Company is
>       not liable for any such alteration or any consequences thereof.
>       Astex Therapeutics Ltd., Registered in England at 436 
> Cambridge Science Park,
>       Cambridge CB4 0QA under number 3751674
>         
> 
> 
> 
> --
> Marc SCHILTZ      http://lcr.epfl.ch
> 

Disclaimer
This communication is confidential and may contain privileged information 
intended solely for the named addressee(s). It may not be used or disclosed 
except for the purpose for which it has been sent. If you are not the intended 
recipient you must not review, use, disclose, copy, distribute or take any 
action in reliance upon it. If you have received this communication in error, 
please notify Astex Therapeutics Ltd by emailing [EMAIL PROTECTED] and destroy 
all copies of the message and any attached documents. 
Astex Therapeutics Ltd monitors, controls and protects all its messaging 
traffic in compliance with its corporate email policy. The Company accepts no 
liability or responsibility for any onward transmission or use of emails and 
attachments having left the Astex Therapeutics domain.  Unless expressly 
stated, opinions in this message are those of the individual sender and not of 
Astex Therapeutics Ltd. The recipient should check this email and any 
attachments for the presence of computer viruses. Astex Therapeutics Ltd 
accepts no liability for damage caused by any virus transmitted by this email. 
E-mail is susceptible to data corruption, interception, unauthorized amendment, 
and tampering, Astex Therapeutics Ltd only send and receive e-mails on the 
basis that the Company is not liable for any such alteration or any 
consequences thereof.
Astex Therapeutics Ltd., Registered in England at 436 Cambridge Science Park, 
Cambridge CB4 0QA under number 3751674

Re: [ccp4bb] Reading the old literature / truncate / refinement programs

Reply via email to