Hi Marc Sorry it's taken me so long to get back to you on this, I've had it on my mind for a while that your comments deserved a response, but you raised many interesting points and it seemed that the objections that you raised to what I was doing required that I think more carefully about the statistics.
You say, and I agree, that what matters in the end are the individual intensities and amplitudes, nevertheless the purpose of computing various statistics such as Rmerge, mean(I/SU(I)) etc is to *condense* this information for presentation to the user. If asked to summarise the data quality you could hardly present a list of all the structure factors, even though it may contain all the information required to assess the data quality. I was originally focused on the shell intensity averages because George Sheldrick raised an objection to my suggestion of using the Bayesian-corrected intensities, which are actually calculated by the TRUNCATE program but not output to the MTZ file (see s/r NEGIAS & NEGICS in truncate.f). He thought that the corrected shell intensity averages would be biased and would give a misleading indication of data quality particularly in the highest resolution shells where it's clearly of some importance to decide where one should make a high resolution cutoff of the data. So I wanted to show that the average of the truncated Is using the French & Wilson method is unbiased. I've since retracted the idea of using corrected intensities in refinement, as I think the only effect it would have is to reduce the R factors for the weak data without having any effect on the model, though I think corrected intensities would have some value in twinning tests, particularly those which rely on testing the proportion of de-twinned negative intensities generated for an assumed twin fraction (e.g. Britton's test). As you say the shell average of the intensity is relevant to the Wilson plot, but it's also relevant to the calculation of normalised amplitudes (Es) using Karle & Hauptman's 'k-curve' method (implemented in ECALC), which are widely used and indeed critical in calculations of structure factor probabilities, so I think cannot be dismissed as of little importance in crystallography. As you point out the related average of I/SU(I) is the statistic that is widely used to guide the decision of where to apply the high resolution cutoff. For the high resolution shells where this matters, the intensities are of course very weak and so the SU(I) is mainly determined by the X-ray background so will be roughly constant for a given shell at high resolution. The main reason the background is not constant at a given d-spacing for strong reflections is because the 'acoustic' contribution to the diffuse scattering background is correlated to the Bragg intensity. However if the Bragg I is weak the TDS I is weaker by an order of magnitude and so won't contribute significantly to the background in comparison with other sources of background X-rays such as scattering from the cryo-glass and the air path, and noise in the detector electronics, and the first two of these of course still vary with Bragg angle but have no other directional dependence. Hence it's reasonable to assume that SU(I) is constant for the outer shells and so I arbitrarily set it uniformly to 1, and therefore I am implicitly calculating the average I/SU(I). On the question of expectations, what I'm doing is calculating expectations of various quantities related to the Is, as 'corrected' by the various methods which have been proposed (I use 'corrected' in a loose sense since in one method it merely consists of rejecting all Is <= 0). So for the F&W method for example I'm calculating the expectation of the bias of the corrected I and the sqrt(expected squared error) (see here for definition: http://en.wikipedia.org/wiki/Mean_squared_error), given only the Wilson distribution parameter S for the shell, and assuming the form of the Wilson PDF (acentric) and the assumed error distribution (normal) for the observed Is. I am certainly not averaging over reflections, even though the method of integration that I'm using (Monte Carlo) requires the generation of randomised discrete points in the sample space, which you could I suppose identify as simulated reflections, nevertheless it is still an integral, and is therefore in my view a perfectly valid expectation. Note that the Wilson PDF is used both to calculate the correction in the F&W case and to calculate the expectation based on the assumed distribution of intensities given S. What you are describing is the expectation given S *and* given the observed I, and assuming the Wilson PDF in the F&W case. Both of these seem to me to be equally valid calculations of expectation, they just make different assumptions. However to show the results for the variation with I as well as S, I would need a much bigger table, and I was trying to keep the information presented as condensed as possible (this was in fact the rationale for making the simplifying assumption that SU(I) = 1, as it means there's one less variable to vary!). You say that my calculations are flawed because I assume that in a real situation I would have to know S precisely. First, I would say that taking the average observed I in shells in most cases should give a perfectly good estimate, and indeed the TRUNCATE program has worked in practice with real data for ~30 years with very few problems (at least AFAIK no problems of getting S < 0 have been reported), but I'm sure you can contrive simulated data which will cause it to fail. Second, in any case if you read my first postings on this subject you'll see that I anticipated this very problem and suggested that S should be calculated iteratively from the corrected Is, hence it would be impossible for S to be < 0. Though I haven't tried this yet, I think it actually requires that the new estimate of S be unbiased, otherwise I can see that it might increase without bound. As indicated above the RMS error is not the sample SD about the mean as you suppose, 'error' is defined in what I believe is the conventional way, namely the difference between the sample and the true value, so the RMS error is a direct measure of the accuracy of the estimate (the advantage of a simulation is of course that one knows the 'true' value, a luxury not available in real experiments!). You could of course argue that the use of the RMS is not ideal, it should perhaps be the mean absolute error, or weighted in some way. However I suspect that using alternative measures would not change the conclusions fundamentally. Further on you seem to recognise that the values in the table represent I/SU(I) and you say that the S&D results are perfectly acceptable because then I/SU(I) ~ 1 (actually 0.9) for the case S=0, however it's clear from the table that the bias remains significant (0.2) even up to I/SU(I) ~ 5. If we take the typical cutoff of I/SU(I) > 1.5 in the outer shell, what this means is that you would have to change the cutoff value to 1.9 in order to get the same effect, and I think that many users would find this very confusing. However, even supposing the table values were the sample SD (which is always <= RMS error, see second equation in Wikipedia article), I'm rather puzzled that you say: > The F&W procedure completely distorts the second moment about the mean of the > intensity distributions. For weak data it is close to 0, whereas the true second > moment about the mean of the distribution is 1.0. Now, I would find this much more > worrisome than a slightly distorted shell average. Surely the effect of including prior information is invariably to reduce the uncertainties of the derived parameters, often drastically? - and doesn't a reduced uncertainty or RMS error represent an improvement? For example, see this nice paper by Parisini, Sheldrick et al: http://journals.iucr.org/d/issues/1999/11/00/jn0061/jn0061.pdf, where they show that the effect of geometric restraints in refinement is to drastically reduce the parameter uncertainties compared with unrestrained refinement (see Fig. 2). From my own experience I know that a typical co-ordinate uncertainty at 1.5 A resolution is ~ 0.2 Ang for unrestrained refinement, but ~ 0.02 Ang for restrained - an order of magnitude less! Would you say the large uncertainties from the unrestrained refinement are 'distorted' by the use of restraints? Again, apologies for the long delay, and for the essay! Cheers -- Ian > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: 06 October 2008 21:17 > To: Ian Tickle > Cc: CCP4BB@JISCMAIL.AC.UK > Subject: Re: [ccp4bb] Reading the old literature / truncate / > refinement programs > > Hi Ian, > > I did not follow up our recent debate about the respective > merits of various "truncate" procedures, in particular the > comparison of the French & Wilson (1978) and Sivia & David > (1994) methods. Since you have extended the previous > simulations, I feel that the discussion should be continued. > > I can still not understand why you are so focused on the > shell averages (i.e. intensity averages over many reflections > in intensity or resolution shells). This is apparently the > criterion that you use to claim that the FW method is > superior to all others. > > The truncate method (whichever prior is used) is a procedure > to estimate amplitudes for individual reflections. This is > the goal of the whole procedure. So we have an estimator > E(Im,Sm) for the true intensity J (or amplitude sqrt(J)) of > one reflection, given its "measured" intensity Im and sigma > Sm. According to the usual statistical definition of bias, > this estimator is unbiased if its expectation value (NOT its > average aver many different reflections) is equal to the true > value J (or sqrt(J)), for all values of J. We seem to agree > that neither the estimator proposed by F&W, nor the one > proposed by S&D are unbiased, and I gave a simple example in > an earlier mail : for J=0, both estimators will return an > estimate E > 0, whatever the measured data (Im,Sm) are. They > are thus biased. > > Now, you seem to be be highly preoccupied by the intensity > averages over many different reflections in a > resolution/intensity shell. Of course, both the F&W and the > S&D methods also return an estimate for such shell averages, > and your simulations indicate that the F&W estimates for > these shell averages are unbiased. But the quantities that > are used in all subsequent computations are individual > reflection intensities/amplitudes. The shell averages are not > used in any important crystallographic computation (apart > maybe the Wilson plot). So what matters really is to get good > estimates for individual reflection intensities/amplitudes. > Shell averages are not used in crystallographic refinements: > we refine against individual reflections ! They are not used > in phasing: we phase individual reflections. So what matters > is the bias on individual reflections, not on shell averages. > I think that you can not simply claim that, because the F&W > method returns unbiased shell averages, it is necessarily > superior to other methods. > > In that sense, your previous statement that > > the average bias of <J> is the same as the bias of the average > <J> > may be formally correct, but I doubt that it is of any > relevance. Because even if the average bias is zero, this > does not mean that the estimator is unbiased. Otherwise, I > would suggest a truncate procedure where the intensities of > all reflections are simply set equal to their shell averages. > Clearly, this would yield unbiased estimates for the shell > averages (and would therefore be a "good" procedure according > to your criteria), but the estimates for individual > reflection intensities would be highly biased. > > Also, your simulations are flawed by the fact that you assume > you exactly know S. However, this is not the case in reality. > In the F&W procedure, S is estimated in shells of resolution > from the measured intensities. This turns S into a random > variable and you will never have S=0 exactly. If you now > imagine the case of data collected and integrated when there > is no diffraction at all, you would get some random number > for S in each shell. Half of the time it would be negative > and so you could not even apply the F&W procedure. I have > tested this: TRUNCATE (with the option truncate yes) fails on > such data. > > More importantly, to decide whether your crystal is > diffracting up to a certain resolution (or whether there is > any diffraction at all), no one would look at the average I > or average F. This is not a meaningful quantity because the > I's or F's are not on an absolute scale and because, in any > event, they would have to be gauged against a measure of > their uncertainty. Therefore, what really matters is the > average I/sigI. This is the sensible criterion to decide > whether you have a signal (or crystal) or not. The average I > is of little use for this purpose. Now, if there is no > crystal at all, the S&D procedure returns average I/sigI > (corrected I's) of 1 and I believe that most > crystallographers would then conclude that the data are at > the noise level (i.e. there is no measurable diffraction). No > one would look at the average I. So, contrary to what you > wrote earlier, the S&D results are perfectly acceptable, even > in the case when there is no diffraction (or crystal) at all. > > Regarding your simulations, I am much more concerned about > the columns which you label by "rmsE". You state that this is > the RMS error for each average. I presume (but I am not sure; > please provide a formula) that this is the sample second > moment about the sample mean (i.e. the sample variance); the > sample being all reflections in a given shell ? If so, then > the F&W procedure appears to be somewhat problematic for weak > data. It seems that the F&W procedure completely distorts the > second moment about the mean of the intensity distributions. > For weak data it is close to 0, whereas the true second > moment about the mean of the distribution is 1.0. Now, I > would find this much more worrisome than a slightly distorted > shell average. > > So I am not yet convinced that your simulations demonstrate > in any way the superiority of the FW method over other methods. > > Cheers, > > Marc > > > > Ian J. Tickle wrote: > > > <[EMAIL PROTECTED] > anford.edu> > <mailto:[EMAIL PROTECTED] > slac.stanford.edu> > <[EMAIL PROTECTED]> > <mailto:[EMAIL PROTECTED]> > <[EMAIL PROTECTED]> > <mailto:[EMAIL PROTECTED]> > From: "Ian Tickle" <[EMAIL PROTECTED]> > <mailto:[EMAIL PROTECTED]> > To: "Frank von Delft" <[EMAIL PROTECTED]> > <mailto:[EMAIL PROTECTED]> , > "Pavel Afonine" > <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]> > Cc: <CCP4BB@JISCMAIL.AC.UK> <mailto:CCP4BB@JISCMAIL.AC.UK> > Return-Path: [EMAIL PROTECTED] > X-OriginalArrivalTime: 03 Oct 2008 16:38:18.0609 (UTC) > FILETIME=[70D42610:01C92576] > > > I agree completely with Frank, IMO this is not > something you should be > doing, particularly as the likelihood function for > intensities can > handle negative & zero intensities perfectly well > (Randy assures me). > Out of interest I've updated my simulation where I > calculate the average > intensity after correction by the various methods in > use, to include the > case where you simply drop I <= 0. I didn't include > this case before > because I didn't think anyone would be using it! > > Here: > > S is the Wilson distribution parameter (true > intensity), assuming an > acentric distribution; > Iav is the average uncorrected (raw) intensity; > I'av is the average intensity after thresholding at > zero (i.e. I' = > max(I,0) ); > I"av is the new case, the average ignoring I <= 0; > <Ja>av is the average of the Bayesian estimate assuming > a uniform > distribution of the true intensity as the prior (Sivia & David); > <Jb>av is the average of the Bayesian estimate assuming > an acentric > Wilson distribution of the true intensity as the prior > (French & Wilson > a la TRUNCATE); > rmsE are the respective RMS errors (RMS difference between the > respective 'corrected' intensity and the true intensity). > > sigma(I) = 1 throughout. > > S Iav rmsE I'av rmsE I"av rmsE <Ja>av rmsE > <Jb>av > rmsE > > 0.0 0.00 1.00 0.40 0.71 0.80 1.00 0.90 > 1.00 0.00 > 0.00 > 0.5 0.50 1.01 0.74 0.78 1.06 0.92 1.17 > 0.89 0.51 > 0.43 > 1.0 1.00 1.00 1.16 0.84 1.40 0.90 1.50 > 0.86 1.00 > 0.64 > 1.5 1.50 0.99 1.62 0.87 1.80 0.91 1.90 > 0.86 1.49 > 0.74 > 2.0 2.00 1.00 2.10 0.90 2.25 0.93 2.34 > 0.88 2.00 > 0.80 > 2.5 2.50 1.00 2.58 0.91 2.71 0.94 2.79 > 0.89 2.50 > 0.83 > 3.0 3.00 1.00 3.07 0.93 3.19 0.95 3.26 > 0.91 3.00 > 0.86 > 3.5 3.50 1.00 3.56 0.93 3.66 0.95 3.72 > 0.91 3.50 > 0.88 > 4.0 4.00 0.99 4.05 0.94 4.13 0.95 4.20 > 0.91 4.00 > 0.89 > 4.5 4.50 0.99 4.55 0.94 4.62 0.95 4.68 > 0.92 4.50 > 0.90 > 5.0 5.00 1.00 5.04 0.95 5.08 0.96 5.17 > 0.93 5.00 > 0.91 > > It can be seen that the new case (I"av) is worse than > using all the > uncorrected intensities (Iav) in terms of average bias > (difference > between average corrected I and true average S), and > only marginally > better in terms of rmsE, and is significantly worse > than including the > negative intensities as zero (I'av) on both counts, > particularly for low > average intensity (<= 1 sigma). The Bayesian-corrected > intensities are > not needed in practice for refinement (but may be > better for other > purposes such as twinning tests) because the likelihood > function can > handle the uncorrected negative & zero intensities. > > Cheers > > -- Ian > > > > -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf > Of Frank von Delft > Sent: 03 October 2008 10:41 > To: Pavel Afonine > Cc: CCP4BB@JISCMAIL.AC.UK > Subject: Re: [ccp4bb] Reading the old > literature / truncate / > refinement programs > > > > > I mentioned previously > phenix.refine tosses your weak data > > > if IMEAN, > > > SIGIMEAN are chosen during refinement. > > > > phenix.refine does not automatically > remove the data based on sigma > (it does it by user's request only). > phenix.refine removes only > negative or zero values for Iobs (Fobs). > > > That is in fact the same as removing based on sigma: > standard practice > has been for some time that no data is removed, ever. > > At the refinement stage, that is. Of course, > we do remove data at > merging, for various reasons which probably > also need investigating > (e.g. "high res cutoff" = truncation; cf .Free Lunch). > > phx. > > > > > > > Disclaimer > This communication is confidential and may contain > privileged information > intended solely for the named addressee(s). It may not > be used or disclosed > except for the purpose for which it has been sent. If > you are not the intended > recipient you must not review, use, disclose, copy, > distribute or take any > action in reliance upon it. If you have received this > communication in error, > please notify Astex Therapeutics Ltd by emailing > [EMAIL PROTECTED] and > destroy all copies of the message and any attached documents. > Astex Therapeutics Ltd monitors, controls and protects > all its messaging traffic > in compliance with its corporate email policy. The > Company accepts no liability > or responsibility for any onward transmission or use of > emails and attachments > having left the Astex Therapeutics domain. Unless > expressly stated, opinions > in this message are those of the individual sender and > not of Astex > Therapeutics Ltd. The recipient should check this email > and any attachments for > the presence of computer viruses. Astex Therapeutics > Ltd accepts no liability > for damage caused by any virus transmitted by this > email. E-mail is susceptible > to data corruption, interception, unauthorized > amendment, and tampering, Astex > Therapeutics Ltd only send and receive e-mails on the > basis that the Company is > not liable for any such alteration or any consequences thereof. > Astex Therapeutics Ltd., Registered in England at 436 > Cambridge Science Park, > Cambridge CB4 0QA under number 3751674 > > > > > -- > Marc SCHILTZ http://lcr.epfl.ch > Disclaimer This communication is confidential and may contain privileged information intended solely for the named addressee(s). It may not be used or disclosed except for the purpose for which it has been sent. If you are not the intended recipient you must not review, use, disclose, copy, distribute or take any action in reliance upon it. If you have received this communication in error, please notify Astex Therapeutics Ltd by emailing [EMAIL PROTECTED] and destroy all copies of the message and any attached documents. Astex Therapeutics Ltd monitors, controls and protects all its messaging traffic in compliance with its corporate email policy. The Company accepts no liability or responsibility for any onward transmission or use of emails and attachments having left the Astex Therapeutics domain. Unless expressly stated, opinions in this message are those of the individual sender and not of Astex Therapeutics Ltd. The recipient should check this email and any attachments for the presence of computer viruses. Astex Therapeutics Ltd accepts no liability for damage caused by any virus transmitted by this email. E-mail is susceptible to data corruption, interception, unauthorized amendment, and tampering, Astex Therapeutics Ltd only send and receive e-mails on the basis that the Company is not liable for any such alteration or any consequences thereof. Astex Therapeutics Ltd., Registered in England at 436 Cambridge Science Park, Cambridge CB4 0QA under number 3751674