Re: [ccp4bb] I/sigma continued

Eleanor Dodson Tue, 31 Mar 2009 01:13:44 -0700

I agree absolutely with James - be as succinct as you like in a tablebut include the verbose definition for each entry in the log file - orat the very least in the manual. It should be easy to search for withthe table tag.

People will not go and read a reference..
 Eleanor

James Holton wrote:

I think the best way to deal with issues like this can be found inStrunk & White "The Elements of Style" (1918). Among other things,these authors put forward a rather simple yet often overlooked rule towriting in general, which I think applies equally well to computerprograms:
"Be clear."
The sentence itself is an example of how brevity need not sacrificeclarity. Yes, you need labels in the table itself to be short, butthere is space immediately below (and above) every table that (IMHO)ought to contain the definitions of each and everyvariable/abbreviation used in the table, spelled out no matter howobvious it may seem to the author. I can tell you many long andpainful stories about me trying to figure out what some variable insome equation in some paper actually meant! Context is everything.If you are tight for space, cite a reference (such as the manual).
That, and scientists talking about such quantities in email, papers,etc. (such as myself) should also heed Strunk and White and also notjust assume that everyone knows exactly what "structure factor" meansas opposed to a "structure amplitude", let alone I/sigma. Indeed, theword "intensity" is an incredibly ill-defined unit all by itself, tothe point of being useless. It can have units of photons, photons/s,photons/area/s, photons/area, energy/volume, and many many more.Often even in the same equation!
I would strongly advise against changing the "variable names" printedout in log files by SCALA and other programs, especially when a givenname has persisted for a decade or more. Adding an "inlinedefinition" is fine, but changing names not only breaks programs thatwere written to read these logs (and sometimes even humans reading thelog), but it also confines the meaning of "I/SIGMA from SCALA" to aparticular period in history.
So, what statistic do we want to look at? That depends on what youare trying to do with the data. There is no way for Phil to knowthis, so it is good that he prints out lots of different statistics.That said, when talking about the data quality requirements forstructure solution by MAD/SAD, I suggest looking at I/sigma(I) where:I - merged intensity (proportional to photons) assigned to areciprocal lattice point (hkl index)
sigma(I)   - the error assigned to I
Exactly what I/sigma(I) is required to solve a structure, or make someconclusion about a solved structure is a topic for another day.
-James Holton
MAD Scientist


Phil Evans wrote:
“I/sigma” statistics seem to be contentious & confusing (see recentdiscussions on CCP4BB), particularly in what the various measuresshould be called (and how they should be labelled in a table, wherethere is only room for a very short name). I thought it worthcommenting on this issue at a little more length.
There are several interacting issues:
1) Statistics can be calculated either for individual observationsIhl or for intensities averaged over multiple (symmetry-related orreplicate) measurements Ih(avg): both are useful, but they need to bedistinguished
2) The statistic can be (a) the ratio of means <I>/<sigma> or (b) themean of ratios <I/sigma> . These are not the same.
3)The “sigma” used in 2(a) can be either (a) the estimated correctedSD or (b) the RMS scatter of observations ie the RMS deviation (whichis itself generally used to estimate a “correction” to the SD). TheRMS scatter cannot be used for 2(b) of course, since that needsindividual sigmas for each reflection.
4) Values will depend on how many outliers have been rejected.

For what it’s worth, Scala outputs two such statistics:-
(i) “I/sigma”: this is calculated for individual observations Ihl andis the (mean intensity <Ihl>)/(RMS scatter of Ihl). RMS scatter = RMS[Ihl – Ih(avg)]. This is some measure of the average significance ofindividual observations, but does not take into account multiplicity.In my new program under development (a Scala replacement) I haverelabelled this column “I/RMS” but I don’t really know what best tocall it. This value is a ratio of means (see 2(a) above).
(ii) “Mn(I/sd)”: this is the mean value of (Ih(avg)/sd(Ih(avg))),where Ih(avg) is the (weighted) average over all observations forreflection h, and sd(Ih(avg)) is the estimated SD of this average,after any “corrections” have been applied. This is, I think, the bestestimate of “signal-to-noise ratio”, but does depend on realisticestimates of sd(Ih(avg)), which is not entirely straightforward (andcertainly doesn’t allow for systematic errors!). This value is a meanof ratios (see 2(b) above).
The “corrected” sd(Ihl) is calculated in Scala for each observation as
sd(Ihl)corrected = SdFac * sqrt{sd(I)**2 + SdB*Ihl*LP +(SdAdd*Ihl)**2}with the parameters SdFac, SdB & SdAdd determined by trying to makethe RMS normalised deviation Delta(hl) = (Ihl -Ih(avg))/sd(Ihl)corrected = 1.0 for all intensity ranges (differentparameters for each run). If the sd estimates are correct, then thedistribution of Delta(hl) should have SD = 1.0, and this “correction”tries to enforce this. This is more or less equivalent to making theRMS scatter == average SD. However the uncertainties in how best toestimate the real error do then influence the reliability of theMn(I/sd) statistic (see (ii) above)
So what statistics do we want to look at? Probably the main reasonfor looking at signal/noise statistics is to choose a “realresolution” cutoff, from some sort of signal/noise ratio. It isn’tclear (to me) what is the best way of doing this, and it isparticularly difficult if the data are significantly anisotropic. Themultiplicity needs to be taken into account, so the individual“I/sigma” (see (i) above) isn’t the best guide. Personally, Igenerally cut data at around the point where Mn(I/sd) =~ 2, but Iwould cut off at <2 for anisotropic data. I also find a useful guidefrom the correlation coefficient between Ih(avg) (Imean) pairs inhalf-datasets (plotted by Scala): the CC should be >0.5 at least, Ithink.
Note that the overall value of any of these statistics over allresolution ranges is not very useful and can be confusing, dependingon the distribution of intensities, since it mixes up strong lowresolution data (high signal/noise) with weak high resolution data(low signal/noise).
That leaves the question of how to label these statistics in aconsistent, clear and concise way: suggestions?
Phil Evans

Re: [ccp4bb] I/sigma continued

Reply via email to