Re: [ccp4bb] Rmerge - was moelcular replacement with large cell

Douglas Theobald Wed, 15 Jul 2009 19:04:00 -0700

James,

Graeme is right. While does indeed (approximately) follow aGaussian, <|I-|> cannot. The absolute value operator keeps itpositive (reflects the negative across the origin), and hence it is ahalf Gaussian. Its mean cannot be zero unless the variance is zero.For standard normals (variance = 1), the mean of <|I-|> is 0.798,just as Graeme said. You can do the integration. So, the fact that <|I-|>/ is unstable at low I/sigma is *not* a consequence of thepeculiar divergent properties of a Cauchy (Lorentzian). Rather, it'sa consequence of E(I) being zero. And, like your calculator knows,division by zero is undefined (or infinite, depending on yourproclivities).


Cheers,

Douglas


On Jul 15, 2009, at 5:03 PM, James Holton wrote:

I tried plugging I/sigma = 0 into your formula below, but mycalculator returned "EEEEEEEE"
-James Holton
MAD Scientist

Graeme Winter wrote:
James,

I'm not sure you're completely right here - it's reasonably
straightforward to show that

Rmerge ~ 0.7979 / (<I/sigma>)

(Weiss & Hilgenfeld, J. Appl. Cryst 1997) which can be verified from
e.g. the Scala log file, provided that the *unmerged* I/sigma is
considered:

http://www.ccp4.ac.uk/xia/rmerge.jpg

This example did not exhibit much radiation damage so it does
represent a best case.

For (unmerged) I/sigma < 1 the statistics do tend to become
unreliable, which I found was best demonstrated by inspection of the
E^4 plot - up to I/sigma ~ 1 it was ~ 2, but increased substantially
thereafter. This I had assumed represented the fact that the
"intensities" were drawn from a gaussian distribution with low I/sigma
rather than the exponential (WIlson) distribution which would be
expected for intensities.
By repeatedly selecting small random subsets* of unique reflectionsin
the example data set and merging them separately, I found that the
"error" on the Rmerge above for the weakest reflections was about
0.05. Since this retains the same multiplicity and the mean value
converges on the complete data set statistics, I believe that the
comparisons are valid.

I guess I "don't believe you" :o)

Best,

Graeme



* CCTBX is awesome for this kind of thing!

2009/7/15 James Holton <jmhol...@lbl.gov>:
Actually, if I/sd < 3, Rmerge, Rpim, Rrim, etc. are all infinity.Doesn't
matter what your redundancy is.

Don't believe me? Try it.
The extreme case is I/sd = 0, and as long as there is somebackground (and,let's face it, there always is), the "observed" spot intensitywill beequally likely to be positive or negative, with a (basically)Gaussian
distribution.
So, if you generate say, ten Gaussian-random numbers (centered onzero),take their average value , compute the average deviation fromthataverage <|I-|>, and then divide <|I-|>/, you will get the"Rmerge"expected for I/sd = 0 at a redundancy of 10. Problem is, if youdo thisagain with a different random number seed, you will get a verydifferentRmerge. Even if you do it with a million different random numberseeds andcompute the "average Rmerge", you will always get wildly differentvalues.Some positive, some negative. And it doesn't matter how many"data points"you use to compute the Rmerge: averaging a million Rmerge valueswill give a
different answer than averaging a million and one.
The reason for this numerical instability is because both and<|I-|>follow a Gaussian distribution that is centered at zero, and theratio oftwo numbers like this has a Lorentzian distribution. TheLorentzian looks alot like a Gaussian, but has much fatter tails. Fat enough sothat theLorentzian distribution has NO MEAN VALUE. Seriously. It is hardtobelieve that the average value of something that is equally likelyto bepositive or negative could be anything but zero, but for allpracticalpurposes you can never arrive at the average value of somethingwith aLorentzian distribution. At least not by taking finite samples.So, no
matter what the redundancy, you will always get a different Rmerge.
However, if is not centered on zero (I/sd > 0), then the ratioof thetwo Gaussian-random numbers starts to look like a Gaussian itself,and this
distribution does have a mean value (Rmerge will be "reproducible").
However, this does not happen all at once. The tails start toshrink asI/sd = 1, they are even smaller at I/sd = 2, and the distributionfinallylooses all "Lorentzian character" when I/sd >= 3. Only then isRmerge a
meaningful quantity.
So, perhaps our "forefathers" who first instituted the practice ofa 3-sigmacutoff for all intensities actually DID know what they weredoing! All R-statistics (including Rcryst and Rfree) are unstable in this wayfor weakdata, but sometime in the early 1990s the practice of computing R-factors on"all data" crept into the field. I'm not saying we should not useall data,maximum likelihood refinement uses sigmas properly and "weak" dataarepowerful restraints. However, I will go on record as suggestingthat a3-sigma cutoff should be used for all R statistics. There isstill a placein your PDB file to put the sigma cutoff you used for your Rfactors.
-James Holton
MAD Scientist


Lijun Liu wrote:
Hi Frank,
Off from the original topic but important to clarify. If Imisled the
concepts, I apologize.
Outer shell Rmerge will always be very high:
----------
True! Especially when I/Sig ~ 1 or less.
Only I/sigI (and completeness, although it's related) is reallyrelevant
for deciding high resolution cutoff.
---------
Normally I use I/Sig = 2.0 for res-cut-off. For this"accuracy"---pleasedo not ask me the exact meaning of Sig(too many contributed thisincludinghardware, software, protocol, strategies,...), the averagemeasuring errorfor reflections could be expected to the inversion of thisnumber, 1/2.0,i.e. 50%, which in general suggests that the Rmerge should notpass muchthis value to make the inclusion of the data meaningful. (Pleaseread thiscarefully since I do not want to confuse two differentconcepts). Or youare merging data with merging error much larger than the datameasuringerror. Although the estimation of Sig(I) is difficult and Sig(I)itself maybe of large error, when I/sig ~ 3, 70% seems still to be too highto accept.
Rmerge is well known to be a weak indicator, but it is not just a
mathematical issue, and never a crap. It should be used withothers (I/S,red, ...). I agree with Ian that all data should be included, ifthe
quality is guaranteed.
I did not comb the history of refinement softwares and theirphilosophy,but today it seems all the prevailing ref-packages use resolutionbins forshelling (I know there has been enough theoretical ground to toso), whichis the source of RESOLUTION CUTOFF and some problems arisen fromRESOLUTIONCUTOFF for example the Rmerge issue. I appreciate to be told ifsomesoftwares had ever used I, I/SigI, F, F/SigF or something elsefor binning,especially in the early time for refinement package development.RESOLUTION
BINNING might not be a have-to? :D

Best regards.

Lijun Liu, PhD
http://www.uoregon.edu/~liulj/ <http://www.uoregon.edu/%7Eliulj/>

Re: [ccp4bb] Rmerge - was moelcular replacement with large cell

Reply via email to