Re: [ccp4bb] How many is too many free reflections?

James Holton Thu, 04 Jun 2015 09:35:44 -0700

Many good points have been made on this thread so far, but mostlyaddressing the question "how many free reflections is enough", whereasthe original question was "how many is too many".

I suppose a reasonable definition of "too many" is when the errorintroduced into the map by leaving out all those reflections start tobecome a problem. It is easy to calculate this error: it is simply thedifference between the map made using all reflections (regardless ofFree-R flag) and the map made with 5% of the reflections left out. Ofcourse, this "difference map" is identical to a map calculated usingonly the 5% "free" reflections, setting all others to zero. The RMSvariation of this "error map" is actually independent of the phases used(Parseval's theorem), and it ends up being:


RMSerror = RMSall * sqrt( free_frac )
where:
RMSerror is the RMS variation of the "error map"
RMSall is the RMS variation of the map calculated with all reflections
free_frac is the fraction of hkls left out of the calculation.

So, with 5% free reflections, the errors induced in the electron densitywill have an RMS variation that is 22.3% of the full map's RMSvariation, or 0.223 "sigma units". 1% free reflections will result inRMS 10% error, or 0.1 "sigmas". This means, for example, that with 5%free reflections a 1.0 "sigma" peak might come up as a 1.2 or 0.8"sigma" feature. Note that these are not the "sigmas" of the Fo-Fc map,(which changes as you build) but rather the "sigma" of the Fo map. Mostof us don't look at Fo maps, but rather 2Fo-Fc or 2mFo-DFc maps, with orwithout the missing reflections "filled in". These are a bit differentfrom a straight "Fo" map. The absolute electron number density (e-/A^3)of the 1 "sigma" contour for all these maps is about the same, but nodoubt the "fill in", extra Fo-Fc term, and the likelihood weightsreduces the overall RMS error. By how much? That is a good question.

Still, we can take this RMS 0.223 "sigma" variation from 5% freereflections as a worst-case scenario, and then ask the question: is thisa "problem"? Well, any source of error can be a problem, but when youare trying to find the best compromise between twodifficult-to-reconcile considerations (such as the stability of Rfreeand the interpretability of the map), it is usually helpful to bring ina third consideration: such as how much noise is in the map already dueto other sources? My colleagues and I measured this recently (doi:10.1073/pnas.1302823110), and found that the 1-sigma contour ranges from0.8 to 1.2 e-/A^3 (relative to vacuum), experimental measurement errorsare RMS ~0.04 e-/A^3 and map errors from the model-data difference isabout RMS 0.13 e-/A^3. So, 22.3% of "sigma" is around RMS 0.22 e-/A^3.This is a bit larger than our biggest empirically-measured error: the"modelling error", indicating that 5% free flags may indeed be "too much".

However, 22.3% is the worst-case error, in the absence of all thecorrections used to make 2mFo-DFc maps, so in reality the modellingerror and the omitted-reflection errors are probably comparable,indicating that 5% is about the right amount. Any more and the errorfrom omitted reflections starts to dominate the total error. On theother hand, the modelling error is (by definition) the Fo-Fc difference,so as Rwork/Rfree get smaller the RMS map variation due to modellingerrors gets smaller as well, eventually exposing the omitted-reflectionerror. So, once your Rwork/Rfree get to be less than ~22%, the errorsin the map are starting to be dominated by the missing Fs of the 5% freeset.

However, early in the refinement, when your R factors are in the 30%s,40%s, or even 50%s, I don't think the errors due to missing 5% of thereflections are going to be important. Then again, late in refinement,it might be a good idea to start including some or all of the "free"reflections back into the working set in order to reduce the overall maperror (cue lamentations from validation experts such as Jane Richardson).

This is perhaps the most important topic on this thread. There are somany ways to "contaminate", "bias" or otherwise compromise the free set,and once done we don't have generally accepted procedures forre-sanctifying the free reflections, other that starting over again fromscratch. This is especially problematic if your starting structure formolecular replacement was refined against all reflections, and yourligand soak is nice and isomorphous to those original crystals. How doyou remove the evil "bias" from this model? You can try shaking it, butthat only really removes bias at high spatial frequencies and is not soeffective at low resolution.So, if bias is so easy to generate why not use it to our advantage?Instead of leaving the free-flagged reflections out of the refinement,put them in, but give them random F values. Then do everything you canto "bias" your model toward these random values. Loosen the geometryweights, turn on B-factors, build dummy atoms into all the "noise peaks"in the maps, and refine for a lot of cycles. Now you've got a modelthat is highly biased, but to a completely unrelated "free" set. Itstands to reason that a model cannot be biased toward two completelyunrelated things. Now you can take this model and put it back intonormal refinement against your original data (with the true free-set Fsrestored and flagged as "free"). Your Rfree will start out very high,but then rapidly drop to what can only be an un-biased value. You canthen try different random number seeds for theover-refine-against-randomized-free-set run, and see how consistent thefinal Rfree becomes. Theoretically, you could do this "biasmisdirection" with as little as one reflection at a time. This would bemore true to classical cross-validation, where you are supposed to do~20 parallel optimizations with 20 different free sets, and look at thecombined Rfree across them. Takes a lot longer, but these days what dowe have if not lots of CPUs lying around?

Anyway, to answer the OP question: 5% is close to being "too many", soless than that is preferable as long as you have "enough" as others onthis thread have so aptly described. In cases whereobservations/parameters is becoming too precious, then perhaps morerobust (aka time-consuming) cross-validation protocols are called for?


-James Holton
MAD Scientist

On 6/2/2015 3:26 AM, Graeme Winter wrote:

Hi Folks
Had a vague comment handed my way that "xia2 assigns too many freereflections" - I have a feeling that by default it makes a free set of5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybeseems excessive now.
This was particularly in the case of high resolution data where youhave a lot of reflections, so 5% could be several thousand which wouldbe more than you need to just check Rfree seems OK.
Since I really don't know what is the right # reflections to assign toa free set thought I would ask here - what do you think? Essentially Ineed to assign a minimum %age or minimum # - the lower of the twopresumably?
Any comments welcome!

Thanks & best wishes Graeme

Re: [ccp4bb] How many is too many free reflections?

Reply via email to