As I see it, the size of the test set is a question of the desired
precision of the free R.  At the point of test set selection there is
variability between the many possible choices: you could happen to pick
a test set with a spuriously low free R or one with an unfortunately
high free R.  These variations don't indicate anything about the quality
of your model because you haven't created one yet.  It is just statistical
fluctuations.

   To investigate this I selected several structures I had worked on,
where I had an unrefined starting model.  For each structure I picked a
percentage for the size of the test set and started a loop of test set
selection and "free R" calculations (Since there had been no refinement
yet the R value for the whole data set is the true free R: all "free R's"
calculated from subsets are just estimates.  For each percentage of each
structure I selected 900 test sets.

   The result is that the variance of the free R estimate is not a function
of the size of the protein, the space group, the solvent content, the
magnitude of free R (only checked between about 35% and 55%), nor the
size of the test set measured as a percent.  It is simply a function of
the number of reflections in the test set.  As Axel said in his paper, a
test set of 1000 reflections has a precision of about 1%.  (and varies
with counting statistics: 1/Sqrt[n])

   If you have a test set of 1000 reflections and your free R estimate is
40% you have a 95% confidence that the true free R is between 43% and 37%,
if I recall my confidence intervals correctly.

   The open question is how these deviants track with refinement.  If you
luck out and happen to pick a test set with a particularly low free R
(estimate) does this mean that all your future free R's will look,
inappropriately, too good?  I suspect so, but I have not done the test
of performing 900 independent refinements with differing test sets.

   My short answer to the original question: The precision of a R free
estimate is determined by the number of reflections, not the percent of
the total data set.  Your 0.3% test set is as precise as a 10% test set
in HEWL.  (Even though the affect of leaving these reflections out of
the refinement will be quite different, of course.)

Dale Tronrud


Andreas Forster wrote:
Hey all,

let me give this discussion a little kick and see if it spins into outer space.

How many reflections do people use for cross-validation? Five per cent is a value that I read often in papers. Georg Zocher started with 5% but lowered that to 1.5% in the course of refinement. We've had problems with reviewers once complaining that the 0.3% of reflections we used were not enough. However, Axel BrĂ¼nger's initial publication deems 1000 reflections sufficient, and that's exactly what 0.3% of reflections corresponded to in our data set.

I would think the fewer observations are discarded, the better. Can one lower this number further by picking reflections smartly, eg. avoiding symmetry-related reflections as was discussed on the ccp4bb a little while back? Should one agonize at all, given that one should do a last run of refinement without any reflections excluded?



Andreas


On 1/31/07, *Georg Zocher* <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:

    First of all, I would like to thank you for your comments.

    After consideration of all your comments, I conclude that there are
    three possibilities.

    1.) search for some particularly poorly-behaved regions using
    parvati-server
       a.) refining the occupancy of that atoms and/or
       b.) tightening the restraints

    Problems which have already been metioned:
    If I tighten the restraints, the anisotropic model may not be
    statistically justified, which seems to be the case.

    Using all reflections may not help that much, because I chose a set
    of 1.5% for Rfree (~1300 reflections) to get as much data as
    possible for the refinement. For my first tries of anisotropic
    refinement I used 5% of the reflections for Rfree but the same
    problem arose, so that I decided to cut the Rfree to 1.5%.

    2.) Using shelxl

    3.) TLS with multi-groups
       Should be the safe way!?

    I will try all the possiblities, but especially the tls refinement
    seems to be a good option to be worthy to try.

    Thanks for your helpful advices,

    georg

Reply via email to