On Friday, October 14, 2011 02:45:08 pm Ed Pozharski wrote: > On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote: > > > > The benefit of including those extra 5% of data is always minimal > > And so is probably the benefit of excluding when all the steps that > require cross-validation have already been performed. My thinking is > that excluding data from analysis should always be justified (and in the > initial stages of refinement, it might be as it prevents overfitting), > not the other way around.
A model with error bars is more useful than a marginally more accurate model without error bars, not least because you are probably taking it on faith that the second model is "more accurate". Crystallographers were kind of late in realizing that a cross validation test could be useful in assessing refinement. What's more, we never really learned the whole lesson. Rather than using the full test, we use only one blade of the jackknife. http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation The full test would involve running multiple parallel refinements, each one omiting a different disjoint set of reflections. The ccp4 suite is set up to do this, since Rfree flags by default run from 0-19 and refmac lets you specify which 5% subset is to be omitted from the current run. Of course, evaluating the end point becomes more complex than looking at a single number "Rfree". Surely someone must have done this! But I can't recall ever reading an analysis of such a refinement protocol. Does anyone know of relevant reports in the literature? Is there a program or script that will collect K-fold parallel output models and their residuals to generate a net indicator of model quality? Ethan -- Ethan A Merritt Biomolecular Structure Center, K-428 Health Sciences Bldg University of Washington, Seattle 98195-7742