On Friday, October 14, 2011 02:45:08 pm Ed Pozharski wrote:
> On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote:
> > 
> > The benefit of including those extra 5% of data is always minimal 
> 
> And so is probably the benefit of excluding when all the steps that
> require cross-validation have already been performed.  My thinking is
> that excluding data from analysis should always be justified (and in the
> initial stages of refinement, it might be as it prevents overfitting),
> not the other way around.

A model with error bars is more useful than a marginally more
accurate model without error bars, not least because you are probably
taking it on faith that the second model is "more accurate".

Crystallographers were kind of late in realizing that a cross validation
test could be useful in assessing refinement.  What's more, we 
never really learned the whole lesson.  Rather than using the full
test, we use only one blade of the jackknife.  

http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation

The full test would involve running multiple parallel refinements, 
each one omiting a different disjoint set of reflections.  
The ccp4 suite is set up to do this,  since Rfree flags by default run 
from 0-19 and refmac lets you specify which 5% subset is to be omitted
from the current run. Of course, evaluating the end point becomes more
complex than looking at a single number "Rfree".

Surely someone must have done this!  But I can't recall ever reading
an analysis of such a refinement protocol.  
Does anyone know of relevant reports in the literature?

Is there a program or script that will collect K-fold parallel output
models and their residuals to generate a net indicator of model quality?

        Ethan

-- 
Ethan A Merritt
Biomolecular Structure Center,  K-428 Health Sciences Bldg
University of Washington, Seattle 98195-7742

Reply via email to