On 1 September 2013 11:31, Frank von Delft <frank.vonde...@sgc.ox.ac.uk>wrote:

>
> 2.
> I'm struck by how small the improvements in R/Rfree are in Diederichs &
> Karplus (ActaD 2013, http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3689524/);
> the authors don't discuss it, but what's current thinking on how to
> estimate the expected variation in R/Rfree - does the Tickle formalism
> (1998) still apply for ML with very weak data?
>

Frank, our paper is still relevant, unfortunately just not to the question
you're trying to answer!  We were trying to answer 2 questions: 1) what
value of Rfree would you expect to get if the structure were free of
systematic error and only random errors were present, so that could be used
as a baseline (assuming a fixed cross-validation test set) to identify
models with gross (e.g. chain-tracing) errors; and 2) how much would you
expect Rfree to vary assuming a fixed starting model but with a different
random sampling of the test set (i.e. the "sampling standard deviation").
The latter is relevant if say you want to compare the same structure (at
the same resolution obviously) done independently in 2 labs, since it tells
you how big the difference in Rfree for an arbitrary choice of test set
needs to be before you can claim that it's statistically significant.

In this case the questions are different because you're certainly not
comparing different models using the same test set, neither I suspect are
you comparing the same model with different randomly selected test sets.  I
assume in this case that the test sets for different resolution cut-offs
are highly correlated, which I suspect makes it quite difficult to say what
is a significant difference in Rfree (I have not attempted to do the
algebra!).

Rfree is one of a number of "model selection criteria" (see
http://en.wikipedia.org/wiki/Model_selection#Criteria_for_model_selection)
whose purpose is to provide a metric for comparison of different models
given specific data, i.e. as for the likelihood function they all take the
form f(model | data), so in all cases you're varying the model with fixed
data.  It's use in the form f(data | model), i.e. where you're varying the
data with a fixed model I would say is somewhat questionable and certainly
requires careful analysis to determine whether the results are
statistically significant.  Even assuming we can argue our way around the
inappropriate application of model selection methodology to a different
problem, unfortunately Rfree is far from an ideal criterion in this
respect; a better one would surely be the free log-likelihood as originally
proposed by Gerard Bricogne.

Cheers

-- Ian

Reply via email to