Re: [ccp4bb] Rfree in similar data set

Dale Tronrud Thu, 24 Sep 2009 09:21:18 -0700

   While I agree with Ian on the theoretical level, in practice
people use free R's to make decisions before the ultimate model
is finished, and our refinement programs are still limited in
their abilities to find even a local minimum.


   On the automated level the test set is used, sometimes, to
determine bulk solvent parameter, and more importantly to calibrate
the likelihood calculations in refinement.  If the test set is
not "free" the likelihood calculation will overestimate the reliability
of the model and I'm not confident that error will not become
a self-fulfilling prophecy.  It is not useful to divine meaning
from the free R until convergence is achieved, but the test
set is used from the first cycle.

   Perhaps I'm in one of my more persnickety moods, but every
paper I've read about optimization algorithms say that the method
requires a number of iteration many times the number of parameters
in the model.  The methods used in refinement programs are pretty
amazing in their ability to drop the residuals with a small number
of cycles, but we are violating the mathematical warranty on
each and every one of them.   A refinement program will produce
a model that is close to optimal, but cannot be expected to be
optimal.  Since we haven't seen an optimal model yet it's hard
to say how far we are off.

   If you have the ability to choose a test set that is "unbiased"
you might as well do so.

Dale Tronrud


Ian Tickle wrote:
>> -----Original Message-----
>> From: owner-ccp...@jiscmail.ac.uk [mailto:owner-ccp...@jiscmail.ac.uk]
> On
>> Behalf Of Eric Bennett
>> Sent: 24 September 2009 13:31
>> To: CCP4BB@JISCMAIL.AC.UK
>> Subject: Re: [ccp4bb] Rfree in similar data set
>>
>> Ian Tickle wrote:
>>
>>> For that to
>>> be true it would have to be possible to arrive at a different
> unbiased
>>> Rfree from another starting point.  But provided your starting point
>>> wasn't a local maximum LL and you haven't gotten into a local maximum
>>> along the way, convergence will be to a unique global maximum of the
> LL,
>>> so the Rfree must be the same whatever starting point is used (within
>>> the radius of convergence of course).
>> But if you're using a different set of data the minima and maxima of
>> the function aren't necessarily going to be in the same place.  Rfree
>> is supposed to inform about overfitting.  In an overfitting situation
>> there are multiple possible models which describe the data well and
>> which overfit solution you end up with could be sensitive to the data
>> set used.  The provisions that you haven't gotten stuck in a local
>> maximum and are within radius of convergence don't seem safe
>> considering historical situations that led to the introduction of
>> Rfree.  What algorithm is going to converge main chain tracing errors
>> to the correct maximum?  Thinking about that situation, isn't part of
>> the goal of Rfree to give you a hint in situations where you have, in
>> fact, gotten stuck in a local maximum due to a significant error in
>> the model that places it outside the radius of convergence of the
>> refinement algorithm?
> 
> Hi Eric,
> 
> Yes clearly the function optima won't necessarily be in the same place
> for different datasets; the question is whether the distance between the
> optima is less than the convergence radius.  This will depend largely on
> whether the datasets have similar dmin; if they do then the differences
> will be largely random measurement errors (I'm assuming that there's
> nothing fundamentally wrong with the data).  Then there should be no
> problem re-refining against the 2nd dataset, and the Rfree will be
> unbiased at the global optimum.  The more common situation perhaps is
> that the 2nd dataset is at much higher resolution; in that case it's
> quite likely that there are undetected local optima in the model from
> the 1st dataset that only become apparent in the maps when the 2nd
> dataset is used.  In that case refinement is almost certainly not the
> answer (or at least not the whole answer), you're going to have to go
> back to the maps and model building.
> 
> On the question of overfitting, again any problems of local optima
> (possibly indicated by a higher than expected Rfree as you say) have to
> be resolved first for each of your candidate parameterizations of the
> model, as best as the data will allow.  Then if you find that Rfree at
> convergence is higher (or LLfree lower) for one parameterization than
> another, you choose the parameterization with the lower Rfree (higher
> LLfree) to go forward.  You cannot safely reject a model as being
> overfitted if the refinement generating the Rfree didn't converge, so
> that the Rfree is unbiased.  I don't see the problem there (except of
> course in choosing which parameterizations to try).
> 
> I think you misunderstood my provisos, I was only doing that to simplify
> the argument; if there are local optima then they have to be resolved,
> most likely by means other than refinement, but their presence does not
> affect the argument about Rfree bias.  My contention is that once all
> issues of local optima are resolved, by whatever means it takes, you
> will end up at the same unique global optimum no matter where you
> started from (unless of course you're very unlucky and there are
> multiple global optima with identical likelihoods but I think we can
> discount that as unlikely!), and therefore Rfree must be unbiased at
> that point.  At intermediate points in this process (i.e. on the paths
> connecting optima), Rfree has no meaning or indeed usefulness and
> therefore the question whether it's biased or not is also meaningless.
> 
> Cheers
> 
> -- Ian
> 
> 
> Disclaimer
> This communication is confidential and may contain privileged information 
> intended solely for the named addressee(s). It may not be used or disclosed 
> except for the purpose for which it has been sent. If you are not the 
> intended recipient you must not review, use, disclose, copy, distribute or 
> take any action in reliance upon it. If you have received this communication 
> in error, please notify Astex Therapeutics Ltd by emailing 
> i.tic...@astex-therapeutics.com and destroy all copies of the message and any 
> attached documents. 
> Astex Therapeutics Ltd monitors, controls and protects all its messaging 
> traffic in compliance with its corporate email policy. The Company accepts no 
> liability or responsibility for any onward transmission or use of emails and 
> attachments having left the Astex Therapeutics domain.  Unless expressly 
> stated, opinions in this message are those of the individual sender and not 
> of Astex Therapeutics Ltd. The recipient should check this email and any 
> attachments for the presence of computer viruses. Astex Therapeutics Ltd 
> accepts no liability for damage caused by any virus transmitted by this 
> email. E-mail is susceptible to data corruption, interception, unauthorized 
> amendment, and tampering, Astex Therapeutics Ltd only send and receive 
> e-mails on the basis that the Company is not liable for any such alteration 
> or any consequences thereof.
> Astex Therapeutics Ltd., Registered in England at 436 Cambridge Science Park, 
> Cambridge CB4 0QA under number 3751674

Re: [ccp4bb] Rfree in similar data set

Reply via email to