Re: [ccp4bb] should the final model be refined against full datset
> Selecting a test set that minimizes Rfree is so wrong on so many levels. > Unless, of course, the only thing I know about Rfree is that it is the > magic number that I need to make small by all means necessary. By using a simple genetic algorithm, I managed to get Rfree for a well-refined model as low as 14.6% and as high as 19.1%. The dataset is not too small (~40,000 reflection in all with the standard sized 5% test set). So you can get spread as wide as 4.5% even with not-so-small dataset. Only ~1/3 of test reflections are exchanged to achieve this. What's curious is that, contrary to my expectations, the test set remains well distributed throughout resolution shells upon this awful "optimization" and the for the working set and test set remain close. Not sure how to judge which model is actually better, but it's noteworthy that the FOM gets worse for *both* upward and downward "optimization" of the test set. -- After much deep and profound brain things inside my head, I have decided to thank you for bringing peace to our home. Julian, King of Lemurs
Re: [ccp4bb] should the final model be refined against full datset
Yes, Rsleep seems to be just the right thing to use for this: Separating model optimization and model validation in statistical cross-validation as applied to crystallography G. J. Kleywegt Acta Cryst. (2007). D63, 939-940 Practically, it would mean that we split 10% of test reflections into 5% used for optimizations like #1-4, and the other 5% (sleep set) is never ever used for anything. The big question here is: whether this will make any important difference? I suspect, as with many similar things, there will be no clear-cut answer (that is it may or may not make difference, case dependent). Pavel On Mon, Oct 17, 2011 at 8:57 AM, Thomas C. Terwilliger wrote: > I think that we are using the test set for many things: > > 1. Determining and communicating to others whether our overall procedure > is overfitting the data. > > 2. Identifying the optimal overall procedure in cases where very different > options are being considered (e.g., should I use TLS). > > 3. Calculating specific parameters (eg sigmaA). > > 4. Identifying the "best" set of overall parameters. > > I would suggest that we should generally restrict our usage of the test > set to purposes #1-3. Given a particular overall procedure for > refinement, a very good set of parameters should be obtainable from the > working set of data. > > In particular, approaches in which many parameters (in the limit... all > parameters) are fit to minimize Rfree do not seem likely to produce the > best model overall. It might be worth doing some experiments with the > super-free set approach to determine whether this is true. > > > >> Hi, > >> > >> On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski > >> wrote: > >> > >>> On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote: > >>> > > > For structures with a small number of reflections, the > >>> > statistical > >>> > > > noise in the 5% sets can be very significant indeed. We have seen > >>> > > > differences between Rfree values obtained from different sets > >>> > reaching > >>> > > > up to 4%. > >>> > >> > >> this is in line with my observations too. > >> Not surprising at all, though (see my previous post on this subject): a > >> small seemingly insignificant change somewhere may result in refinement > >> taking a different pathway leading to a different local minimum. There > is > >> even way of making practical use of this (Rice, Shamoo & Brunger, 1998; > >> Korostelev, Laurberg & Noller, 2009; ...). > >> > >> This "seemingly insignificant change somewhere" may be: > >> - what Ed mentioned (different noise level in free reflections or simply > >> different strength of reflections in free set between sets); > >> - slightly different staring conditions (starting parameter value); > >> - random seed used in Xray/restraints target weight calculation (applies > >> to > >> phenix.refine), > >> - I can go on for 10+ possibilities. > >> > >> I do not know whether choosing the result with the lowest Rfree is a > good > >> idea or not (after reading Ed's post I am slightly puzzled now), but > >> what's > >> definitely a good idea in my opinion is to know the range of possible > >> R-factor values in your specific case, so you know which difference > >> between > >> two R-factors obtained in two refinement runs is significant and which > one > >> is not. > >> > >> Pavel > >> >
Re: [ccp4bb] should the final model be refined against full datset
I think that we are using the test set for many things: 1. Determining and communicating to others whether our overall procedure is overfitting the data. 2. Identifying the optimal overall procedure in cases where very different options are being considered (e.g., should I use TLS). 3. Calculating specific parameters (eg sigmaA). 4. Identifying the "best" set of overall parameters. I would suggest that we should generally restrict our usage of the test set to purposes #1-3. Given a particular overall procedure for refinement, a very good set of parameters should be obtainable from the working set of data. In particular, approaches in which many parameters (in the limit... all parameters) are fit to minimize Rfree do not seem likely to produce the best model overall. It might be worth doing some experiments with the super-free set approach to determine whether this is true. >> Hi, >> >> On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski >> wrote: >> >>> On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote: >>> > > > For structures with a small number of reflections, the >>> > statistical >>> > > > noise in the 5% sets can be very significant indeed. We have seen >>> > > > differences between Rfree values obtained from different sets >>> > reaching >>> > > > up to 4%. >>> >> >> this is in line with my observations too. >> Not surprising at all, though (see my previous post on this subject): a >> small seemingly insignificant change somewhere may result in refinement >> taking a different pathway leading to a different local minimum. There is >> even way of making practical use of this (Rice, Shamoo & Brunger, 1998; >> Korostelev, Laurberg & Noller, 2009; ...). >> >> This "seemingly insignificant change somewhere" may be: >> - what Ed mentioned (different noise level in free reflections or simply >> different strength of reflections in free set between sets); >> - slightly different staring conditions (starting parameter value); >> - random seed used in Xray/restraints target weight calculation (applies >> to >> phenix.refine), >> - I can go on for 10+ possibilities. >> >> I do not know whether choosing the result with the lowest Rfree is a good >> idea or not (after reading Ed's post I am slightly puzzled now), but >> what's >> definitely a good idea in my opinion is to know the range of possible >> R-factor values in your specific case, so you know which difference >> between >> two R-factors obtained in two refinement runs is significant and which one >> is not. >> >> Pavel >>
Re: [ccp4bb] should the final model be refined against full datset
Dear Gerard,Tom and Bernhard, Thankyou for highlighting the IUCr Diffraction Data Deposition Working Group and Forum. Dear Colleagues, I am travelling at present and apologise for not replying sooner to the CCP4bb, and also am with intermittent email access until later this week when I 'return to office'. The points being raised in this CCP4bb thread are very important and the IUCr also recognises this. The role of the IUCr Working Group that has been set up is to bring to a focus information and to identify steps forward. We seek to make progress towards archiving and making available all relevant scientific data associated with a publication (or a completed structure deposition in a validated database such as the PDB). The consultation process is being formalised via the IUCr Forum pages. The Working Group and a wider Group consisting of IUCr Commissions and consultants has been established for discussion and planning. We are also aiming at a community consultation via the Forum approach and we will launch the Forum for this asap. The IUCr invites as wide as possible inputs, from the various communities that the IUCr Commissions serve, on the diffraction data deposition future, which can surely be improved. Thus this Forum will help to record an organised set of inputs for future reference. The Forum is being set up and will require registration, which is a straightforward process. Details will follow shortly. Members of the Working Group and its consulted representatives are listed below. Best wishes and regards, Yours sincerely, John Prof John R Helliwell DSc Chairman of the IUCr Diffraction Data Deposition Working Group (IUCr DDD WG). IUCr DDD WG Members Steve Androulakis (TARDIS representative) John R. Helliwell (Chair) (IUCr ICSTI Representative; Chairman of the IUCr Journals Commission 1996-2005) Loes Kroon-Batenburg (Data processing software) Brian McMahon (IUCr CODATA Representative) John Westbrook (wwPDB representative and COMCIFS) Sol Gruner (Diffuse scattering specialist and SR Facility Director) Heinz-Josef Weyer (SR and Neutron Facility user) Tom Terwilliger (Macromolecular Crystallography) Consultants: Alun Ashton (Diamond Light Source (DLS); Data Archive leader there) Herbert Bernstein (Head of the imgCIF Dictionary Maintenance Group and member of COMCIFS) Frances Bernstein (Observer on data deposition policies) Gerard Bricogne (Active software and methods developer) Bernhard Rupp ( Macromolecular crystallographer) IUCr Commissions (Chairs and/or alternates). On Sat, Oct 15, 2011 at 1:32 AM, Gerard Bricogne wrote: > Dear Tom, > > I am not sure that I feel happy with your invitation that views on such > crucial matters as these deposition issues be communicated to you off-list. > It would seem much healthier if these views were aired out within the BB. > Again!, some will say ... but the difference is that there is now a forum > for them, set up by the IUCr, that may eventually turn opinions into some > form of action. > > I am sure that many subscribers to this BB, and not just you as a > member of some committees, would be interested to hear the full variety of > views on the desirable and the feasible in these areas, and to express their > own for everyone to read and discuss. > > Perhaps John Helliwell can elaborate on this and on the newly created > forum. > > > With best wishes, > > Gerard. > > -- > On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote: >> For those who have strong opinions on what data should be deposited... >> >> The IUCR is just starting a serious discussion of this subject. Two >> committees, the "Data Deposition Working Group", led by John Helliwell, >> and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su) >> are working on this. >> >> Two key issues are (1) feasibility and importance of deposition of raw >> images and (2) deposition of sufficient information to fully reproduce the >> crystallographic analysis. >> >> I am on both committees and would be happy to hear your ideas (off-list). >> I am sure the other members of the committees would welcome your thoughts >> as well. >> >> -Tom T >> >> Tom Terwilliger >> terwilli...@lanl.gov >> >> >> >> This is a follow up (or a digression) to James comparing test set to >> >> missing reflections. I also heard this issue mentioned before but was >> >> always too lazy to actually pursue it. >> >> >> >> So. >> >> >> >> The role of the test set is to prevent overfitting. Let's say I have >> >> the final model and I monitored the Rfree every step of the way and can >> >> conclude that there is no overfitting. Should I do the final refinement >> >> against complete dataset? >> >> >> >> IMCO, I absolutely should. The test set reflections contain >> >> information, and the "final" model is actually biased towards the >> >> working set. Refining using all the data can only improve the accuracy >> >> of the model, if only slightly. >> >> >> >> T
Re: [ccp4bb] should the final model be refined against full datset
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dear Nicholas, for a data set with 5132 unique reflections you should flag 10.5% for Rfree, otherwise you could as well drop Rfree completely and use the whole data set for refinement. At least this is how I understand Axel Brunger's article about Rfree where he states that one needs 500-1000 reflections for a significant meaning of Rfree. I have wondered where the '5%-rule' came in which compromises the Rfree for low resolution data sets (especially with high symmetry). If Axel Brunger's initial statement has become obsolete I would appreciate some clarification on the required number of flagged reflection, but until then I will keep on flagging 500-1000 reflections, rather than 5%. Tim On 10/15/2011 10:48 AM, Nicholas M Glykos wrote: >>> For structures with a small number of reflections, the statistical >>> noise in the 5% sets can be very significant indeed. We have seen >>> differences between Rfree values obtained from different sets reaching >>> up to 4%. >> >> This is very intriguing indeed! Is there something specific in these >> structures that Rfree differences depending on the set used reach 4%? >> NCS? Or the 5% set having less than ~1000-1500 reflections? > > Tassos, by your standards, these structures should have been described as > 'tiny' and not small ... ;-) [Yes, significantly less than 1000. In one > case the _total_ number of reflections was 5132 reflections (which were, > nevertheless, slowly and meticulously measured by a CAD4 one-by-one. These > were the days ... :-)) ]. > > > > - -- - -- Dr Tim Gruene Institut fuer anorganische Chemie Tammannstr. 4 D-37077 Goettingen GPG Key ID = A46BEE1A -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iD8DBQFOm9p6UxlJ7aRr7hoRAkumAKD5beU+JnpRuO7TJF1232a1axMtAACdHCI5 nf8+rtr5Are0kBgmk9w0rg4= =Agc9 -END PGP SIGNATURE-
Re: [ccp4bb] should the final model be refined against full datset
Hi, On Sun, Oct 16, 2011 at 7:48 PM, Ed Pozharski wrote: > On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote: > > > > For structures with a small number of reflections, the > > statistical > > > > noise in the 5% sets can be very significant indeed. We have seen > > > > differences between Rfree values obtained from different sets > > reaching > > > > up to 4%. > this is in line with my observations too. Not surprising at all, though (see my previous post on this subject): a small seemingly insignificant change somewhere may result in refinement taking a different pathway leading to a different local minimum. There is even way of making practical use of this (Rice, Shamoo & Brunger, 1998; Korostelev, Laurberg & Noller, 2009; ...). This "seemingly insignificant change somewhere" may be: - what Ed mentioned (different noise level in free reflections or simply different strength of reflections in free set between sets); - slightly different staring conditions (starting parameter value); - random seed used in Xray/restraints target weight calculation (applies to phenix.refine), - I can go on for 10+ possibilities. I do not know whether choosing the result with the lowest Rfree is a good idea or not (after reading Ed's post I am slightly puzzled now), but what's definitely a good idea in my opinion is to know the range of possible R-factor values in your specific case, so you know which difference between two R-factors obtained in two refinement runs is significant and which one is not. Pavel
Re: [ccp4bb] should the final model be refined against full datset
On Sat, 2011-10-15 at 11:48 +0300, Nicholas M Glykos wrote: > > > For structures with a small number of reflections, the > statistical > > > noise in the 5% sets can be very significant indeed. We have seen > > > differences between Rfree values obtained from different sets > reaching > > > up to 4%. This produces a curious paradox. One possible reason for the variation in Rfree when choosing a different test sets is that by pure chance reflections with more/less noise can be selected. Which automatically means that the working set contains reflections with less/more noise and therefore the model (presumably) gets better/worse. So, selecting a test set that results in lower Rfree leads to the model which is likely worse? In fact, an obvious way to improve the Rfree through choice of a better test set is by biasing it towards stronger reflections in each resolution shell. Selecting a test set that minimizes Rfree is so wrong on so many levels. Unless, of course, the only thing I know about Rfree is that it is the magic number that I need to make small by all means necessary. Cheers, Ed. -- Oh, suddenly throwing a giraffe into a volcano to make water is crazy? Julian, King of Lemurs
Re: [ccp4bb] should the final model be refined against full datset
> > For structures with a small number of reflections, the statistical > > noise in the 5% sets can be very significant indeed. We have seen > > differences between Rfree values obtained from different sets reaching > > up to 4%. > > This is very intriguing indeed! Is there something specific in these > structures that Rfree differences depending on the set used reach 4%? > NCS? Or the 5% set having less than ~1000-1500 reflections? Tassos, by your standards, these structures should have been described as 'tiny' and not small ... ;-) [Yes, significantly less than 1000. In one case the _total_ number of reflections was 5132 reflections (which were, nevertheless, slowly and meticulously measured by a CAD4 one-by-one. These were the days ... :-)) ]. -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/
Re: [ccp4bb] should the final model be refined against full datset
> > > For structures with a small number of reflections, the statistical noise > in the 5% sets can be very significant indeed. We have seen differences > between Rfree values obtained from different sets reaching up to 4%. This is very intriguing indeed! Is there something specific in these structures that Rfree differences depending on the set used reach 4%? NCS? Or the 5% set having less than ~1000-1500 reflections? It would be indeed very interesting if there was a correlation there! A. > > Ideally, and instead of PDBSET+REFMAC we should have been using simulated > annealing (without positional refinement), but moving continuously between > the CNS-XPLOR and CCP4 was too much for my laziness. > > All the best, > Nicholas > > > -- > > > Dr Nicholas M. Glykos, Department of Molecular Biology > and Genetics, Democritus University of Thrace, University Campus, > Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, >Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/
Re: [ccp4bb] should the final model be refined against full datset
Dear Ethan, List, > Surely someone must have done this! But I can't recall ever reading > an analysis of such a refinement protocol. > Does anyone know of relevant reports in the literature? Total statistical cross validation is indeed what we should be doing, but for large structures the computational cost may be significant. In the absence of total statistical cross validation the reported Rfree may be an 'outlier' (with respect to the distribution of the Rfree values that would have been obtained from all disjoined sets). To tackle this, we usually resort to the following ad hoc procedure : At an early stage of the positional refinement, we use a shell script which (a) uses Phil's PDBSET with the NOISE keyword to randomly shift atomic positions, (b) refine the resulting models with each of the different free sets to completion, (c) Calculate the mean of the resulting free R values, (d) Select (once and for all) the free set which is closer to the mean of the Rfree values obtained above. For structures with a small number of reflections, the statistical noise in the 5% sets can be very significant indeed. We have seen differences between Rfree values obtained from different sets reaching up to 4%. Ideally, and instead of PDBSET+REFMAC we should have been using simulated annealing (without positional refinement), but moving continuously between the CNS-XPLOR and CCP4 was too much for my laziness. All the best, Nicholas -- Dr Nicholas M. Glykos, Department of Molecular Biology and Genetics, Democritus University of Thrace, University Campus, Dragana, 68100 Alexandroupolis, Greece, Tel/Fax (office) +302551030620, Ext.77620, Tel (lab) +302551030615, http://utopia.duth.gr/~glykos/
Re: [ccp4bb] should the final model be refined against full datset
Hi, yes, shifts depend on resolution indeed. See pages 75-77 here: http://www.phenix-online.org/presentations/latest/pavel_refinement_general.pdf Pavel On Fri, Oct 14, 2011 at 7:34 PM, Ed Pozharski wrote: > On Fri, 2011-10-14 at 23:41 +0100, Phil Evans wrote: > > I just tried refining a "finished" structure turning off the FreeR > > set, in Refmac, and I have to say I can barely see any difference > > between the two sets of coordinates. > > The amplitude of the shift, I presume, depends on the resolution and > data quality. With a very good 1.2A dataset refined with anisotropic > B-factors to R~14% what I see is ~0.005A rms shift. Which is not much, > however the reported ML DPI is ~0.02A, so perhaps the effect is not that > small compared to the precision of the model. > > On the other hand, the more "normal" example at 1.7A (and very good data > refining down to R~15%) shows ~0.03A general variation with a variable > test set. Again, not much, but the ML DPI in this case is ~0.06A - > comparable to the variation induced by the choice of the test set. > > Cheers, > > Ed. > > -- > Hurry up, before we all come back to our senses! > Julian, King of Lemurs >
Re: [ccp4bb] should the final model be refined against full datset
On Fri, 2011-10-14 at 23:41 +0100, Phil Evans wrote: > I just tried refining a "finished" structure turning off the FreeR > set, in Refmac, and I have to say I can barely see any difference > between the two sets of coordinates. The amplitude of the shift, I presume, depends on the resolution and data quality. With a very good 1.2A dataset refined with anisotropic B-factors to R~14% what I see is ~0.005A rms shift. Which is not much, however the reported ML DPI is ~0.02A, so perhaps the effect is not that small compared to the precision of the model. On the other hand, the more "normal" example at 1.7A (and very good data refining down to R~15%) shows ~0.03A general variation with a variable test set. Again, not much, but the ML DPI in this case is ~0.06A - comparable to the variation induced by the choice of the test set. Cheers, Ed. -- Hurry up, before we all come back to our senses! Julian, King of Lemurs
Re: [ccp4bb] should the final model be refined against full datset
Each R-free flag corresponds a particular HKL index. Redundancy refers to the number of times a reflection corresponding to a given HKL index is observed. The final structure factor of a given HKL can be thought of as an average of these redundant observations. Related to your question, someone once mentioned that for each particular space group, there should be a preferred R-free assignment. As far as I know, nothing tangible ever came of that idea. James On Oct 14, 2011, at 5:34 PM, D Bonsor wrote: > I may be missing something or someone could point out that I am wrong and why > as I am curious, but with a highly redundant dataset the difference between > refining the final model against the full dataset would be small based upon > the random selection of reflections for Rfree?
Re: [ccp4bb] should the final model be refined against full datset
Now it would be interesting to refine this structure to convergence, with the original free set. If I understood correctly Ian Tickle has done essentially this, and the Free R returns essentially to its original value: the minimum arrived at is independent of starting point, perhaps within limitation that one might get caught in a different false minimum (which is unlikely given the miniscule changes you see). If that is the case we should stop worrying about "corrupting" the free set by refining against it or even using it to make maps in which models will be adjusted. This is a perennial discussion but I never saw the report that in fact original free-R is _not_ recoverable by refining to convergence. Phil Evans wrote: I just tried refining a "finished" structure turning off the FreeR set, in Refmac, and I have to say I can barely see any difference between the two sets of coordinates. From this n=1 trial, I can't see that it improves the model significantly, nor that it ruins the model irretrievably for future purposes. I suspect we worry too much about these things Phil Evans Now it would be interesting to refine this structure to convergence, with the original free set. If I understood correctly Ian Tickle has done essentially this, and the Free R returns essentially to its original value: the minimum arrived at is independent of starting point, perhaps within limitation that one might get caught in a different false minimum (which is unlikely given the miniscule changes you see). If that is the case we should stop worrying about "corrupting" the free set by refining against it or even using it to make maps in which models will be adjusted. This is a perennial discussion but I never saw the report that in fact original free-R is _not_ recoverable by refining to convergence. Indeed, perhaps we worry too much about such things. On 14 Oct 2011, at 21:35, Nat Echols wrote: On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang wrote: Sorry, I don't quite understand your reasoning for how the structure is rendered useless if one refined it with all data. "Useless" was too strong a word (it's Friday, sorry). I guess simulated annealing can address the model-bias issue, but I'm not totally convinced that this solves the problem. And not every crystallographer will run SA every time he/she solves an isomorphous structure, so there's a real danger of misleading future users of the PDB file. The reported R-free, of course, is still meaningless in the context of the deposited model. Would your argument also apply to all the structures that were refined before R-free existed? Technically, yes - but how many proteins are there whose only representatives in the PDB were refined this way? I suspect very few; in most cases, a more recent model should be available. -Nat
Re: [ccp4bb] should the final model be refined against full datset
Dear Gerard, I'm very happy for the discussion to be on the CCP4 list (or on the IUCR forums, or both). I was only trying to not create too much traffic. All the best, Tom T >> Dear Tom, >> >> I am not sure that I feel happy with your invitation that views on >> such >> crucial matters as these deposition issues be communicated to you >> off-list. >> It would seem much healthier if these views were aired out within the BB. >> Again!, some will say ... but the difference is that there is now a forum >> for them, set up by the IUCr, that may eventually turn opinions into some >> form of action. >> >> I am sure that many subscribers to this BB, and not just you as a >> member of some committees, would be interested to hear the full variety of >> views on the desirable and the feasible in these areas, and to express >> their >> own for everyone to read and discuss. >> >> Perhaps John Helliwell can elaborate on this and on the newly created >> forum. >> >> >> With best wishes, >> >> Gerard. >> >> -- >> On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote: >>> For those who have strong opinions on what data should be deposited... >>> >>> The IUCR is just starting a serious discussion of this subject. Two >>> committees, the "Data Deposition Working Group", led by John Helliwell, >>> and the Commission on Biological Macromolecules (chaired by Xiao-Dong >>> Su) >>> are working on this. >>> >>> Two key issues are (1) feasibility and importance of deposition of raw >>> images and (2) deposition of sufficient information to fully reproduce >>> the >>> crystallographic analysis. >>> >>> I am on both committees and would be happy to hear your ideas >>> (off-list). >>> I am sure the other members of the committees would welcome your >>> thoughts >>> as well. >>> >>> -Tom T >>> >>> Tom Terwilliger >>> terwilli...@lanl.gov >>> >>> >>> >> This is a follow up (or a digression) to James comparing test set to >>> >> missing reflections. I also heard this issue mentioned before but >>> was >>> >> always too lazy to actually pursue it. >>> >> >>> >> So. >>> >> >>> >> The role of the test set is to prevent overfitting. Let's say I have >>> >> the final model and I monitored the Rfree every step of the way and >>> can >>> >> conclude that there is no overfitting. Should I do the final >>> refinement >>> >> against complete dataset? >>> >> >>> >> IMCO, I absolutely should. The test set reflections contain >>> >> information, and the "final" model is actually biased towards the >>> >> working set. Refining using all the data can only improve the >>> accuracy >>> >> of the model, if only slightly. >>> >> >>> >> The second question is practical. Let's say I want to deposit the >>> >> results of the refinement against the full dataset as my final model. >>> >> Should I not report the Rfree and instead insert a remark explaining >>> the >>> >> situation? If I report the Rfree prior to the test set removal, it >>> is >>> >> certain that every validation tool will report a mismatch. It does >>> not >>> >> seem that the PDB has a mechanism to deal with this. >>> >> >>> >> Cheers, >>> >> >>> >> Ed. >>> >> >>> >> >>> >> >>> >> -- >>> >> Oh, suddenly throwing a giraffe into a volcano to make water is >>> crazy? >>> >> Julian, King of >>> Lemurs >>> >> >> >> -- >> >> === >> * * >> * Gerard Bricogne g...@globalphasing.com * >> * * >> * Global Phasing Ltd. * >> * Sheraton House, Castle Park Tel: +44-(0)1223-353033 * >> * Cambridge CB3 0AX, UK Fax: +44-(0)1223-366889 * >> * * >> === >>
Re: [ccp4bb] should the final model be refined against full datset
I may be missing something or someone could point out that I am wrong and why as I am curious, but with a highly redundant dataset the difference between refining the final model against the full dataset would be small based upon the random selection of reflections for Rfree?
Re: [ccp4bb] should the final model be refined against full datset
Dear Tom, I am not sure that I feel happy with your invitation that views on such crucial matters as these deposition issues be communicated to you off-list. It would seem much healthier if these views were aired out within the BB. Again!, some will say ... but the difference is that there is now a forum for them, set up by the IUCr, that may eventually turn opinions into some form of action. I am sure that many subscribers to this BB, and not just you as a member of some committees, would be interested to hear the full variety of views on the desirable and the feasible in these areas, and to express their own for everyone to read and discuss. Perhaps John Helliwell can elaborate on this and on the newly created forum. With best wishes, Gerard. -- On Fri, Oct 14, 2011 at 04:56:20PM -0600, Thomas C. Terwilliger wrote: > For those who have strong opinions on what data should be deposited... > > The IUCR is just starting a serious discussion of this subject. Two > committees, the "Data Deposition Working Group", led by John Helliwell, > and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su) > are working on this. > > Two key issues are (1) feasibility and importance of deposition of raw > images and (2) deposition of sufficient information to fully reproduce the > crystallographic analysis. > > I am on both committees and would be happy to hear your ideas (off-list). > I am sure the other members of the committees would welcome your thoughts > as well. > > -Tom T > > Tom Terwilliger > terwilli...@lanl.gov > > > >> This is a follow up (or a digression) to James comparing test set to > >> missing reflections. I also heard this issue mentioned before but was > >> always too lazy to actually pursue it. > >> > >> So. > >> > >> The role of the test set is to prevent overfitting. Let's say I have > >> the final model and I monitored the Rfree every step of the way and can > >> conclude that there is no overfitting. Should I do the final refinement > >> against complete dataset? > >> > >> IMCO, I absolutely should. The test set reflections contain > >> information, and the "final" model is actually biased towards the > >> working set. Refining using all the data can only improve the accuracy > >> of the model, if only slightly. > >> > >> The second question is practical. Let's say I want to deposit the > >> results of the refinement against the full dataset as my final model. > >> Should I not report the Rfree and instead insert a remark explaining the > >> situation? If I report the Rfree prior to the test set removal, it is > >> certain that every validation tool will report a mismatch. It does not > >> seem that the PDB has a mechanism to deal with this. > >> > >> Cheers, > >> > >> Ed. > >> > >> > >> > >> -- > >> Oh, suddenly throwing a giraffe into a volcano to make water is crazy? > >> Julian, King of Lemurs > >> -- === * * * Gerard Bricogne g...@globalphasing.com * * * * Global Phasing Ltd. * * Sheraton House, Castle Park Tel: +44-(0)1223-353033 * * Cambridge CB3 0AX, UK Fax: +44-(0)1223-366889 * * * ===
Re: [ccp4bb] should the final model be refined against full datset
For those who have strong opinions on what data should be deposited... The IUCR is just starting a serious discussion of this subject. Two committees, the "Data Deposition Working Group", led by John Helliwell, and the Commission on Biological Macromolecules (chaired by Xiao-Dong Su) are working on this. Two key issues are (1) feasibility and importance of deposition of raw images and (2) deposition of sufficient information to fully reproduce the crystallographic analysis. I am on both committees and would be happy to hear your ideas (off-list). I am sure the other members of the committees would welcome your thoughts as well. -Tom T Tom Terwilliger terwilli...@lanl.gov >> This is a follow up (or a digression) to James comparing test set to >> missing reflections. I also heard this issue mentioned before but was >> always too lazy to actually pursue it. >> >> So. >> >> The role of the test set is to prevent overfitting. Let's say I have >> the final model and I monitored the Rfree every step of the way and can >> conclude that there is no overfitting. Should I do the final refinement >> against complete dataset? >> >> IMCO, I absolutely should. The test set reflections contain >> information, and the "final" model is actually biased towards the >> working set. Refining using all the data can only improve the accuracy >> of the model, if only slightly. >> >> The second question is practical. Let's say I want to deposit the >> results of the refinement against the full dataset as my final model. >> Should I not report the Rfree and instead insert a remark explaining the >> situation? If I report the Rfree prior to the test set removal, it is >> certain that every validation tool will report a mismatch. It does not >> seem that the PDB has a mechanism to deal with this. >> >> Cheers, >> >> Ed. >> >> >> >> -- >> Oh, suddenly throwing a giraffe into a volcano to make water is crazy? >> Julian, King of Lemurs >>
Re: [ccp4bb] should the final model be refined against full datset
I just tried refining a "finished" structure turning off the FreeR set, in Refmac, and I have to say I can barely see any difference between the two sets of coordinates. From this n=1 trial, I can't see that it improves the model significantly, nor that it ruins the model irretrievably for future purposes. I suspect we worry too much about these things Phil Evans On 14 Oct 2011, at 21:35, Nat Echols wrote: > On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang wrote: > Sorry, I don't quite understand your reasoning for how the structure is > rendered useless if one refined it with all data. > > "Useless" was too strong a word (it's Friday, sorry). I guess simulated > annealing can address the model-bias issue, but I'm not totally convinced > that this solves the problem. And not every crystallographer will run SA > every time he/she solves an isomorphous structure, so there's a real danger > of misleading future users of the PDB file. The reported R-free, of course, > is still meaningless in the context of the deposited model. > > Would your argument also apply to all the structures that were refined before > R-free existed? > > Technically, yes - but how many proteins are there whose only representatives > in the PDB were refined this way? I suspect very few; in most cases, a more > recent model should be available. > > -Nat
Re: [ccp4bb] should the final model be refined against full datset
On Friday, October 14, 2011 02:45:08 pm Ed Pozharski wrote: > On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote: > > > > The benefit of including those extra 5% of data is always minimal > > And so is probably the benefit of excluding when all the steps that > require cross-validation have already been performed. My thinking is > that excluding data from analysis should always be justified (and in the > initial stages of refinement, it might be as it prevents overfitting), > not the other way around. A model with error bars is more useful than a marginally more accurate model without error bars, not least because you are probably taking it on faith that the second model is "more accurate". Crystallographers were kind of late in realizing that a cross validation test could be useful in assessing refinement. What's more, we never really learned the whole lesson. Rather than using the full test, we use only one blade of the jackknife. http://en.wikipedia.org/wiki/Cross-validation_(statistics)#K-fold_cross-validation The full test would involve running multiple parallel refinements, each one omiting a different disjoint set of reflections. The ccp4 suite is set up to do this, since Rfree flags by default run from 0-19 and refmac lets you specify which 5% subset is to be omitted from the current run. Of course, evaluating the end point becomes more complex than looking at a single number "Rfree". Surely someone must have done this! But I can't recall ever reading an analysis of such a refinement protocol. Does anyone know of relevant reports in the literature? Is there a program or script that will collect K-fold parallel output models and their residuals to generate a net indicator of model quality? Ethan -- Ethan A Merritt Biomolecular Structure Center, K-428 Health Sciences Bldg University of Washington, Seattle 98195-7742
Re: [ccp4bb] should the final model be refined against full datset
Thanks for the clear explanation. I understood that. But I was trying to understand how this would negatively affects the initial model to render it useless or less useful. In the scenario that you presented, I would expect a better result (better model) if the initial model was refined with all data, thus more useful. Sure, again in your scenario, the "new" structure has seen R-free reflections in the equivalent indexes of its replacement model, but their intensities should be different anyway, so I am not sure how this is bad. Even if the bias is huge, let's say this bias results in 1% reduction in initial R-free (exaggerating here), how would this makes one's model bad or how would this be bad for one's science? In the end, our objective is to build the best model possible and I think that more data would likely result in better model, not the other way around. If we can agree that refining a model with all data would result in a better model, then wouldn't not doing so constitute a compromise of model quality for a more "pure" statistic? I had not refined a model with all data before (just to keep inline), but I wondered if I was doing the best thing. Cheers, Quyen On Oct 14, 2011, at 5:27 PM, Phil Jeffrey wrote: Let's say you have two isomorphous crystals of two different protein- ligand complexes. Same protein different ligand, same xtal form. Conventionally you'd keep the same free set reflections (hkl values) between the two datasets to reduce biasing. However if the first model had been refined against all reflections there is no longer a free set for that model, thus all hkl's have seen the atoms during refinement, and so your R-free in the second complex is initially biased to the model from the first complex. [*] The tendency is to do less refinement in these sort of isomorphous cases than in molecular replacement solutions, because the structural changes are usually far less (it is isomorphous after all) so there's a risk that the R-free will not be allowed to fully float free of that initial bias. That makes your R-free look better than it actually is. This is rather strongly analogous to using different free sets in the two datasets. However I'm not sure that this is as big of a deal as it is being made to sound. It can be dealt with straightforwardly. However refining against all the data weakens the use of R-free as a validation tool for that particular model so the people that like to judge structures based on a single number (i.e. R-free) are going to be quite put out. It's also the case that the best model probably *is* the one based on a careful last round of refinement against all data, as long as nothing much changes. That would need to be quantified in some way(s). Phil Jeffrey Princeton [* Your R-free is also initially model-biased in cases where the data are significant non-isomorphous or you're using two different xtal forms, to varying extents] I still don't understand how a structure model refined with all data would negatively affect the determination and/or refinement of an isomorphous structure using a different data set (even without doing SA first). Quyen On Oct 14, 2011, at 4:35 PM, Nat Echols wrote: On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang mailto:qqho...@gmail.com>> wrote: Sorry, I don't quite understand your reasoning for how the structure is rendered useless if one refined it with all data. "Useless" was too strong a word (it's Friday, sorry). I guess simulated annealing can address the model-bias issue, but I'm not totally convinced that this solves the problem. And not every crystallographer will run SA every time he/she solves an isomorphous structure, so there's a real danger of misleading future users of the PDB file. The reported R-free, of course, is still meaningless in the context of the deposited model. Would your argument also apply to all the structures that were refined before R-free existed? Technically, yes - but how many proteins are there whose only representatives in the PDB were refined this way? I suspect very few; in most cases, a more recent model should be available. -Nat
Re: [ccp4bb] should the final model be refined against full datset
We have obligations that extend beyond simply presenting a "best" model. In an ideal world, the PDB would accept two coordinate sets and two sets of statistics, one for the last step where the cross-validation set was valid, and a final model refined against all the data. Until there is a clear way to do that, and an unambiguous presentation of them to the public, IMO, the gains won by refinement against all the data are outweighed by the Confusion that it can cause when presenting model and associated statistics to the public. On Oct 14, 2011, at 3:32 PM, Jan Dohnalek wrote: > Regarding refinement against all reflections: the main goal of our work is to > provide the best possible representation of the experimental data in the form > of the structure model. Once the structure building and refinement process is > finished keeping the Rfree set separate does not make sense any more. Its > role finishes once the last set of changes have been done to the model and > verified ... > > J. Dohnalek
Re: [ccp4bb] should the final model be refined against full datset
On Fri, 2011-10-14 at 13:07 -0700, Nat Echols wrote: > You should enter the statistics for the model and data that you > actually deposit, not statistics for some other model that you might > have had at one point but which the PDB will never see. If you read my post carefully, you'll see that I never suggested reporting statistics for one model and depositing the other > Not only does refining against R-free make it impossible to verify and > validate your structure, it also means that any time you or anyone > else wants to solve an isomorphous structure by MR using your > structure as a search model, or continue the refinement with > higher-resolution data, you will be starting with a model that has > been refined against all reflections. So any future refinements done > with that model against isomorphous data are pre-biased, making your > model potentially useless. Frankly, I think you are exaggerating the magnitude of model bias in the situation that I described. You assume that the refinement will become severely unstable after tossing in the test reflections. Depending on the resolution etc, the rms shift of the model may vary but if it even is, say half an angstrom the model hardly becomes useless (and that is hugely overestimated). And at least in theory including *all the data* should make the model more, not less accurate. > The benefit of including those extra 5% of data is always minimal And so is probably the benefit of excluding when all the steps that require cross-validation have already been performed. My thinking is that excluding data from analysis should always be justified (and in the initial stages of refinement, it might be as it prevents overfitting), not the other way around. Cheers, Ed. -- "Hurry up before we all come back to our senses!" Julian, King of Lemurs
Re: [ccp4bb] should the final model be refined against full datset
Recently we (I mean WE - community) frequently refine structures around 1 Angstrom resolution. This is not what for the Rfree was invented. It was invented to go away with 3.0-2.8 Angstrom data in times when people did not possess facilities good enough to look on the electron density maps…. We finish (WE - I again mean - community) the refinement of our structures too early. Dr Felix Frolow Professor of Structural Biology and Biotechnology Department of Molecular Microbiology and Biotechnology Tel Aviv University 69978, Israel Acta Crystallographica F, co-editor e-mail: mbfro...@post.tau.ac.il Tel: ++972-3640-8723 Fax: ++972-3640-9407 Cellular: 0547 459 608 On Oct 14, 2011, at 22:35 , Nat Echols wrote: > On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang wrote: > Sorry, I don't quite understand your reasoning for how the structure is > rendered useless if one refined it with all data. > > "Useless" was too strong a word (it's Friday, sorry). I guess simulated > annealing can address the model-bias issue, but I'm not totally convinced > that this solves the problem. And not every crystallographer will run SA > every time he/she solves an isomorphous structure, so there's a real danger > of misleading future users of the PDB file. The reported R-free, of course, > is still meaningless in the context of the deposited model. > > Would your argument also apply to all the structures that were refined before > R-free existed? > > Technically, yes - but how many proteins are there whose only representatives > in the PDB were refined this way? I suspect very few; in most cases, a more > recent model should be available. > > -Nat
Re: [ccp4bb] should the final model be refined against full datset
Let's say you have two isomorphous crystals of two different protein-ligand complexes. Same protein different ligand, same xtal form. Conventionally you'd keep the same free set reflections (hkl values) between the two datasets to reduce biasing. However if the first model had been refined against all reflections there is no longer a free set for that model, thus all hkl's have seen the atoms during refinement, and so your R-free in the second complex is initially biased to the model from the first complex. [*] The tendency is to do less refinement in these sort of isomorphous cases than in molecular replacement solutions, because the structural changes are usually far less (it is isomorphous after all) so there's a risk that the R-free will not be allowed to fully float free of that initial bias. That makes your R-free look better than it actually is. This is rather strongly analogous to using different free sets in the two datasets. However I'm not sure that this is as big of a deal as it is being made to sound. It can be dealt with straightforwardly. However refining against all the data weakens the use of R-free as a validation tool for that particular model so the people that like to judge structures based on a single number (i.e. R-free) are going to be quite put out. It's also the case that the best model probably *is* the one based on a careful last round of refinement against all data, as long as nothing much changes. That would need to be quantified in some way(s). Phil Jeffrey Princeton [* Your R-free is also initially model-biased in cases where the data are significant non-isomorphous or you're using two different xtal forms, to varying extents] I still don't understand how a structure model refined with all data would negatively affect the determination and/or refinement of an isomorphous structure using a different data set (even without doing SA first). Quyen On Oct 14, 2011, at 4:35 PM, Nat Echols wrote: On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang mailto:qqho...@gmail.com>> wrote: Sorry, I don't quite understand your reasoning for how the structure is rendered useless if one refined it with all data. "Useless" was too strong a word (it's Friday, sorry). I guess simulated annealing can address the model-bias issue, but I'm not totally convinced that this solves the problem. And not every crystallographer will run SA every time he/she solves an isomorphous structure, so there's a real danger of misleading future users of the PDB file. The reported R-free, of course, is still meaningless in the context of the deposited model. Would your argument also apply to all the structures that were refined before R-free existed? Technically, yes - but how many proteins are there whose only representatives in the PDB were refined this way? I suspect very few; in most cases, a more recent model should be available. -Nat
Re: [ccp4bb] should the final model be refined against full datset
I still don't understand how a structure model refined with all data would negatively affect the determination and/or refinement of an isomorphous structure using a different data set (even without doing SA first). Quyen On Oct 14, 2011, at 4:35 PM, Nat Echols wrote: On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang wrote: Sorry, I don't quite understand your reasoning for how the structure is rendered useless if one refined it with all data. "Useless" was too strong a word (it's Friday, sorry). I guess simulated annealing can address the model-bias issue, but I'm not totally convinced that this solves the problem. And not every crystallographer will run SA every time he/she solves an isomorphous structure, so there's a real danger of misleading future users of the PDB file. The reported R-free, of course, is still meaningless in the context of the deposited model. Would your argument also apply to all the structures that were refined before R-free existed? Technically, yes - but how many proteins are there whose only representatives in the PDB were refined this way? I suspect very few; in most cases, a more recent model should be available. -Nat
Re: [ccp4bb] should the final model be refined against full datset
On Fri, Oct 14, 2011 at 1:20 PM, Quyen Hoang wrote: > Sorry, I don't quite understand your reasoning for how the structure is > rendered useless if one refined it with all data. > "Useless" was too strong a word (it's Friday, sorry). I guess simulated annealing can address the model-bias issue, but I'm not totally convinced that this solves the problem. And not every crystallographer will run SA every time he/she solves an isomorphous structure, so there's a real danger of misleading future users of the PDB file. The reported R-free, of course, is still meaningless in the context of the deposited model. Would your argument also apply to all the structures that were refined > before R-free existed? Technically, yes - but how many proteins are there whose only representatives in the PDB were refined this way? I suspect very few; in most cases, a more recent model should be available. -Nat
Re: [ccp4bb] should the final model be refined against full datset
Regarding refinement against all reflections: the main goal of our work is to provide the best possible representation of the experimental data in the form of the structure model. Once the structure building and refinement process is finished keeping the Rfree set separate does not make sense any more. Its role finishes once the last set of changes have been done to the model and verified ... J. Dohnalek On Fri, Oct 14, 2011 at 10:23 PM, Craig A. Bingman < cbing...@biochem.wisc.edu> wrote: > Recent experience indicates that the PDB is checking these statistics very > closely for new depositions. The checks made by the PDB are intended to > prevent accidents and oversights made by honest people from creeping into > the database. "Getting away" with something seems to imply some intention > to deceive, and that is much more difficult to detect. > > On Oct 14, 2011, at 3:09 PM, Robbie Joosten wrote: > > The deposited R-free sets in the PDB are quite frequently 'unfree' or the > wrong set was deposited (checking this is one of the recommendations in the > VTF report in Structure). So at the moment you would probably get away with > depositing an unfree R-free set ;) > > > -- Jan Dohnalek, Ph.D Institute of Macromolecular Chemistry Academy of Sciences of the Czech Republic Heyrovskeho nam. 2 16206 Praha 6 Czech Republic Tel: +420 296 809 390 Fax: +420 296 809 410
Re: [ccp4bb] should the final model be refined against full datset
Recent experience indicates that the PDB is checking these statistics very closely for new depositions. The checks made by the PDB are intended to prevent accidents and oversights made by honest people from creeping into the database. "Getting away" with something seems to imply some intention to deceive, and that is much more difficult to detect. On Oct 14, 2011, at 3:09 PM, Robbie Joosten wrote: > The deposited R-free sets in the PDB are quite frequently 'unfree' or the > wrong set was deposited (checking this is one of the recommendations in the > VTF report in Structure). So at the moment you would probably get away with > depositing an unfree R-free set ;) >
Re: [ccp4bb] should the final model be refined against full datset
Sorry, I don't quite understand your reasoning for how the structure is rendered useless if one refined it with all data. Would your argument also apply to all the structures that were refined before R-free existed? Quyen You should enter the statistics for the model and data that you actually deposit, not statistics for some other model that you might have had at one point but which the PDB will never see. Not only does refining against R-free make it impossible to verify and validate your structure, it also means that any time you or anyone else wants to solve an isomorphous structure by MR using your structure as a search model, or continue the refinement with higher- resolution data, you will be starting with a model that has been refined against all reflections. So any future refinements done with that model against isomorphous data are pre-biased, making your model potentially useless. I'm amazed that anyone is still depositing structures refined against all data, but the PDB does still get a few. The benefit of including those extra 5% of data is always minimal in every paper I've seen that reports such a procedure, and far outweighed by having a reliable and relatively unbiased validation statistic that is preserved in the final deposition. (The situation may be different for very low resolution data, but those structures are a tiny fraction of the PDB.) -Nat
Re: [ccp4bb] should the final model be refined against full datset
Hi Ed, > This is a follow up (or a digression) to James comparing test set to > missing reflections. I also heard this issue mentioned before but was > always too lazy to actually pursue it. > > So. > > The role of the test set is to prevent overfitting. Let's say I have > the final model and I monitored the Rfree every step of the way and can > conclude that there is no overfitting. Should I do the final refinement > against complete dataset? > > IMCO, I absolutely should. The test set reflections contain > information, and the "final" model is actually biased towards the > working set. Refining using all the data can only improve the accuracy > of the model, if only slightly. Hmm, if your R-free set is small the added value will also be small. If it is relatively big, then your previously established optimal weights may no longer be optimal. A more elegant thing to would be refine the model with, say, 20 different 5% R-free sets, deposit the ensemble and report the average R(-free) plus a standard deviation. AFAIK, this is what the R-free set numbers that CCP4's FREERFLAG generates are for. Of course, in that case you should do enough refinement (and perhaps rebuilding) to make sure each R-free set is free. > The second question is practical. Let's say I want to deposit the > results of the refinement against the full dataset as my final model. > Should I not report the Rfree and instead insert a remark explaining the > situation? If I report the Rfree prior to the test set removal, it is > certain that every validation tool will report a mismatch. It does not > seem that the PDB has a mechanism to deal with this. The deposited R-free sets in the PDB are quite frequently 'unfree' or the wrong set was deposited (checking this is one of the recommendations in the VTF report in Structure). So at the moment you would probably get away with depositing an unfree R-free set ;) Cheers, Robbie > > Cheers, > > Ed. > > > > -- > Oh, suddenly throwing a giraffe into a volcano to make water is crazy? > Julian, King of Lemurs
Re: [ccp4bb] should the final model be refined against full datset
On Fri, Oct 14, 2011 at 12:52 PM, Ed Pozharski wrote: > The second question is practical. Let's say I want to deposit the > results of the refinement against the full dataset as my final model. > Should I not report the Rfree and instead insert a remark explaining the > situation? If I report the Rfree prior to the test set removal, it is > certain that every validation tool will report a mismatch. It does not > seem that the PDB has a mechanism to deal with this. > You should enter the statistics for the model and data that you actually deposit, not statistics for some other model that you might have had at one point but which the PDB will never see. Not only does refining against R-free make it impossible to verify and validate your structure, it also means that any time you or anyone else wants to solve an isomorphous structure by MR using your structure as a search model, or continue the refinement with higher-resolution data, you will be starting with a model that has been refined against all reflections. So any future refinements done with that model against isomorphous data are pre-biased, making your model potentially useless. I'm amazed that anyone is still depositing structures refined against all data, but the PDB does still get a few. The benefit of including those extra 5% of data is always minimal in every paper I've seen that reports such a procedure, and far outweighed by having a reliable and relatively unbiased validation statistic that is preserved in the final deposition. (The situation may be different for very low resolution data, but those structures are a tiny fraction of the PDB.) -Nat