Re: [ccp4bb] How many is too many free reflections?
Dear Axel and Paul, Thank you for reopening the Rfree and TEST set discussion. The concept of Rfree and TEST set play an important role crystallography. When you introduced them back in 1992, Rfree was the first systematic method of structure validation. Its advantage is that it can use data from the structure being determined in the absence of any other data sources. Nowadays, over two decades after, we have learned a lot about the structures. Real space approaches from density fit, over deviation from ideal, statistically derived geometrical restraints to packing information together provide insight in the structure correctness, which in my opinion ensure the structure correctness and over interpretation. Not to mention the relevance of the fit structure factors of model (Fmodel) to the measured data (Fobs) by R-factor (R-work). While the concept of the TEST set und its use in refinement provided a simple criterion for structure validation, it raises the following concerns: - Refining structures against incomplete data results in structures which are off the true target. Namely, the information of reflections omitted from the WORK set is introducing bias of their absence. This bias is direct consequence of orthogonality of the Fourier series terms. The bias of absence is diminished by reducing the amount of data included in the TEST set, but nevertheless remains with its presence. In time the portion and size of the TEST was diminished substantially. - The identification of TEST reflections faces the problem of their independency, when identical subunits present in a structure are related by NCS. I think that a substantial proportion of structures contains NCS. An interesting angle on NCS issue is provided by the work of Silva Rossmann (1985), who discarded most of data almost proportionally to the level of NCS redundancy (using 1/7th for WORK set and 6/7 for TEST set in the case of 10-fold NCS). - An additional, so far almost neglected concern is the cross propagation of systematic errors in structures. They are a consequence of interactions of structural parts through the chemical bonding and nonbonding energy terms used in refinement. Absence of consideration of errors of this origin results in too small coordinate error estimates essential for the Maximum Likelihood (ML) function. - The original use of the TEST set in refinement used the Least Square target, apart from the bias of its absence, does not effect the Least Square target itself, whereas the standard ML function relies on this data and is therefore biased by them. - The Rfree is an indicator of structure correctness and is monitored during refinement to assure its decrease, however a different choice of TEST set will result in a different phase error and gap between Rfree and Rwork. The relationship of the Rfree and Rwork gap and the phase error between different tests sets calculated on our 5 test cases with 4 different portions of 31 different TEST sets is either statistically significant or insignificant. Both groups are contain approximately equal number of members. When the relationship turned statistically significant it happened that the lower Rfree Rwork gap quite oftne deliver higher phase error. (This part of analysis was not included in the paper, however the negative correlation may be seen in the trend of the orange dots in several graphs on Figure 6.) Hence, there is no warranty that the TEST set with the lowest gap between Rfree and Rwork will deliver also the structure with the lowest phase error, which is an underlying assumption of the use of Rfree for the purpose of structure validation. This suggests that the gap between Rfree and Rwork can be easily manipulated and manipulation not spotted. In the absence of the reference structure it is namely impossible to discover which choice of the TEST set and the corresponding gap between Rfree and Rwork delivers the structure with the lowest phase error. (This argument in way provides support for the Gerard's point that the TEST set may not be exchanged when various structures of the same crystal form of a molecule are being determined using the Rfree methodology.) The “trick” of exchange of the TEST set is no surprise to the community which uses it at the occasions, when they suspect that a too large gap between R-free and R-work may lead to potential problems with a stubborn referee. To overcome these concerns we developed the Maximum Likelihood Free Kick function (ML FK). As the cases used in the paper indicate, ML FK target function delivered more accurate structures and narrower solutions than the todays standard Maximum Likelihood Cross Validation (ML CV) function in all tested cases including the case of 2AHN structure build in the wrong direction. Our understanding is that the role of Rfree should be considered from the historical perspective. In our paper we wrote “Regarding the use of Rfree to
Re: [ccp4bb] How many is too many free reflections?
What is wrong with using Rfree until the very late stages of refinement, then alternating refinements with all reflections and Rfree reflections while not introducing more refinement parameters. This way you would get a structure and e-map based on all the data while ensuring that the data has not been overfitted. Just a thought. On Tue, Jun 16, 2015 at 8:07 AM, dusan turk dusan.t...@ijs.si wrote: Dear Axel and Paul, Thank you for reopening the Rfree and TEST set discussion. The concept of Rfree and TEST set play an important role crystallography. When you introduced them back in 1992, Rfree was the first systematic method of structure validation. Its advantage is that it can use data from the structure being determined in the absence of any other data sources. Nowadays, over two decades after, we have learned a lot about the structures. Real space approaches from density fit, over deviation from ideal, statistically derived geometrical restraints to packing information together provide insight in the structure correctness, which in my opinion ensure the structure correctness and over interpretation. Not to mention the relevance of the fit structure factors of model (Fmodel) to the measured data (Fobs) by R-factor (R-work). While the concept of the TEST set und its use in refinement provided a simple criterion for structure validation, it raises the following concerns: - Refining structures against incomplete data results in structures which are off the true target. Namely, the information of reflections omitted from the WORK set is introducing bias of their absence. This bias is direct consequence of orthogonality of the Fourier series terms. The bias of absence is diminished by reducing the amount of data included in the TEST set, but nevertheless remains with its presence. In time the portion and size of the TEST was diminished substantially. - The identification of TEST reflections faces the problem of their independency, when identical subunits present in a structure are related by NCS. I think that a substantial proportion of structures contains NCS. An interesting angle on NCS issue is provided by the work of Silva Rossmann (1985), who discarded most of data almost proportionally to the level of NCS redundancy (using 1/7th for WORK set and 6/7 for TEST set in the case of 10-fold NCS). - An additional, so far almost neglected concern is the cross propagation of systematic errors in structures. They are a consequence of interactions of structural parts through the chemical bonding and nonbonding energy terms used in refinement. Absence of consideration of errors of this origin results in too small coordinate error estimates essential for the Maximum Likelihood (ML) function. - The original use of the TEST set in refinement used the Least Square target, apart from the bias of its absence, does not effect the Least Square target itself, whereas the standard ML function relies on this data and is therefore biased by them. - The Rfree is an indicator of structure correctness and is monitored during refinement to assure its decrease, however a different choice of TEST set will result in a different phase error and gap between Rfree and Rwork. The relationship of the Rfree and Rwork gap and the phase error between different tests sets calculated on our 5 test cases with 4 different portions of 31 different TEST sets is either statistically significant or insignificant. Both groups are contain approximately equal number of members. When the relationship turned statistically significant it happened that the lower Rfree Rwork gap quite oftne deliver higher phase error. (This part of analysis was not included in the paper, however the negative correlation may be seen in the trend of the orange dots in several graphs on Figure 6.) Hence, there is no warranty that the TEST set with the lowest gap between Rfree and Rwork will deliver also the structure with the lowest phase error, which is an underlying assumption of the use of Rfree for the purpose of structure validation. This suggests that the gap between Rfree and Rwork can be easily manipulated and manipulation not spotted. In the absence of the reference structure it is namely impossible to discover which choice of the TEST set and the corresponding gap between Rfree and Rwork delivers the structure with the lowest phase error. (This argument in way provides support for the Gerard's point that the TEST set may not be exchanged when various structures of the same crystal form of a molecule are being determined using the Rfree methodology.) The “trick” of exchange of the TEST set is no surprise to the community which uses it at the occasions, when they suspect that a too large gap between R-free and R-work may lead to potential problems with a stubborn referee. To overcome these concerns we developed the Maximum Likelihood Free Kick function (ML FK). As the cases used in the paper
Re: [ccp4bb] How many is too many free reflections?
Dear Dusan, Following up on Gerard's comment, we also read your nice paper with great interest. Your method appears most useful for cases with a limited number of reflections (e.g., small unit cell and/or low resolution) resulting in 5% test sets with less than 1000 reflections in total. It improves the performance of your implementation of ML refinement for the cases that you described. However, we don't think that you can conclude that cross-validation is not needed anymore. To quote your paper, in the Discussion section: To address the use of R free as indicator of wrong structures, we repeated the Kleywegt and Jones experiment (Kleywegt Jones, 1995; Kleywegt Jones, 1997) and built the 2ahn structure in the reverse direction and then refined it in the absence of solvent using the ML CV and ML FK approaches. Fig. 9 shows that Rfree stayed around 50% and Rfree–Rwork around 15% in the case of the reverse structure regardless of the ML approach and the fraction of data used in the test set. These values indicate that there is a fundamental problem with the structure, which supports the further use of Rfree as an indicator. Thank you for reaffirming the utility of the statistical tool of cross-validation. The reverse chain trace of 2ahn is admittedly an extreme case of misfitting, and would probably be detected with other validation tools as well these days. However, the danger of overfitting or misfitting is still a very real possibility for large structures, especially when only moderate to low resolution data are available, even with today's tools. Cross-validation can help even at very low resolution: in Structure 20, 957-966 (2012) we showed that cross-validation is useful for certain low resolution refinements where additional restraints (DEN restraints in that case) are used to reduce overfitting and obtain a more accurate structure. Cross-validation made it possible to detect overfitting of the data when no DEN restraints were used. We believe this should also apply when other types of restraints are used (e.g., reference model restraints in phenix.refine, REFMAC, or BUSTER). In summary, we believe that cross-validation remains an important (and conceptually simple) method to detect overfitting and for overall structure validation. Axel Axel T. Brunger Professor and Chair, Department of Molecular and Cellular Physiology Investigator, HHMI Email: brun...@stanford.edu mailto:brun...@stanford.edu Phone: 650-736-1031 Web: http://atbweb.stanford.edu http://atbweb.stanford.edu/ Paul Paul Adams Deputy Division Director, Physical Biosciences Division, Lawrence Berkeley Lab Division Deputy for Biosciences, Advanced Light Source, Lawrence Berkeley Lab Adjunct Professor, Department of Bioengineering, U.C. Berkeley Vice President for Technology, the Joint BioEnergy Institute Laboratory Research Manager, ENIGMA Science Focus Area Tel: 1-510-486-4225, Fax: 1-510-486-5909 http://cci.lbl.gov/paul http://cci.lbl.gov/paul On Jun 5, 2015, at 2:18 AM, Gerard Bricogne g...@globalphasing.com wrote: Dear Dusan, This is a nice paper and an interestingly different approach to avoiding bias and/or quantifying errors - and indeed there are all kinds of possibilities if you have a particular structure on which you are prepared to spend unlimited time and resources. The specific context in which Graeme's initial question led me to query instead who should set the FreeR flags, at what stage and on what basis? was that of the data analysis linked to high-throughput fragment screening, in which speed is of the essence at every step. Creating FreeR flags afresh for each target-fragment complex dataset without any reference to those used in the refinement of the apo structure is by no means an irrecoverable error, but it will take extra computing time to let the refinement of the complex adjust to a new free set, starting from a model refined with the ignored one. It is in order to avoid the need for that extra time, or for a recourse to various debiasing methods, that the book-keeping faff described yesterday has been introduced. Operating without it is perfectly feasible, it is just likely to not be optimally direct. I will probably bow out here, before someone asks How many [e-mails from me] is too many? :-) . With best wishes, Gerard. -- On Fri, Jun 05, 2015 at 09:14:18AM +0200, dusan turk wrote: Graeme, one more suggestion. You can avoid all the recipes by use all data for WORK set and 0 reflections for TEST set regardless of the amount of data by using the FREE KICK ML target. For explanation see our recent paper Praznikar, J. Turk, D. (2014) Free kick instead of cross-validation in maximum-likelihood refinement of macromolecular crystal structures. Acta Cryst. D70, 3124-3134. Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML” best, dusan On Jun 5, 2015, at
Re: [ccp4bb] How many is too many free reflections?
Dear Frank, I was going to reply to Ian's last comment last night, but got distracted. This last paragraph of Ian's message does sound rather negative if detached from the context of the previous one, which was about non-isomorphism between fragment complexes and the apo being the rule rather than the exception. Ian uses the Crick-Magdoff definition of an acceptable level of non-isomorphism, which is quite a stringent one because its refers to a level that would invalidate isomorphism for experimental phasing purposes. A much greater level of non-isomorphism can be tolerated when it comes to solving a target-fragment complex starting from the apo structure, so the Crick-Magdoff criterion is not relevant here. Furthermore I think that Ian identifies perhaps too readily the effect of non-isomorphism in creating noise in the comparison of intensities and its effect on invalidating the working vs. free status of observations. I think, therefore, that Ian's claim that failing the Crick-Magdoff criterion for isomorphism results in scrambling the distinction between the working set and the free set is a very big overstatement. You describe as bookkeeping faff the procedures that Ian and I outlined to preserve the FreeR flags of the apo refinement, and ask for a paper. These matters are probably not glamorous enough to find their way into papers, and would best be discussed (or re-discussed) in a specialised BB like this one. If the shift from the question How many is too many to How the free set should be chosen that I tried to bring about yesterday results in a general sharing of evidence that otherwise gets set aside, I will be very happy. I would find it unwise to dismiss this question by expecting that there would be a mountain of published evidence if it was really important. Let us go ahead, then: could everyone who has evidence (rather than preconceptions) on this matter please come forward and share it? Answering this question is very important, even if the conclusion is that the faff is unimportant. With best wishes, Gerard. -- On Thu, Jun 04, 2015 at 10:43:15PM +0100, Frank von Delft wrote: I'm afraid Gerard an Ian between them have left me a bit confused with conflicting statements: On 04/06/2015 15:29, Gerard Bricogne wrote: snip In order to guard the detection of putative bound fragments against the evils of model bias, it is very important to ensure that the refinement of each complex against data collected on it does not treat as free any reflections that were part of the working set in the refinement of the apo structure. snip On 04/06/2015 17:34, Ian Tickle wrote: snip So I suspect that most of our efforts in maintaining common free R flags are for nothing; however it saves arguments with referees when it comes to publication! snip I also remember conversations and even BB threads that made me conclude that it did NOT matter to have the same Rfree set for independent datasets (e.g. different crystals). I confess I don't remember the arguments, only the relief at not having to bother with all the bookkeeping faff Gerard outlines and Ian describes. So: could someone explain in detail why this matters (or why not), and is there a URL to the evidence (paper or anything else) in either direction? (As far as I remember, the argument went that identical free sets were unnecessary even for exactly isomorphous crystals. Something like this: model bias is not a big deal when the model has largely converged, and that's what you have for molecular substitution (as Jim Pflugrath calls it). In addition, even a weakly binding fragment compounds produces intensity perturbations large enough to make model bias irrelevant.) phx
Re: [ccp4bb] How many is too many free reflections?
Dear Dusan, This is a nice paper and an interestingly different approach to avoiding bias and/or quantifying errors - and indeed there are all kinds of possibilities if you have a particular structure on which you are prepared to spend unlimited time and resources. The specific context in which Graeme's initial question led me to query instead who should set the FreeR flags, at what stage and on what basis? was that of the data analysis linked to high-throughput fragment screening, in which speed is of the essence at every step. Creating FreeR flags afresh for each target-fragment complex dataset without any reference to those used in the refinement of the apo structure is by no means an irrecoverable error, but it will take extra computing time to let the refinement of the complex adjust to a new free set, starting from a model refined with the ignored one. It is in order to avoid the need for that extra time, or for a recourse to various debiasing methods, that the book-keeping faff described yesterday has been introduced. Operating without it is perfectly feasible, it is just likely to not be optimally direct. I will probably bow out here, before someone asks How many [e-mails from me] is too many? :-) . With best wishes, Gerard. -- On Fri, Jun 05, 2015 at 09:14:18AM +0200, dusan turk wrote: Graeme, one more suggestion. You can avoid all the recipes by use all data for WORK set and 0 reflections for TEST set regardless of the amount of data by using the FREE KICK ML target. For explanation see our recent paper Praznikar, J. Turk, D. (2014) Free kick instead of cross-validation in maximum-likelihood refinement of macromolecular crystal structures. Acta Cryst. D70, 3124-3134. Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML” best, dusan On Jun 5, 2015, at 1:03 AM, CCP4BB automatic digest system lists...@jiscmail.ac.uk wrote: Date:Thu, 4 Jun 2015 08:30:57 + From:Graeme Winter graeme.win...@gmail.com Subject: Re: How many is too many free reflections? Hi Folks, Many thanks for all of your comments - in keeping with the spirit of the BB I have digested the responses below. Interestingly I suspect that the responses to this question indicate the very wide range of resolution limits of the data people work with! Best wishes Graeme === Proposal 1: 10% reflections, max 2000 Proposal 2: from wiki: http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set including Randy Read recipe: So here's the recipe I would use, for what it's worth: 1 reflections:set aside 10% 1-2 reflections: set aside 1000 reflections 2-4 reflections: set aside 5% 4 reflections:set aside 2000 reflections Proposal 3: 5% maximum 2-5k Proposal 4: 3% minimum 1000 Proposal 5: 5-10% of reflections, minimum 1000 Proposal 6: 50 reflections per bin in order to get reliable ML parameter estimation, ideally around 150 / bin. Proposal 7: If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be 40k i.e. rather a lot. Referees question use of 5k reflections as test set. Comment 1 in response to this: Surely absolute # of test reflections is not relevant, percentage is. Approximate consensus (i.e. what I will look at doing in xia2) - probably follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy most of the criteria raised by everyone else. On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com wrote: Hi Folks Had a vague comment handed my way that xia2 assigns too many free reflections - I have a feeling that by default it makes a free set of 5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now. This was particularly in the case of high resolution data where you have a lot of reflections, so 5% could be several thousand which would be more than you need to just check Rfree seems OK. Since I really don't know what is the right # reflections to assign to a free set thought I would ask here - what do you think? Essentially I need to assign a minimum %age or minimum # - the lower of the two presumably? Any comments welcome! Thanks best wishes Graeme Dr. Dusan Turk, Prof. Head of Structural Biology Group http://bio.ijs.si/sbl/ Head of Centre for Protein and Structure Production Centre of excellence for Integrated Approaches in Chemistry and Biology of Proteins, Scientific Director http://www.cipkebip.org/ Professor of Structural Biology at IPS Jozef Stefan e-mail: dusan.t...@ijs.si phone: +386 1 477 3857 Dept. of Biochem. Mol. Struct. Biol. fax: +386 1 477 3984 Jozef Stefan Institute
Re: [ccp4bb] How many is too many free reflections?
Graeme, one more suggestion. You can avoid all the recipes by use all data for WORK set and 0 reflections for TEST set regardless of the amount of data by using the FREE KICK ML target. For explanation see our recent paper Praznikar, J. Turk, D. (2014) Free kick instead of cross-validation in maximum-likelihood refinement of macromolecular crystal structures. Acta Cryst. D70, 3124-3134. Link to the paper you can find at “http://www-bmb.ijs.si/doc/references.HTML” best, dusan On Jun 5, 2015, at 1:03 AM, CCP4BB automatic digest system lists...@jiscmail.ac.uk wrote: Date:Thu, 4 Jun 2015 08:30:57 + From:Graeme Winter graeme.win...@gmail.com Subject: Re: How many is too many free reflections? Hi Folks, Many thanks for all of your comments - in keeping with the spirit of the BB I have digested the responses below. Interestingly I suspect that the responses to this question indicate the very wide range of resolution limits of the data people work with! Best wishes Graeme === Proposal 1: 10% reflections, max 2000 Proposal 2: from wiki: http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set including Randy Read recipe: So here's the recipe I would use, for what it's worth: 1 reflections:set aside 10% 1-2 reflections: set aside 1000 reflections 2-4 reflections: set aside 5% 4 reflections:set aside 2000 reflections Proposal 3: 5% maximum 2-5k Proposal 4: 3% minimum 1000 Proposal 5: 5-10% of reflections, minimum 1000 Proposal 6: 50 reflections per bin in order to get reliable ML parameter estimation, ideally around 150 / bin. Proposal 7: If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be 40k i.e. rather a lot. Referees question use of 5k reflections as test set. Comment 1 in response to this: Surely absolute # of test reflections is not relevant, percentage is. Approximate consensus (i.e. what I will look at doing in xia2) - probably follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy most of the criteria raised by everyone else. On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com wrote: Hi Folks Had a vague comment handed my way that xia2 assigns too many free reflections - I have a feeling that by default it makes a free set of 5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now. This was particularly in the case of high resolution data where you have a lot of reflections, so 5% could be several thousand which would be more than you need to just check Rfree seems OK. Since I really don't know what is the right # reflections to assign to a free set thought I would ask here - what do you think? Essentially I need to assign a minimum %age or minimum # - the lower of the two presumably? Any comments welcome! Thanks best wishes Graeme Dr. Dusan Turk, Prof. Head of Structural Biology Group http://bio.ijs.si/sbl/ Head of Centre for Protein and Structure Production Centre of excellence for Integrated Approaches in Chemistry and Biology of Proteins, Scientific Director http://www.cipkebip.org/ Professor of Structural Biology at IPS Jozef Stefan e-mail: dusan.t...@ijs.si phone: +386 1 477 3857 Dept. of Biochem. Mol. Struct. Biol. fax: +386 1 477 3984 Jozef Stefan Institute Jamova 39, 1 000 Ljubljana,Slovenia Skype: dusan.turk (voice over internet: www.skype.com
Re: [ccp4bb] How many is too many free reflections?
I'm afraid Gerard an Ian between them have left me a bit confused with conflicting statements: On 04/06/2015 15:29, Gerard Bricogne wrote: snip In order to guard the detection of putative bound fragments against the evils of model bias, it is very important to ensure that the refinement of each complex against data collected on it does not treat as free any reflections that were part of the working set in the refinement of the apo structure. snip On 04/06/2015 17:34, Ian Tickle wrote: snip So I suspect that most of our efforts in maintaining common free R flags are for nothing; however it saves arguments with referees when it comes to publication! snip I also remember conversations and even BB threads that made me conclude that it did NOT matter to have the same Rfree set for independent datasets (e.g. different crystals). I confess I don't remember the arguments, only the relief at not having to bother with all the bookkeeping faff Gerard outlines and Ian describes. So: could someone explain in detail why this matters (or why not), and is there a URL to the evidence (paper or anything else) in either direction? (As far as I remember, the argument went that identical free sets were unnecessary even for exactly isomorphous crystals. Something like this: model bias is not a big deal when the model has largely converged, and that's what you have for molecular substitution (as Jim Pflugrath calls it). In addition, even a weakly binding fragment compounds produces intensity perturbations large enough to make model bias irrelevant.) phx
Re: [ccp4bb] How many is too many free reflections?
It seems to me that the how many is too many aspect of this question, and the various culinary procedures that have been proposed as answers, may have obscured another, much more fundamental issue, namely: is it really the business of the data processing package to assign FreeR flags? I would argue that it isn't. (...) Excellent point! I can't agree more. Pavel
Re: [ccp4bb] How many is too many free reflections?
In other words, the free set for each complex must be such that reflections that are also present in the apo dataset retain the FreeR flag they had in that dataset. A very easy way to achieve this- generate a complete dataset to ridiculously high resolution with the cell of your crystal, and assign free-r flags. (If the first structure has been already solved, merge it's free set and extend to the new reflections) Now for every new structure solved, discard any free set that the data reduction program may have generated and merge with the complete set, discarding reflection with no Fobs (MNF) or with SigF=0. In fact, if we consider a dataset is just a 3-dimensional array, or some subset of it enclosing the reciprocal space asymmetric unit, I don't see any reason we couldn't assign one universal P1 free-R set and use it for every structure in whatever space group. By taking each new dataset, merging with the universal Free-R, and discarding those reflections not present in the new data, you would obtain a random set for your structure. There could be nested (concentric?) free-R sets with 10%, 5%, 2%, 1% free so that if you start out excluding 5% for a low-res structure then get a high resolution dataset and want to exclude 2%, you could be sure that all the 2% free reflections were also free in your previous 5% set. Thin or thick shells could be predefined. There may be problems when it is desired to exclude reflections according to some twin law or NCS. (just now read Nick Keep's post which expresses some similar ideas) eab On 06/04/2015 10:29 AM, Gerard Bricogne wrote: Dear Graeme and other contributors to this thread, It seems to me that the how many is too many aspect of this question, and the various culinary procedures that have been proposed as answers, may have obscured another, much more fundamental issue, namely: is it really the business of the data processing package to assign FreeR flags? I would argue that it isn't. From the statistical viewpoint that justifies the need for FreeR flags, these are pre-refinement entities rather than post-processing ones. If one considers a single instance of going from a dataset to a refined structure, then this distinction may seem artificial. Consider, instead, the case of high-throughput screening to detect fragment binding on a large number of crystals of complexes between a given target protein (the apo) and a multitude of small, weakly-binding fragments into solutions of which crystals of the apo have been soaked. The model for the apo crystal structure comes from a refinement against a dataset, using a certain set of FreeR flags. In order to guard the detection of putative bound fragments against the evils of model bias, it is very important to ensure that the refinement of each complex against data collected on it does not treat as free any reflections that were part of the working set in the refinement of the apo structure. In other words, the free set for each complex must be such that reflections that are also present in the apo dataset retain the FreeR flag they had in that dataset. Any mixup, in the FreeR flags for a complex, of the work vs. free status of the reflections also in the apo would push Rwork up and Rfree down, invalidating their role as indicators of quality of fit or of incipient overfitting. Great care must therefore be exercised, in the form of adequate book-keeping and procedures for generating the FreeR flags in the mtz file for each complex from that for the apo, to properly enforce this inheritance of work vs. free status. In such a context there is a clear and crucial difference between a post-processing entity and a pre-refinement one. FreeR flags belong to the latter category. In fact, the creation of FreeR flags at the end of the processing step can create a false perception, among people doing ligand screening under pressure, that they cannot re-use the FreeR flag information of the apo in refining their complexes, simply because a new set has been created for each of them. This is clearly to be avoided. Preserving the FreeR flags of the reflections that were used in the refinement of the apo structure is one of the explicit recommendations explicitly in the 2013 paper by Pozharski et al. (Acta Cryst. D69, 150-167) - see section 1.1.3, p.152. Best practice in this area may therefore not be only a question of numbers, but also of doing the appropriate thing in the appropriate place. There are of course corner cases where e.g. substantial unit-cell changes start to introduce some cross-talk between working and free reflections, but the possibililty of such complications is no argument to justify giving up on doing the right thing when the right thing can be done. With best wishes, Gerard. -- On Thu, Jun 04, 2015 at 08:30:57AM +, Graeme Winter wrote: Hi Folks, Many thanks for all of your comments - in keeping with the spirit of the BB I have digested
Re: [ccp4bb] How many is too many free reflections?
Many good points have been made on this thread so far, but mostly addressing the question how many free reflections is enough, whereas the original question was how many is too many. I suppose a reasonable definition of too many is when the error introduced into the map by leaving out all those reflections start to become a problem. It is easy to calculate this error: it is simply the difference between the map made using all reflections (regardless of Free-R flag) and the map made with 5% of the reflections left out. Of course, this difference map is identical to a map calculated using only the 5% free reflections, setting all others to zero. The RMS variation of this error map is actually independent of the phases used (Parseval's theorem), and it ends up being: RMSerror = RMSall * sqrt( free_frac ) where: RMSerror is the RMS variation of the error map RMSall is the RMS variation of the map calculated with all reflections free_frac is the fraction of hkls left out of the calculation. So, with 5% free reflections, the errors induced in the electron density will have an RMS variation that is 22.3% of the full map's RMS variation, or 0.223 sigma units. 1% free reflections will result in RMS 10% error, or 0.1 sigmas. This means, for example, that with 5% free reflections a 1.0 sigma peak might come up as a 1.2 or 0.8 sigma feature. Note that these are not the sigmas of the Fo-Fc map, (which changes as you build) but rather the sigma of the Fo map. Most of us don't look at Fo maps, but rather 2Fo-Fc or 2mFo-DFc maps, with or without the missing reflections filled in. These are a bit different from a straight Fo map. The absolute electron number density (e-/A^3) of the 1 sigma contour for all these maps is about the same, but no doubt the fill in, extra Fo-Fc term, and the likelihood weights reduces the overall RMS error. By how much? That is a good question. Still, we can take this RMS 0.223 sigma variation from 5% free reflections as a worst-case scenario, and then ask the question: is this a problem? Well, any source of error can be a problem, but when you are trying to find the best compromise between two difficult-to-reconcile considerations (such as the stability of Rfree and the interpretability of the map), it is usually helpful to bring in a third consideration: such as how much noise is in the map already due to other sources? My colleagues and I measured this recently (doi: 10.1073/pnas.1302823110), and found that the 1-sigma contour ranges from 0.8 to 1.2 e-/A^3 (relative to vacuum), experimental measurement errors are RMS ~0.04 e-/A^3 and map errors from the model-data difference is about RMS 0.13 e-/A^3. So, 22.3% of sigma is around RMS 0.22 e-/A^3. This is a bit larger than our biggest empirically-measured error: the modelling error, indicating that 5% free flags may indeed be too much. However, 22.3% is the worst-case error, in the absence of all the corrections used to make 2mFo-DFc maps, so in reality the modelling error and the omitted-reflection errors are probably comparable, indicating that 5% is about the right amount. Any more and the error from omitted reflections starts to dominate the total error. On the other hand, the modelling error is (by definition) the Fo-Fc difference, so as Rwork/Rfree get smaller the RMS map variation due to modelling errors gets smaller as well, eventually exposing the omitted-reflection error. So, once your Rwork/Rfree get to be less than ~22%, the errors in the map are starting to be dominated by the missing Fs of the 5% free set. However, early in the refinement, when your R factors are in the 30%s, 40%s, or even 50%s, I don't think the errors due to missing 5% of the reflections are going to be important. Then again, late in refinement, it might be a good idea to start including some or all of the free reflections back into the working set in order to reduce the overall map error (cue lamentations from validation experts such as Jane Richardson). This is perhaps the most important topic on this thread. There are so many ways to contaminate, bias or otherwise compromise the free set, and once done we don't have generally accepted procedures for re-sanctifying the free reflections, other that starting over again from scratch. This is especially problematic if your starting structure for molecular replacement was refined against all reflections, and your ligand soak is nice and isomorphous to those original crystals. How do you remove the evil bias from this model? You can try shaking it, but that only really removes bias at high spatial frequencies and is not so effective at low resolution. So, if bias is so easy to generate why not use it to our advantage? Instead of leaving the free-flagged reflections out of the refinement, put them in, but give them random F values. Then do everything you can to bias your model toward these random values. Loosen the
Re: [ccp4bb] How many is too many free reflections?
Nick What you describe is (almost) exactly the way we have always done it at Astex I'm surprised to hear that others are not routinely doing the same. The difference is that we don't generate a free R flag MTZ file to ultra-high resolution as you suggest, since there's never any need to. What we do is generate by default a 1.5 Ang. free R flag file using UNIQUE, FREERFLAG and MTZUTILS whenever a new apo structure for a given target/crystal form is solved and keep that with the intial apo data as a reference dataset for auto-re-indexing (so that all the protein-ligand datasets are indexed the same way). When a dataset is combined with the higher resolution free R flag file we would of course cut the resolution to that of the data (still keeping the original free R flag file), mainly in order to save space in the database. Obviously if the initial apo data were higher resolution than 1.5 Ang, the processing script would generate an initial free R flag file also correspondlingly higher (say to 1 Ang.). If a ligand dataset comes along later at higher resolution than 1.5 Ang. the script would do the same thing, but then it would use the MTZUTILS UNIQ option to merge the old free R flags up to 1.5 Ang. with the new ones between 1.5 and 1 Ang. Then it would combine the data file with the free R flag file as before and cut the resolution of the combined data file to the actual resolution of the data. The script would then replace the old free R flag file with the new one and use the latter for all subsequent datasets from that target/crystal form. The users are completely unaware that any of this is happening (unless they want to dig into the scripts!). We enforce use of 'approved' scripts for all the processing and refinement essentially by using an Oracle database with web-based access authentication which means that if you don't use the approved scripts to process your data then you can't upload your data to the database, which then means that no-one else will get to see and/or use your results! Our scripts make full use of CCP4 and Global Phasing programs (autoPROC, autoBUSTER, GRADE etc): however using CCP4i or other programs from the command line to process the data and only uploading the final results to the database is severely deprecated (and totally unsupported!), mainly because there will then be no permanent traceback in the database of the user's actions for others to see. On Gerard's final point of the effect on non-isomorphism, we find that isomorphism is the exception rather than the rule, i.e. the majority of our datasets would fail the Crick-Magdoff test for isomorphism (i.e. no more than 0.5% change for all cell lengths for 3 Ang. data and a correspondingly lower threshold at more typical resolution limits of 2 - 1.5 Ang.). This is obviously very target and crystal form-dependent, some targets/crystal forms give more isomorphous crystals than others. So I suspect that most of our efforts in maintaining common free R flags are for nothing; however it saves arguments with referees when it comes to publication! Cheers -- Ian On 4 June 2015 at 16:00, Nicholas Keep n.k...@mail.cryst.bbk.ac.uk wrote: I agree with Gerard. It would be much better in many ways to generate a separate file of Free R flags for each crystal form of a project to some high resolution that is unlikely to ever be exceeded eg 0.4 A that is a separate input file to refinement rather than in the mtz. The generation of this free set could ask some questions like is the data twinned, do you want to extend the free set from a higher symmetry free set. eg C2 rather than C2221 (symmetry is close to the higher symmetry but not perfect- seems to happen not infrequently). Could some judicious selection of sets of related potentially related hkls work as a universal free set? (Not thought this through fully) This would get around practical issues like I had yestserday in refining in another well known package where coot drew the map as if it was 0.5 A data even though there were only observed data to 2.1 the rest just being a hopelessly overoptimistic guess of the best ever dataset we might collect. I agree you CAN do this with current software- it is just not the path of least resistance, so you have to double check your group are doing this. Best wishes Nick -- Prof Nicholas H. Keep Executive Dean of School of Science Professor of Biomolecular Science Crystallography, Institute for Structural and Molecular Biology, Department of Biological Sciences Birkbeck, University of London, Malet Street, Bloomsbury LONDON WC1E 7HX email n.k...@mail.cryst.bbk.ac.uk Telephone 020-7631-6852 (Room G54a Office) 020-7631-6800 (Department Office) Fax 020-7631-6803 If you want to access me in person you have to come to the crystallography entrance and ring me or the department office from the internal phone by the door
Re: [ccp4bb] How many is too many free reflections?
Dear Graeme and other contributors to this thread, It seems to me that the how many is too many aspect of this question, and the various culinary procedures that have been proposed as answers, may have obscured another, much more fundamental issue, namely: is it really the business of the data processing package to assign FreeR flags? I would argue that it isn't. From the statistical viewpoint that justifies the need for FreeR flags, these are pre-refinement entities rather than post-processing ones. If one considers a single instance of going from a dataset to a refined structure, then this distinction may seem artificial. Consider, instead, the case of high-throughput screening to detect fragment binding on a large number of crystals of complexes between a given target protein (the apo) and a multitude of small, weakly-binding fragments into solutions of which crystals of the apo have been soaked. The model for the apo crystal structure comes from a refinement against a dataset, using a certain set of FreeR flags. In order to guard the detection of putative bound fragments against the evils of model bias, it is very important to ensure that the refinement of each complex against data collected on it does not treat as free any reflections that were part of the working set in the refinement of the apo structure. In other words, the free set for each complex must be such that reflections that are also present in the apo dataset retain the FreeR flag they had in that dataset. Any mixup, in the FreeR flags for a complex, of the work vs. free status of the reflections also in the apo would push Rwork up and Rfree down, invalidating their role as indicators of quality of fit or of incipient overfitting. Great care must therefore be exercised, in the form of adequate book-keeping and procedures for generating the FreeR flags in the mtz file for each complex from that for the apo, to properly enforce this inheritance of work vs. free status. In such a context there is a clear and crucial difference between a post-processing entity and a pre-refinement one. FreeR flags belong to the latter category. In fact, the creation of FreeR flags at the end of the processing step can create a false perception, among people doing ligand screening under pressure, that they cannot re-use the FreeR flag information of the apo in refining their complexes, simply because a new set has been created for each of them. This is clearly to be avoided. Preserving the FreeR flags of the reflections that were used in the refinement of the apo structure is one of the explicit recommendations explicitly in the 2013 paper by Pozharski et al. (Acta Cryst. D69, 150-167) - see section 1.1.3, p.152. Best practice in this area may therefore not be only a question of numbers, but also of doing the appropriate thing in the appropriate place. There are of course corner cases where e.g. substantial unit-cell changes start to introduce some cross-talk between working and free reflections, but the possibililty of such complications is no argument to justify giving up on doing the right thing when the right thing can be done. With best wishes, Gerard. -- On Thu, Jun 04, 2015 at 08:30:57AM +, Graeme Winter wrote: Hi Folks, Many thanks for all of your comments - in keeping with the spirit of the BB I have digested the responses below. Interestingly I suspect that the responses to this question indicate the very wide range of resolution limits of the data people work with! Best wishes Graeme === Proposal 1: 10% reflections, max 2000 Proposal 2: from wiki: http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set including Randy Read recipe: So here's the recipe I would use, for what it's worth: 1 reflections:set aside 10% 1-2 reflections: set aside 1000 reflections 2-4 reflections: set aside 5% 4 reflections:set aside 2000 reflections Proposal 3: 5% maximum 2-5k Proposal 4: 3% minimum 1000 Proposal 5: 5-10% of reflections, minimum 1000 Proposal 6: 50 reflections per bin in order to get reliable ML parameter estimation, ideally around 150 / bin. Proposal 7: If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be 40k i.e. rather a lot. Referees question use of 5k reflections as test set. Comment 1 in response to this: Surely absolute # of test reflections is not relevant, percentage is. Approximate consensus (i.e. what I will look at doing in xia2) - probably follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy most of the criteria raised by everyone else. On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com wrote: Hi Folks Had a vague comment handed my way that xia2 assigns too many free reflections - I have a
Re: [ccp4bb] How many is too many free reflections?
I agree with Gerard. It would be much better in many ways to generate a separate file of Free R flags for each crystal form of a project to some high resolution that is unlikely to ever be exceeded eg 0.4 A that is a separate input file to refinement rather than in the mtz. The generation of this free set could ask some questions like is the data twinned, do you want to extend the free set from a higher symmetry free set. eg C2 rather than C2221 (symmetry is close to the higher symmetry but not perfect- seems to happen not infrequently). Could some judicious selection of sets of related potentially related hkls work as a universal free set? (Not thought this through fully) This would get around practical issues like I had yestserday in refining in another well known package where coot drew the map as if it was 0.5 A data even though there were only observed data to 2.1 the rest just being a hopelessly overoptimistic guess of the best ever dataset we might collect. I agree you CAN do this with current software- it is just not the path of least resistance, so you have to double check your group are doing this. Best wishes Nick -- Prof Nicholas H. Keep Executive Dean of School of Science Professor of Biomolecular Science Crystallography, Institute for Structural and Molecular Biology, Department of Biological Sciences Birkbeck, University of London, Malet Street, Bloomsbury LONDON WC1E 7HX email n.k...@mail.cryst.bbk.ac.uk Telephone 020-7631-6852 (Room G54a Office) 020-7631-6800 (Department Office) Fax 020-7631-6803 If you want to access me in person you have to come to the crystallography entrance and ring me or the department office from the internal phone by the door
Re: [ccp4bb] How many is too many free reflections?
Hi Folks, Many thanks for all of your comments - in keeping with the spirit of the BB I have digested the responses below. Interestingly I suspect that the responses to this question indicate the very wide range of resolution limits of the data people work with! Best wishes Graeme === Proposal 1: 10% reflections, max 2000 Proposal 2: from wiki: http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set including Randy Read recipe: So here's the recipe I would use, for what it's worth: 1 reflections:set aside 10% 1-2 reflections: set aside 1000 reflections 2-4 reflections: set aside 5% 4 reflections:set aside 2000 reflections Proposal 3: 5% maximum 2-5k Proposal 4: 3% minimum 1000 Proposal 5: 5-10% of reflections, minimum 1000 Proposal 6: 50 reflections per bin in order to get reliable ML parameter estimation, ideally around 150 / bin. Proposal 7: If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be 40k i.e. rather a lot. Referees question use of 5k reflections as test set. Comment 1 in response to this: Surely absolute # of test reflections is not relevant, percentage is. Approximate consensus (i.e. what I will look at doing in xia2) - probably follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy most of the criteria raised by everyone else. On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com wrote: Hi Folks Had a vague comment handed my way that xia2 assigns too many free reflections - I have a feeling that by default it makes a free set of 5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now. This was particularly in the case of high resolution data where you have a lot of reflections, so 5% could be several thousand which would be more than you need to just check Rfree seems OK. Since I really don't know what is the right # reflections to assign to a free set thought I would ask here - what do you think? Essentially I need to assign a minimum %age or minimum # - the lower of the two presumably? Any comments welcome! Thanks best wishes Graeme
Re: [ccp4bb] How many is too many free reflections?
Hi Graeme, We have had similar discussion with PDB_REDO that is frequently forced to assign a new R-free set when the input data doesn’t have one (this still happens with new PDB entries!). The ‘500/1000/1500/2000 reflections’ is enough school seems to look only at the variance of R-free for different choices of test sets, which depends on the absolute number of reflections. You also want a representative sample of reciprocal space which depends on the fraction of reflections. In PDB_REDO we make a new test set if: - The test set is smaller than 1% of the reflections - When the set has fewer than 500 reflections AND is smaller than 10% of the reflections. The new set is chosen as at least 5% of the possible reflections given the cell parameters and the resolution. If there are between 2 and 1 reflections, the percentage is increased to get at least 1000 reflections in the test set. So the maximum percentage is 10%. Funny side note: The random number generator in freerflag was set up to always pick the same test set for given resolution and cell parameters, which is useful if you misplace your test set. Unfortunately, we also had data sets from the PDB where the newly generated test set had no observed reflections. Most of these datasets were close to 95% complete ;) Cheers, Robbie From: CCP4 bulletin board [mailto:CCP4BB@JISCMAIL.AC.UK] On Behalf Of Graeme Winter Sent: Tuesday, June 2, 2015 12:27 To: CCP4BB@JISCMAIL.AC.UK Subject: [ccp4bb] How many is too many free reflections? Hi Folks Had a vague comment handed my way that xia2 assigns too many free reflections - I have a feeling that by default it makes a free set of 5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now. This was particularly in the case of high resolution data where you have a lot of reflections, so 5% could be several thousand which would be more than you need to just check Rfree seems OK. Since I really don't know what is the right # reflections to assign to a free set thought I would ask here - what do you think? Essentially I need to assign a minimum %age or minimum # - the lower of the two presumably? Any comments welcome! Thanks best wishes Graeme
Re: [ccp4bb] How many is too many free reflections?
Hi Graeme, There's a very nice page on the (unofficial?) CCP4 wiki about it http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set For structures with a lot of reflections, a rule of thumb would be that 2000 free reflections would give an adequate reliability in the free R factor. Hope this helps, Folmer Fredslund 2015-06-02 12:26 GMT+02:00 Graeme Winter graeme.win...@gmail.com: Hi Folks Had a vague comment handed my way that xia2 assigns too many free reflections - I have a feeling that by default it makes a free set of 5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now. This was particularly in the case of high resolution data where you have a lot of reflections, so 5% could be several thousand which would be more than you need to just check Rfree seems OK. Since I really don't know what is the right # reflections to assign to a free set thought I would ask here - what do you think? Essentially I need to assign a minimum %age or minimum # - the lower of the two presumably? Any comments welcome! Thanks best wishes Graeme -- Folmer Fredslund
Re: [ccp4bb] How many is too many free reflections?
Hi Graeme, free reflections are used for two purposes, at least: cross-validation (calculation of Rfree) and ML parameters estimation (sigmaa or alpha/beta). For the latter it is important that each relatively thin resolution bin (sufficiently thin so that alpha/beta can be considered constants in it) receives no less than 50 reflections absolute min; in Phenix we found that ~150 per bin is sufficient and this is what's used by default. Pavel On Tue, Jun 2, 2015 at 3:26 AM, Graeme Winter graeme.win...@gmail.com wrote: Hi Folks Had a vague comment handed my way that xia2 assigns too many free reflections - I have a feeling that by default it makes a free set of 5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now. This was particularly in the case of high resolution data where you have a lot of reflections, so 5% could be several thousand which would be more than you need to just check Rfree seems OK. Since I really don't know what is the right # reflections to assign to a free set thought I would ask here - what do you think? Essentially I need to assign a minimum %age or minimum # - the lower of the two presumably? Any comments welcome! Thanks best wishes Graeme
Re: [ccp4bb] How many is too many free reflections?
Hi Graeme, in a data set with just below 800,000 independent reflections I use 1 % for freeR which is still impressive 8,000. xia2 would have assigned 40,000 for freeR at 5 %. I think this is way too much. Often we collect many data sets of the same project to find the better data. We do use default xia2 FreeR assignments at this stage, and after locating the best data set we can not go back and reassign FreeR, as the new set will be biased towards the model. Referees/editors however query cases when over 5,000 reflections were used for cross-validation. Misha From: CCP4 bulletin board [CCP4BB@JISCMAIL.AC.UK] on behalf of Pavel Afonine [pafon...@gmail.com] Sent: Tuesday, June 2, 2015 3:10 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] How many is too many free reflections? Hi Graeme, free reflections are used for two purposes, at least: cross-validation (calculation of Rfree) and ML parameters estimation (sigmaa or alpha/beta). For the latter it is important that each relatively thin resolution bin (sufficiently thin so that alpha/beta can be considered constants in it) receives no less than 50 reflections absolute min; in Phenix we found that ~150 per bin is sufficient and this is what's used by default. Pavel On Tue, Jun 2, 2015 at 3:26 AM, Graeme Winter graeme.win...@gmail.commailto:graeme.win...@gmail.com wrote: Hi Folks Had a vague comment handed my way that xia2 assigns too many free reflections - I have a feeling that by default it makes a free set of 5% which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems excessive now. This was particularly in the case of high resolution data where you have a lot of reflections, so 5% could be several thousand which would be more than you need to just check Rfree seems OK. Since I really don't know what is the right # reflections to assign to a free set thought I would ask here - what do you think? Essentially I need to assign a minimum %age or minimum # - the lower of the two presumably? Any comments welcome! Thanks best wishes Graeme