Re: [ccp4bb] Does ncs bias R-free? And if so, can it be avoided by special selection of the free set?

dusan turk Mon, 03 Jun 2019 07:39:35 -0700

Dear Jonathan,

Here is some additional explanation on the NCS dependent and NCS independent 
R-free issues.


I think that we all agree that redundancy is related to the accuracy of a 
measurement. The higher redundancy a data set has, the more accurate its 
average value is. In crystallography, if we have a P1 P6 space groups. We can 
also agree that it is clear if we collect and process data set rotated about an 
arbitrary spindle axes for 180 degrees, then the data of the P6 space group 
will have approximately 6-times higher redundancy then the P1 when merged to 
the asymmetric unit. However, the asymmetric unit of P6 is 6-times smaller than 
of P1. If we leave out a small portion of data from the P6 data set, it will 
not really effect completeness, even the processed data will be almost the same 
and similar it will be true for the P1 data set. The difference will be only in 
redundancy and thereby accuracy. However if we leave out 5/6 also considering 
the Friedel symmetry, then the P6 data set will still have high completeness, 
whereas the P1 data set will be highly incomplete. The same argument can be 
applied for data with NCS as demonstrated by the Silva and Rossmann paper. 
Hence, one can remove a lot of data from a system with high NCS, and the system 
will still be completely described, even though the lowered redundancy will 
lower the accuracy. What I want to say is that redundancy is good, it increases 
accuracy, whereas to the bias of NCS in R-free, the NCS bias is there and 
therefore understandable that the Rfree gap should remain the same as well as 
R-work and R-free. I am actually surprised that they are that different.

On the R-work, R-free and R-free gap of the molecules without NCS, the visual 
comparison of numbers does not provide a validated answer on whether they 
correspond to the same structure or not.  This is a matter of the accuracy of 
requirements. As it is appears from your message, your requirements were met, 
hence you are right. However, to answer the question whether the differences 
between them are meaningful (are the structures equally apart from the true 
answer) one must not only look at R-work and R-frees, but also into the 
accompanying model that delivers these numbers. In order to deliver a validated 
statement about the relationship between R-free gap and the structure 
correctness one needs a references (the actual objects) to which these refined 
structures are compared to. In our paper we used two comparisons: RMSD 
deviations between the refined and the true reference structure, and average 
phase difference between Fcalc of the refined model and those of the true 
reference structure. Because in reality we do not have the true true 
structures, for our analysis to be valid we have chosen data sets with 
truncated high resolution, a model with errors, and starting models (coming out 
of molecular replacement) and compared them with the final structures. All 
these tests were based on the assumption that when the target model is far 
enough from the starting point, then the final refined structure can be used as 
the reference of the true true structure. To deliver a validated statement 
about the poor to nonexistent correlation between the R-free gap and phase 
error of the model we selected 31 different data sets with different portions 
of the TEST set and refined each structure against each of the set until 
convergence without including any model rebuilding and human intervention. And 
then we calculated their “correlation” between the phase errors and R-free gap 
and established that the relation is poor or nonexistent. So, in the absence of 
the comparison with the true reference structures, it is in my opinion 
impossible to establish whether the differences between the numbers in the 
table you calculated are meaningful or not.

I hope this helps,

dusan


> On 3 Jun 2019, at 01:00, CCP4BB automatic digest system 
> <lists...@jiscmail.ac.uk> wrote:
> 
> Date:    Sun, 2 Jun 2019 19:28:04 +0100
> From:    Eleanor Dodson <eleanor.dod...@york.ac.uk>
> Subject: Re: Does ncs bias R-free? And if so, can it be avoided by special 
> selection of the free set?
> 
> The current Rfree selection is done in the highest possible Laue group - eg
> trigonal uses P6/mmm - then the selection is proogated to the chosen Laue
> group - eg P3. So IF the ncs reflects a higher Laue symmetry as it often
> does the FreeR is sort of buffered against the ncs- effect..
> 
> That wont always be true of course but it does help avoid NCS bias.
> Eleanor
> 
> On Sat, 1 Jun 2019 at 22:57, Jonathan Cooper <
> 00000c2488af9525-dmarc-requ...@jiscmail.ac.uk> wrote:
> 
>> I have done some more tests with different programs for choosing the
>> R-free set in shells or at random and the results are at the same link:
>> 
>> https://www.ucl.ac.uk/~rmhajc0/rfreetests.pdf
>> 
>> There still seems to be no significant difference between the normal
>> R-free and the R-free in shells, with up to 20-fold NCS present. I can't
>> comment on twinning, but with NCS it would seem that the normal CCP4 way of
>> picking the R-free set is as good as anything else!
>> On Sunday, 26 May 2019, 14:02:50 BST, dusan turk <dusan.t...@ijs.si>
>> wrote:
>> 
>> 
>> Dear colleagues,
>> 
>> 
>>> Does ncs bias R-free? And if so, can it be avoided by special selection
>> of
>>    the free set?
>> 
>> It occurs to me that we tend to forget that the objective of structure
>> determination is not the model with the lowest model bias, but the model
>> which is closest to the true structure. The structure without model bias is
>> the structure without a model - which is not really helpful.
>> 
>> An angle on the NCS issue is provided by the work of Silva & Rossmann
>> (1985, Acta Cryst B41, 147-157), who discarded most of data almost
>> proportionally to the level of NCS redundancy (using 1/7th for WORK set and
>> 6/7 for TEST set in the case of 10-fold NCS). They did it in 1990s in order
>> to make refinement of their large structure computationally feasible:
>> “Despite the reduction in the number of variables imposed by the
>> non-crystallographic constraints, the problem remained a formidable one if
>> all 298615 crystallographically independent reflections were to be used in
>> the refinement. However, the reduction of size of the asymmetric unit in
>> real space should be equivalent to a corresponding reduction in reciprocal
>> space. Hence, one-tenth of refinement of the independent data might suffice
>> for refinement.” In conclusion they stated that “This is the first time
>> that the structure of a complete virus has been refined by a
>> reciprocal-space method.” To conclude, to select an independent data set to
>> refined against, one should take an n-th fraction of reflections from the
>> data set containing the n-fold NCS.
>> 
>> Now on the bias of the concept of R-free itself. As we known, each term in
>> the Fourier series is orthogonal to all other terms, hence the projection
>> of any two terms on each other is zero. We also know that diffraction
>> pattern of a crystal structure is composed of Iobs which reflect Fobs. Fobs
>> are a Fourier series of terms . From measured set of Iobs we can directly
>> calculate |Fobs|, but not their phase. To calculate the phase in refinement
>> we use Fmodel structure factors, of which the most significant part are
>> Fcalc calculated from atomic model. However, the model is changed during
>> model building and refinement (atomic positions, B-factors and
>> occupancies), all Fmodel structure factors change in size and in phase
>> angle.
>> 
>> During refinement using a cross validated maximum likelihood target
>> function atomic model is fitted against the selected subset of |Fobs|,
>> called WORK set, using a corresponding subset of Fmodel. The remaining part
>> of structure factors of Fmodel, called the TEST set is used to calculate
>> the weighted terms used in refinement and is based on phase error
>> estimates. This Fmodel fraction equally depends on attributes of all atoms
>> of the model. As consequence, the TEST fraction of Fmodel structure factors
>> is model dependent. Now comes the catch, if the TEST fraction of structure
>> factors (Fobs) was truly independent from the model, then it should remain
>> so also during the refinement. As consequence and simultaneous proof of
>> this independency, the R-free should not be affected by refinement. As we
>> know this holds only for the incorrect structure solutions. Their atoms are
>> refined in direction that do not lead towards the true structure. As soon
>> as a structure solution is correct, its improvements will lower R-free
>> because the model is related to the true crystal structure. This is in my
>> opinion the only true value of the R-free gap criterion. The problems are
>> that use of the WORK subset makes refinement to aim off the true target and
>> that the use of TEST fraction for estimating phase error correctness is an
>> approximation not justified by the claim of independency of the TEST set. I
>> do not want to undermine the historical importance of the TEST set use for
>> refinement and structure validation, however we need and can do better.
>> 
>> As shown by Silva & Rossman in 1985 the concept of independency of a TEST
>> subset fraction of Fobs structure factors is not true for the structures
>> composed of equal copies of molecules present in asymmetric unit of a
>> crystal (crystals with NCS) . The same reasoning can be applied to the
>> twinned data sets. However, de-twining is model dependent, hence the claim
>> of independency of TEST and WORK subsets of Fobs structure factors actually
>> fail due to dependency of the Fmodel WORK and TEST subsets.
>> 
>> The significant part of model bias originates from the use of chemical
>> restraints in refinement that effect positions of intermediate bonding and
>> non-bonding partners and propagate through crystallographic terms to all
>> atoms. To overcome this problem we replaced the calculation of phase error
>> estimates, which is based on the TEST subset of structure factors, by
>> calculation of phase error estimates which is using WORK subset or all data
>> and Fmodel structure factors calculated from kicked model generated by
>> randomly displacing atomic positions. In the Figures 6 and 7 there is a
>> poor or non-existing correlation between R-free gaps and phase errors. For
>> details please read ( Praznikar, J. & Turk, D. (2014) Free kick instead of
>> cross-validation in maximum-likelihood refinement of macromolecular crystal
>> structures. Acta Cryst. D70, 3124-3134.
>> http://journals.iucr.org/d/issues/2014/12/00/lv5072/lv5072.pd) We
>> concluded “Since the ML FK approach allows the use of all data in
>> refinement with a gain in structure accuracy and thereby delivers lower
>> model bias, this work encourages the use of all data in the refinement of
>> macromolecular structures.”
>> 
>> Just to add, it appears that the R-free discussions keep resurfacing,
>> because the use of the R-free concept in refinement and structure
>> validation persistently raises doubts about its validity. The discussions
>> that follow try to strengthen the beliefs. In my opinion, however, »the
>> persistent use of R-free as an indicator of structure correctness is a
>> result of the desire to simplify the reality by wishful thinking.” (Turk
>> (2017), Boxes of Model Building and Visualization, Protein Crystallography,
>> Methods in MolecularBiology 1607, Springer protocols).
>> 
>> I hope this helps to clarify a few issues.
>> 
>> dusan turk
>> 
>>> On 25 May 2019, at 01:00, CCP4BB automatic digest system <
>> lists...@jiscmail.ac.uk> wrote:
>>> 
>>> Date:    Fri, 24 May 2019 22:27:28 +0000
>>> From:    Jonathan Cooper <bogba...@yahoo.co.uk>
>>> Subject: Re: Does ncs bias R-free? And if so, can it be avoided by
>> special selection of the free set?
>>> 
>>> Having been fond of the idea discussed above i.e. that when NCS is
>> present, one should have the R-free set chosen in shells, I did some simple
>> tests. Many others must have done the same, but here's how it went:
>>> 1) Choose a few familiar structures, both with and without NCS and get
>> the data.
>>> 2) Since there was some difficulty in remembering if the original R-free
>> sets were in shells or not, I ditched any existing test set (shock, but see
>> 3 below) and generated new ones, both at random and in shells (using
>> SFTOOLS and I repeated some with an old copy of SHELXPRO). Some of the
>> reflection files lacked original R-free sets since they weredeposited
>> before the R-free was invented.
>>> 3) Reduce the bias of each model to the reflections that are now in the
>> new test setsand tease out over-fitting by rattling the structures a bit,
>> i.e. add a random +/-0.1 Angstroms to x, y and z of each atom (0.17
>> Angstroms net shift) and reset all the B-factors to 30 A^2.
>>> 4) Refine the rattled structures with the new R-free sets, i.e. random
>> and in shells (no NCS restraints).
>>> 5) If anyone is really interested, the results are here:
>>> https://www.ucl.ac.uk/~rmhajc0/rfreetests.pdf
>>> but to summarise, assuming the programs have picked the test sets in
>> shells or otherwise correctly (!), there seems to be no significant
>> difference between the R-free in shells and the normal one, whether NCS is
>> present or not. If anything, the R-free in shells tends to be a tiny bit
>> lower than the normal R-free when NCS is present, although this is probably
>> by chance due to the small number of tests done!
>>> I am sure this is a well known fact, but haven't had the chance to test
>> it till now!    On Sunday, 19 May 2019, 13:22:00 BST, Ian Tickle <
>> ianj...@gmail.com> wrote:
>>> 
>>> 
>>> Hi Ed
>>> Yes, Rfree: my favourite topic, I'll take this one on!  First off, we
>> all need to be ultra-careful and precise about the terminology here, for
>> fear of creating even more confusion.  For example what on earth is meant
>> by "reflections ... are uncorrelated"?  A reflection can be regarded as a
>> object that possesses a set of attributes (indices, d spacing, setting
>> angles, position on detector, LP correction, intensity, amplitude, phase,
>> errors in those, etc. etc.).  An object as such is not associated with any
>> kind of value (it is rather an instance of a class of objects possessing
>> the same set of attributes but with different values for those attributes),
>> so it's totally meaningless to talk about the correlation, or lack thereof,
>> of two sets of objects (what's the correlation of a bag of apples and a bag
>> of oranges?).  You can only talk about the correlation of the values of the
>> objects' attributes (e.g. the apples' and oranges' size or weight).
>> Perhaps you'll say that it was clear from the context that you meant the
>> correlation of the reflection's measured intensities (or amplitudes).  If
>> that is what you meant then you would be wrong!  The fact that it's not
>> about NCS-related intensities or amplitudes does rather throw a spanner in
>> the works of those who claim that's it's the correlation of these
>> quantities that obliges one to choose the test set in a certain way.
>>> Before I say why, I would also point out that R factors are not the
>> quantities minimised in refinement: for one thing the conventional Rwork
>> and Rfree are unweighted so all reflections whether poorly or well-measured
>> contribute equally, which makes no sense.  In ML refinement it's the
>> negative log-likelihood gain (-LLG) that's minimised so that is the
>> quantity you should be using.  This means that one cannot expect Rwork to
>> be a minimum at convergence since it's not directly related to LLGwork.  In
>> addition one has no idea what is the confidence interval of an R factor so
>> it's impossible to say whether a given decrease in R is significant or
>> not.  So R factors are entirely unsuited for any kind of quantitative
>> analysis of model errors, and I despair when I read papers that do just
>> that.  The R factor was devised in the 50's before calculators or computers
>> became readily available and crystallographic computations were performed
>> with pencil & paper!  So the form of the R factor, i.e. using an unweighted
>> absolute value instead of a weighted square as would have been appropriate
>> for least squares refinement, was specifically designed as a
>> rough-and-ready guide of refinement progress, not a quantitative measure.
>>> To see why it's not about intensities or amplitudes, it's important to
>> understand the purpose and operation of cross-validation (a.k.a.
>> 'jack-knife test') with a test set set aside for this purpose and using a
>> statistic such as LLGfree (or Rfree if you must), in order to quantify the
>> agreement of the model with the test set.  In any scientific experiment the
>> measuring apparatus is never perfect so never reports the true values of
>> the quantities being measured: measurement errors are an inevitable fact of
>> life.  Cross-validation flags up the impact of these errors on the model
>> that is used to explain the measurements by some process of best-fitting to
>> them.  Note that by 'model' I mean the mathematical model, i.e. in this
>> case the structure-factor equation that relates the atomic model to the
>> measurements.  The adjustments in the model's variable parameters (x, y, z,
>> B etc.) during refinement may give a closer fit between the true and
>> calculated amplitudes in which case both -LLGwork and -LLGfree will both
>> decrease (as indicated above Rwork and Rfree may go up or down
>> unpredictably).
>>> Unfortunately we have only the measured amplitudes, not the true ones,
>> so in this process of fitting one may go too far and fit to the measurement
>> errors ('overfitting'), which will obviously introduce errors in the
>> model.  If one only considers the refinement target function (LLG) or
>> Rwork, it will always appear that the model is improving even when it isn't
>> (i.e. agreeing better with the measured values but not necessarily with the
>> true values due to the errors in the measured values).  This generally
>> happens because in the attempt to extract more detail in the model from the
>> data one has set up a model with more variables (or fewer/too loose
>> restraints) than the data can support.
>>> Since the changes in the model on overfitting will not be related to
>> changes required to obtain the true model values but to completely
>> arbitrary random numbers unrelated to the truth, and provided the
>> measurement errors in the test set are uncorrelated with those in the
>> working set, the test-set statistic will most likely go on its own sweet
>> way (i.e. up) indicating overfitting.  If for any reason the measurement
>> errors of working and test-set reflections are correlated, then the
>> test-set statistic will be biased towards the working-set value and so will
>> not be a reliable diagnostic of overfitting.  Note that the overfitting
>> fate is decided at the point where we choose the starting set of parameters
>> and restraints, though it doesn't become apparent until after the
>> subsequent refinement run has completed.  Then one should redesign the
>> model with fewer variables and/or more/tighter restraints, and repeat the
>> last run, rather than proceed further with the faulty model.  If
>> overfitting is diagnosed by the cross-validation test, try something else!
>>> So there you have it: what matters is that the _errors_ in the
>> NCS-related amplitudes are uncorrelated, or at least no more correlated
>> than the errors in the non-NCS-related amplitudes, NOT the amplitudes
>> themselves.  This is like when talking about the standard deviation of a
>> quantity, do you mean the quantity itself (e.g. the electron density in the
>> map), or the _error_ in that quantity (the practice of calling the latter
>> the 'standard deviation in the error' or 'standard error' to avoid this
>> confusion is to be commended).
>>> Finally let's examine this: are the _errors_ in the NCS-related
>> amplitudes expected to be more correlated than errors of non-NCS-related
>> amplitudes, giving test-set statistic bias if the NCS-related working-set
>> reflection is selected to be in the test-set. as opposed to having both in
>> the same set?  Clearly counting errors are totally random and uncorrelated
>> with anything so they will contribute zero correlation to both NCS and
>> non-NCS-related errors in amplitudes.  What other sources of measurement
>> error are there?  - most likely errors in image scale factors, errors due
>> to variability in the illuminated volume of the crystal and errors due to
>> radiation damage.  Is there any reason to believe that any of these effects
>> could introduce more correlation of errors of NCS-related intensities
>> compared with non-NCS-related?  I would suggest that this could happen only
>> by a complete fluke!
>>> Cheers
>>> -- Ian
>>> 
>>> On Sun, 19 May 2019 at 04:34, Edward A. Berry <ber...@upstate.edu>
>> wrote:
>>> 
>>> Revisiting (and testing) an old question:
>>> 
>>> On 08/12/2003 02:38 PM, wgsc...@chemistry.ucsc.edu wrote:
>>>> ***  For details on how to be removed from this list visit the  ***
>>>> ***          CCP4 home page http://www.ccp4.ac.uk         ***
>>> 
>>>> On 08/12/2003 06:43 AM, Dirk Kostrewa wrote:
>>>>> 
>>>>> (1) you only need to take special care for choosing a test set if you
>> _apply_
>>>>> the NCS in your refinement, either as restraints or as constraints. If
>> you
>>>>> refine your NCS protomers without any NCS restraints/constraints, both
>> your
>>>>> protomers and your reflections will be independent, and thus no
>> special care
>>>>> for choosing a test set has to be taken
>>>> 
>>>> If your space group is P6 with only one molecule in the asymmetric unit
>> but you instead choose the subgroup P3 in which to refine it, and you now
>> have two molecules per asymmetric unit related by "local" symmetry to one
>> another, but you don't apply it, does that mean that reflections that are
>> the same (by symmetry) in P6 are uncorrelated in P3 unless you apply the
>> "NCS"?
>>> 
>>> ===================================================
>>> The experiment described below  seems to show that Dirk's initial
>>> statement was correct: even in the case where the "ncs" is actually
>>> crystallographic, and the free set is chosen randomly, R-free is not
>>> affected by how you pick the free set.  A structure is refined with
>>> artificially low symmetry, so that a 2-fold crystallographic operator
>>> becomes "NCS". Free reflections are picked either randomly (in which
>>> case the great majority of free reflections are related by the NCS to
>>> working reflections), or taking the lattice symmetry into account so
>>> that symm-related pairs are either both free or both working. The final
>>> R-factors are not significantly different, even with repeating each mode
>>> 10 times with independently selected free sets. They are also not
>>> significantly different from the values obtained refining in the correct
>>> space group, where there is no ncs.
>>> 
>>> Maybe this is not really surprising. Since symmetry-related reflections
>>> have the same resolution, picking free reflections this way is one way
>>> of picking them in (very) thin shells, and this has been reported not to
>>> avoid bias: See Table 2 of Kleywegt and Brunger Structure 1996, Vol 4,
>>> 897-904. Also results of Chapman et al.(Acta Cryst. D62, 227–238). And
>> see:
>>> http://www.phenix-online.org/pipermail/phenixbb/2012-January/018259.html
>>> 
>>> But this is more significant: in cases of lattice symmetry like this,
>>> the ncs takes working reflections directly onto free reflections. In the
>>> case of true ncs the operator takes the reflection to a point between
>>> neighboring reflections, which are closely coupled to that point by the
>>> Rossmann G function. Some of these neighbors are outside the thin shell
>>> (if the original reflection was inside; or vice versa), and thus defeat
>>> the thin-shells strategy.  In our case the symm-related free reflection
>>> is directly coupled to the working reflection by the ncs operator, and
>>> its neighbors are no closer than the neighbors of the original
>>> reflection, so if there is bias due to NCS it should be principally
>>> through the sym-related reflection and not through its neighbors. And so
>>> most of the bias should be eliminated by picking the free set in thin
>>> shells or by lattice symmetry.
>>> 
>>> Also, since the "ncs" is really crystallographic, we have the control of
>>> refining in the correct space group where there is no ncs. The R-factors
>>> were not significantly different when the structure was refined in the
>>> correct space group. (Although it could be argued that that leads to a
>>> better structure, and the only reason the R-factors were the same is
>>> that bias in the lower symmetry refinement resulted in lowering Rfree
>>> to the same level.)
>>> 
>>> Just one example, but it is the first I tried- no cherry-picking. I
>>> would be interested to know if anyone has an example where taking
>>> lattice symmetry into account did make a difference.
>>> 
>>> For me the lack of effect is most simply explained by saying that, while
>>> of course ncs-related reflections are correlated in their Fo's and Fc's,
>>> and perhaps in in their |Fo-Fc|'s, I see no reason to expect that the
>>> _changes_ in |Fo-Fc| produced by a step of refinement will be correlated
>>> (I can expound on this). Therefore whatever refinement is doing to
>>> improve the fit to working reflections is equally likely to improve or
>>> worsen the fit to sym-related free reflections. In that case it is hard
>>> to see how refinement against working reflections could bias their
>>> symm-related free reflections.  (Then how does R-free work? Why does
>>> R-free come down at all when you refine? Because of coupling to
>>> neighboring working reflections by the G-function?)
>>> 
>>> Summary of results (details below):
>>> 0. structure 2CHR, I422, as reported in PDB, with 2-Sigma cutoff)
>>> R: 0.189          Rfree: 0.264  Nfree:442(5%)  Nrefl: 9087
>>> 
>>> 1. The deposited 2chr (I422) was refined in that space group with the
>>> original free set. No Sigma cutoff, 10 macrocycles.
>>> R: 0.1767        Rfree: 0.2403  Nfree:442(5%)  Nrefl: 9087
>>> 
>>> 2. The deposited structure was refined in I422 10 times, 50 macrocycles
>>> each, with randomly picked 10% free reflections
>>> R: 0.1725±0.0013  Rfree: 0.2507±0.0062  Nfree: 908.9±  Nrefl: 9087
>>> 
>>> 3. The structure was expanded to an I4 dimer related by the unused I422
>>> crystallographic operator, matching the dimer of 1chr. This dimer was
>>> refined against the original (I4) data of 1chr, picking free reflections
>>> in symmetry related pairs. This was repeated 10 times with different
>>> random seed for picking reflections.
>>> R: 0.1666±0.0012  **Rfree:0.2523±0.0077  Nfree: 1601.4  Nrefl:16011
>>> 
>>> 4. same as 3 but picking free reflections randomly without regard for
>>> lattice symmetry.
>>> On average 15 free reflections were in pairs, 212 were invariant under
>>> the operator (no sym-mate) and 1374 (86%) were paired with working
>>> reflections.
>>> R: 0.1674±0.0017  **Rfree:0.2523±0.0050  Nfree: 1600.9 Nrefl:16011
>>> 
>>> (**-Average Rfree almost identical by coincidence- the individual
>>> results were all different)
>>> 
>>> Detailed results from the individual refinement runs are available in
>>> spreadsheet in dropbox:
>>> https://www.dropbox.com/s/fwk6q90xbc5r8n1/NCSbias.xls?dl=0
>>> Scripts used in running the tests are also there in NCSbias.tgz:
>>> https://www.dropbox.com/s/sul7a6hzd5krppw/NCSbias.tgz?dl=0
>>> 
>>> ========================================
>>> 
>>> Methods:
>>> I would like an experiment where relatively complete data is available
>>> in the lower symmetry. To get something that is available to everyone, I
>>> choose from the PDB. A good example is 2CHR, in space group I422, which
>>> was originally solved and the data deposited in I4 with two molecules in
>>> the asymmetric unit(structure 1CHR).
>>> 
>>> 2CHR statistics from the PDB:
>>>       R      R-free  complete  (Refined 8.0 to 3.0 A
>>>       0.189  0.264  81.4      reported in PDB, with 2-Sig cutoff)
>>>                                   Nfree=442  (4.86%)
>>> Further refinement in phenix with same free set, no sigma cutoff:
>>> 10 macrocycles bss, indiv XYZ, indiv ADP refinement; phenix default
>>> Resol 37.12 - 3.00 A 92.95% complete, Nrefl=9087 Nfree=442(4.86%)
>>> Start: r_work = 0.2097 r_free = 0.2503 bonds = 0.008 angles = 1.428
>>> Final: r_work = 0.1787 r_free = 0.2403 bonds = 0.011 angles = 1.284
>>>   (2chr_orig_001.pdb,
>>> 
>>> The number of free reflections is small, so the uncertainty
>>> in Rfree is large (a good case for Rcomplete)
>>> Instead for better statistics, use new 10% free set and repeat 10 times;
>>> 50 macrocycles, with different random seeds:
>>> R: 0.1725±0.0013  Rfree: 0.2507±0.0062 bonds:0.010 Angles:1.192
>>>   Nfree: 908.9±0.32  Nrefl: 9087
>>> 
>>> For artificially low symmetry, expand the I422 structure (making what I
>>> call 3chr for convenience although I'm sure that ID has been taken):
>>> 
>>> pdbset xyzin 2CHR.pdb xyzout 3chr.pdb <<eof
>>> exclude header
>>> spacegroup I4
>>> cell 111.890  111.890  148.490  90.00  90.00  90.00
>>> symgen  X,Y,Z
>>> symgen X,1-Y,1-Z
>>> CHAIN SYMMETRY 2 A B
>>> eof
>>> 
>>> Get the structure factors from 1CHR: 1chr-sf.cif
>>> Run phenix.refine on 3chr.pdb with 1chr-sf.cif.
>>> This file has no free set (deposited 1993) so tell phenix to generate
>>> one. I don't want phenix to protect me from my own stupidity, so I use:
>>>         generate = True
>>>         use_lattice_symmetry = False
>>>         use_dataman_shells = False
>>>   (the .eff file with all non-default parameters is available as
>>> 3chr_rand_001.eff in the .tgz mentioned above)
>>> 
>>> For more significance, use the script multirefine.csh to repeat the
>> refinement 10 times with different random seed.After each run, grep
>> significant results into a log file.
>>> 
>>> 
>>> To check this gives free reflections related to working reflections, I
>>> used mtz2various and a fortran prog (sortfree.f in .tgz) to separate the
>>> data (3chr_rand_data.mtz) into two asymmetric units: h,k,l with h>k
>>> (columns 4-5) and with h<k (col 6-7), listed the pairs, thusly:
>>> 
>>> mtz2various hklin 3chr_rand_data.mtz hklout temp.hkl <<eof
>>>   LABIN FP=F-obs DUM1=R-free-flags
>>>   OUTPUT USER '(3I4,2F10.5)'
>>> eof
>>> sortfree <<eof >sort3.hkl
>>> 
>>> sort3.hkl  looks like:
>>>                 ______h>k______    ______h<k______
>>>   h  k  l      F        free    F*        free*
>>>   1  2  3    208.97      0.00    174.95      0.00
>>>   1  2  5    226.85      0.00    191.65      0.00
>>>   1  2  7    144.85      0.00    164.86      0.00
>>>   1  2  9    251.26      0.00    261.71      0.00
>>>   1  2  11    333.84      0.00    335.18      0.00
>>>   1  2  13    800.37      0.00    791.77      0.00
>>>   1  2  15    412.92      0.00    409.90      0.00
>>>   1  2  17    306.99      0.00    317.53      0.00
>>>   1  2  19    225.54      0.00    220.91      0.00
>>>   1  2  21    101.20      1.00*  104.84      0.00
>>>   1  2  23    156.27      0.00    156.49      0.00
>>>   1  2  25    202.97      0.00    202.23      0.00
>>>   1  2  27    216.10      0.00    219.28      0.00
>>>   1  2  29    106.76      0.00    100.93      0.00
>>>   1  2  31    157.32      0.00    154.37      1.00*
>>>   1  2  33    71.84      0.00    20.78      0.00
>>>   1  2  35    179.05      0.00    165.67      0.00
>>>   1  2  37    254.04      0.00    239.96      1.00*
>>>   1  2  39    69.56      0.00    30.61      0.00
>>>   1  2  41    56.20      0.00    51.02      0.00
>>> 
>>> , and awked for 1 in the free columns. Out of 6922 pairs of reflections,
>>> in one case:
>>> 674 in the first asu (h>k) are in the free set,
>>> 703 in the second asu (h<k) are in the free set
>>> only 11 pairs have the reflections in both asu free.
>>> 
>>> out of 16011 refl in I4,
>>> 6922 pairs (=13844 refl), 1049 invariant (h=k or h=0), 1118 with absent
>> mate.
>>> 
>>> out of 1601 free reflections:
>>> On average 15 free reflections were in pairs, 212 were invariant under
>>> the operator (no sym-mate) and 1374 (86%) were paired with working
>>> reflections.
>>> 
>>> Then do 10 more runs of 50 macrocycles with:
>>>   use_lattice_symmetry = False
>>>   collecting the same statistics
>>> (also scripted in multirefine.csh)
>>> 
>>> Finally, use ref2chr.eff to refine (as previously mentined) a monomer in
>> I422 (2chr.pdb) 10 times with 10% free, 50 macrocycles
>>> (also scripted in multirefine.csh)
>> 
>> ########################################################################

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=CCP4BB&A=1

Re: [ccp4bb] Does ncs bias R-free? And if so, can it be avoided by special selection of the free set?

Reply via email to