[ccp4bb] Crystallographic group leader position

2015-06-04 Thread Yvonne TAN Yih Wan (ETC)
Hi Crystallographers,

The Experimental Therapeutice Centre (Agency for Science, Technology and 
Research, Singapore) is hiring a senior Research Fellow to lead a small team of 
structural biologist. The successful candidate will lead-from-the-bench all 
aspects of X-ray crystallography, from protein expression to co-crystal trials 
to data collection and visualization.
Kindly refer to 
https://astar.aqayo.com/site-YXN0YXJ8MjA/member_offerdetail.jsp?siteid=YXN0YXJ8MjArequisitionuid=d6dc8ac9dfef0d05ccb2c08d8d1029c9781fec8e
 for more details.

Applicant should send their CV to Jeffrey Hill jh...@etc.a-star.edu.sg or apply 
directly through the website.

This position is suitable for crystallographers with postdoc experience and 
would like to step up.

Best regards,
Yvonne

Re: [ccp4bb] PyMOL v. Coot map 'level'

2015-06-04 Thread Emilia C. Arturo (Emily)
Thomas,


 I tried to figure out the PyMOL vs. Coot normalization discrepancy a while
 ago. As far as I remember, PyMOL normalizes on the raw data array, while
 Coot normalizes across the unit cell. So if the data doesn't exactly cover
 the cell, the results might be different.


I posted the same question to the Coot mailing list (the thread can be
found here: https://goo.gl/YjVtTu) , and got the following reply from Paul
Emsley; I highlight the questions that I think you could best answer, with
'***':

[ ...]
I suspect that the issue is related to different answers to the rmsd of
what?

In Coot, we use all the grid points in the asymmetric unit - other programs
make a selection of grid points around the protein (and therefore have less
solvent).

More solvent means lower rmsd. If one then contours in n-rmsd levels, then
absolute level used in Coot will be lower - and thus seem to be noisier
(perhaps).  I suppose that if you want comparable levels from the same
map/mtz file then you should use absolute levels, not rmsd. ***What does
PyMOL's 1.0 mean in electrons/A^3?***

Regards,

Paul.

Regards,
Emily.


 On 01 Jun 2015, at 11:37, Emilia C. Arturo (Emily) ec...@drexel.edu
 wrote:
 One cannot understand what is going on without knowing how this map
  was calculated.  Maps calculated by the Electron Density Server have
  density in units of electron/A^3 if I recall, or at least its best
  effort to do so.
 
  This is what I was looking for! (i.e. what the units are) Thanks. :-)
  Yes, I'd downloaded the 2mFo-DFc map from the EDS, and got the same Coot
 v. PyMOL discrepancy whether or not I turned off the PyMOL map
 normalization feature.
 
 If you load the same map into Pymol and ask it to normalize the
  density values you should set your contour level to Coot's rmsd level.
   If you don't normalize you should use Coot's e/A^3 level.  It is
  quite possible that they could differ by a factor of two.
 
  This was exactly the case. The map e/A^3 level (not the rmsd level) in
 Coot matched very well, visually, the map 'level' in PyMOL; they were
 roughly off by a factor of 2.
 
  I did end up also generating a 2mFo-DFc map using phenix, which fetched
 the structure factors of the model in which I was interested. The result
 was the same (i.e. PyMOL 'level' = Coot e/A^3 level ~ = 1/2 Coot's rmsd
 level) whether I used the CCP4 map downloaded from the EDS, or generated
 from the structure factors with phenix.
 
  Thanks All.
 
  Emily.
 
 
 
  Dale Tronrud
 
  On 5/29/2015 1:15 PM, Emilia C. Arturo (Emily) wrote:
   Hello. I am struggling with an old question--old because I've found
   several discussions and wiki bits on this topic, e.g. on the PyMOL
   mailing list
   (http://sourceforge.net/p/pymol/mailman/message/26496806/ and
   http://www.pymolwiki.org/index.php/Display_CCP4_Maps), but the
   suggestions about how to fix the problem are not working for me,
   and I cannot figure out why. Perhaps someone here can help:
  
   I'd like to display (for beauty's sake) a selection of a model with
   the map about this selection. I've fetched the model from the PDB,
   downloaded its 2mFo-DFc CCP4 map, loaded both the map and model
   into both PyMOL (student version) and Coot (0.8.2-pre EL (revision
   5592)), and decided that I would use PyMOL to make the figure. I
   notice, though, that the map 'level' in PyMOL is not equivalent to
   the rmsd level in Coot, even when I set normalization off in PyMOL.
   I expected that a 1.0 rmsd level in Coot would look identical to a
   1.0 level in PyMOL, but it does not; rather, a 1.0 rmsd level in
   Coot looks more like a 0.5 level in PyMOL. Does anyone have insight
   they could share about the difference between how Coot and PyMOL
   loads maps? Maybe the PyMOL 'level' is not a rmsd? is there some
   other normalization factor in PyMOL that I should set? Or, perhaps
   there is a mailing list post out there that I've missed, to which
   you could point me. :-)
  
   Alternatively, does anyone have instructions on how to use Coot to
   do what I'm trying to do in PyMOL? In PyMOL I displayed the mesh of
   the 2Fo-Fc map, contoured at 1.0 about a 3-residue-long
   'selection' like so: isomesh map, My_2Fo-Fc.map, 1.0, selection,
   carve=2.0, and after hiding everything but the selection, I have a
   nice picture ... but with a map at a level I cannot interpret in
   PyMOL relative to Coot :-/
  
   Regards, Emily.
  -BEGIN PGP SIGNATURE-
  Version: GnuPG v2.0.22 (MingW32)
 
  iEYEARECAAYFAlVo1L4ACgkQU5C0gGfAG10YkwCfROYPVXBK/pDS4z/zi5MNY1D+
  nHIAnjOFiAkb6JbuIGWRWkBFDG5Xgc2K
  =hrPT
  -END PGP SIGNATURE-
 

 --
 Thomas Holder
 PyMOL Principal Developer
 Schrödinger, Inc.




Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Frank von Delft
I'm afraid Gerard an Ian between them have left me a bit confused with 
conflicting statements:



On 04/06/2015 15:29, Gerard Bricogne wrote:

snip
In order to guard the detection of putative bound fragments against the evils 
of model bias, it is very important to ensure that the refinement of each 
complex against data collected on it does not treat as free any reflections 
that were part of the working set in the refinement of the apo structure.
snip


On 04/06/2015 17:34, Ian Tickle wrote:

snip
So I suspect that most of our efforts in maintaining common free R 
flags are for nothing; however it saves arguments with referees when 
it comes to publication!

snip



I also remember conversations and even BB threads that made me conclude 
that it did NOT matter to have the same Rfree set for independent 
datasets (e.g. different crystals).  I confess I don't remember the 
arguments, only the relief at not having to bother with all the 
bookkeeping faff Gerard outlines and Ian describes.


So:  could someone explain in detail why this matters (or why not), and 
is there a URL to the evidence (paper or anything else) in either 
direction?


(As far as I remember, the argument went that identical free sets were 
unnecessary even for exactly isomorphous crystals.  Something like 
this:  model bias is not a big deal when the model has largely 
converged, and that's what you have for molecular substitution (as Jim 
Pflugrath calls it).  In addition, even a weakly binding fragment 
compounds produces intensity perturbations large enough to make model 
bias irrelevant.)


phx


Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Pavel Afonine
  It seems to me that the how many is too many aspect of this
 question, and the various culinary procedures that have been proposed
 as answers, may have obscured another, much more fundamental issue,
 namely: is it really the business of the data processing package to
 assign FreeR flags?

  I would argue that it isn't. (...)



Excellent point! I can't agree more.

Pavel


Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Edward A. Berry

In other words, the free set for each complex must be
such that reflections that are also present in the apo dataset retain
the FreeR flag they had in that dataset.


A very easy way to achieve this- generate a complete dataset to ridiculously
high resolution with the cell of your crystal, and assign free-r flags.
(If the first structure has been already solved, merge it's free set and
extend to the new reflections)
Now for every new structure solved, discard any free set that the data
reduction program may have generated and merge with the complete set,
discarding reflection with no Fobs (MNF) or with SigF=0.

In fact, if we consider a dataset is just a 3-dimensional array, or some
subset of it enclosing the reciprocal space asymmetric unit, I don't
see any reason we couldn't assign one universal P1 free-R set and
use it for every structure in whatever space group. By taking each
new dataset, merging with the universal Free-R, and discarding those
reflections not present in the new data, you would obtain a random
set for your structure. There could be nested (concentric?) free-R sets
with 10%, 5%, 2%, 1% free so that if you start out excluding 5% for a
low-res structure then get a high resolution dataset and want to exclude 2%,
you could be sure that all the 2% free reflections were also free in
your previous 5% set.

Thin or thick shells could be predefined. There may be problems when
it is desired to exclude reflections according to some twin law or NCS.

(just now read Nick Keep's post which expresses some similar ideas)
eab

On 06/04/2015 10:29 AM, Gerard Bricogne wrote:

Dear Graeme and other contributors to this thread,

  It seems to me that the how many is too many aspect of this
question, and the various culinary procedures that have been proposed
as answers, may have obscured another, much more fundamental issue,
namely: is it really the business of the data processing package to
assign FreeR flags?

  I would argue that it isn't. From the statistical viewpoint that
justifies the need for FreeR flags, these are pre-refinement entities
rather than post-processing ones. If one considers a single instance
of going from a dataset to a refined structure, then this distinction
may seem artificial. Consider, instead, the case of high-throughput
screening to detect fragment binding on a large number of crystals of
complexes between a given target protein (the apo) and a multitude
of small, weakly-binding fragments into solutions of which crystals of
the apo have been soaked.

  The model for the apo crystal structure comes from a refinement
against a dataset, using a certain set of FreeR flags. In order to
guard the detection of putative bound fragments against the evils of
model bias, it is very important to ensure that the refinement of each
complex against data collected on it does not treat as free any
reflections that were part of the working set in the refinement of the
apo structure. In other words, the free set for each complex must be
such that reflections that are also present in the apo dataset retain
the FreeR flag they had in that dataset. Any mixup, in the FreeR flags
for a complex, of the work vs. free status of the reflections also in
the apo would push Rwork up and Rfree down, invalidating their role as
indicators of quality of fit or of incipient overfitting.

  Great care must therefore be exercised, in the form of adequate
book-keeping and procedures for generating the FreeR flags in the mtz
file for each complex from that for the apo, to properly enforce this
inheritance of work vs. free status.

  In such a context there is a clear and crucial difference between
a post-processing entity and a pre-refinement one. FreeR flags belong
to the latter category. In fact, the creation of FreeR flags at the
end of the processing step can create a false perception, among people
doing ligand screening under pressure, that they cannot re-use the
FreeR flag information of the apo in refining their complexes, simply
because a new set has been created for each of them. This is clearly
to be avoided. Preserving the FreeR flags of the reflections that were
used in the refinement of the apo structure is one of the explicit
recommendations explicitly in the 2013 paper by Pozharski et al. (Acta
Cryst. D69, 150-167) - see section 1.1.3, p.152.

  Best practice in this area may therefore not be only a question
of numbers, but also of doing the appropriate thing in the appropriate
place. There are of course corner cases where e.g. substantial
unit-cell changes start to introduce some cross-talk between working
and free reflections, but the possibililty of such complications is no
argument to justify giving up on doing the right thing when the right
thing can be done.


  With best wishes,

   Gerard.

--
On Thu, Jun 04, 2015 at 08:30:57AM +, Graeme Winter wrote:

Hi Folks,

Many thanks for all of your comments - in keeping with the spirit of the BB
I have digested 

Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread James Holton
Many good points have been made on this thread so far, but mostly 
addressing the question how many free reflections is enough, whereas 
the original question was how many is too many.


I suppose a reasonable definition of too many is when the error 
introduced into the map by leaving out all those reflections start to 
become a problem.  It is easy to calculate this error: it is simply the 
difference between the map made using all  reflections (regardless of 
Free-R flag) and the map made with 5% of the reflections left out.  Of 
course, this difference map is identical to a map calculated using 
only the 5% free reflections, setting all others to zero.  The RMS 
variation of this error map is actually independent of the phases used 
(Parseval's theorem), and it ends up being:


RMSerror = RMSall * sqrt( free_frac )
where:
RMSerror is the RMS variation of the error map
RMSall is the RMS variation of the map calculated with all reflections
free_frac is the fraction of hkls left out of the calculation.

So, with 5% free reflections, the errors induced in the electron density 
will have an RMS variation that is 22.3% of the full map's RMS 
variation, or 0.223 sigma units.  1% free reflections will result in 
RMS 10% error, or 0.1 sigmas.  This means, for example, that with 5% 
free reflections a 1.0 sigma peak might come up as a 1.2 or 0.8 
sigma feature.  Note that these are not the sigmas of the Fo-Fc map, 
(which changes as you build) but rather the sigma of the Fo map.  Most 
of us don't look at Fo maps, but rather 2Fo-Fc or 2mFo-DFc maps, with or 
without the missing reflections filled in.  These are a bit different 
from a straight Fo map.  The absolute electron number density (e-/A^3) 
of the 1 sigma contour for all these maps is about the same, but no 
doubt the fill in, extra Fo-Fc term, and the likelihood weights 
reduces the overall RMS error.  By how much?  That is a good question.


Still, we can take this RMS 0.223 sigma variation from 5% free 
reflections as a worst-case scenario, and then ask the question: is this 
a problem?  Well, any source of error can be a problem, but when you 
are trying to find the best compromise between two 
difficult-to-reconcile considerations (such as the stability of Rfree 
and the interpretability of the map), it is usually helpful to bring in 
a third consideration: such as how much noise is in the map already due 
to other sources?  My colleagues and I measured this recently (doi: 
10.1073/pnas.1302823110), and found that the 1-sigma contour ranges from 
0.8 to 1.2 e-/A^3 (relative to vacuum), experimental measurement errors 
are RMS ~0.04 e-/A^3 and map errors from the model-data difference is 
about RMS 0.13 e-/A^3.  So, 22.3% of sigma is around RMS 0.22 e-/A^3.  
This is a bit larger than our biggest empirically-measured error: the 
modelling error, indicating that 5% free flags may indeed be too much.


However, 22.3% is the worst-case error, in the absence of all the 
corrections used to make 2mFo-DFc maps, so in reality the modelling 
error and the omitted-reflection errors are probably comparable, 
indicating that 5% is about the right amount.  Any more and the error 
from omitted reflections starts to dominate the total error.   On the 
other hand, the modelling error is (by definition) the Fo-Fc difference, 
so as Rwork/Rfree get smaller the RMS map variation due to modelling 
errors gets smaller as well, eventually exposing the omitted-reflection 
error.  So, once your Rwork/Rfree get to be less than ~22%, the errors 
in the map are starting to be dominated by the missing Fs of the 5% free 
set.


However, early in the refinement, when your R factors are in the 30%s, 
40%s, or even 50%s, I don't think the errors due to missing 5% of the 
reflections are going to be important.  Then again, late in refinement, 
it might be a good idea to start including some or all of the free 
reflections back into the working set in order to reduce the overall map 
error (cue lamentations from validation experts such as Jane Richardson).


This is perhaps the most important topic on this thread.  There are so 
many ways to contaminate, bias or otherwise compromise the free set, 
and once done we don't have generally accepted procedures for 
re-sanctifying the free reflections, other that starting over again from 
scratch.  This is especially problematic if your starting structure for 
molecular replacement was refined against all reflections, and your 
ligand soak is nice and isomorphous to those original crystals.  How do 
you remove the evil bias from this model?  You can try shaking it, but 
that only really removes bias at high spatial frequencies and is not so 
effective at low resolution.
So, if bias is so easy to generate why not use it to our advantage?  
Instead of leaving the free-flagged reflections out of the refinement, 
put them in, but give them random F values.  Then do everything you can 
to bias your model toward these random values.  Loosen the 

Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Ian Tickle
Nick

What you describe is (almost) exactly the way we have always done it at
Astex  I'm surprised to hear that others are not routinely doing the
same.  The difference is that we don't generate a free R flag MTZ file to
ultra-high resolution as you suggest, since there's never any need to.
What we do is generate by default a 1.5 Ang. free R flag file using UNIQUE,
FREERFLAG and MTZUTILS whenever a new apo structure for a given
target/crystal form is solved and keep that with the intial apo data as a
reference dataset for auto-re-indexing (so that all the protein-ligand
datasets are indexed the same way).  When a dataset is combined with the
higher resolution free R flag file we would of course cut the resolution to
that of the data (still keeping the original free R flag file), mainly in
order to save space in the database.

Obviously if the initial apo data were higher resolution than 1.5 Ang, the
processing script would generate an initial free R flag file also
correspondlingly higher (say to 1 Ang.).  If a ligand dataset comes along
later at higher resolution than 1.5 Ang. the script would do the same
thing, but then it would use the MTZUTILS UNIQ option to merge the old free
R flags up to 1.5 Ang. with the new ones between 1.5 and 1 Ang.  Then it
would combine the data file with the free R flag file as before and cut the
resolution of the combined data file to the actual resolution of the data.
The script would then replace the old free R flag file with the new one and
use the latter for all subsequent datasets from that target/crystal form.
The users are completely unaware that any of this is happening (unless they
want to dig into the scripts!).

We enforce use of 'approved' scripts for all the processing and refinement
essentially by using an Oracle database with web-based access
authentication which means that if you don't use the approved scripts to
process your data then you can't upload your data to the database, which
then means that no-one else will get to see and/or use your results!  Our
scripts make full use of CCP4 and Global Phasing programs (autoPROC,
autoBUSTER, GRADE etc): however using CCP4i or other programs from the
command line to process the data and only uploading the final results to
the database is severely deprecated (and totally unsupported!), mainly
because there will then be no permanent traceback in the database of the
user's actions for others to see.

On Gerard's final point of the effect on non-isomorphism, we find that
isomorphism is the exception rather than the rule, i.e. the majority of our
datasets would fail the Crick-Magdoff test for isomorphism (i.e. no more
than 0.5% change for all cell lengths for 3 Ang. data and a correspondingly
lower threshold at more typical resolution limits of 2 - 1.5 Ang.).  This
is obviously very target and crystal form-dependent, some targets/crystal
forms give more isomorphous crystals than others.  So I suspect that most
of our efforts in maintaining common free R flags are for nothing; however
it saves arguments with referees when it comes to publication!

Cheers

-- Ian


On 4 June 2015 at 16:00, Nicholas Keep n.k...@mail.cryst.bbk.ac.uk wrote:

 I agree with Gerard.  It would be much better in many ways to generate a
 separate file of Free R flags for each crystal form of a project to some
 high resolution that is unlikely to ever be exceeded eg 0.4 A that is a
 separate input file to refinement rather than in the mtz.


 The generation of this free set could ask some questions like is the data
 twinned, do you want to extend the free set from a higher symmetry free
 set.  eg C2 rather than C2221 (symmetry is close to the higher symmetry but
 not perfect- seems to happen not infrequently).

 Could some judicious selection of sets of related potentially related hkls
 work as a universal free set? (Not thought this through fully)

 This would get around practical issues like I had yestserday in refining
 in another well known package where coot drew the map as if it was 0.5 A
 data even though there were only observed data to 2.1 the rest just being a
 hopelessly overoptimistic guess of the best ever dataset we might collect.

 I agree you CAN do this with current software- it is just not the path of
 least resistance, so you have to double check your group are doing this.

 Best wishes
 Nick





 --
 Prof Nicholas H. Keep
 Executive Dean of School of Science
 Professor of Biomolecular Science
 Crystallography, Institute for Structural and Molecular Biology,
 Department of Biological Sciences
 Birkbeck,  University of London,
 Malet Street,
 Bloomsbury
 LONDON
 WC1E 7HX

 email n.k...@mail.cryst.bbk.ac.uk
 Telephone 020-7631-6852  (Room G54a Office)
   020-7631-6800  (Department Office)
 Fax   020-7631-6803
 If you want to access me in person you have to come to the crystallography
 entrance
 and ring me or the department office from the internal phone by the door



[ccp4bb] why are the bond lengths all different between D and L amino acids?

2015-06-04 Thread Kenneth Satyshur
We have been trying to deposit our peptide structures with D amino acids in 
them They are 15 mers all D, with and L racemate, refined in Refmac 5.8.0107
When we run validation, all the D amino acids have bond length outliers, while 
none of the L do. Example from the validation server:

A bond length (or angle) with jZj  2 is considered an outlier worth 
inspection.
Mol Type Chain Res Link Bond lengths Bond angles
  Counts RMSZ #jZj  2 Counts RMSZ #jZj  2
1 DLY A 10 - 8,?,? 0.47 0 8,?,? 1.74 2 (25%)
1 DLY A 11 - 4,?,? 1.38 1 (25%) 4,?,? 1.51 1 (25%)
1 DPN A 12 - 11,?,? 1.14 1 (9%) 13,?,? 1.29 2 (15%)
1 DAL A 13 - 4,?,? 0.83 0 4,?,? 2.11 2 (50%)
1 DLY A 14 - 4,?,? 1.19 1 (25%) 4,?,? 2.43 1 (25%)
1 DAL A 15 - 4,?,? 1.55 1 (25%) 4,?,? 2.27 1 (25%)
1 DPN A 16 - 11,?,? 1.97 1 (9%) 13,?,? 1.10 1 (7%)
1 DVA A 17 - 6,?,? 1.03 1 (16%) 7,?,? 1.03 1 (14%)
etc.

None of the L's have this.
I also downloaded the lib file for D and L Leucine and the bond lengths are 
different:
Leu
_chem_comp_bond.value_dist
_chem_comp_bond.value_dist_esd
 LEU  N  CAsingle  1.4910.021
 LEU  CA HAsingle  0.9800.020
 LEU  CA CBsingle  1.5300.020
 LEU  CB HB3   single  0.9700.020
 LEU  CB HB2   single  0.9700.020
 LEU  CB CGsingle  1.5300.020
 LEU  CG HGsingle  0.9700.020
 LEU  CG CD1   single  1.5210.020
 LEU  CD1HD11  single  0.9600.020
 LEU  CD1HD12  single  0.9600.020
 LEU  CD1HD13  single  0.9600.020
 LEU  CG CD2   single  1.5210.020
 LEU  CD2HD21  single  0.9600.020
 LEU  CD2HD22  single  0.9600.020
 LEU  CD2HD23  single  0.9600.020
 LEU  CA C single  1.5250.021
 LEU  C  O deloc   1.2310.020
 LEU  N  H1single  0.9600.020
 LEU  N  H2single  0.9600.020
 LEU  N  H3single  0.9600.020
 LEU  C  OXT   deloc   1.2310.020

and for DLE
_chem_comp_bond.value_dist
_chem_comp_bond.value_dist_esd
 DLe  N  CAsingle  1.4550.020
 DLe  CB CAsingle  1.5240.020
 DLe  CA C single  1.5000.020
 DLe  CG CBsingle  1.5240.020
 DLe  CD1CGsingle  1.5240.020
 DLe  CD2CGsingle  1.5240.020
 DLe  O  C deloc   1.2500.020
 DLe  C  OXT   deloc   1.2500.020
 DLe  HN N single  0.9540.020
 DLe  HA CAsingle  1.0990.020
 DLe  HB1CBsingle  1.0920.020
 DLe  HB2CBsingle  1.0920.020
 DLe  HG CGsingle  1.0990.020
 DLe  HD21   CD2   single  1.0590.020
 DLe  HD22   CD2   single  1.0590.020
 DLe  HD23   CD2   single  1.0590.020
 DLe  HD11   CD1   single  1.0590.020
 DLe  HD12   CD1   single  1.0590.020
 DLe  HD13   CD1   single  1.0590.020

Has anyone seen this? and how do I refine if the restraints are different for 
the enantiomorphs? I doubt ALL the bl's
should be different.
Thanks for your help.




Kenneth A. Satyshur, M.S., Ph.D.

Senior Scientist

University of Wisconsin-Madison

Madison, Wisconsin, 53706

608-215-5207


Re: [ccp4bb] Off-topic: Request for DNA

2015-06-04 Thread Kimberly Stanek
There is a group at Regensburg University in Germany that has worked with
this organism. That is where we got our DNA from.

Kim

On Tue, May 26, 2015 at 2:54 PM, Mohamed Noor mohamed.n...@staffmail.ul.ie
wrote:

 Dear all

 I am looking for a small amount of Aquifex aeolicus DNA or cell pellet for
 PCR. Unfortunately neither ATCC nor DSMZ holds this bacterium.

 Thanks.




-- 
Kimberly Stanek
Graduate Student
Mura Lab
Department of Chemistry
University of Virginia
(434) 924-7979


Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Gerard Bricogne
Dear Graeme and other contributors to this thread,

 It seems to me that the how many is too many aspect of this
question, and the various culinary procedures that have been proposed
as answers, may have obscured another, much more fundamental issue,
namely: is it really the business of the data processing package to
assign FreeR flags?

 I would argue that it isn't. From the statistical viewpoint that
justifies the need for FreeR flags, these are pre-refinement entities
rather than post-processing ones. If one considers a single instance
of going from a dataset to a refined structure, then this distinction
may seem artificial. Consider, instead, the case of high-throughput
screening to detect fragment binding on a large number of crystals of
complexes between a given target protein (the apo) and a multitude
of small, weakly-binding fragments into solutions of which crystals of
the apo have been soaked.

 The model for the apo crystal structure comes from a refinement
against a dataset, using a certain set of FreeR flags. In order to
guard the detection of putative bound fragments against the evils of
model bias, it is very important to ensure that the refinement of each
complex against data collected on it does not treat as free any
reflections that were part of the working set in the refinement of the
apo structure. In other words, the free set for each complex must be
such that reflections that are also present in the apo dataset retain
the FreeR flag they had in that dataset. Any mixup, in the FreeR flags
for a complex, of the work vs. free status of the reflections also in
the apo would push Rwork up and Rfree down, invalidating their role as
indicators of quality of fit or of incipient overfitting.

 Great care must therefore be exercised, in the form of adequate
book-keeping and procedures for generating the FreeR flags in the mtz
file for each complex from that for the apo, to properly enforce this 
inheritance of work vs. free status.

 In such a context there is a clear and crucial difference between
a post-processing entity and a pre-refinement one. FreeR flags belong
to the latter category. In fact, the creation of FreeR flags at the
end of the processing step can create a false perception, among people
doing ligand screening under pressure, that they cannot re-use the
FreeR flag information of the apo in refining their complexes, simply
because a new set has been created for each of them. This is clearly
to be avoided. Preserving the FreeR flags of the reflections that were
used in the refinement of the apo structure is one of the explicit
recommendations explicitly in the 2013 paper by Pozharski et al. (Acta
Cryst. D69, 150-167) - see section 1.1.3, p.152.

 Best practice in this area may therefore not be only a question
of numbers, but also of doing the appropriate thing in the appropriate
place. There are of course corner cases where e.g. substantial
unit-cell changes start to introduce some cross-talk between working
and free reflections, but the possibililty of such complications is no
argument to justify giving up on doing the right thing when the right
thing can be done.


 With best wishes,
  
  Gerard.

--
On Thu, Jun 04, 2015 at 08:30:57AM +, Graeme Winter wrote:
 Hi Folks,
 
 Many thanks for all of your comments - in keeping with the spirit of the BB
 I have digested the responses below. Interestingly I suspect that the
 responses to this question indicate the very wide range of resolution
 limits of the data people work with!
 
 Best wishes Graeme
 
 ===
 
 Proposal 1:
 
 10% reflections, max 2000
 
 Proposal 2: from wiki:
 
 http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set
 
 including Randy Read recipe:
 
 So here's the recipe I would use, for what it's worth:
   1 reflections:set aside 10%
1-2 reflections:  set aside 1000 reflections
2-4 reflections:  set aside 5%
   4 reflections:set aside 2000 reflections
 
 Proposal 3:
 
 5% maximum 2-5k
 
 Proposal 4:
 
 3% minimum 1000
 
 Proposal 5:
 
 5-10% of reflections, minimum 1000
 
 Proposal 6:
 
  50 reflections per bin in order to get reliable ML parameter
 estimation, ideally around 150 / bin.
 
 Proposal 7:
 
 If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be
 40k i.e. rather a lot. Referees question use of  5k reflections as test
 set.
 
 Comment 1 in response to this:
 
 Surely absolute # of test reflections is not relevant, percentage is.
 
 
 
 Approximate consensus (i.e. what I will look at doing in xia2) - probably
 follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy
 most of the criteria raised by everyone else.
 
 
 
 On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com
 wrote:
 
  Hi Folks
 
  Had a vague comment handed my way that xia2 assigns too many free
  reflections - I have a 

Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Nicholas Keep
I agree with Gerard.  It would be much better in many ways to generate a 
separate file of Free R flags for each crystal form of a project to some 
high resolution that is unlikely to ever be exceeded eg 0.4 A that is a 
separate input file to refinement rather than in the mtz.



The generation of this free set could ask some questions like is the 
data twinned, do you want to extend the free set from a higher symmetry 
free set.  eg C2 rather than C2221 (symmetry is close to the higher 
symmetry but not perfect- seems to happen not infrequently).


Could some judicious selection of sets of related potentially related 
hkls work as a universal free set? (Not thought this through fully)


This would get around practical issues like I had yestserday in refining 
in another well known package where coot drew the map as if it was 0.5 
A data even though there were only observed data to 2.1 the rest just 
being a hopelessly overoptimistic guess of the best ever dataset we 
might collect.


I agree you CAN do this with current software- it is just not the path 
of least resistance, so you have to double check your group are doing this.


Best wishes
Nick





--
Prof Nicholas H. Keep
Executive Dean of School of Science
Professor of Biomolecular Science
Crystallography, Institute for Structural and Molecular Biology,
Department of Biological Sciences
Birkbeck,  University of London,
Malet Street,
Bloomsbury
LONDON
WC1E 7HX

email n.k...@mail.cryst.bbk.ac.uk
Telephone 020-7631-6852  (Room G54a Office)
  020-7631-6800  (Department Office)
Fax   020-7631-6803
If you want to access me in person you have to come to the crystallography 
entrance
and ring me or the department office from the internal phone by the door


Re: [ccp4bb] How many is too many free reflections?

2015-06-04 Thread Graeme Winter
Hi Folks,

Many thanks for all of your comments - in keeping with the spirit of the BB
I have digested the responses below. Interestingly I suspect that the
responses to this question indicate the very wide range of resolution
limits of the data people work with!

Best wishes Graeme

===

Proposal 1:

10% reflections, max 2000

Proposal 2: from wiki:

http://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/Test_set

including Randy Read recipe:

So here's the recipe I would use, for what it's worth:
  1 reflections:set aside 10%
   1-2 reflections:  set aside 1000 reflections
   2-4 reflections:  set aside 5%
  4 reflections:set aside 2000 reflections

Proposal 3:

5% maximum 2-5k

Proposal 4:

3% minimum 1000

Proposal 5:

5-10% of reflections, minimum 1000

Proposal 6:

 50 reflections per bin in order to get reliable ML parameter
estimation, ideally around 150 / bin.

Proposal 7:

If lots of reflections (i.e. 800K unique) around 1% selected - 5% would be
40k i.e. rather a lot. Referees question use of  5k reflections as test
set.

Comment 1 in response to this:

Surely absolute # of test reflections is not relevant, percentage is.



Approximate consensus (i.e. what I will look at doing in xia2) - probably
follow Randy Read recipe from ccp4wiki as this seems to (probably) satisfy
most of the criteria raised by everyone else.



On Tue, Jun 2, 2015 at 11:26 AM Graeme Winter graeme.win...@gmail.com
wrote:

 Hi Folks

 Had a vague comment handed my way that xia2 assigns too many free
 reflections - I have a feeling that by default it makes a free set of 5%
 which was OK back in the day (like I/sig(I) = 2 was OK) but maybe seems
 excessive now.

 This was particularly in the case of high resolution data where you have a
 lot of reflections, so 5% could be several thousand which would be more
 than you need to just check Rfree seems OK.

 Since I really don't know what is the right # reflections to assign to a
 free set thought I would ask here - what do you think? Essentially I need
 to assign a minimum %age or minimum # - the lower of the two presumably?

 Any comments welcome!

 Thanks  best wishes Graeme