Here are my 2-3 cents worth on the topic:
The first thing to keep in mind is that the goal of a structure
determination
is not to get the best stats or to claim the highest possible
resolution.
The goal is to get the best possible structure and to be confident that
observed features in a structure are real and not the result of noise.
From that perspective, if any of the conclusions one draws from a
structure
change depending on whether one includes data with an I/sigI in the
highest
resolution shell of 2 or 1, one probably treads on thin ice.
The general guide that one should include only data, for which the
shell's average
I/sigI > 2 comes from the following simple consideration.
F/sigF = 2 I/sigI
So if you include data with an I/sigI of 2 then your F/sigF =4. In
other words you will
have a roughly 25% experimental uncertainty in your F.
Now assume that you actually knew the structure of your protein and
you would
calculate the crystallographic R-factor between the Fcalcs from your
true structure and the
observed F.
In this situation, you would expect to get a crystallographic R-
factor around 25%,
simply because of the average error in your experimental structure
factor.
Since most macromolecular structures have R-factors around 20%, it
makes little
sense to include data, where the experimental uncertainty alone will
guarantee that your R-factor will be worse.
Of course, these days maximum-likely-hood refinement will just down
weight
such data and all you do is to burn CPU cycles.
If you actually want to do a semi rigorous test of where you should stop
including data, simply include increasingly higher resolution data in
your
refinement and see if your structure improves.
If you have really high resolution data (i.e. better than 1.2 Angstrom)
you can do matrix inversion in SHELX and get estimated standard
deviations (esd)
for your refined parameters. As you include more and more data the
esds should
initially decrease. Simply keep including higher resolution data
until your esds
start to increase again.
Similarly, for lower resolution data you can monitor some molecular
parameters, which are not
included in the stereochemical restraints and see, if the inclusion
of higher-resolution data makes the
agreement between the observed and expected parameters better. For
example SHELX does not
restrain torsion angles in aliphatic portions of side chains. If your
structure improves, those
angles should cluster more tightly around +60 -60 and 180...
Cheers,
Ulrich
Could someone point me to some standards for data quality,
especially for publishing structures? I'm wondering in particular
about highest shell completeness, multiplicity, sigma and Rmerge.
A co-worker pointed me to a '97 article by Kleywegt and Jones:
http://xray.bmc.uu.se/gerard/gmrp/gmrp.html
"To decide at which shell to cut off the resolution, we nowadays
tend to use the following criteria for the highest shell:
completeness > 80 %, multiplicity > 2, more than 60 % of the
reflections with I > 3 sigma(I), and Rmerge < 40 %. In our opinion,
it is better to have a good 1.8 Å structure, than a poor 1.637 Å
structure."
Are these recommendations still valid with maximum likelihood
methods? We tend to use more data, especially in terms of the
Rmerge and sigma cuttoff.
Thanks in advance,
Shane Atwell