On Tue, May 25, 2010 at 6:58 AM, Pavel Afonine <pafon...@lbl.gov> wrote:

> Yes, because all the information you need is encapsulated in 1 number
> per region of interest!
>
> I agree it's a good reason.

Actually what I described doesn't quite achieve this, because you need
to keep a record of both the RMSD Z-score *and* Npoints (or Ndof) in
order to decide whether the critical p-value is exceeded or not.  Note
that this is also true of the RSCC and RSR scores: the reliability of
the score will depend critically on the number of statistically
independent grid points used to compute it.  For this reason I would
suggest a small modification to the algorithm I described that avoids
this problem:

1. Compute chi-squared = sum(delta_rho^2)/sigma(rho)^2 / 8 and Ndof =
Npoints / 8 for the hypothetical Shannon-sampled map.

Here the factors of 8 account for the over-sampling of the actual map
as before but may need to be adjusted in practice because FFT doesn't
necessarily use grid sampling that's an exact integer fraction of the
resolution.

2. Look up or compute the p-value ( = 1 - CDF of the chi-square
distribution).  Note that Ndof doesn't need to be an integer for this,
since it's basically a gamma function.

3. Use the p-value, or probably more conveniently its negative log, as
the score.  Now you don't need to keep track of Npoints or Ndof.

4. Alternatively you could convert the p-value to the equivalent sigma
level for the normal distribution, using tables or the inverse normal
CDF function.  I suspect many people are not familiar with p-values
but are more comfortable with sigma levels (i.e. > 3 sigma probably
means that it's significant).

> But I don't understand what you mean by
> 2mFo-DFc & mFo-DFc being counted each as 1 number.
>
> Given the map and model, you can get the map value at (x,y,z) position, for
> example, at the center of atom. This is what phenix.model_vs_data reports.
> For each atom you get three numbers: map CC, and the values of 2mFo-DFc and
> mFo-DFc maps at the atomic position.

I was thinking of per-residue scores (maybe separated according to
main/side-chain atoms), I don't think per-atom scores are so useful.

> We can more or less relate these values to the map appearance and
> model-to-map fit quality. Looking at these numbers one can approximately
> tell whether it's good, so so, or bad. It's like crystallographic reciprocal
> space R-factor. If I see R=35% for a structure at 2A resolution - it's not
> good, and R=17% is much better.

What you need is a continuously variable *single* score of 'goodness'
(essentially a probability score such as a p-value) that you can plot
vertically on a graph with the atom/residue nos running horizontally.
Then if you want you can divide the graph vertically into regions:
e.g. 'good', 'bad' and 'so-so'.  If you have multiple scores which
individually do not give you that information, you still have the
problem of coming up with a formula to combine them into a single
score that can be graphed.

Also you really need to distinguish those scores that purely measure
model *consistency* (i.e. with the X-ray data), in order to inform the
crystallographer what further improvements to the model are necessary
during the structure determination phase, from scores that purely
measure model *quality*.  By that I mean those scores that allow users
of structure models to compare your model with other structures
obtained from different datasets (and quite possibly different crystal
forms), for the purpose of obtaining biological information from the
structure, so that really only becomes an issue post-structure
determination.  After all do you really want to be told that the
density for your surface side-chains and loops is poor quality when
there's nothing more you can do to improve the fit - you knew that
already!  The problem with the RSCC, RSR and 2mFo-DFc map scores is
that they confuse consistency with quality, so don't necessarily tell
you where changes need to be made, whereas the difference map and
scores based on it measure only consistency.  I agree though that the
2mFo-DFc map when used in conjunction with the difference map is
useful for telling you where changes need to be made.

> Anyway I will code that formula and play with it.

Great!

Cheers

-- Ian

Reply via email to