On Tue, May 25, 2010 at 6:58 AM, Pavel Afonine <pafon...@lbl.gov> wrote:
> Yes, because all the information you need is encapsulated in 1 number > per region of interest! > > I agree it's a good reason. Actually what I described doesn't quite achieve this, because you need to keep a record of both the RMSD Z-score *and* Npoints (or Ndof) in order to decide whether the critical p-value is exceeded or not. Note that this is also true of the RSCC and RSR scores: the reliability of the score will depend critically on the number of statistically independent grid points used to compute it. For this reason I would suggest a small modification to the algorithm I described that avoids this problem: 1. Compute chi-squared = sum(delta_rho^2)/sigma(rho)^2 / 8 and Ndof = Npoints / 8 for the hypothetical Shannon-sampled map. Here the factors of 8 account for the over-sampling of the actual map as before but may need to be adjusted in practice because FFT doesn't necessarily use grid sampling that's an exact integer fraction of the resolution. 2. Look up or compute the p-value ( = 1 - CDF of the chi-square distribution). Note that Ndof doesn't need to be an integer for this, since it's basically a gamma function. 3. Use the p-value, or probably more conveniently its negative log, as the score. Now you don't need to keep track of Npoints or Ndof. 4. Alternatively you could convert the p-value to the equivalent sigma level for the normal distribution, using tables or the inverse normal CDF function. I suspect many people are not familiar with p-values but are more comfortable with sigma levels (i.e. > 3 sigma probably means that it's significant). > But I don't understand what you mean by > 2mFo-DFc & mFo-DFc being counted each as 1 number. > > Given the map and model, you can get the map value at (x,y,z) position, for > example, at the center of atom. This is what phenix.model_vs_data reports. > For each atom you get three numbers: map CC, and the values of 2mFo-DFc and > mFo-DFc maps at the atomic position. I was thinking of per-residue scores (maybe separated according to main/side-chain atoms), I don't think per-atom scores are so useful. > We can more or less relate these values to the map appearance and > model-to-map fit quality. Looking at these numbers one can approximately > tell whether it's good, so so, or bad. It's like crystallographic reciprocal > space R-factor. If I see R=35% for a structure at 2A resolution - it's not > good, and R=17% is much better. What you need is a continuously variable *single* score of 'goodness' (essentially a probability score such as a p-value) that you can plot vertically on a graph with the atom/residue nos running horizontally. Then if you want you can divide the graph vertically into regions: e.g. 'good', 'bad' and 'so-so'. If you have multiple scores which individually do not give you that information, you still have the problem of coming up with a formula to combine them into a single score that can be graphed. Also you really need to distinguish those scores that purely measure model *consistency* (i.e. with the X-ray data), in order to inform the crystallographer what further improvements to the model are necessary during the structure determination phase, from scores that purely measure model *quality*. By that I mean those scores that allow users of structure models to compare your model with other structures obtained from different datasets (and quite possibly different crystal forms), for the purpose of obtaining biological information from the structure, so that really only becomes an issue post-structure determination. After all do you really want to be told that the density for your surface side-chains and loops is poor quality when there's nothing more you can do to improve the fit - you knew that already! The problem with the RSCC, RSR and 2mFo-DFc map scores is that they confuse consistency with quality, so don't necessarily tell you where changes need to be made, whereas the difference map and scores based on it measure only consistency. I agree though that the 2mFo-DFc map when used in conjunction with the difference map is useful for telling you where changes need to be made. > Anyway I will code that formula and play with it. Great! Cheers -- Ian