The second part of your question has to do with assessing the
probability of correctness of a model by comparing the distribution of
the individual values of geometry items with the distribution observed
in large sets of high quality crystal structures. Certainly, if your
model has many more large deviants than expected from the observed
distribution of deviants in quality models I would have doubts about it.
(I would also like to say that too few large deviants is a mark of
shame too, but read on.)
Actually, this is nothing more than comparing the rmsd bond lengths
and rmsd bond angles with the rmsd's of the restraint library. You are
basically fitting a Normal distribution to both sets of observations and
comparing their sigmas. Remember when we used to do that, and still do
implicitly when we publish these rmsd's in Table 1.
What we have learned is that a model with rmsd's that are too large
is certainly suspect, but people only rarely produce such models any
more. The real complication is that we, as a community, have decided
based on other criteria that it is best for our models to have rmsd's
for geometry that are much smaller than the rmsd's of our restraint
libraries.
The rmsd bond length of the quality models that I've seen tend to be
around 0.02 A. Looking in the PDB we tend to prefer 0.01 A and often
less. There are good reasons for this, based on the fact that low
resolution data cannot define the correct values of the deviants and in
that case we prefer to have deviants that are too small than deviants
that have the correct magnitude distribution but are not related to the
"real" deviants on a bond-by-bond basis. (SigmaA weighting comes to
mind as a similar solution to a similar problem.)
If we assess the reliability of our models by looking to see if the
distribution of deviants matches that of the library all of our models
will be flagged as extremely unlikely. Does that mean that matching the
distributions will improve the model, as measured by the reliability of
the individual or relative locations of the atoms? I don't think so.
Dale E. Tronrud
On 11/8/2022 3:25 PM, James Holton wrote:
Thank you Ian for your quick response!
I suppose what I'm really trying to do is put a p-value on the
"geometry" of a given PDB file. As in: what are the odds the deviations
from ideality of this model are due to chance?
I am leaning toward the need to take all the deviations in the structure
together as a set, but, as Joao just noted, that it just "feels wrong"
to tolerate a 3-sigma deviate. Even more wrong to tolerate 4 sigma, 5
sigma. And 6 sigma deviates are really difficult to swallow unless your
have trillions of data points.
To put it down in equations, is the p-value of a structure with 1000
bonds in it with one 3-sigma deviate given by:
a) p = 1-erf(3/sqrt(2))
or
b) p = 1-erf(3/sqrt(2))**1000
or
c) something else?
On 11/8/2022 2:56 PM, Ian Tickle wrote:
Hi James
I don't think it's meaningful to ask whether the deviation of a single
bond length (or anything else that's single) from its expected value
is significant, since as you say there's always some finite
probability that it occurred purely by chance. Statistics can only
meaningfully be applied to samples of a 'reasonable' size. I know
there are statistics designed for small samples but not for samples of
size 1 ! It's more meaningful to talk about distributions. For
example if 1% of the sample contained deviations > 3 sigma when you
expected there to be only 0.3 %, that is probably significant (but it
still has a finite probability of occurring by chance), as would be
finding no deviations > 3 sigma (for a reasonably large sample to
avoid sampling errors).
Cheers
-- Ian
On Tue, Nov 8, 2022, 22:22 James Holton <jmhol...@lbl.gov> wrote:
OK, so lets suppose there is this bond in your structure that is
stretched a bit. Is that for real? Or just a random fluke? Let's
say
for example its a CA-CB bond that is supposed to be 1.529 A long,
but in
your model its 1.579 A. This is 0.05 A too long. Doesn't seem like
much, right? But the "sigma" given to such a bond in our geometry
libraries is 0.016 A. These sigmas are typically derived from a
database of observed bonds of similar type found in highly accurate
structures, like small molecules. So, that makes this a 3-sigma
outlier.
Assuming the distribution of deviations is Gaussian, that's a pretty
unlikely thing to happen. You expect 3-sigma deviates to appear less
than 0.3% of the time. So, is that significant?
But, then again, there are lots of other bonds in the structure. Lets
say there are 1000. With that many samplings from a Gaussian
distribution you generally expect to see a 3-sigma deviate at least
once. That is, do an "experiment" where you pick 1000
Gaussian-random
numbers from a distribution with a standard deviation of 1.0.
Then, look
for the maximum over all 1000 trials. Is that one > 3 sigma? It
probably
is. If you do this "experiment" millions of times it turns out
seeing at
least one 3-sigma deviate in 1000 tries is very common. Specifically,
about 93% of the time. It is rare indeed to have every member of a
1000-deviate set all lie within 3 sigmas. So, we have gone from one
3-sigma deviate being highly unlikely to being a virtual certainty if
you look at enough samples.
So, my question is: is a 3-sigma deviate significant? Is it
significant
only if you have one bond in the structure? What about angles?
What if
you have 500 bonds and 500 angles? Do they count as 1000 deviates
together? Or separately?
I'm sure the more mathematically inclined out there will have some
intelligent answers for the rest of us, however, if you are not a
mathematician, how about a vote? Is a 3-sigma bond length deviation
significant? Or not?
Looking forward to both kinds of responses,
-James Holton
MAD Scientist
########################################################################
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
This message was issued to members of www.jiscmail.ac.uk/CCP4BB
<http://www.jiscmail.ac.uk/CCP4BB>, a mailing list hosted by
www.jiscmail.ac.uk <http://www.jiscmail.ac.uk>, terms & conditions
are available at https://www.jiscmail.ac.uk/policyandsecurity/
------------------------------------------------------------------------
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
<https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
########################################################################
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list
hosted by www.jiscmail.ac.uk, terms & conditions are available at
https://www.jiscmail.ac.uk/policyandsecurity/