The characters on my car license plate are in one sense highly improbable -
billions-to-one against.  In another sense they are normal and probable
because there is nothing special about those characters.

However if I ordered a new car and it was delivered with a license plate
that included my initials as in PDSS 0001 that would clearly not be a
coincidence!

On the other hand we sometimes overlook the number of possible paths that
lead to an observation, and see connections where there aren't any.  Eg
what would you have said, at the time, was the chance that the "umbrella
man" was NOT involved in the assassination of JFK (spoiler - he wasn't!)?

https://en.wikipedia.org/wiki/Umbrella_man_(JFK_assassination)




On Wed, Nov 9, 2022 at 7:38 AM Dale Tronrud <de...@daletronrud.com> wrote:

>     Let's say you have decided that you want to know if the CA-CB bond
> of residue 123 in your favorite protein differs from the expected value
> for that type of bond.  You solve the structure and refine a model
> against your crystallographic data, then look at residue's 123 CA-CB
> bond and find that it is 3 sigma from the expected value.  Is this
> observation unlikely given the uncertainties in the parameters of the
> model?
>
>     Now, let's look at a different case.  You have solved and refined a
> model of your favorite protein.  After examining all of 1000 bond
> lengths in your model you notice that the CA-CB bond of residue 123 is 3
> sigma from its expected value.  Is this observation unlikely given the
> uncertainties in the parameters of the model?
>
>     Even though you are looking at the same bond in the same model and
> see exactly the same thing, the calculation of the probability that this
> bond is actually different than is usual it very different.  The
> calculation that you want to perform - the classic p test based on a
> Normal distribution - is valid for the first case but is quite
> inappropriate for the second.
>
>     It is clearly much more likely that, among 1000 bonds, one of them
> will have a deviation of 3 sigma.  In fact I would say it is a near
> certainty.
>
>     This twist of statistical analysis was never discussed in the basic
> classes on stats that I took and most scientists tend to ignore it.  To
> avoid the apparent paradox that you are confronting you have to include
> in your calculations the consequences of the actual question you have
> asked.
>
>     There are huge problems with calculating this sort of "significance"
> because it is quite tempting to change your question after the fact and
> conclude that something is significant when it is not.  TNT always
> produced a list of the geometry outliers after refinement.  If you
> notice that a residue in the active site is present in that list, you
> will be tempted to forget that this residue was brought to your
> attention by a search over all geometry restraints and not a prior
> interest in the active site.
>
>     This is a problem that many other fields of research are contending
> with.  One solution is to publish the questions you hope your model will
> answer before you perform the research.  That is certainly difficult
> with our sort of research.
>
>     An example from another area might be helpful.  A researcher
> performs a survey of a lot of people asking questions about their diet
> and about their medical history.  Very often the published conclusion
> will be that, say, dietary item number 5 is correlated with medical
> condition number 12.  These studies tend to assess the significance of
> this result by just comparing the odds of these two items having the
> observed magnitude of correlation.
>
>     This ignores the fact that a host of correlations were calculated
> and only this one was "significant".  If the survey had 20 dietary
> factors and 20 conditions then 400 comparisons were made and it was a
> virtual certainty that one of them would be "significant" unless the
> proper correction made to the probability calculations.
>
> Dale E. Tronrud
>
> On 11/8/2022 3:25 PM, James Holton wrote:
> > Thank you Ian for your quick response!
> >
> > I suppose what I'm really trying to do is put a p-value on the
> > "geometry" of a given PDB file.  As in: what are the odds the deviations
> > from ideality of this model are due to chance?
> >
> > I am leaning toward the need to take all the deviations in the structure
> > together as a set, but, as Joao just noted, that it just "feels wrong"
> > to tolerate a 3-sigma deviate.  Even more wrong to tolerate 4 sigma, 5
> > sigma. And 6 sigma deviates are really difficult to swallow unless your
> > have trillions of data points.
> >
> > To put it down in equations, is the p-value of a structure with 1000
> > bonds in it with one 3-sigma deviate given by:
> >
> > a)  p = 1-erf(3/sqrt(2))
> > or
> > b)  p = 1-erf(3/sqrt(2))**1000
> > or
> > c) something else?
> >
> >
> >
> > On 11/8/2022 2:56 PM, Ian Tickle wrote:
> >> Hi James
> >>
> >> I don't think it's meaningful to ask whether the deviation of a single
> >> bond length (or anything else that's single) from its expected value
> >> is significant, since as you say there's always some finite
> >> probability that it occurred purely by chance.  Statistics can only
> >> meaningfully be applied to samples of a 'reasonable' size.  I know
> >> there are statistics designed for small samples but not for samples of
> >> size 1 !  It's more meaningful to talk about distributions.  For
> >> example if 1% of the sample contained deviations > 3 sigma when you
> >> expected there to be only 0.3 %, that is probably significant (but it
> >> still has a finite probability of occurring by chance), as would be
> >> finding no deviations > 3 sigma (for a reasonably large sample to
> >> avoid sampling errors).
> >>
> >> Cheers
> >>
> >> -- Ian
> >>
> >>
> >> On Tue, Nov 8, 2022, 22:22 James Holton <jmhol...@lbl.gov> wrote:
> >>
> >>     OK, so lets suppose there is this bond in your structure that is
> >>     stretched a bit.  Is that for real? Or just a random fluke?  Let's
> >>     say
> >>     for example its a CA-CB bond that is supposed to be 1.529 A long,
> >>     but in
> >>     your model its 1.579 A.  This is 0.05 A too long. Doesn't seem like
> >>     much, right? But the "sigma" given to such a bond in our geometry
> >>     libraries is 0.016 A.  These sigmas are typically derived from a
> >>     database of observed bonds of similar type found in highly accurate
> >>     structures, like small molecules. So, that makes this a 3-sigma
> >>     outlier.
> >>     Assuming the distribution of deviations is Gaussian, that's a pretty
> >>     unlikely thing to happen. You expect 3-sigma deviates to appear less
> >>     than 0.3% of the time.  So, is that significant?
> >>
> >>     But, then again, there are lots of other bonds in the structure.
> Lets
> >>     say there are 1000. With that many samplings from a Gaussian
> >>     distribution you generally expect to see a 3-sigma deviate at least
> >>     once.  That is, do an "experiment" where you pick 1000
> >>     Gaussian-random
> >>     numbers from a distribution with a standard deviation of 1.0.
> >>     Then, look
> >>     for the maximum over all 1000 trials. Is that one > 3 sigma? It
> >>     probably
> >>     is. If you do this "experiment" millions of times it turns out
> >>     seeing at
> >>     least one 3-sigma deviate in 1000 tries is very common.
> Specifically,
> >>     about 93% of the time. It is rare indeed to have every member of a
> >>     1000-deviate set all lie within 3 sigmas.  So, we have gone from one
> >>     3-sigma deviate being highly unlikely to being a virtual certainty
> if
> >>     you look at enough samples.
> >>
> >>     So, my question is: is a 3-sigma deviate significant?  Is it
> >>     significant
> >>     only if you have one bond in the structure?  What about angles?
> >>     What if
> >>     you have 500 bonds and 500 angles?  Do they count as 1000 deviates
> >>     together? Or separately?
> >>
> >>     I'm sure the more mathematically inclined out there will have some
> >>     intelligent answers for the rest of us, however, if you are not a
> >>     mathematician, how about a vote?  Is a 3-sigma bond length deviation
> >>     significant? Or not?
> >>
> >>     Looking forward to both kinds of responses,
> >>
> >>     -James Holton
> >>     MAD Scientist
> >>
> >>
>  ########################################################################
> >>
> >>     To unsubscribe from the CCP4BB list, click the following link:
> >>     https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
> >>     <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
> >>
> >>     This message was issued to members of www.jiscmail.ac.uk/CCP4BB
> >>     <http://www.jiscmail.ac.uk/CCP4BB>, a mailing list hosted by
> >>     www.jiscmail.ac.uk <http://www.jiscmail.ac.uk>, terms & conditions
> >>     are available at https://www.jiscmail.ac.uk/policyandsecurity/
> >>
> >
> >
> > ------------------------------------------------------------------------
> >
> > To unsubscribe from the CCP4BB list, click the following link:
> > https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
> > <https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1>
> >
>
> ########################################################################
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
>
> This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a
> mailing list hosted by www.jiscmail.ac.uk, terms & conditions are
> available at https://www.jiscmail.ac.uk/policyandsecurity/
>


-- 
 patr...@douglas.co.uk    Douglas Instruments Ltd.
 Douglas House, East Garston, Hungerford, Berkshire, RG17 7HD, UK
 Directors: Patrick Shaw Stewart, Peter Baldock, Stefan Kolek

 http://www.douglas.co.uk
 Tel: 44 (0) 148-864-9090    US toll-free 1-877-225-2034
 Regd. England 2177994, VAT Reg. GB 480 7371 36

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to