They say one test is worth a thousand expert opinions, so I tried my
hand at the former.
The question is: what is the right way to treat disordered side
chains?:
a) omit atoms you cannot see
b) build them, and set occupancy to zero
c) build them, and "let the B factors take care of it"
d) none of the above
The answer, of course, is d).
Oh, c'mon. Yes, I know one of a,b, or c is what you've been doing
your whole life. I do it too. But, let's face it: none of these
solutions are perfect. So, the real question is not which one is
"right", but which is the least wrong?
We all know what is really going on: the side chain is flapping
around. No doubt it spends most of its time in energetically
reasonable but nevertheless numerous conformations. There are 41
"Favorable" rotamers for Lys alone, and it doesn't take that many to
spread the density thin enough to fall below the classical 1-sigma
contour level. The atoms are still there, they are still
contributing to the data, and they haven't gone far. So why don't we
"just" model that? Already, I can hear the cries of "over-fitting!"
and "observations/parameters!", "model bias!", and "think of the
children!" Believe it or not, none of these are the major issue
here. Allow me to demonstrate:
Consider a simple case where we have a Lys side chain in ten
conformers. I chose from popular rotamers, but evenly spread. That
is, all 10 conformers have an occupancy of 0.10, and there is a
3-3-4 split of chi1 values between minus, plus and trans. This will
give the maximum contrast of density between CB and CG. Let us
further require that there is no strain in this ground-truth. No
stretched bonds, no tortured angles, no clashes, etc. Real
molecules don't occupy such high-energy states unless they
absolutely have to. Let us further assume that the bulk solvent
works the way phenix models it, which is a probe radius of 1.1 A for
both ions and aliphatics and a shrink radius of 0.9. But, instead
of running one phenix.fmodel job, I ran ten: one for each conformer
(A thru J). To add some excitement, I moved the main chain ~0.2 A
in a random direction for each conformer. I then took these ten
calculated electron density maps (bulk solvent and all) and added
them together to form the ground truth for the following trials.
Before refinement, I added noise consistent with an I/sigma of 50
and cut the resolution at 2.0 A. Wilson B is 50:
CCtrue Rwork% Rfree% fo-fc(sigma)
description
0.8943 9.05 10.60 5.9 stump at CB
0.9540 9.29 11.73 6.0 single conformer, zero
occupancy
0.9471 10.35 15.04 5.1 single conformer, full
occupancy, refmac5
0.9523 9.78 15.61 4.9 single conformer, full
occupancy, phenix.refine
So, it would appear that the zero-occupancy choice "wins", but by
the narrowest of margins. Here CCtrue is the Pearson correlation
coefficient between the ground-truth right-answer electron density
and the 2fofc map resulting from the refinement. Rwork and Rfree
are the usual suspects, and fo-fc indicates the tallest peak in the
difference map. Refinement was with refmac unless otherwise
indicated. I think we often forget that both phenix and refmac
restrain B factor values, not just through bonds but through space,
and they use rather different algorithms. Refmac tries to make the
histogram of B factors "look right", whereas phenix allows steeper
gradients. I also ran all 10 correct rotamers separately and picked
the one with the best CCtrue to show above. If you instead sort on
Rfree (which you really shouldn't do), you get different bests, but
they are not much better (as low as 10.5%). So, the winner here
depends on how you score. CCtrue is the best score, but also
unfortunately unavailable for real data.
It is perhaps interesting here that better CCtrue goes along with
worse Rfree. This is not what I usually see in experiments like
this. Rather, what I think is going on here is the system is
frustrated. We are trying to fit various square pegs into a round
hole, and none of them fit all that well.
In all cases here the largest difference peak was indicating another
place to put the Lys, so why not build into that screaming, 6-sigma
difference peak? Here is what happens when you do that:
CCtrue Rwork% Rfree% fo-fc(sigma)
description
0.8943 9.05 10.60 5.9
stump
at CB
0.9580 9.95 11.60 6.4
stump
at CG
0.9585
10.20
12.29 6.2
stump
at CG, all 10
confs
0.9543 10.61 12.24 5.3
stump
at CD, all 10
confs
0.9383 10.69 14.64
4.1
stump at CE,
all 10 confs
0.9476 9.66 13.48
4.6
all atoms, all
10 confs
0.9214 7.09 11.8
5.6
three
conformers
(worst of 120
combos)
0.9718
6.53
8.55
4.3
three
conformers
(best of 120
combos)
0.9710 7.17 9.44
6.1
two conformers
(best of 45
combos)
0.9471
10.35
15.04
5.1
single
conformer
(best of 10
choices)
If I add one CG, the other two chi1 positions light up. So, I tried
building in all 10 true CG positions, and let the refinement decide
what to do with them. The clear indication was that a CD should be
added. After adding all the CDs, the difference peaks were weaker,
but still indicating more atoms were needed. Rwork and Rfree,
however, tell the opposite story. They get worse the more atoms you
add. CCtrue, on the other hand, was best when cutting everything
after CG. Why is that? Well, every time you add another atom you
fill in the difference density, but then that atom pushes back the
bulk solvent model that was filling in the density for the next
atom. The atom-to-solvent distance is roughly twice that of a
covalent bond. So again, square pegs and round holes.
Three conformers coming out as the winner may be because it is a
selective process with a noisy score. In the ground truth there are
10 conformers at equal occupancy, so no one triplet is really any
better than any other. However, one has a density shape that fits
better than other combos. My search over all possible quartets is
still running.
But what if we got the solvent "right"? Well, here is what that
looks like:
CCtrue
Rwork% Rfree%
fo-fc(sigma)
description
0.9476
9.66
13.48
4.6
all atoms, all
confs, refmac
defaults
0.9696
6.15
8.88 3.7
all
atoms, all
confs,
phenix.refine
0.9825 0.80 0.89 3.9 all
atoms, all confs, true solvent
0.9824 0.92 1.26 7.3
true
model, minus
one H atom
from ordered
HIS side chain
You can see that the default solvent of phenix.refine fares better
than refmac here, but since I generated the solvent with phenix
refine it may have an unfair advantage. Nevertheless, providing the
"true solvent" here is quite a striking drop in R factors. This is
not surprising since this was the last systematic error in this
ground truth. In all cases, I provided the true atomic positions at
the start of refinement, so there was no confusion about
strain-inducing local minima, such as which rotamer goes with which
main chain shift. And yes, you can provide arbitrary bulk solvent
maps to refmac5 using the "Fpart" feature. I've had good luck with
real data using bulk density derived form MD simulations.
What is more, once the R factors are this low I can remove just one
hydrogen atom and it comes back as a 7.3-sigma difference peak. This
corresponds to the protonation state of that His. This kind of
sensitivity is really attractive if you are looking for low-lying
features, such as partially-occupied ligands. Some may pooh-pooh R
factors as "cosmetic" features of structures, but they are, in fact,
nothing more or less than the % error between your model and your
data. This % error translates directly into the noise level of your
map. At 20% error there is no hope whatsoever of seeing 1-electron
changes. This is because hydrogen is only 17% of a carbon. But 3-5%
error, which is a typical experimental error in crystallographic
data, anything bigger than one electron is clear.
-James Holton
MAD Scientist
On 3/18/2023 2:10 PM, Nicholas Pearce
wrote:
Not stupid, but essentially the same as
modelling alt confs, though would probably give more
overfitting. Alt confs can easily be converted to an
ensemble (if done properly…).
Thanks,
Nick
———
Nicholas Pearce
Assistant Professor in Bioinformatics &
DDLS Fellow
Linköping University
Sweden
Hi,
Probably a stupid question.
Could you multiply a, b and c cell dimensions by 2 or 3
(to give 8 or 27 structures) and restrain well defined
parts of structure to be ‘identical’ ? To give you a more
NMR like chemically sensible ensemble of structures?
Ben
> On 18 Mar 2023, at 12:04, Helen Ginn
<ccp...@hginn.co.uk> wrote:
>
> Models for crystallography have two purposes:
refinement and interpretation. Here these two purposes are
in conflict. Neither case is handled well by either trim
or not trim scenario, but trimming results in a deficit
for refinement and not-trimming results in a deficit for
interpretation.
>
> Our computational tools are not “fixed” in the same
way that the standard amino acids are “fixed” or your
government’s bureaucracy pathways are “fixed”. They are
open for debate and for adjustments. This is a fine
example where it may be more productive to discuss the
options for making changes to the model itself or its
representation, to better account for awkward situations
such as these. Otherwise we are left figuring out the best
imperfect way to use an imperfect tool (as all tools are,
to varying degrees!), which isn’t satisfying for enough
people, enough of the time.
>
> I now appreciate the hypocrisy in the argument “do
not trim, but also don’t model disordered regions”, even
though I’d be keen to avoid trimming. This discussion has
therefore softened my own viewpoint.
>
> My refinement models (as implemented in Vagabond) do
away with the concept of B factors precisely for the
anguish it causes here, and refines a distribution of
protein conformations which is sampled to generate an
ensemble. By describing the conformations through the
torsion angles that comprise the protein, modelling
flexibility of a disordered lysine is comparatively
trivial, and indeed modelling all possible conformations
of a disordered loop becomes feasible. Lysines end up
looking like a frayed end of a rope. Each conformation can
produce its own solvent mask, which can be summed together
to produce a blurring of density that matches what you
would expect to see in the crystal.
>
> In my experience this doesn’t drop the R factors as
much as you’d assume, because blurred out protein density
does look very much like solvent, but it vastly improves
the interpretability of the model. This also better models
the boundary between the atoms you would trim and those
you’d leave untrimmed, by avoiding such a binary
distinction. No fear of trimming and pushing those errors
unseen into the rest of the structure. No fear of leaving
atoms in with an inadequate B factor model that cannot
capture the nature of the disorder.
>
> Vagabond is undergoing a heavy rewrite though, and is
not yet ready for human consumption. Its first iteration
worked on single-dataset-single-model refinement, which
handled disordered side chains well enough, with no need
to decide to exclude atoms. The heart of the issue lies in
main chain flexibility, and this must be handled
correctly, for reasons of interpretability and elucidating
the biological impact. This model isn’t perfect either,
and necessitates its own compromises - but will provide
another tool in the structural biology arsenal.
>
> —-
>
> Dr Helen Ginn
> Group leader, DESY
> Hamburg Advanced Research Centre for Bioorganic
Chemistry (HARBOR)
> Luruper Chaussee 149
> 22607 Hamburg
>
>
########################################################################
>
> To unsubscribe from the CCP4BB list, click the
following link:
>
https://eur01.safelinks.protection.outlook.com/?url="">
>
> This message was issued to members of
https://eur01.safelinks.protection.outlook.com/?url="">,
a mailing list hosted by
https://eur01.safelinks.protection.outlook.com/?url="">,
terms & conditions are available at
https://eur01.safelinks.protection.outlook.com/?url="">
########################################################################
To unsubscribe from the CCP4BB list, click the following
link:
https://eur01.safelinks.protection.outlook.com/?url="">
This message was issued to members of
https://eur01.safelinks.protection.outlook.com/?url="">,
a mailing list hosted by
https://eur01.safelinks.protection.outlook.com/?url="">,
terms & conditions are available at
https://eur01.safelinks.protection.outlook.com/?url="">
To unsubscribe from the CCP4BB list, click the
following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1
To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1