Re: [ccp4bb] To Trim or Not to To Trim

Lijun Liu Sun, 19 Mar 2023 21:31:25 -0700

James:

Sent from my iPhone

On Mar 19, 2023, at 2:37 PM, James Holton <jmhol...@lbl.gov> wrote:

They say one test is worth a thousand expert opinions, so I tried my hand at the former.

The question is: what is the right way to treat disordered side chains?:
a) omit atoms you cannot see
b) build them, and set occupancy to zero
c) build them, and "let the B factors take care of it"
d) none of the above

The answer, of course, is d).

Oh, c'mon. Yes, I know one of a,b, or c is what you've been doing your whole life. I do it too. But, let's face it: none of these solutions are perfect. So, the real question is not which one is "right", but which is the least wrong?

We all know what is really going on: the side chain is flapping around. No doubt it spends most of its time in energetically reasonable but nevertheless numerous conformations. There are 41 "Favorable" rotamers for Lys alone, and it doesn't take that many to spread the density thin enough to fall below the classical 1-sigma contour level. The atoms are still there, they are still contributing to the data, and they haven't gone far. So why don't we "just" model that? Already, I can hear the cries of "over-fitting!" and "observations/parameters!", "model bias!", and "think of the children!" Believe it or not, none of these are the major issue here. Allow me to demonstrate:

Consider a simple case where we have a Lys side chain in ten conformers. I chose from popular rotamers, but evenly spread. That is, all 10 conformers have an occupancy of 0.10, and there is a 3-3-4 split of chi1 values between minus, plus and trans. This will give the maximum contrast of density between CB and CG. Let us further require that there is no strain in this ground-truth. No stretched bonds, no tortured angles, no clashes, etc. Real molecules don't occupy such high-energy states unless they absolutely have to. Let us further assume that the bulk solvent works the way phenix models it, which is a probe radius of 1.1 A for both ions and aliphatics and a shrink radius of 0.9. But, instead of running one phenix.fmodel job, I ran ten: one for each conformer (A thru J). To add some excitement, I moved the main chain ~0.2 A in a random direction for each conformer. I then took these ten calculated electron density maps (bulk solvent and all) and added them together to form the ground truth for the following trials. Before refinement, I added noise consistent with an I/sigma of 50 and cut the resolution at 2.0 A. Wilson B is 50:

CCtrue Rwork% Rfree% fo-fc(sigma)   description
0.8943     9.05   10.60      5.9         stump at CB
0.9540     9.29 11.73      6.0        single conformer, zero occupancy
0.9471    10.35 15.04     5.1     single conformer, full occupancy, refmac5
0.9523     9.78 15.61      4.9     single conformer, full occupancy, phenix.refine

So, it would appear that the zero-occupancy choice "wins", but by the narrowest of margins. Here CCtrue is the Pearson correlation coefficient between the ground-truth right-answer electron density and the 2fofc map resulting from the refinement. Rwork and Rfree are the usual suspects, and fo-fc indicates the tallest peak in the difference map. Refinement was with refmac unless otherwise indicated. I think we often forget that both phenix and refmac restrain B factor values, not just through bonds but through space, and they use rather different algorithms. Refmac tries to make the histogram of B factors "look right", whereas phenix allows steeper gradients. I also ran all 10 correct rotamers separately and picked the one with the best CCtrue to show above. If you instead sort on Rfree (which you really shouldn't do), you get different bests, but they are not much better (as low as 10.5%). So, the winner here depends on how you score. CCtrue is the best score, but also unfortunately unavailable for real data.

It is perhaps interesting here that better CCtrue goes along with worse Rfree. This is not what I usually see in experiments like this. Rather, what I think is going on here is the system is frustrated. We are trying to fit various square pegs into a round hole, and none of them fit all that well.

In all cases here the largest difference peak was indicating another place to put the Lys, so why not build into that screaming, 6-sigma difference peak? Here is what happens when you do that:

CCtrue Rwork% Rfree% fo-fc(sigma)   description
0.8943     9.05 10.60      5.9 stump at CB
0.9580     9.95   11.60      6.4 stump at CG
0.9585    10.20   12.29      6.2 stump at CG, all 10 confs
0.9543    10.61   12.24      5.3 stump at CD, all 10 confs
0.9383    10.69   14.64      4.1         stump at CE, all 10 confs
0.9476     9.66   13.48      4.6         all atoms, all 10 confs
0.9214 7.09    11.8      5.6         three conformers (worst of 120 combos)
0.9718     6.53    8.55      4.3         three conformers (best of 120 combos)
0.9710 7.17    9.44      6.1         two conformers (best of 45 combos)
0.9471    10.35   15.04      5.1         single conformer (best of 10 choices)

If I add one CG, the other two chi1 positions light up. So, I tried building in all 10 true CG positions, and let the refinement decide what to do with them. The clear indication was that a CD should be added. After adding all the CDs, the difference peaks were weaker, but still indicating more atoms were needed. Rwork and Rfree, however, tell the opposite story. They get worse the more atoms you add. CCtrue, on the other hand, was best when cutting everything after CG. Why is that? Well, every time you add another atom you fill in the difference density, but then that atom pushes back the bulk solvent model that was filling in the density for the next atom. The atom-to-solvent distance is roughly twice that of a covalent bond. So again, square pegs and round holes.

Three conformers coming out as the winner may be because it is a selective process with a noisy score. In the ground truth there are 10 conformers at equal occupancy, so no one triplet is really any better than any other. However, one has a density shape that fits better than other combos. My search over all possible quartets is still running.

But what if we got the solvent "right"? Well, here is what that looks like:

CCtrue Rwork% Rfree% fo-fc(sigma)   description
0.9476     9.66   13.48      4.6         all atoms, all confs, refmac defaults
0.9696 6.15    8.88      3.7 all atoms, all confs, phenix.refine
0.9825 0.80    0.89      3.9 all atoms, all confs, true solvent
0.9824     0.92    1.26      7.3 true model, minus one H atom from ordered HIS side chain

You can see that the default solvent of phenix.refine fares better than refmac here, but since I generated the solvent with phenix refine it may have an unfair advantage. Nevertheless, providing the "true solvent" here is quite a striking drop in R factors. This is not surprising since this was the last systematic error in this ground truth. In all cases, I provided the true atomic positions at the start of refinement, so there was no confusion about strain-inducing local minima, such as which rotamer goes with which main chain shift. And yes, you can provide arbitrary bulk solvent maps to refmac5 using the "Fpart" feature. I've had good luck with real data using bulk density derived form MD simulations.

What is more, once the R factors are this low I can remove just one hydrogen atom and it comes back as a 7.3-sigma difference peak. This corresponds to the protonation state of that His. This kind of sensitivity is really attractive if you are looking for low-lying features, such as partially-occupied ligands. Some may pooh-pooh R factors as "cosmetic" features of structures, but they are, in fact, nothing more or less than the % error between your model and your data. This % error translates directly into the noise level of your map. At 20% error there is no hope whatsoever of seeing 1-electron changes. This is because hydrogen is only 17% of a carbon. But 3-5% error, which is a typical experimental error in crystallographic data, anything bigger than one electron is clear.

-James Holton
MAD Scientist

On 3/18/2023 2:10 PM, Nicholas Pearce wrote:

Not stupid, but essentially the same as modelling alt confs, though would probably give more overfitting. Alt confs can easily be converted to an ensemble (if done properly…).

Thanks,

Nick

———

Nicholas Pearce

Assistant Professor in Bioinformatics & DDLS Fellow

Linköping University

Sweden

From: CCP4 bulletin board <CCP4BB@JISCMAIL.AC.UK> on behalf of benjamin bax <ben.d.v....@gmail.com>
Sent: Saturday, March 18, 2023 10:07:26 PM
To: CCP4BB@JISCMAIL.AC.UK <CCP4BB@JISCMAIL.AC.UK>
Subject: Re: [ccp4bb] To Trim or Not to To Trim

Hi,
Probably a stupid question.
Could you multiply a, b and c cell dimensions by 2 or 3 (to give 8 or 27 structures) and restrain well defined parts of structure to be ‘identical’ ? To give you a more NMR like chemically sensible ensemble of structures?
Ben

> On 18 Mar 2023, at 12:04, Helen Ginn <ccp...@hginn.co.uk> wrote:
>
> Models for crystallography have two purposes: refinement and interpretation. Here these two purposes are in conflict. Neither case is handled well by either trim or not trim scenario, but trimming results in a deficit for refinement and not-trimming results in a deficit for interpretation.
>
> Our computational tools are not “fixed” in the same way that the standard amino acids are “fixed” or your government’s bureaucracy pathways are “fixed”. They are open for debate and for adjustments. This is a fine example where it may be more productive to discuss the options for making changes to the model itself or its representation, to better account for awkward situations such as these. Otherwise we are left figuring out the best imperfect way to use an imperfect tool (as all tools are, to varying degrees!), which isn’t satisfying for enough people, enough of the time.
>
> I now appreciate the hypocrisy in the argument “do not trim, but also don’t model disordered regions”, even though I’d be keen to avoid trimming. This discussion has therefore softened my own viewpoint.
>
> My refinement models (as implemented in Vagabond) do away with the concept of B factors precisely for the anguish it causes here, and refines a distribution of protein conformations which is sampled to generate an ensemble. By describing the conformations through the torsion angles that comprise the protein, modelling flexibility of a disordered lysine is comparatively trivial, and indeed modelling all possible conformations of a disordered loop becomes feasible. Lysines end up looking like a frayed end of a rope. Each conformation can produce its own solvent mask, which can be summed together to produce a blurring of density that matches what you would expect to see in the crystal.
>
> In my experience this doesn’t drop the R factors as much as you’d assume, because blurred out protein density does look very much like solvent, but it vastly improves the interpretability of the model. This also better models the boundary between the atoms you would trim and those you’d leave untrimmed, by avoiding such a binary distinction. No fear of trimming and pushing those errors unseen into the rest of the structure. No fear of leaving atoms in with an inadequate B factor model that cannot capture the nature of the disorder.
>
> Vagabond is undergoing a heavy rewrite though, and is not yet ready for human consumption. Its first iteration worked on single-dataset-single-model refinement, which handled disordered side chains well enough, with no need to decide to exclude atoms. The heart of the issue lies in main chain flexibility, and this must be handled correctly, for reasons of interpretability and elucidating the biological impact. This model isn’t perfect either, and necessitates its own compromises - but will provide another tool in the structural biology arsenal.
>
> —-
>
> Dr Helen Ginn
> Group leader, DESY
> Hamburg Advanced Research Centre for Bioorganic Chemistry (HARBOR)
> Luruper Chaussee 149
> 22607 Hamburg
>
> ########################################################################
>
> To unsubscribe from the CCP4BB list, click the following link:
> https://eur01.safelinks.protection.outlook.com/?url="">
>
> This message was issued to members of https://eur01.safelinks.protection.outlook.com/?url="">, a mailing list hosted by https://eur01.safelinks.protection.outlook.com/?url="">, terms & conditions are available at https://eur01.safelinks.protection.outlook.com/?url="">

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://eur01.safelinks.protection.outlook.com/?url="">

This message was issued to members of https://eur01.safelinks.protection.outlook.com/?url="">, a mailing list hosted by https://eur01.safelinks.protection.outlook.com/?url="">, terms & conditions are available at https://eur01.safelinks.protection.outlook.com/?url="">

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

Re: [ccp4bb] To Trim or Not to To Trim

Reply via email to