Re: [ccp4bb] To Trim or Not to To Trim
Hi James:First of all, I think all those trim-or-not-to-trim practices are kind of compromises when the data did not really offer local density strong enough to model side chain reliably. So many smart pioneers could not make a simple agreement, which means this science of art has the personality-dependent tribute too. Depending on how to interpret(model builder) and how to understand (reader), I am ok with a) or c) in your list and personally dislike the zero occupancy treatment (sorry if this may annoying many ones). The reason is that if a “side chain conformation” is modeled with a 0 occupancy, the reading of the information turns to be that all other conformations are possible (occupancies sums to 1.0) but this modeled one (normally what preferred?) which is simply the impossible one for its 0 occupancy and particularly mentioned —— logically a paradox.Your test modeling and refinement of a lysine is interesting, but the experiment may not be flawless when trying to compare those different strategies. The experiment may be based on a dataset that really offers side chain modeling difficulty for its very weak densities around the expected location. The huge difference of CCtrue in your experiment probably just means the model’s completeness (4 out of 10 non-H atoms missing in the trim treatment), and your dataset may actually allow building side chain(s), fully or partially, one or more confs, based on your description. Also, this system is too small? In a normal macromolecular refinement (say more than 100 residues) , not trying to say not important, but a few side chain’s treatment in different ways may not diff these refinement statistics so significantly (not like a single bond’s huge deviation could increase the overall rmsd by many times).LijunSent from my iPhoneOn Mar 19, 2023, at 11:30 PM, Lijun Liu wrote:James:Sent from my iPhoneOn Mar 19, 2023, at 2:37 PM, James Holton wrote: They say one test is worth a thousand expert opinions, so I tried my hand at the former. The question is: what is the right way to treat disordered side chains?: a) omit atoms you cannot see b) build them, and set occupancy to zero c) build them, and "let the B factors take care of it" d) none of the above The answer, of course, is d). Oh, c'mon. Yes, I know one of a,b, or c is what you've been doing your whole life. I do it too. But, let's face it: none of these solutions are perfect. So, the real question is not which one is "right", but which is the least wrong? We all know what is really going on: the side chain is flapping around. No doubt it spends most of its time in energetically reasonable but nevertheless numerous conformations. There are 41 "Favorable" rotamers for Lys alone, and it doesn't take that many to spread the density thin enough to fall below the classical 1-sigma contour level. The atoms are still there, they are still contributing to the data, and they haven't gone far. So why don't we "just" model that? Already, I can hear the cries of "over-fitting!" and "observations/parameters!", "model bias!", and "think of the children!" Believe it or not, none of these are the major issue here. Allow me to demonstrate: Consider a simple case where we have a Lys side chain in ten conformers. I chose from popular rotamers, but evenly spread. That is, all 10 conformers have an occupancy of 0.10, and there is a 3-3-4 split of chi1 values between minus, plus and trans. This will give the maximum contrast of density between CB and CG. Let us further require that there is no strain in this ground-truth. No stretched bonds, no tortured angles, no clashes, etc. Real molecules don't occupy such high-energy states unless they absolutely have to. Let us further assume that the bulk solvent works the way phenix models it, which is a probe radius of 1.1 A for both ions and aliphatics and a shrink radius of 0.9. But, instead of running one phenix.fmodel job, I ran ten: one for each conformer (A thru J). To add some excitement, I moved the main chain ~0.2 A in a random direction for each conformer. I then took these ten calculated electron density maps (bulk solvent and all) and added them together to form the ground truth for the following trials. Before refinement, I added noise consistent with an I/sigma of 50 and cut the resolution at 2.0 A. Wilson B is 50: CCtrue Rwork% Rfree% fo-fc(sigma) description 0.8943 9.05 10.60 5.9 stump at CB 0.9540 9.29 11.73 6.0 single conformer, zero occupancy 0.9471 10.35 15.04 5.1 single conformer, full occupancy, refmac5 0.9523 9.78 15.61 4.9 single conformer, full occupancy, phenix.refine So, it would appear t
Re: [ccp4bb] To Trim or Not to To Trim
James:Sent from my iPhoneOn Mar 19, 2023, at 2:37 PM, James Holton wrote: They say one test is worth a thousand expert opinions, so I tried my hand at the former. The question is: what is the right way to treat disordered side chains?: a) omit atoms you cannot see b) build them, and set occupancy to zero c) build them, and "let the B factors take care of it" d) none of the above The answer, of course, is d). Oh, c'mon. Yes, I know one of a,b, or c is what you've been doing your whole life. I do it too. But, let's face it: none of these solutions are perfect. So, the real question is not which one is "right", but which is the least wrong? We all know what is really going on: the side chain is flapping around. No doubt it spends most of its time in energetically reasonable but nevertheless numerous conformations. There are 41 "Favorable" rotamers for Lys alone, and it doesn't take that many to spread the density thin enough to fall below the classical 1-sigma contour level. The atoms are still there, they are still contributing to the data, and they haven't gone far. So why don't we "just" model that? Already, I can hear the cries of "over-fitting!" and "observations/parameters!", "model bias!", and "think of the children!" Believe it or not, none of these are the major issue here. Allow me to demonstrate: Consider a simple case where we have a Lys side chain in ten conformers. I chose from popular rotamers, but evenly spread. That is, all 10 conformers have an occupancy of 0.10, and there is a 3-3-4 split of chi1 values between minus, plus and trans. This will give the maximum contrast of density between CB and CG. Let us further require that there is no strain in this ground-truth. No stretched bonds, no tortured angles, no clashes, etc. Real molecules don't occupy such high-energy states unless they absolutely have to. Let us further assume that the bulk solvent works the way phenix models it, which is a probe radius of 1.1 A for both ions and aliphatics and a shrink radius of 0.9. But, instead of running one phenix.fmodel job, I ran ten: one for each conformer (A thru J). To add some excitement, I moved the main chain ~0.2 A in a random direction for each conformer. I then took these ten calculated electron density maps (bulk solvent and all) and added them together to form the ground truth for the following trials. Before refinement, I added noise consistent with an I/sigma of 50 and cut the resolution at 2.0 A. Wilson B is 50: CCtrue Rwork% Rfree% fo-fc(sigma) description 0.8943 9.05 10.60 5.9 stump at CB 0.9540 9.29 11.73 6.0 single conformer, zero occupancy 0.9471 10.35 15.04 5.1 single conformer, full occupancy, refmac5 0.9523 9.78 15.61 4.9 single conformer, full occupancy, phenix.refine So, it would appear that the zero-occupancy choice "wins", but by the narrowest of margins. Here CCtrue is the Pearson correlation coefficient between the ground-truth right-answer electron density and the 2fofc map resulting from the refinement. Rwork and Rfree are the usual suspects, and fo-fc indicates the tallest peak in the difference map. Refinement was with refmac unless otherwise indicated. I think we often forget that both phenix and refmac restrain B factor values, not just through bonds but through space, and they use rather different algorithms. Refmac tries to make the histogram of B factors "look right", whereas phenix allows steeper gradients. I also ran all 10 correct rotamers separately and picked the one with the best CCtrue to show above. If you instead sort on Rfree (which you really shouldn't do), you get different bests, but they are not much better (as low as 10.5%). So, the winner here depends on how you score. CCtrue is the best score, but also unfortunately unavailable for real data. It is perhaps interesting here that better CCtrue goes along with worse Rfree. This is not what I usually see in experiments like this. Rather, what I think is going on here is the system is frustrated. We are trying to fit various square pegs into a round hole, and none of them fit all that well. In all cases here the largest difference peak was indicating another place to put the Lys, so why not build into that screaming, 6-sigma difference peak? Here is what happens when you do that: CCtrue Rwork% Rfree% fo-fc(sigma) description 0.8943 9.05 10.60 5.9 stump at CB
Re: [ccp4bb] To Trim or Not to To Trim
They say one test is worth a thousand expert opinions, so I tried my hand at the former. The question is: what is the right way to treat disordered side chains?: a) omit atoms you cannot see b) build them, and set occupancy to zero c) build them, and "let the B factors take care of it" d) none of the above The answer, of course, is d). Oh, c'mon. Yes, I know one of a,b, or c is what you've been doing your whole life. I do it too. But, let's face it: none of these solutions are perfect. So, the real question is not which one is "right", but which is the least wrong? We all know what is really going on: the side chain is flapping around. No doubt it spends most of its time in energetically reasonable but nevertheless numerous conformations. There are 41 "Favorable" rotamers for Lys alone, and it doesn't take that many to spread the density thin enough to fall below the classical 1-sigma contour level. The atoms are still there, they are still contributing to the data, and they haven't gone far. So why don't we "just" model that? Already, I can hear the cries of "over-fitting!" and "observations/parameters!", "model bias!", and "think of the children!" Believe it or not, none of these are the major issue here. Allow me to demonstrate: Consider a simple case where we have a Lys side chain in ten conformers. I chose from popular rotamers, but evenly spread. That is, all 10 conformers have an occupancy of 0.10, and there is a 3-3-4 split of chi1 values between minus, plus and trans. This will give the maximum contrast of density between CB and CG. Let us further require that there is no strain in this ground-truth. No stretched bonds, no tortured angles, no clashes, etc. Real molecules don't occupy such high-energy states unless they absolutely have to. Let us further assume that the bulk solvent works the way phenix models it, which is a probe radius of 1.1 A for both ions and aliphatics and a shrink radius of 0.9. But, instead of running one phenix.fmodel job, I ran ten: one for each conformer (A thru J). To add some excitement, I moved the main chain ~0.2 A in a random direction for each conformer. I then took these ten calculated electron density maps (bulk solvent and all) and added them together to form the ground truth for the following trials. Before refinement, I added noise consistent with an I/sigma of 50 and cut the resolution at 2.0 A. Wilson B is 50: CCtrue Rwork% Rfree% fo-fc(sigma) description 0.8943 9.05 10.60 5.9 stump at CB 0.9540 9.29 11.73 6.0 single conformer, zero occupancy 0.9471 10.35 15.04 5.1 single conformer, full occupancy, refmac5 0.9523 9.78 15.61 4.9 single conformer, full occupancy, phenix.refine So, it would appear that the zero-occupancy choice "wins", but by the narrowest of margins. Here CCtrue is the Pearson correlation coefficient between the ground-truth right-answer electron density and the 2fofc map resulting from the refinement. Rwork and Rfree are the usual suspects, and fo-fc indicates the tallest peak in the difference map. Refinement was with refmac unless otherwise indicated. I think we often forget that both phenix and refmac restrain B factor values, not just through bonds but through space, and they use rather different algorithms. Refmac tries to make the histogram of B factors "look right", whereas phenix allows steeper gradients. I also ran all 10 correct rotamers separately and picked the one with the best CCtrue to show above. If you instead sort on Rfree (which you really shouldn't do), you get different bests, but they are not much better (as low as 10.5%). So, the winner here depends on how you score. CCtrue is the best score, but also unfortunately unavailable for real data. It is perhaps interesting here that better CCtrue goes along with worse Rfree. This is not what I usually see in experiments like this. Rather, what I think is going on here is the system is frustrated. We are trying to fit various square pegs into a round hole, and none of them fit all that well. In all cases here the largest difference peak was indicating another place to put the Lys, so why not build into that screaming, 6-sigma difference peak? Here is what happens when you do that: CCtrue Rwork% Rfree% fo-fc(sigma) description 0.8943 9.05 10.60 5.9 stump at CB 0.9580 9.95 11.60 6.4 stump at CG 0.9585 10.20 12.29 6.2 stump at CG, all 10 confs 0.9543 10.61 12.24 5.3 stump at CD, all 10 confs 0.9383 10.69 14.64 4.1 stump at CE, all 10 confs 0.9476 9.66 13.48 4.6 all atoms, all 10 confs 0.9214 7.09 11.8 5.6 three conformers (worst of 120 combos) 0.9718 6.53 8.55 4.3 three conformers (best of 120 combos) 0.9710 7.17 9.44 6.1 two conformers (best of 45 combos) 0.9471 10.35 15.04 5.1 single conformer (best of 10 choices)
Re: [ccp4bb] To Trim or Not to To Trim
Dear all, well at least ChatGPT is still way off the heart of the matter and several steps behind some of the best views in the community. I asked it: Please write a paragraph on the subject: "10 things I hate about structural biology data deposition policy" It gave me this: "As an AI language model, I do not have personal opinions or feelings. However, I can provide a general response to the topic you have given me. Structural biology data deposition policies are critical to ensure that research data is accessible and reproducible, facilitating scientific progress. However, some researchers may find these policies frustrating, especially when they encounter obstacles during the process. Ten things that researchers may dislike about structural biology data deposition policies could include complicated data formats, unclear data preparation instructions, data release delays, data sharing restrictions, lack of feedback from data repositories, non-standardized metadata requirements, confidentiality concerns, data loss risks, high submission fees, and legal obligations. Despite these challenges, data deposition policies remain necessary to promote transparency, collaboration, and scientific integrity in the field of structural biology." Ciao Pietro On 18-03-2023 10:28, Dale Tronrud wrote: I'm going to dive back in here again to expand this discussion. Whether this diversion clarifies or obscures issues surrounding the "crystallographers' dilemma" I'll leave for others to decide. There is currently considerable discussion, among people who care about cell phone cameras, over the behavior of the cameras in some Samsung cell phones when photos are taken that include the Moon. https://www.reddit.com/r/Android/comments/11nzrb0/samsung_space_zoom_moon_shots_are_fake_and_here/ In this post, evidence is presented that photos taken with one of these phone cameras, and includes the Earth's largest satellite, the image shows a properly exposed and detailed image of the orb. This despite the fact that the test photograph described in the post is of a tableau containing a deliberately blurred photo of the cratered globe. The claim is made that the Samsung app is adding information from sources other than the camera's light sensor and therefore the image is "fake". I expect that Samsung would reply that, once the app is confident that the silver disk in the image is the nighttime traditional symbol of romance, it is perfectly reasonable to make that, now identified aerial, phenomenon appear in the image as expected by every sighted human in the history of our species. There have been billions of high quality photos of the silicate sphere taken. (This is both literally true and a gross underestimate.) How can the photo be fake if it better reflects what the photographer saw than what can be deduced from only the raw pixels of the sensor? Of course, this example differs only in degree from common practice going back to the beginning of photography. Photos have always been modified, sometimes in order to deceive the viewer, but most often to make the photos more like what the photographer believed the scene actually looked like. For example, in nearly every photo I take I "correct" the color balance. Is the photo with a detailed Moon fake? Are my photos taken at the forest floor, but without everything being some shade of green, fake? I think most people would be satisfied if there was a way for them to know what sources of information were used in creating the image. We, as scientists, are much more demanding of our PDB models. We build better models when we use all the knowledge at our disposal. If we are interpreting a 9A resolution map of hemoglobin and see a disk-shaped piece of density where we know the heme goes, we are perfectly justified to build an atomic model of heme. We are also obliged to make clear that the exact atomic positions, bond lengths and angles, were not derived from that map, just as a journalist needs to make clear to the reader that their photo has been processed to include detail which was not present when the image was "taken". I have deposited models that contained features which were only "consistent" with the electron density but supported by enough other forms of evidence to make me confident in their existence. I have done my best to make the justification of these models clear in the reports I have written but continue to be frustrated by the lack of tools to represent the precise interplay of data sources that support my model WITHIN the deposition. I am not so naive to believe that everyone who has cited my papers have actually read them. Dale E. Tronrud On 3/10/2023 1:05 AM, Julia Griese wrote: Hi all, My impression has been that the most common approach these days is to “let the B-factors take care of it”, but I might be wrong. Maybe it’s time to run another poll? Personally, I call any other approach R-facto