Re: [ccp4bb] Resolution and data/parameter ratio, which one is more important?

James Holton Sat, 16 Mar 2013 11:01:24 -0700

Well, when it comes to observations/parameters, it is important toremember that not all observations are created equal. 10,000"observations" with I/sigma = 1 are definitely not as desirable as10,000 observations with I/sigma = 10. Not all parameters are createdequal either. Yes, you may have 3,000 atoms, but there are bondsbetween them and that means you don't REALLY have 3,000*3 degrees offreedom. How many "parameters" are removed by each bond, of course,depends on how tight your geometry weight is. So, although theobservations/parameters rule of thumb is useful for knowing roughly justhow much you are asking of your fitting program, it is a qualitativeassessment only. Not quantitative.

It is instructive to consider the 1-dimensional case. Say you have 100data points, evenly spaced, and you are fitting a curve to them. If youfit a 100th-order polynomial to these points, then you can always getyour curve to pass straight through every single point. But do you wantto do that? What if the error bars are huge and you can see that thepoints follow a much smoother curve? In that case, you definitely wantto reduce the number of "parameters" so that you are not"over-fitting". But how can you tell if your simplified model isplausible? Well, one way to do it is leave out some of the observationsfrom the fit and see if a curve fit to the other ones predicts thosepoints reasonable well. This is called a "cross check" (aka, Rfree).

The equivalent of "resolution" in this case is the scale on the x-axis.Yes, a scale factor. 100 points along the x-axis with a unit celllength of 200 is equivalent to "2A data", but if you change the unitcell to 300 A and you still have 100 "samples", then that is equivalentto 3A data with the same number of "observations". But wait! Shouldn't2A data always be "better" than 3A data if everything else is equal?Well, the difference comes from knowing the "scale" of what you aretrying to measure. If you're trying to assign an atom-atom distance toeither a hydrogen bond or a van der Waals bump, then you need to tell~2.0 A from ~3.5 A, so now suddenly the scale of the x-axis (200A vs 300A) matters. The problem becomes one of propagating the error bars ofthe data (on the y axis) into error bars on the parameters of the fit(on the x-axis). For my 1-D case of 100 points, this is equivalent to"knowing" how smooth your fitted function should be. Yes, having moredata points can be "better", but it is the size of your error barsrelative to what you are trying to measure that is actually important.

So, one way of thinking about the difference between 3A data with 50%solvent vs 3.6A data with 80% solvent is to "scale" the 3.0A unit cellso that it has the same volume as the 3.6A crystal's unit cell. Thenyou effectively have two structures with differentobservations/parameters and the SAME resolution (because you havechanged the scale of space for one of them). The re-scaling ismathematically equivalent to stretching the 3.0 A electron density map,so all you have done is "inflate" the protein so that the atoms are nowfarther apart. Does this make them easier to distinguish? No, becausealthough they are farther apart, they are also "fatter". Stretching themap changes both the peak widths and the distance between themproportionally. The width of atomic peaks is actually very closelyrelated to the resolution (especially at ~3A), so after stretching themap the peak widths in both thte 3A/50% and 3.6A/80% cases will be aboutthe same, but the distances between the peaks will be larger in the mapthat came from the 3A/50% data. So, relatively speaking(delta-bond/bond_length), you still have more "accuracy" with the 3Acase, no matter what the observations/parameters ratio is.

Of course, all this is assuming that all you are interested in is bondlengths. If you happen to know your bond lengths already (such ascovalent bonds) then that changes the relationship between the errors inyour data and the parameter you are trying to measure. To put thingsanother way, a 60 nm resolution 3D reconstruction of a 5-micron widecell represents about as many "observations" as a 1.2A crystal structureof a 100 A unit cell. Which is "more accurate"? Depends on the questionyou are trying to ask.

So, to answer the OP's question, I still think 3.0A is better than3.6A. Yes, higher solvent content gives you better phases, and phaseaccuracy IS important for placing atoms (30 degrees of phase error at 3A means that the spatial waves at that resolution are "off" by anaverage of ~0.25 A). But, that advantage is really only in the initialstages of phasing, and it fades as soon as the experimental phases startholding your refinement back more than they help (which actually happensrather quickly). Remember, the 'bulk solvent' model, as far as thephases are concerned, is really just another kind of "solvent flattening".


-James Holton
MAD Scientist

On 3/14/2013 5:27 PM, Guangyu Zhu wrote:

I have this question. For exmaple, a protein could be crystallized intwo crystal forms. Two crystal form have same space group, and 1molecule/asymm. One crystal form diffracts to 3A with 50% solvent; andthe other diffracts to 3.6A with 80% solvent. The cell volume of 3.6Acrystal must be 5/2=2.5 times larger because of higher solventcontent. If both data collecte to same completeness (say 100%), 3.6Adata actually have higher data/parameter ratio, 5/2/(3.6/3)**3= 1.45times to 3A data. For refinement, better data/parameter should givemore accurate structure, ie. 3.6A data is better. But higherresolution should give a better resolved electron density map. Sowhich crystal form really give a better (more reliable and accurate)protein structure?

Re: [ccp4bb] Resolution and data/parameter ratio, which one is more important?

Reply via email to