Well, when it comes to observations/parameters, it is important to remember that not all observations are created equal. 10,000 "observations" with I/sigma = 1 are definitely not as desirable as 10,000 observations with I/sigma = 10. Not all parameters are created equal either. Yes, you may have 3,000 atoms, but there are bonds between them and that means you don't REALLY have 3,000*3 degrees of freedom. How many "parameters" are removed by each bond, of course, depends on how tight your geometry weight is. So, although the observations/parameters rule of thumb is useful for knowing roughly just how much you are asking of your fitting program, it is a qualitative assessment only. Not quantitative.

It is instructive to consider the 1-dimensional case. Say you have 100 data points, evenly spaced, and you are fitting a curve to them. If you fit a 100th-order polynomial to these points, then you can always get your curve to pass straight through every single point. But do you want to do that? What if the error bars are huge and you can see that the points follow a much smoother curve? In that case, you definitely want to reduce the number of "parameters" so that you are not "over-fitting". But how can you tell if your simplified model is plausible? Well, one way to do it is leave out some of the observations from the fit and see if a curve fit to the other ones predicts those points reasonable well. This is called a "cross check" (aka, Rfree).

The equivalent of "resolution" in this case is the scale on the x-axis. Yes, a scale factor. 100 points along the x-axis with a unit cell length of 200 is equivalent to "2A data", but if you change the unit cell to 300 A and you still have 100 "samples", then that is equivalent to 3A data with the same number of "observations". But wait! Shouldn't 2A data always be "better" than 3A data if everything else is equal? Well, the difference comes from knowing the "scale" of what you are trying to measure. If you're trying to assign an atom-atom distance to either a hydrogen bond or a van der Waals bump, then you need to tell ~2.0 A from ~3.5 A, so now suddenly the scale of the x-axis (200A vs 300 A) matters. The problem becomes one of propagating the error bars of the data (on the y axis) into error bars on the parameters of the fit (on the x-axis). For my 1-D case of 100 points, this is equivalent to "knowing" how smooth your fitted function should be. Yes, having more data points can be "better", but it is the size of your error bars relative to what you are trying to measure that is actually important.

So, one way of thinking about the difference between 3A data with 50% solvent vs 3.6A data with 80% solvent is to "scale" the 3.0A unit cell so that it has the same volume as the 3.6A crystal's unit cell. Then you effectively have two structures with different observations/parameters and the SAME resolution (because you have changed the scale of space for one of them). The re-scaling is mathematically equivalent to stretching the 3.0 A electron density map, so all you have done is "inflate" the protein so that the atoms are now farther apart. Does this make them easier to distinguish? No, because although they are farther apart, they are also "fatter". Stretching the map changes both the peak widths and the distance between them proportionally. The width of atomic peaks is actually very closely related to the resolution (especially at ~3A), so after stretching the map the peak widths in both thte 3A/50% and 3.6A/80% cases will be about the same, but the distances between the peaks will be larger in the map that came from the 3A/50% data. So, relatively speaking (delta-bond/bond_length), you still have more "accuracy" with the 3A case, no matter what the observations/parameters ratio is.

Of course, all this is assuming that all you are interested in is bond lengths. If you happen to know your bond lengths already (such as covalent bonds) then that changes the relationship between the errors in your data and the parameter you are trying to measure. To put things another way, a 60 nm resolution 3D reconstruction of a 5-micron wide cell represents about as many "observations" as a 1.2A crystal structure of a 100 A unit cell. Which is "more accurate"? Depends on the question you are trying to ask.

So, to answer the OP's question, I still think 3.0A is better than 3.6A. Yes, higher solvent content gives you better phases, and phase accuracy IS important for placing atoms (30 degrees of phase error at 3 A means that the spatial waves at that resolution are "off" by an average of ~0.25 A). But, that advantage is really only in the initial stages of phasing, and it fades as soon as the experimental phases start holding your refinement back more than they help (which actually happens rather quickly). Remember, the 'bulk solvent' model, as far as the phases are concerned, is really just another kind of "solvent flattening".

-James Holton
MAD Scientist

On 3/14/2013 5:27 PM, Guangyu Zhu wrote:
I have this question. For exmaple, a protein could be crystallized in two crystal forms. Two crystal form have same space group, and 1 molecule/asymm. One crystal form diffracts to 3A with 50% solvent; and the other diffracts to 3.6A with 80% solvent. The cell volume of 3.6A crystal must be 5/2=2.5 times larger because of higher solvent content. If both data collecte to same completeness (say 100%), 3.6A data actually have higher data/parameter ratio, 5/2/(3.6/3)**3= 1.45 times to 3A data. For refinement, better data/parameter should give more accurate structure, ie. 3.6A data is better. But higher resolution should give a better resolved electron density map. So which crystal form really give a better (more reliable and accurate) protein structure?

Reply via email to