L'adresse mail de l'expéditeur est extérieure :owner-ccp...@jiscmail.ac.uk En
cas de doute : ne répondez pas, ne cliquez pas et signalez le message au
Support Informatique
Dear Pavel & CCP4bb readers,
On Wed, Feb 14, 2024 at 08:28:03PM -0800, Pavel Afonine wrote:
What follows below is not very specific to the particular program
(STAIRSANISO) nor the original questions, but nonetheless, I believe it is
relevant.
Thanks for joining the discussion: always good to have different viewpoints
or opinions made visible - especially for less knowledgeable users and
readers of the CCP4bb.
And apologies to anyone getting tired of "another long post" here, but
some remarks do require follow-ups that hopefully will help keep the
discussion at a level useful to all readers.
In the past, performing any adjustments to the diffraction data intended
for solving and refining atomic models was more or less considered taboo.
That is a very broad statement that I have trouble making sense of: what do
you mean with "adjustments" and what do you mean with "diffraction data"?
If we are truly looking at diffraction data as it comes out of our
experiment, we are looking at the raw images, right? Those are then
handled roughly as follows (as an example for MX):
* initial integrated intensities (simplifying 3D pixel data)
* profile fitting of integrated intensities
* scaling (with various parametrisation models)
* selection of data (excluding image ranges due to radiation damage or
because a crystal moves out of the beam, excluding/handling ice-ring
contamination, selecting datasets in SSX etc)
* adjustment of error model (to get "meaningful" error estimates,
i.e. sigma values)
* outlier rejection (based largely on those sigmas)
* merging (inverse-variance weighted)
Maybe all those "adjustments to the diffraction data" are not what you are
referring to in your remark above? So let's assume you are referring to the
merged intensity data after all of the above steps as being "the
diffraction data" ... and what we are doing after that:
* conversion from intensities to amplitudes (using different methods and
priors) which most often will include an adjustment of weak and
negative intensities (e.g. via the French & Wilson method [1]).
* decision what reflections to use for subsequent steps
- defined by geometric constraints (we can only use those that hit the
detector)
- defined by some significance criterion
resulting in an adjustment of the dataset (not the values themselves)
coming out of the raw diffraction data.
Maybe this is still not what you are referring to as "adjustments to
the diffraction data"? Let's see what additional "adjustments to the
diffraction data" might happen further along ...
* anisotropic scaling of the diffraction data /without/ the use of any
atomic model, as provided e.g. by
- the UCLA Anisotropy Server [2,3], using the anisotropy analysis
from Phaser [4]
- STARANISO [5], using its own analysis
* relative anisotropic scaling of the diffraction data and the current
model in refinement, e.g. in
- REFMAC [6,7]
Note: this includes writing a set of observed amplitudes into the
output MTZ file that have been corrected using the model-based
overall anisotropy factors (as far we know and at least up to
version 5.8.0352). So any "structure factor" deposition using only
the output reflection data from such a run will have anisotropy
corrected observed observed data in the PDB archive. Our
aB_deposition_combine tool described below detects and undoes this
(when combining the reflection mmCIF data from processing with the
reflection data after refinement) to ensure that data exactly as
used as /input/ to the refinement program is deposited.
- CCTBX [8] and Phenix [9,10]
- SHELX [11,12,13]
- BUSTER [14]
* classification and rejection of model-based outlier reflections
- in Phenix [15] (still default?)
* DFc completion for missing observations in 2mFo-DFc electron density maps
- default in REFMAC (into single FWT/PHWT by default as far as we know)
- default in Phenix (into an additional set of map coefficients?)
- default in BUSTER (into two additional sets of map coefficients,
2FOFCWT_iso-fill/PH2FOFCWT_iso-fill using a sphere and
2FOFCWT_aniso-fill/PH2FOFCWT_aniso-fill using the anisotropic cut-off
information from STARANISO).
Which of all of the above are you referring to as being "considered taboo"?
It would be helpful if you could clarify this so that we can then focus on
that particular point in our discussion.
When cryo-EM emerged as a competitor to x-ray crystallography, the paradigm
began to shift. In cryo-EM, manipulations applied to the data (the map) are
a standard practice. The map can be boxed, filtered (sharpened, blurred,
etc.), modified (e.g., setting something outside the molecular region), and
so forth; you name it. One might wonder why the same isn't done to x-ray data.
I don't think it is true that this "isn't done to x-ray data":
* In small-molecule crystallography the "SQUEEZE" procedure [16] exists,
that does modify the diffraction data
* Electron-density sharpening [17] is widely used/described [18,19]
Historical analogies include truncating data beyond 6-8Å resolution
to avoid dealing with the bulk solvent
Maybe that should be rephrased from
... truncating data beyond 6-8Å resolution to avoid dealing with the bulk
solvent ...
to
... truncating data below 6-8Å resolution because we couldn't deal with
the bulk solvent at the time (but once we could [20], there was no longer
a need for that) ...
That's then a fairly normal evolution of science/methods: we do the best we
can at a given time with the tools at our disposal while trying to develop
better methods that might complement or replace those existing tools.
or default sharpening (a feature available in X-plor for some time,
then removed for obvious reasons, AFAIK), choosing resolution limits
(PAIRREF), and anisotropic data massaging by the UCLA server as a
more recent example. STAIRSANISO is the leader in doing things along
these lines as of today.
Your wording here makes it very difficult to take that serious: describing
the whole range of features provided by STARANISO (sic) as "massaging data"
takes us into the quicksands of polemics ... I'm sure we can do better than
that.
Which component are you critisizing here exactly? Maybe by being a bit more
explicit we can have a scientific discussion about those items and come to
some understanding (or agree to disagree) that is useful to the average
reader of these CCP4bb threads. We have:
(1) The analysis of anisotropy in the data (as also provided by
e.g. Phaser [4] or ctruncate [21])?
(2) The selection of reflection data without an isotropic constraint (that
would be leading to a spherical cut-off)?
==> this can be switched off on the STARANISO server [22]
(3) The anisotropic scaling/correction of the data according to the
analysis in (1)?
==> this can be switched off on the STARANISO server [22]
We've chosen defaults in STARANISO (if run through autoPROC or through the
server) that we feel make the most sense in our hands and are based on a
lot of user feedback. We don't force users to stick with those defaults or
to use STARANISO in the first place.
Indeed, why not if this is helpful to solve the structure?
Exactly: up to the point where STARANISO provides reflection data, no
notion of a structural model has entered any of the computations. If a
particular method of processing the raw diffraction data (images) leads to
a model and electron density map that shows more information and allows for
better interpretation and correction of that model, it clearly suggests to
me that it provides a higher information content and is useful for that
purpose.
This has to be seen obviously in the context of the methods, programs and
parametrisations we currently use: nothing is set in stone and new
developments will come along that make current approaches redundant at some
stage in the future ... it's called "progress" ;-)
However, it's important that the deposition clearly contains and
annotates at least the following:
- the original unmanipulated data;
- modified data (by whatever method or program);
- accessible information about the data that was used to obtain the final
deposited atomic model.
Completely agree with you (even if I would choose "unmodified" instead of
"unmanipulated" here: choice of words matter and we should stay as neutral
as possible I think).
That is exactly the reason why autoPROC/STARANISO is providing a
deposition-ready PDBx/mmCIF files by default since the March 2019 release
[23]. We are trying to explain the usage of those in great detail [24], but
users are often not aware of those files if dtheir data were auto-processed
at synchrotrons [25] (the presentation of autoPROC/STARANISO results/files
is not always complete and could be improved upon).
There are several issues a normal user has to deal with when it comes to
PDB deposition:
* There is often a significant time lag between data collection (and
probably processing) and deposition: this was on average about 2.5 years
the last time I checked this. Making sure that the model and reflection
data from the final refinement steps are correctly associated with the
original data processing can become tricky (which is why we provide the
"aB_deposition_combine" tool to help users, [24]).
* Historical baggage that seems impossible to get rid of. I'm especially
thinking of the requirement (by the OneDep system, as far as I
understand) for the data quality metrics (i.e. those statistics that
describe the reflection data) to be part of the model mmCIF file. This
basically goes back to the time when deposition of "structure factor"
(sic) data was not compulsory (pre Feb 2008) and these items had to be in
the model file.
With the use of mmCIF files for the model /and/ the reflection data
during deposition, that requirement should not be necessary anymore. It
would especially avoid the use of data-preparation tools that try and
extract some values from a variety of logfiles with the intrinsic
problems this entails:
- these logfiles are by definition separate from the reflection file
with the danger of encountering mix-ups;
- they are completely unstructured and could (and do) change at will -
while a mmCIF file is structured and can be validated (for format
mainly, but also somewhat for content) against an official mmCIF
dictionary;
* The guidance - by documentation and deposition systems - concerning what
is the best to provide the correct information in the correct format to
the deposition software is too long, too scattered, not detailed enough,
confusing, contradictory etc. We could all do much better here I guess
and ensure that at deposition time users need to deal primarily with the
correct scientific content of a deposition and not with format and
format-validation questions. The latter often seem to end up forcing
users to deposit "something" - often under stress - as long as it goes
through those checks and they can move on.
* The uncanny power of sloppy throw-away remarks. I remember the times
when everyone said "SHARP is slow and only needed as a last resort for
really difficult cases." (for the novice readers: SHARP is a program for
experimental phasing). Yes, it was slow back in the early 90s on SGI
workstations etc, because it does some pretty extensive
computations. For the last 20 years though, it is now usually running in
seconds on nearly all problems and is by far the fastest step during a
typical experimental phasing experiment (site detection, density
modification and automatic building are MUCH slower). But we still hear
the same old remarks ...
Or we could look at the discussion about Rmerge (and how we still see it
in depositions and papers and have reviewers commenting on it being too
high) ... 25 years after the papers pointing out its flaws?
Now we often hear questions about "can I deposit STARANISO data", with
extremely little scientific reasoning why one couldn't or shouldn't. It
all seems to be based on some fear that powerful referees, PIs or well
known experts will complain about this at some point. These don't seem
like very good reasons for doing or not doing something if it otherwise
seems sound to the actual user, but pushing back against that external
pressure as a new or one-off crystallographer is really hard. It is up
to us (so-called) "experts" to be aware of the power we wield here - and
use it wisely.
By all means, have a scientific argument with us and show everyone why
some of our methods are not doing the right thing or is buggy. We'd be
the first to welcome any such comments because ultimately they lead to
improved methods and programs for everyone. But remarks like "data
massaging" or "manipulated" have a real negative impact without adding
anything to such a discussion ...
The bottom line for our software [24]: it should be trivial to provide (a)
the deposition-ready model mmCIF file (coming from Phenix, REFMAC or
BUSTER) that contains the correct data quality metrics, and (b) a
deposition-ready reflection data mmCIF file including all the above
datablocks described by Pavel.
Note *accessible* above as this is the key for what follows below.
Let's consider this example:https://files.rcsb.org/download/6R72-sf.cif ,
which is representative of the class of problems I'm trying to convey here.
That is one example (but maybe not a good one, see below) - maybe a better
one would e.g. be 8ar7.
The file has everything, kudos to the authors: The original data, the
manipulated data and a whole lot more.
Are these data accessible?
YES, if you download the file, open it in your favorite text editor, and
carefully scroll and read through its 76,566 lines and use your best guess
to infer what are the original data arrays, what are the modified data
arrays and so on.
A mmCIF file is not something anyone would want to look at in a text
editor! So isn't that more a problem with the software you decided to use
for getting and looking at that data? BUSTER provides a simple tool
("fetch_PDB_gemmi") that will fetch a PDB entry and not only extract the
reflection data for each block, but also the explanations they carry. Here
is an excerpt of what it reports for the entry you picked (the full output
is a bit longer and I didn't want to make this email even less likely to be
read):
### merged data block #1 = r6r72sf
data as used in refinement and resulting electron density maps.
Converted by gemmi-mtz2cif 0.2.0
### merged data block #2 = r6r72Asf
merged and scaled data post-processed by for conversion from intensities to
structure factor amplitudes.
### merged data block #3 = r6r72Bsf
merged and scaled data from AIMLESS without any post-processing and/or data
cut-off.
The reason I mentioned that 6r72 is not a good example is visible
above: somehow the string "STARANISO" got lost in the description
(_diffrn.details) of the second data block ... you can see that by the
incomplete sentence and the double spaces. If you do the same for 8ar7
via
fetch_PDB_gemmi 8ar7
you get
### merged data block #1 = r8ar7sf
data as used in refinement and resulting electron density maps.
### merged data block #2 = r8ar7Asf
2mFo-DFc map coefficients complemented for missing data (as defined by
SA_flag from STARANISO).
### merged data block #3 = r8ar7Bsf
2mFo-DFc map coefficients complemented for missing data (within full
resolution range).
### merged data block #4 = r8ar7Csf
merged and scaled data post-processed by STARANISO for conversion from
intensities to structure factor amplitudes and anomalous data.
### merged data block #5 = r8ar7Dsf
merged and scaled EARLY (potentially least radiation-damaged) data
post-processed by STARANISO for conversion from intensities to structure factor
amplitudes - useful for radiation-damage detection/description maps (as e.g.
done in BUSTER).
### merged data block #6 = r8ar7Esf
merged and scaled LATE (potentially most radiation-damaged) data
post-processed by STARANISO for conversion from intensities to structure factor
amplitudes - useful for radiation-damage detection/description maps (as e.g.
done in BUSTER).
### merged data block #7 = r8ar7Fsf
merged and scaled data from AIMLESS without any post-processing and/or data
cut-off.
### unmerged data block #8 = r8ar7Gsf
unmerged and scaled data from AIMLESS without any post-processing and/or
data cut-off
and a MTZ file for each of those data blocks. It should be very easy from
that to pick any datablock you like based on that description (which
unfortunately isn't based on a fixed vocabulary, but that could be added to
the mmCIF dictionary if needed).
NO, absolutely NO, if you parse data files in PDB automatically with a
script, and attempt to extract particular data (eg., original unmanipulated
data). And this is what I find problematic, especially given 215+k entries
in PDB as of today.
Hope someone does something about it!
Well, from our side I think we've done already a fair amount here through
* our software (creating and combining deposition-ready mmCIF files from
processing+refinement and providing a tool to fetch archived PDB
entries),
* tools like our "Table 1" server [27]
* very useful discussions with e.g. the PDBj that has resulted in a much
enriched description for the data archived with a given entry [28], and
* our work within the PDBx/mmCIF WG [29], especially the Processing
Subgroup [30]).
If the software systems at your disposal don't provide adequate tools you
should probably discuss this with those developers ;-)
Once you have defined the "something", the best "someone" is yourself - so
feel free to join in productively rather than disparagingly.
Cheers
Clemens
[1] French, S. and Wilson, K., 1978. On the treatment of negative
intensity observations. Acta Crystallographica Section A: Crystal
Physics, Diffraction, Theoretical and General Crystallography,
34(4), pp.517-525.
[2] Sawaya, M.R., 2014. Methods to refine macromolecular structures in
cases of severe diffraction anisotropy. Structural Genomics:
General Applications, pp.205-214.
[3]https://srv.mbi.ucla.edu/Anisoscale/
[4] McCoy, A.J., Grosse-Kunstleve, R.W., Adams, P.D., Winn, M.D.,
Storoni, L.C. and Read, R.J., 2007. Phaser crystallographic
software. Journal of applied crystallography, 40(4), pp.658-674.
[5]https://staraniso.globalphasing.org/
[6] Murshudov, G.N., Davies, G.J., Isupov, M., Krzywda, S. and Dodson,
E.J., 1998. The effect of overall anisotropic scaling in
macromolecular refinement. CCP4 newsletter on protein
crystallography, 35, pp.37-42.
[7] Murshudov, G.N., Skubák, P., Lebedev, A.A., Pannu, N.S., Steiner,
R.A., Nicholls, R.A., Winn, M.D., Long, F. and Vagin, A.A.,
2011. REFMAC5 for the refinement of macromolecular crystal
structures. Acta Crystallographica Section D: Biological
Crystallography, 67(4), pp.355-367.
[8] Afonine, P.V., Grosse-Kunstleve, R.W. and Adams, P.D., 2005. A
robust bulk-solvent correction and anisotropic scaling
procedure. Acta Crystallographica Section D: Biological
Crystallography, 61(7), pp.850-855.
[9] Afonine, P.V., Grosse-Kunstleve, R.W., Chen, V.B., Headd, J.J.,
Moriarty, N.W., Richardson, J.S., Richardson, D.C., Urzhumtsev,
A., Zwart, P.H. and Adams, P.D., 2010. phenix. model_vs_data: A
high-level tool for the calculation of crystallographic model and
data statistics. Journal of applied crystallography, 43(4),
pp.669-676.
[10] Afonine, P.V., Grosse-Kunstleve, R.W., Adams, P.D. and
Urzhumtsev, A., 2013. Bulk-solvent and overall scaling revisited:
faster calculations, improved results. Acta Crystallographica
Section D: Biological Crystallography, 69(4), pp.625-634.
Afonine, P.V., Grosse-Kunstleve, R.W., Adams, P.D. and
Urzhumtsev, A., 2023. Bulk-solvent and overall scaling revisited:
faster calculations, improved results. Corrigendum. Acta
Crystallographica Section D: Structural Biology, 79(7).
[11] Shakked, Z., 1983. Anisotropic scaling of three-dimensional
intensity data. Acta Crystallographica Section A: Foundations of
Crystallography, 39(3), pp.278-279.
[12] Pohl, E., Schneider, T.R., Dauter, Z., Schmidt, A., Fritz,
H.J. and Sheldrick, G.M., 1999. 1.7 Å structure of the stabilized
REIv mutant T39K. Application of local NCS restraints. Acta
Crystallographica Section D: Biological Crystallography, 55(6),
pp.1158-1167.
[13] Sheldrick, G.M., 2012. Macromolecular applications of
SHELX. International Tables for Crystallography
(2012). Vol. F. ch. 18.9, pp. 529-533.
[14] Blanc, E., Roversi, P., Vonrhein, C., Flensburg, C., Lea,
S.M. and Bricogne, G., 2004. Refinement of severely incomplete
structures with maximum likelihood in BUSTER–TNT. Acta
Crystallographica Section D: Biological Crystallography, 60(12),
pp.2210-2221.
[15]https://phenix-online.org/documentation/faqs/refine.html#general ("Why
does phenix.refine not use all data in refinement?") and
https://phenix-online.org/pipermail/phenixbb/2010-December/016283.html
[16] Spek, A.L., 2015. PLATON SQUEEZE: a tool for the calculation of
the disordered solvent contribution to the calculated structure
factors. Acta Crystallographica Section C: Structural Chemistry,
71(1), pp.9-18.
[17] DeLaBarre, B. and Brunger, A.T., 2006. Considerations for the
refinement of low-resolution crystal structures. Acta
Crystallographica Section D: Biological Crystallography, 62(8),
pp.923-932.
[18] Liu, C. and Xiong, Y., 2014. Electron density sharpening as a
general technique in crystallographic studies. Journal of
molecular biology, 426(4), pp.980-993.
[19] Terwilliger, T.C., Sobolev, O.V., Afonine, P.V. and Adams, P.D.,
2018. Automated map sharpening by maximization of detail and
connectivity. Acta Crystallographica Section D: Structural
Biology, 74(6), pp.545-559.
[20] Jiang, J.S. and Brünger, A.T., 1994. Protein hydration observed
by X-ray diffraction: solvation properties of penicillopepsin and
neuraminidase crystal structures. Journal of molecular biology,
243(1), pp.100-115.
[21]https://www.ccp4.ac.uk/html/ctruncate.html
[22]https://staraniso.globalphasing.org/
[23]https://www.globalphasing.com/autoproc/ReleaseNotes/ReleaseNotes-autoPROC_snapshot_20190301.txt
[24]https://www.globalphasing.com/buster/wiki/index.cgi?DepositionMmCif
[25]https://www.globalphasing.com/autoproc/wiki/index.cgi?RunningAutoProcAtSynchrotrons
[26]https://www.globalphasing.com/buster/
[27]https://staraniso.globalphasing.org/table1/
(e.g.https://staraniso.globalphasing.org/table1/ar/8ar7.html)
[28]https://pdbj.org/mine/experimental_details/8AR7
[29]https://www.wwpdb.org/task/mmcif
[30]https://github.com/pdbxmmcifwg/mmcif-data-proc