Dear Pavel & CCP4bb readers, On Wed, Feb 14, 2024 at 08:28:03PM -0800, Pavel Afonine wrote: > What follows below is not very specific to the particular program > (STAIRSANISO) nor the original questions, but nonetheless, I believe it is > relevant.
Thanks for joining the discussion: always good to have different viewpoints or opinions made visible - especially for less knowledgeable users and readers of the CCP4bb. And apologies to anyone getting tired of "another long post" here, but some remarks do require follow-ups that hopefully will help keep the discussion at a level useful to all readers. > In the past, performing any adjustments to the diffraction data intended > for solving and refining atomic models was more or less considered taboo. That is a very broad statement that I have trouble making sense of: what do you mean with "adjustments" and what do you mean with "diffraction data"? If we are truly looking at diffraction data as it comes out of our experiment, we are looking at the raw images, right? Those are then handled roughly as follows (as an example for MX): * initial integrated intensities (simplifying 3D pixel data) * profile fitting of integrated intensities * scaling (with various parametrisation models) * selection of data (excluding image ranges due to radiation damage or because a crystal moves out of the beam, excluding/handling ice-ring contamination, selecting datasets in SSX etc) * adjustment of error model (to get "meaningful" error estimates, i.e. sigma values) * outlier rejection (based largely on those sigmas) * merging (inverse-variance weighted) Maybe all those "adjustments to the diffraction data" are not what you are referring to in your remark above? So let's assume you are referring to the merged intensity data after all of the above steps as being "the diffraction data" ... and what we are doing after that: * conversion from intensities to amplitudes (using different methods and priors) which most often will include an adjustment of weak and negative intensities (e.g. via the French & Wilson method [1]). * decision what reflections to use for subsequent steps - defined by geometric constraints (we can only use those that hit the detector) - defined by some significance criterion resulting in an adjustment of the dataset (not the values themselves) coming out of the raw diffraction data. Maybe this is still not what you are referring to as "adjustments to the diffraction data"? Let's see what additional "adjustments to the diffraction data" might happen further along ... * anisotropic scaling of the diffraction data /without/ the use of any atomic model, as provided e.g. by - the UCLA Anisotropy Server [2,3], using the anisotropy analysis from Phaser [4] - STARANISO [5], using its own analysis * relative anisotropic scaling of the diffraction data and the current model in refinement, e.g. in - REFMAC [6,7] Note: this includes writing a set of observed amplitudes into the output MTZ file that have been corrected using the model-based overall anisotropy factors (as far we know and at least up to version 5.8.0352). So any "structure factor" deposition using only the output reflection data from such a run will have anisotropy corrected observed observed data in the PDB archive. Our aB_deposition_combine tool described below detects and undoes this (when combining the reflection mmCIF data from processing with the reflection data after refinement) to ensure that data exactly as used as /input/ to the refinement program is deposited. - CCTBX [8] and Phenix [9,10] - SHELX [11,12,13] - BUSTER [14] * classification and rejection of model-based outlier reflections - in Phenix [15] (still default?) * DFc completion for missing observations in 2mFo-DFc electron density maps - default in REFMAC (into single FWT/PHWT by default as far as we know) - default in Phenix (into an additional set of map coefficients?) - default in BUSTER (into two additional sets of map coefficients, 2FOFCWT_iso-fill/PH2FOFCWT_iso-fill using a sphere and 2FOFCWT_aniso-fill/PH2FOFCWT_aniso-fill using the anisotropic cut-off information from STARANISO). Which of all of the above are you referring to as being "considered taboo"? It would be helpful if you could clarify this so that we can then focus on that particular point in our discussion. > When cryo-EM emerged as a competitor to x-ray crystallography, the paradigm > began to shift. In cryo-EM, manipulations applied to the data (the map) are > a standard practice. The map can be boxed, filtered (sharpened, blurred, > etc.), modified (e.g., setting something outside the molecular region), and > so forth; you name it. One might wonder why the same isn't done to x-ray data. I don't think it is true that this "isn't done to x-ray data": * In small-molecule crystallography the "SQUEEZE" procedure [16] exists, that does modify the diffraction data * Electron-density sharpening [17] is widely used/described [18,19] > Historical analogies include truncating data beyond 6-8Å resolution > to avoid dealing with the bulk solvent Maybe that should be rephrased from ... truncating data beyond 6-8Å resolution to avoid dealing with the bulk solvent ... to ... truncating data below 6-8Å resolution because we couldn't deal with the bulk solvent at the time (but once we could [20], there was no longer a need for that) ... That's then a fairly normal evolution of science/methods: we do the best we can at a given time with the tools at our disposal while trying to develop better methods that might complement or replace those existing tools. > or default sharpening (a feature available in X-plor for some time, > then removed for obvious reasons, AFAIK), choosing resolution limits > (PAIRREF), and anisotropic data massaging by the UCLA server as a > more recent example. STAIRSANISO is the leader in doing things along > these lines as of today. Your wording here makes it very difficult to take that serious: describing the whole range of features provided by STARANISO (sic) as "massaging data" takes us into the quicksands of polemics ... I'm sure we can do better than that. Which component are you critisizing here exactly? Maybe by being a bit more explicit we can have a scientific discussion about those items and come to some understanding (or agree to disagree) that is useful to the average reader of these CCP4bb threads. We have: (1) The analysis of anisotropy in the data (as also provided by e.g. Phaser [4] or ctruncate [21])? (2) The selection of reflection data without an isotropic constraint (that would be leading to a spherical cut-off)? ==> this can be switched off on the STARANISO server [22] (3) The anisotropic scaling/correction of the data according to the analysis in (1)? ==> this can be switched off on the STARANISO server [22] We've chosen defaults in STARANISO (if run through autoPROC or through the server) that we feel make the most sense in our hands and are based on a lot of user feedback. We don't force users to stick with those defaults or to use STARANISO in the first place. > Indeed, why not if this is helpful to solve the structure? Exactly: up to the point where STARANISO provides reflection data, no notion of a structural model has entered any of the computations. If a particular method of processing the raw diffraction data (images) leads to a model and electron density map that shows more information and allows for better interpretation and correction of that model, it clearly suggests to me that it provides a higher information content and is useful for that purpose. This has to be seen obviously in the context of the methods, programs and parametrisations we currently use: nothing is set in stone and new developments will come along that make current approaches redundant at some stage in the future ... it's called "progress" ;-) > However, it's important that the deposition clearly contains and > annotates at least the following: > > - the original unmanipulated data; > - modified data (by whatever method or program); > - accessible information about the data that was used to obtain the final > deposited atomic model. Completely agree with you (even if I would choose "unmodified" instead of "unmanipulated" here: choice of words matter and we should stay as neutral as possible I think). That is exactly the reason why autoPROC/STARANISO is providing a deposition-ready PDBx/mmCIF files by default since the March 2019 release [23]. We are trying to explain the usage of those in great detail [24], but users are often not aware of those files if dtheir data were auto-processed at synchrotrons [25] (the presentation of autoPROC/STARANISO results/files is not always complete and could be improved upon). There are several issues a normal user has to deal with when it comes to PDB deposition: * There is often a significant time lag between data collection (and probably processing) and deposition: this was on average about 2.5 years the last time I checked this. Making sure that the model and reflection data from the final refinement steps are correctly associated with the original data processing can become tricky (which is why we provide the "aB_deposition_combine" tool to help users, [24]). * Historical baggage that seems impossible to get rid of. I'm especially thinking of the requirement (by the OneDep system, as far as I understand) for the data quality metrics (i.e. those statistics that describe the reflection data) to be part of the model mmCIF file. This basically goes back to the time when deposition of "structure factor" (sic) data was not compulsory (pre Feb 2008) and these items had to be in the model file. With the use of mmCIF files for the model /and/ the reflection data during deposition, that requirement should not be necessary anymore. It would especially avoid the use of data-preparation tools that try and extract some values from a variety of logfiles with the intrinsic problems this entails: - these logfiles are by definition separate from the reflection file with the danger of encountering mix-ups; - they are completely unstructured and could (and do) change at will - while a mmCIF file is structured and can be validated (for format mainly, but also somewhat for content) against an official mmCIF dictionary; * The guidance - by documentation and deposition systems - concerning what is the best to provide the correct information in the correct format to the deposition software is too long, too scattered, not detailed enough, confusing, contradictory etc. We could all do much better here I guess and ensure that at deposition time users need to deal primarily with the correct scientific content of a deposition and not with format and format-validation questions. The latter often seem to end up forcing users to deposit "something" - often under stress - as long as it goes through those checks and they can move on. * The uncanny power of sloppy throw-away remarks. I remember the times when everyone said "SHARP is slow and only needed as a last resort for really difficult cases." (for the novice readers: SHARP is a program for experimental phasing). Yes, it was slow back in the early 90s on SGI workstations etc, because it does some pretty extensive computations. For the last 20 years though, it is now usually running in seconds on nearly all problems and is by far the fastest step during a typical experimental phasing experiment (site detection, density modification and automatic building are MUCH slower). But we still hear the same old remarks ... Or we could look at the discussion about Rmerge (and how we still see it in depositions and papers and have reviewers commenting on it being too high) ... 25 years after the papers pointing out its flaws? Now we often hear questions about "can I deposit STARANISO data", with extremely little scientific reasoning why one couldn't or shouldn't. It all seems to be based on some fear that powerful referees, PIs or well known experts will complain about this at some point. These don't seem like very good reasons for doing or not doing something if it otherwise seems sound to the actual user, but pushing back against that external pressure as a new or one-off crystallographer is really hard. It is up to us (so-called) "experts" to be aware of the power we wield here - and use it wisely. By all means, have a scientific argument with us and show everyone why some of our methods are not doing the right thing or is buggy. We'd be the first to welcome any such comments because ultimately they lead to improved methods and programs for everyone. But remarks like "data massaging" or "manipulated" have a real negative impact without adding anything to such a discussion ... The bottom line for our software [24]: it should be trivial to provide (a) the deposition-ready model mmCIF file (coming from Phenix, REFMAC or BUSTER) that contains the correct data quality metrics, and (b) a deposition-ready reflection data mmCIF file including all the above datablocks described by Pavel. > Note *accessible* above as this is the key for what follows below. > > Let's consider this example: https://files.rcsb.org/download/6R72-sf.cif , > which is representative of the class of problems I'm trying to convey here. That is one example (but maybe not a good one, see below) - maybe a better one would e.g. be 8ar7. > The file has everything, kudos to the authors: The original data, the > manipulated data and a whole lot more. > > Are these data accessible? > > YES, if you download the file, open it in your favorite text editor, and > carefully scroll and read through its 76,566 lines and use your best guess > to infer what are the original data arrays, what are the modified data > arrays and so on. A mmCIF file is not something anyone would want to look at in a text editor! So isn't that more a problem with the software you decided to use for getting and looking at that data? BUSTER provides a simple tool ("fetch_PDB_gemmi") that will fetch a PDB entry and not only extract the reflection data for each block, but also the explanations they carry. Here is an excerpt of what it reports for the entry you picked (the full output is a bit longer and I didn't want to make this email even less likely to be read): ### merged data block #1 = r6r72sf data as used in refinement and resulting electron density maps. Converted by gemmi-mtz2cif 0.2.0 ### merged data block #2 = r6r72Asf merged and scaled data post-processed by for conversion from intensities to structure factor amplitudes. ### merged data block #3 = r6r72Bsf merged and scaled data from AIMLESS without any post-processing and/or data cut-off. The reason I mentioned that 6r72 is not a good example is visible above: somehow the string "STARANISO" got lost in the description (_diffrn.details) of the second data block ... you can see that by the incomplete sentence and the double spaces. If you do the same for 8ar7 via fetch_PDB_gemmi 8ar7 you get ### merged data block #1 = r8ar7sf data as used in refinement and resulting electron density maps. ### merged data block #2 = r8ar7Asf 2mFo-DFc map coefficients complemented for missing data (as defined by SA_flag from STARANISO). ### merged data block #3 = r8ar7Bsf 2mFo-DFc map coefficients complemented for missing data (within full resolution range). ### merged data block #4 = r8ar7Csf merged and scaled data post-processed by STARANISO for conversion from intensities to structure factor amplitudes and anomalous data. ### merged data block #5 = r8ar7Dsf merged and scaled EARLY (potentially least radiation-damaged) data post-processed by STARANISO for conversion from intensities to structure factor amplitudes - useful for radiation-damage detection/description maps (as e.g. done in BUSTER). ### merged data block #6 = r8ar7Esf merged and scaled LATE (potentially most radiation-damaged) data post-processed by STARANISO for conversion from intensities to structure factor amplitudes - useful for radiation-damage detection/description maps (as e.g. done in BUSTER). ### merged data block #7 = r8ar7Fsf merged and scaled data from AIMLESS without any post-processing and/or data cut-off. ### unmerged data block #8 = r8ar7Gsf unmerged and scaled data from AIMLESS without any post-processing and/or data cut-off and a MTZ file for each of those data blocks. It should be very easy from that to pick any datablock you like based on that description (which unfortunately isn't based on a fixed vocabulary, but that could be added to the mmCIF dictionary if needed). > NO, absolutely NO, if you parse data files in PDB automatically with a > script, and attempt to extract particular data (eg., original unmanipulated > data). And this is what I find problematic, especially given 215+k entries > in PDB as of today. > > Hope someone does something about it! Well, from our side I think we've done already a fair amount here through * our software (creating and combining deposition-ready mmCIF files from processing+refinement and providing a tool to fetch archived PDB entries), * tools like our "Table 1" server [27] * very useful discussions with e.g. the PDBj that has resulted in a much enriched description for the data archived with a given entry [28], and * our work within the PDBx/mmCIF WG [29], especially the Processing Subgroup [30]). If the software systems at your disposal don't provide adequate tools you should probably discuss this with those developers ;-) Once you have defined the "something", the best "someone" is yourself - so feel free to join in productively rather than disparagingly. Cheers Clemens [1] French, S. and Wilson, K., 1978. On the treatment of negative intensity observations. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 34(4), pp.517-525. [2] Sawaya, M.R., 2014. Methods to refine macromolecular structures in cases of severe diffraction anisotropy. Structural Genomics: General Applications, pp.205-214. [3] https://srv.mbi.ucla.edu/Anisoscale/ [4] McCoy, A.J., Grosse-Kunstleve, R.W., Adams, P.D., Winn, M.D., Storoni, L.C. and Read, R.J., 2007. Phaser crystallographic software. Journal of applied crystallography, 40(4), pp.658-674. [5] https://staraniso.globalphasing.org/ [6] Murshudov, G.N., Davies, G.J., Isupov, M., Krzywda, S. and Dodson, E.J., 1998. The effect of overall anisotropic scaling in macromolecular refinement. CCP4 newsletter on protein crystallography, 35, pp.37-42. [7] Murshudov, G.N., Skubák, P., Lebedev, A.A., Pannu, N.S., Steiner, R.A., Nicholls, R.A., Winn, M.D., Long, F. and Vagin, A.A., 2011. REFMAC5 for the refinement of macromolecular crystal structures. Acta Crystallographica Section D: Biological Crystallography, 67(4), pp.355-367. [8] Afonine, P.V., Grosse-Kunstleve, R.W. and Adams, P.D., 2005. A robust bulk-solvent correction and anisotropic scaling procedure. Acta Crystallographica Section D: Biological Crystallography, 61(7), pp.850-855. [9] Afonine, P.V., Grosse-Kunstleve, R.W., Chen, V.B., Headd, J.J., Moriarty, N.W., Richardson, J.S., Richardson, D.C., Urzhumtsev, A., Zwart, P.H. and Adams, P.D., 2010. phenix. model_vs_data: A high-level tool for the calculation of crystallographic model and data statistics. Journal of applied crystallography, 43(4), pp.669-676. [10] Afonine, P.V., Grosse-Kunstleve, R.W., Adams, P.D. and Urzhumtsev, A., 2013. Bulk-solvent and overall scaling revisited: faster calculations, improved results. Acta Crystallographica Section D: Biological Crystallography, 69(4), pp.625-634. Afonine, P.V., Grosse-Kunstleve, R.W., Adams, P.D. and Urzhumtsev, A., 2023. Bulk-solvent and overall scaling revisited: faster calculations, improved results. Corrigendum. Acta Crystallographica Section D: Structural Biology, 79(7). [11] Shakked, Z., 1983. Anisotropic scaling of three-dimensional intensity data. Acta Crystallographica Section A: Foundations of Crystallography, 39(3), pp.278-279. [12] Pohl, E., Schneider, T.R., Dauter, Z., Schmidt, A., Fritz, H.J. and Sheldrick, G.M., 1999. 1.7 Å structure of the stabilized REIv mutant T39K. Application of local NCS restraints. Acta Crystallographica Section D: Biological Crystallography, 55(6), pp.1158-1167. [13] Sheldrick, G.M., 2012. Macromolecular applications of SHELX. International Tables for Crystallography (2012). Vol. F. ch. 18.9, pp. 529-533. [14] Blanc, E., Roversi, P., Vonrhein, C., Flensburg, C., Lea, S.M. and Bricogne, G., 2004. Refinement of severely incomplete structures with maximum likelihood in BUSTER–TNT. Acta Crystallographica Section D: Biological Crystallography, 60(12), pp.2210-2221. [15] https://phenix-online.org/documentation/faqs/refine.html#general ("Why does phenix.refine not use all data in refinement?") and https://phenix-online.org/pipermail/phenixbb/2010-December/016283.html [16] Spek, A.L., 2015. PLATON SQUEEZE: a tool for the calculation of the disordered solvent contribution to the calculated structure factors. Acta Crystallographica Section C: Structural Chemistry, 71(1), pp.9-18. [17] DeLaBarre, B. and Brunger, A.T., 2006. Considerations for the refinement of low-resolution crystal structures. Acta Crystallographica Section D: Biological Crystallography, 62(8), pp.923-932. [18] Liu, C. and Xiong, Y., 2014. Electron density sharpening as a general technique in crystallographic studies. Journal of molecular biology, 426(4), pp.980-993. [19] Terwilliger, T.C., Sobolev, O.V., Afonine, P.V. and Adams, P.D., 2018. Automated map sharpening by maximization of detail and connectivity. Acta Crystallographica Section D: Structural Biology, 74(6), pp.545-559. [20] Jiang, J.S. and Brünger, A.T., 1994. Protein hydration observed by X-ray diffraction: solvation properties of penicillopepsin and neuraminidase crystal structures. Journal of molecular biology, 243(1), pp.100-115. [21] https://www.ccp4.ac.uk/html/ctruncate.html [22] https://staraniso.globalphasing.org/ [23] https://www.globalphasing.com/autoproc/ReleaseNotes/ReleaseNotes-autoPROC_snapshot_20190301.txt [24] https://www.globalphasing.com/buster/wiki/index.cgi?DepositionMmCif [25] https://www.globalphasing.com/autoproc/wiki/index.cgi?RunningAutoProcAtSynchrotrons [26] https://www.globalphasing.com/buster/ [27] https://staraniso.globalphasing.org/table1/ (e.g. https://staraniso.globalphasing.org/table1/ar/8ar7.html) [28] https://pdbj.org/mine/experimental_details/8AR7 [29] https://www.wwpdb.org/task/mmcif [30] https://github.com/pdbxmmcifwg/mmcif-data-proc -- *-------------------------------------------------------------- * Clemens Vonrhein, Ph.D. vonrhein AT GlobalPhasing DOT com * Global Phasing Ltd., Sheraton House, Castle Park * Cambridge CB3 0AX, UK www.globalphasing.com *-------------------------------------------------------------- ######################################################################## To unsubscribe from the CCP4BB list, click the following link: https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1 This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/