Dear Pavel & CCP4bb readers,

On Wed, Feb 14, 2024 at 08:28:03PM -0800, Pavel Afonine wrote:
> What follows below is not very specific to the particular program
> (STAIRSANISO) nor the original questions, but nonetheless, I believe it is
> relevant.

Thanks for joining the discussion: always good to have different viewpoints
or opinions made visible - especially for less knowledgeable users and
readers of the CCP4bb.

And apologies to anyone getting tired of "another long post" here, but
some remarks do require follow-ups that hopefully will help keep the
discussion at a level useful to all readers.

> In the past, performing any adjustments to the diffraction data intended
> for solving and refining atomic models was more or less considered taboo.

That is a very broad statement that I have trouble making sense of: what do
you mean with "adjustments" and what do you mean with "diffraction data"?
If we are truly looking at diffraction data as it comes out of our
experiment, we are looking at the raw images, right?  Those are then
handled roughly as follows (as an example for MX):

  * initial integrated intensities (simplifying 3D pixel data)
  
  * profile fitting of integrated intensities

  * scaling (with various parametrisation models)

  * selection of data (excluding image ranges due to radiation damage or
    because a crystal moves out of the beam, excluding/handling ice-ring
    contamination, selecting datasets in SSX etc)

  * adjustment of error model (to get "meaningful" error estimates,
    i.e. sigma values)

  * outlier rejection (based largely on those sigmas)

  * merging (inverse-variance weighted)

Maybe all those "adjustments to the diffraction data" are not what you are
referring to in your remark above? So let's assume you are referring to the
merged intensity data after all of the above steps as being "the
diffraction data" ... and what we are doing after that:

  * conversion from intensities to amplitudes (using different methods and
    priors) which most often will include an adjustment of weak and
    negative intensities (e.g. via the French & Wilson method [1]).

  * decision what reflections to use for subsequent steps

    - defined by geometric constraints (we can only use those that hit the
      detector)

    - defined by some significance criterion

    resulting in an adjustment of the dataset (not the values themselves)
    coming out of the raw diffraction data.

Maybe this is still not what you are referring to as "adjustments to
the diffraction data"? Let's see what additional "adjustments to the
diffraction data" might happen further along ...

  * anisotropic scaling of the diffraction data /without/ the use of any
    atomic model, as provided e.g. by

    - the UCLA Anisotropy Server [2,3], using the anisotropy analysis
      from Phaser [4]

    - STARANISO [5], using its own analysis

  * relative anisotropic scaling of the diffraction data and the current
    model in refinement, e.g. in

    - REFMAC [6,7]

        Note: this includes writing a set of observed amplitudes into the
        output MTZ file that have been corrected using the model-based
        overall anisotropy factors (as far we know and at least up to
        version 5.8.0352). So any "structure factor" deposition using only
        the output reflection data from such a run will have anisotropy
        corrected observed observed data in the PDB archive. Our
        aB_deposition_combine tool described below detects and undoes this
        (when combining the reflection mmCIF data from processing with the
        reflection data after refinement) to ensure that data exactly as
        used as /input/ to the refinement program is deposited.

    - CCTBX [8] and Phenix [9,10]

    - SHELX [11,12,13]

    - BUSTER [14]

  * classification and rejection of model-based outlier reflections

    - in Phenix [15] (still default?)

  * DFc completion for missing observations in 2mFo-DFc electron density maps

    - default in REFMAC (into single FWT/PHWT by default as far as we know)

    - default in Phenix (into an additional set of map coefficients?)

    - default in BUSTER (into two additional sets of map coefficients,
      2FOFCWT_iso-fill/PH2FOFCWT_iso-fill using a sphere and
      2FOFCWT_aniso-fill/PH2FOFCWT_aniso-fill using the anisotropic cut-off
      information from STARANISO).

Which of all of the above are you referring to as being "considered taboo"?
It would be helpful if you could clarify this so that we can then focus on
that particular point in our discussion.

> When cryo-EM emerged as a competitor to x-ray crystallography, the paradigm
> began to shift. In cryo-EM, manipulations applied to the data (the map) are
> a standard practice. The map can be boxed, filtered (sharpened, blurred,
> etc.), modified (e.g., setting something outside the molecular region), and
> so forth; you name it. One might wonder why the same isn't done to x-ray data.

I don't think it is true that this "isn't done to x-ray data":

 * In small-molecule crystallography the "SQUEEZE" procedure [16] exists,
   that does modify the diffraction data

 * Electron-density sharpening [17] is widely used/described [18,19]

> Historical analogies include truncating data beyond 6-8Å resolution
> to avoid dealing with the bulk solvent

Maybe that should be rephrased from

 ... truncating data beyond 6-8Å resolution to avoid dealing with the bulk
 solvent ...

to

 ... truncating data below 6-8Å resolution because we couldn't deal with
 the bulk solvent at the time (but once we could [20], there was no longer
 a need for that) ...

That's then a fairly normal evolution of science/methods: we do the best we
can at a given time with the tools at our disposal while trying to develop
better methods that might complement or replace those existing tools.

> or default sharpening (a feature available in X-plor for some time,
> then removed for obvious reasons, AFAIK), choosing resolution limits
> (PAIRREF), and anisotropic data massaging by the UCLA server as a
> more recent example. STAIRSANISO is the leader in doing things along
> these lines as of today.

Your wording here makes it very difficult to take that serious: describing
the whole range of features provided by STARANISO (sic) as "massaging data"
takes us into the quicksands of polemics ... I'm sure we can do better than
that.

Which component are you critisizing here exactly? Maybe by being a bit more
explicit we can have a scientific discussion about those items and come to
some understanding (or agree to disagree) that is useful to the average
reader of these CCP4bb threads. We have:

 (1) The analysis of anisotropy in the data (as also provided by
     e.g. Phaser [4] or ctruncate [21])?

 (2) The selection of reflection data without an isotropic constraint (that
     would be leading to a spherical cut-off)?

     ==> this can be switched off on the STARANISO server [22]

 (3) The anisotropic scaling/correction of the data according to the
     analysis in (1)?

     ==> this can be switched off on the STARANISO server [22]

We've chosen defaults in STARANISO (if run through autoPROC or through the
server) that we feel make the most sense in our hands and are based on a
lot of user feedback. We don't force users to stick with those defaults or
to use STARANISO in the first place.

> Indeed, why not if this is helpful to solve the structure?

Exactly: up to the point where STARANISO provides reflection data, no
notion of a structural model has entered any of the computations. If a
particular method of processing the raw diffraction data (images) leads to
a model and electron density map that shows more information and allows for
better interpretation and correction of that model, it clearly suggests to
me that it provides a higher information content and is useful for that
purpose.

This has to be seen obviously in the context of the methods, programs and
parametrisations we currently use: nothing is set in stone and new
developments will come along that make current approaches redundant at some
stage in the future ... it's called "progress" ;-)

> However, it's important that the deposition clearly contains and
> annotates at least the following:
> 
> - the original unmanipulated data;
> - modified data (by whatever method or program);
> - accessible information about the data that was used to obtain the final 
> deposited atomic model.

Completely agree with you (even if I would choose "unmodified" instead of
"unmanipulated" here: choice of words matter and we should stay as neutral
as possible I think).

That is exactly the reason why autoPROC/STARANISO is providing a
deposition-ready PDBx/mmCIF files by default since the March 2019 release
[23]. We are trying to explain the usage of those in great detail [24], but
users are often not aware of those files if dtheir data were auto-processed
at synchrotrons [25] (the presentation of autoPROC/STARANISO results/files
is not always complete and could be improved upon).

There are several issues a normal user has to deal with when it comes to
PDB deposition:

 * There is often a significant time lag between data collection (and
   probably processing) and deposition: this was on average about 2.5 years
   the last time I checked this. Making sure that the model and reflection
   data from the final refinement steps are correctly associated with the
   original data processing can become tricky (which is why we provide the
   "aB_deposition_combine" tool to help users, [24]).

 * Historical baggage that seems impossible to get rid of. I'm especially
   thinking of the requirement (by the OneDep system, as far as I
   understand) for the data quality metrics (i.e. those statistics that
   describe the reflection data) to be part of the model mmCIF file. This
   basically goes back to the time when deposition of "structure factor"
   (sic) data was not compulsory (pre Feb 2008) and these items had to be in
   the model file.

   With the use of mmCIF files for the model /and/ the reflection data
   during deposition, that requirement should not be necessary anymore. It
   would especially avoid the use of data-preparation tools that try and
   extract some values from a variety of logfiles with the intrinsic
   problems this entails:

    - these logfiles are by definition separate from the reflection file
      with the danger of encountering mix-ups;

    - they are completely unstructured and could (and do) change at will -
      while a mmCIF file is structured and can be validated (for format
      mainly, but also somewhat for content) against an official mmCIF
      dictionary;

 * The guidance - by documentation and deposition systems - concerning what
   is the best to provide the correct information in the correct format to
   the deposition software is too long, too scattered, not detailed enough,
   confusing, contradictory etc. We could all do much better here I guess
   and ensure that at deposition time users need to deal primarily with the
   correct scientific content of a deposition and not with format and
   format-validation questions. The latter often seem to end up forcing
   users to deposit "something" - often under stress - as long as it goes
   through those checks and they can move on.

 * The uncanny power of sloppy throw-away remarks. I remember the times
   when everyone said "SHARP is slow and only needed as a last resort for
   really difficult cases." (for the novice readers: SHARP is a program for
   experimental phasing). Yes, it was slow back in the early 90s on SGI
   workstations etc, because it does some pretty extensive
   computations. For the last 20 years though, it is now usually running in
   seconds on nearly all problems and is by far the fastest step during a
   typical experimental phasing experiment (site detection, density
   modification and automatic building are MUCH slower). But we still hear
   the same old remarks ...

   Or we could look at the discussion about Rmerge (and how we still see it
   in depositions and papers and have reviewers commenting on it being too
   high) ... 25 years after the papers pointing out its flaws?

   Now we often hear questions about "can I deposit STARANISO data", with
   extremely little scientific reasoning why one couldn't or shouldn't. It
   all seems to be based on some fear that powerful referees, PIs or well
   known experts will complain about this at some point. These don't seem
   like very good reasons for doing or not doing something if it otherwise
   seems sound to the actual user, but pushing back against that external
   pressure as a new or one-off crystallographer is really hard. It is up
   to us (so-called) "experts" to be aware of the power we wield here - and
   use it wisely.

   By all means, have a scientific argument with us and show everyone why
   some of our methods are not doing the right thing or is buggy. We'd be
   the first to welcome any such comments because ultimately they lead to
   improved methods and programs for everyone. But remarks like "data
   massaging" or "manipulated" have a real negative impact without adding
   anything to such a discussion ...

The bottom line for our software [24]: it should be trivial to provide (a)
the deposition-ready model mmCIF file (coming from Phenix, REFMAC or
BUSTER) that contains the correct data quality metrics, and (b) a
deposition-ready reflection data mmCIF file including all the above
datablocks described by Pavel.

> Note *accessible* above as this is the key for what follows below.
> 
> Let's consider this example: https://files.rcsb.org/download/6R72-sf.cif ,
> which is representative of the class of problems I'm trying to convey here.

That is one example (but maybe not a good one, see below) - maybe a better
one would e.g. be 8ar7.

> The file has everything, kudos to the authors: The original data, the
> manipulated data and a whole lot more.
> 
> Are these data accessible?
> 
> YES, if you download the file, open it in your favorite text editor, and
> carefully scroll and read through its 76,566 lines and use your best guess
> to infer what are the original data arrays, what are the modified data
> arrays and so on.

A mmCIF file is not something anyone would want to look at in a text
editor! So isn't that more a problem with the software you decided to use
for getting and looking at that data? BUSTER provides a simple tool
("fetch_PDB_gemmi") that will fetch a PDB entry and not only extract the
reflection data for each block, but also the explanations they carry. Here
is an excerpt of what it reports for the entry you picked (the full output
is a bit longer and I didn't want to make this email even less likely to be
read):

 ### merged data block #1 = r6r72sf

  data as used in refinement and resulting electron density maps.
  Converted by gemmi-mtz2cif 0.2.0

 ### merged data block #2 = r6r72Asf

  merged and scaled data post-processed by  for conversion from intensities to 
structure factor amplitudes.

 ### merged data block #3 = r6r72Bsf

  merged and scaled data from AIMLESS without any post-processing and/or data 
cut-off.

The reason I mentioned that 6r72 is not a good example is visible
above: somehow the string "STARANISO" got lost in the description
(_diffrn.details) of the second data block ... you can see that by the
incomplete sentence and the double spaces. If you do the same for 8ar7
via

  fetch_PDB_gemmi 8ar7

you get

 ### merged data block #1 = r8ar7sf
  data as used in refinement and resulting electron density maps.

 ### merged data block #2 = r8ar7Asf
  2mFo-DFc map coefficients complemented for missing data (as defined by 
SA_flag from STARANISO).

 ### merged data block #3 = r8ar7Bsf
  2mFo-DFc map coefficients complemented for missing data (within full 
resolution range).

 ### merged data block #4 = r8ar7Csf
  merged and scaled data post-processed by STARANISO for conversion from 
intensities to structure factor amplitudes and anomalous data.

 ### merged data block #5 = r8ar7Dsf
  merged and scaled EARLY (potentially least radiation-damaged) data 
post-processed by STARANISO for conversion from intensities to structure factor 
amplitudes - useful for radiation-damage detection/description maps (as e.g. 
done in BUSTER).

 ### merged data block #6 = r8ar7Esf
  merged and scaled LATE (potentially most radiation-damaged) data 
post-processed by STARANISO for conversion from intensities to structure factor 
amplitudes - useful for radiation-damage detection/description maps (as e.g. 
done in BUSTER).

 ### merged data block #7 = r8ar7Fsf
  merged and scaled data from AIMLESS without any post-processing and/or data 
cut-off.

 ### unmerged data block #8 = r8ar7Gsf
  unmerged and scaled data from AIMLESS without any post-processing and/or data 
cut-off

and a MTZ file for each of those data blocks. It should be very easy from
that to pick any datablock you like based on that description (which
unfortunately isn't based on a fixed vocabulary, but that could be added to
the mmCIF dictionary if needed).

> NO, absolutely NO, if you parse data files in PDB automatically with a
> script, and attempt to extract particular data (eg., original unmanipulated
> data). And this is what I find problematic, especially given 215+k entries
> in PDB as of today.
> 
> Hope someone does something about it!

Well, from our side I think we've done already a fair amount here through

  * our software (creating and combining deposition-ready mmCIF files from
    processing+refinement and providing a tool to fetch archived PDB
    entries),

  * tools like our "Table 1" server [27]

  * very useful discussions with e.g. the PDBj that has resulted in a much
    enriched description for the data archived with a given entry [28], and

  * our work within the PDBx/mmCIF WG [29], especially the Processing
    Subgroup [30]).

If the software systems at your disposal don't provide adequate tools you
should probably discuss this with those developers ;-)

Once you have defined the "something", the best "someone" is yourself - so
feel free to join in productively rather than disparagingly.

Cheers

Clemens


[1] French, S. and Wilson, K., 1978. On the treatment of negative
    intensity observations. Acta Crystallographica Section A: Crystal
    Physics, Diffraction, Theoretical and General Crystallography,
    34(4), pp.517-525.

[2] Sawaya, M.R., 2014. Methods to refine macromolecular structures in
    cases of severe diffraction anisotropy. Structural Genomics:
    General Applications, pp.205-214.

[3] https://srv.mbi.ucla.edu/Anisoscale/

[4] McCoy, A.J., Grosse-Kunstleve, R.W., Adams, P.D., Winn, M.D.,
    Storoni, L.C. and Read, R.J., 2007. Phaser crystallographic
    software. Journal of applied crystallography, 40(4), pp.658-674.

[5] https://staraniso.globalphasing.org/

[6] Murshudov, G.N., Davies, G.J., Isupov, M., Krzywda, S. and Dodson,
    E.J., 1998. The effect of overall anisotropic scaling in
    macromolecular refinement. CCP4 newsletter on protein
    crystallography, 35, pp.37-42.

[7] Murshudov, G.N., Skubák, P., Lebedev, A.A., Pannu, N.S., Steiner,
    R.A., Nicholls, R.A., Winn, M.D., Long, F. and Vagin, A.A.,
    2011. REFMAC5 for the refinement of macromolecular crystal
    structures. Acta Crystallographica Section D: Biological
    Crystallography, 67(4), pp.355-367.

[8] Afonine, P.V., Grosse-Kunstleve, R.W. and Adams, P.D., 2005. A
    robust bulk-solvent correction and anisotropic scaling
    procedure. Acta Crystallographica Section D: Biological
    Crystallography, 61(7), pp.850-855.

[9] Afonine, P.V., Grosse-Kunstleve, R.W., Chen, V.B., Headd, J.J.,
    Moriarty, N.W., Richardson, J.S., Richardson, D.C., Urzhumtsev,
    A., Zwart, P.H. and Adams, P.D., 2010. phenix. model_vs_data: A
    high-level tool for the calculation of crystallographic model and
    data statistics. Journal of applied crystallography, 43(4),
    pp.669-676.

[10] Afonine, P.V., Grosse-Kunstleve, R.W., Adams, P.D. and
     Urzhumtsev, A., 2013. Bulk-solvent and overall scaling revisited:
     faster calculations, improved results. Acta Crystallographica
     Section D: Biological Crystallography, 69(4), pp.625-634.

     Afonine, P.V., Grosse-Kunstleve, R.W., Adams, P.D. and
     Urzhumtsev, A., 2023. Bulk-solvent and overall scaling revisited:
     faster calculations, improved results. Corrigendum. Acta
     Crystallographica Section D: Structural Biology, 79(7).

[11] Shakked, Z., 1983. Anisotropic scaling of three-dimensional
     intensity data. Acta Crystallographica Section A: Foundations of
     Crystallography, 39(3), pp.278-279.

[12] Pohl, E., Schneider, T.R., Dauter, Z., Schmidt, A., Fritz,
     H.J. and Sheldrick, G.M., 1999. 1.7 Å structure of the stabilized
     REIv mutant T39K. Application of local NCS restraints. Acta
     Crystallographica Section D: Biological Crystallography, 55(6),
     pp.1158-1167.

[13] Sheldrick, G.M., 2012. Macromolecular applications of
     SHELX. International Tables for Crystallography
     (2012). Vol. F. ch. 18.9, pp. 529-533.

[14] Blanc, E., Roversi, P., Vonrhein, C., Flensburg, C., Lea,
     S.M. and Bricogne, G., 2004. Refinement of severely incomplete
     structures with maximum likelihood in BUSTER–TNT. Acta
     Crystallographica Section D: Biological Crystallography, 60(12),
     pp.2210-2221.

[15] https://phenix-online.org/documentation/faqs/refine.html#general ("Why
     does phenix.refine not use all data in refinement?") and
     https://phenix-online.org/pipermail/phenixbb/2010-December/016283.html

[16] Spek, A.L., 2015. PLATON SQUEEZE: a tool for the calculation of
     the disordered solvent contribution to the calculated structure
     factors. Acta Crystallographica Section C: Structural Chemistry,
     71(1), pp.9-18.

[17] DeLaBarre, B. and Brunger, A.T., 2006. Considerations for the
     refinement of low-resolution crystal structures. Acta
     Crystallographica Section D: Biological Crystallography, 62(8),
     pp.923-932.

[18] Liu, C. and Xiong, Y., 2014. Electron density sharpening as a
     general technique in crystallographic studies. Journal of
     molecular biology, 426(4), pp.980-993.

[19] Terwilliger, T.C., Sobolev, O.V., Afonine, P.V. and Adams, P.D.,
     2018. Automated map sharpening by maximization of detail and
     connectivity. Acta Crystallographica Section D: Structural
     Biology, 74(6), pp.545-559.

[20] Jiang, J.S. and Brünger, A.T., 1994. Protein hydration observed
     by X-ray diffraction: solvation properties of penicillopepsin and
     neuraminidase crystal structures. Journal of molecular biology,
     243(1), pp.100-115.

[21] https://www.ccp4.ac.uk/html/ctruncate.html

[22] https://staraniso.globalphasing.org/

[23] 
https://www.globalphasing.com/autoproc/ReleaseNotes/ReleaseNotes-autoPROC_snapshot_20190301.txt

[24] https://www.globalphasing.com/buster/wiki/index.cgi?DepositionMmCif

[25] 
https://www.globalphasing.com/autoproc/wiki/index.cgi?RunningAutoProcAtSynchrotrons

[26] https://www.globalphasing.com/buster/

[27] https://staraniso.globalphasing.org/table1/
     (e.g. https://staraniso.globalphasing.org/table1/ar/8ar7.html)

[28] https://pdbj.org/mine/experimental_details/8AR7

[29] https://www.wwpdb.org/task/mmcif

[30] https://github.com/pdbxmmcifwg/mmcif-data-proc

-- 

*--------------------------------------------------------------
* Clemens Vonrhein, Ph.D.     vonrhein AT GlobalPhasing DOT com
* Global Phasing Ltd., Sheraton House, Castle Park 
* Cambridge CB3 0AX, UK                   www.globalphasing.com
*--------------------------------------------------------------

########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to