Re: [ccp4bb] problem of conventions

Ian Tickle Wed, 06 Apr 2011 09:47:11 -0700

Bernhard,

> Well, "it" *IS* broke.

As they say "it works for me", so either you're using a different set
of programs from me, or you're using the same programs but in a
different way. Perhaps you could be more specific as to which
program(s) appear to be broken? If possible please post the
logfile(s) on this forum, then someone might recognise the problem(s).
Did you try reporting it to CCP4 (assuming of course we're talking
about CCP4 programs)? You're the 2nd person in this thread to claim
that the space-group handling for the alternate settings is broken, so
it would be nice to get to the bottom of it!

> If you are running some type of process, as you
> implied in referring to LIMS, then there is a step in which you move from
> the crystal system and point group to the actual space group. So, at that
> point you identify P22121. The next clear step, automatically by software,
> is to convert to P21212, and move on. That doesn't take an enormous amount
> of code writing, and you have a clear trail on how you got there.

I'm puzzled why I need a workaround for a bug that only you and
possibly James have experienced: AFAIK no-one else has reported
problems with this recently. Wouldn't it be make more sense to fix
the bug(s)? - that way, everyone benefits and I don't need to do
anything! Anyway, to respond to your suggestion: I've spent some time
looking into this (so I hope you'll forgive the delay in replying!),
and unfortunately it's not as simple as you think. I can see 3 main
steps that would be required for a workaround:

Step 1 (create new crystal form entry): First I would have to make a
copy of the entry for the old crystal form in the PROTEINS table,
giving it a new unique ID. Then I would perform the
re-indexing/re-orientation operations on the reference & free-R MTZ
files and the PDB file for the refined structure, and change the
filename entries in the row of PROTEINS table just created to point to
them. This row also contains the parameters for MR, rigid-body
refinement, TLS and binding site definitions but these won't need to
be changed. The user interface would need to be modified to give
users the option of implementing this change, since I know some
(most?) users who won't be happy to do it!

One problem I foresee is confusing the users with a multiplicity of
unit cells, since we already work with potentially 2 different cells
per crystal: first the 'canonical' unit cell for the crystal form from
the reference MTZ file header; then there's the unit cell for the
isomorphous crystal as found by the indexing software. Users
understand that the indexing program won't necessarily choose the
reference cell, particularly in the situation you indicate below where
2 cell lengths are almost equal. Now you want me to add a 3rd
possibly different unit cell, i.e. that after a second run of
re-indexing to the 'standard setting'; the users won't understand the
need for this.

Next comes a tricky bit: for tracking purposes I would somehow need to
make a link from the new crystal form to the old one, my guess is with
a self-referencing foreign key. All the database applications for
doing searches & reports would need to be modified to recognise this
change. This doesn't look trivial to me! I would need to hand this
task over to the database administrator & programmers, since I'm not
involved with administration of the database. Getting a "clear trail"
doesn't happen automatically, it has to be programmed! I anticipate
some searching questions from all the users and the db admin, such as
"why do we need to do this?", "what bad things will happen if we
don't?" and "why haven't we seen these bad things happening before?".
I'm hoping that you will be able to provide convincing answers to
these questions - because I can't!

Step 2 (re-index historical data): Then I would need to copy each
entry for the historical datasets that were previously added to the
database for the old crystal form to the new crystal form (of course
it's actually _same_ crystal form, but we're fooling the LIMS into
treating it as though it were a new one). This is so that we can
continue to track the data using the new crystal form ID. All
datasets for a given crystal form must be indexed in the same way
since the LIMS interface allows you to mix & match PDB, MTZ & MAP
files for the crystal form without the need to do superpositions (of
course superpositions can be done if needed, but then you lose the
symmetry info). These 'historical' datasets are all the ones
generated in the process of getting and optimising the crystal form,
i.e. from all the different constructs made (typically ~ 30 +- 20),
the purifications and crystallisation trials, optimising the
cryobuffer & DMSO concentration for soaking ligands, then the datasets
used during the structure determination (MR/MAD/SAD etc). This may
run to 100-150 datasets, but the actual number is immaterial since
it's just as easy to write the database application for many as for
one. So a database application would have to be developed for this:
this is not a straightforward as you seem to think.

First the easy bit: I would re-index the processed MTZ file for each
of these historical datasets. Then I would copy each corresponding
row in the CRYSTALS table, changing the filename to the re-indexed MTZ
file and permuting the cell lengths. For the re-indexing I use the
CCP4 REFINDEX program that I wrote for this purpose: this
automatically re-indexes a dataset to maximise the correlation of the
F^2 between the 'reference' dataset and the new one. I know I could
also use POINTLESS for this, but I wrote REFINDEX while we were
developing the LIMS software from 1999-2001, since nothing equivalent
was available at the time (POINTLESS was not developed until several
years later).

The primary key which identifies the new dataset would obviously need
to be updated (it has to be unique) - we can't simply change the old
primary key because it's used as a foreign key in several other
tables: for example the crystal ID is referenced, among others, in the
tables which contain info related to transport of the crystals to the
synchrotron and/or the mounting robot assignment (i.e. cane/puck
positions). This row in the table also contains statistics (Rmerges
etc) and other info (e.g. mosaicity, phi range & step, image file
location) extracted from the processing logfiles, but again this info
can be simply copied across. I would need to add an entry in the JOBS
table (which will also contain links to the old crystal IDs), to
record the fact I had done all this (i.e. foreign key to new crystal
ID, user ID, date/time, protocol name/version, command line,
completion status).

For convenience, again in the CRYSTALS table I would need to link the
old data entries to the new ones, so we have a record of what was
done, and to deal with the fact that there would still be foreign keys
in several other tables pointing at all the _old_ crystal ID primary
keys. So again we would need to add a self-referencing foreign key to
the CRYSTALS table, and ensure the upgraded database applications can
also recognise this new column, again not trivial.

Step 3 (process new data): Finally, for each new dataset I would
process the data in the normal way, pretty well as you suggest.
However I already foresee another problem here: the crystal form is
currently recognised from the lattice type (P, I, C etc), the point
group and the unit cell volume; the software chooses the unit cell
with the same lattice type & point group which has the closest volume
(+- 20%) to the reference cell. The implicit assumption we have made
in the design is that no two crystal forms for a given protein can
have all 3 criteria equal at the same time (we've never seen
exceptions to this). However you are now proposing to violate this
assumption, since all 3 criteria will be identical for the old & new
crystal forms. There's a 50% chance that the software would re-index
the data into the old crystal form instead of the new one and we would
have changed nothing! Clearly, we would need to add another criterion
(e.g. the space group) to resolve the ambiguity.

> To be even more intrusive, what if you had cell parameters of 51.100,
> 51.101, and 51.102, and it's orthorhombic, P21212. For other co-crystals,
> soaks, mutants, etc., you might have both experimental errors and real
> differences in the unit cell, so you're telling me that you would process
> according to the a < b < c rule in P222 to average and scale, and then it
> might turn out to be P22121, P21221, or P21212 later on? When you wish to
> compare coordinates, then you have re-assign one coordinate data to match
> the other by using superposition, rather than taking on an earlier step of
> just using the conventional space group of P21212?

No, the cell lengths are irrelevant even if they're almost equal,
since as I mentioned above, REFINDEX tries all possible re-indexings
and maximises the correlation coefficient of the _F^2_. So yes the
data is always processed as P222 using the a<b<c rule, but then it may
be re-indexed to the reference (and the correct space group assigned
from the reference MTZ header) so that (nearly) isomorphous datasets
are then all indexed identically. When I advised against re-indexing
earlier, I was talking about re-indexing to a 'standard setting'
without a good reason: you will recall I said "isomorphism overrides
convention". The PDB file is re-oriented using the inverse transposed
re-indexing matrix, it's not necessary to use superposition (though it
would of course give a similar result).

> Again, while I see use of the a < b < c rule when there isn't an
> overriding reason to assign it otherwise, as in P222 or P212121, there
> *is* a reason to stick to the convention of one standard setting. That's
> the rationale on using P21/n sometimes vs. P21/c, or I2 vs C2, to avoid a
> large beta angle, and adopt a non-standard setting.

So what is the reason to stick to one standard setting? If there's
already an isomorphous structure in that setting I can see its value,
but how does it help in the case there's no similar structure?
Equally if say there's an isomorphous structure already in the
non-standard setting it would be sensible to use that: "isomorphism
always trumps convention".

You say "convention of one standard setting": I wasn't aware that such
a convention existed, and it certainly doesn't mention any such
convention in IT (how could it, it would then be inconsistent with
itself!). I would like to be reassured that a group of experts has
considered the details of such a convention at length and produced
readily accessible reference documentation. Can you provide a
reference to such documentation for your convention? Like you, I did
my doctorate in crystallography (small molecule) and I was taught that
IT was the 'bible' in all matters crystallographic! You seem to want
to pick the parts of the convention you like but have exceptions for
those that you don't. It would be like saying that you will use the
SI convention on units, with the exception that you will use feet
instead of metres (so instead of 1.1 metres you would have "3 ft 20
cm"!).

What's so special about the 'standard setting' anyway? In the 1935 &
1952 IT editions the 'standard setting' was chosen arbitrarily only as
a _representative_ setting for illustrative purposes (from which the
reader was expected to derive the other settings by permutation), and
the corresponding 'standard symbol' was used as the page heading for
indexing purposes. Those were their sole functions. Please take a
look at IT/A: to ensure we're seeing the same info I suggest we both
look at the 2006 online edition which can be found here:

http://ahrenkiel.sdsmt.edu/courses/Spring_2011/NANO704/International_Tables_For_Crystallography_A.pdf

On p.39 in the PDF document (p.22 on the printed page) it reads:

In the earlier (1935 and 1952) editions of International Tables,
only one setting was illustrated, in a projection along c, so that it
was usual to consider it as the ‘standard setting’ and to accept its cell
edges as crystal axes and its space-group symbol as ‘standard
Hermann–Mauguin symbol’. In the present edition, however, all six
orthorhombic settings are illustrated, as explained below.

(there are 6 orthorhombic settings excluding all those related by
simply negating 1 or more axes; however of course only 3 of these can
be differently labelled).

In the 2002 edition the entry for C2 covers 8 pages (pp 124-131): only
one of those illustrates the 'standard setting' (i.e. the setting
corresponding to the 'standard symbol' in the page heading). This use
of one of the settings chosen arbitrarily had nothing to do with the
choice of axes, which is a totally unrelated issue - all settings have
equal status in this respect.

If there hadn't been the need to save paper after WW2, all alternative
settings might have been illustrated in the 1952 edition and there
would have been no need for a "standard setting"! In fact later
editions of IT/A tabulate _all_ settings in chapters 3 & 4
("Space-group Determination" & "Synoptic Tables of Space-Group
Symbols") and the latest editions also illustrate all the settings, so
the "standard setting" concept is now largely redundant. Look up the
subject index (PDF p.909) for "standard setting" (or even "setting,
standard"). Actually don't, because you won't find it! If the
concept of "standard setting" is as critical as you claim, wouldn't it
deserve at the very least an index entry and a dedicated section
explaining your "standard settings convention"? In reality it gets a
few brief mentions, mainly in the historical context.

IT does actually give an example of space-group assignment which is
relevant to this discussion, see PDF p.60 (printed p.45) in the
chapter by the late & great Martin J. Buerger (I remember "Elementary
Crystallography" and "The Precession Method" with the hard grey covers
well!) :

The diffraction pattern of a compound has Laue class mmm.
The crystal system is thus orthorhombic. The diffraction spots
are indexed such that the reflection conditions are 0kl : l = 2n;
h0l : h = l = 2n; h00 : h = 2n; 00l : l = 2n. Table 3.1.4.1 shows
that the diffraction symbol is mmmPcn–. Possible space groups
are Pcn2 (30) and Pcnm (53). For neither space group does the
axial choice correspond to that of the standard setting. For No.
30, the standard symbol is Pnc2, for No. 53 it is Pmna.

It doesn't say what the cell lengths are, but I would guess this is a
small-molecule crystal (no prizes for guessing why!) and so the unit
cell is likely to have been chosen in full knowledge of the two
space-group possibilities (which differ only in a mirror plane). Yet
a non-standard setting was chosen! No mention here of the vital
importance to re-index to the standard setting!

I suspect that most people who think that there's a "standard
settings" convention believe it because that's what they were taught
(and also what their teachers were taught!), or if they actually have
consulted IT/A at all they will have only looked at the space-group
diagrams. If they haven't taken the trouble to read the important
explanatory chapters (1 through 5 and 8 through 15) which precede and
follow chapter 7 (which contains the diagrams for the 3-D space
groups), it's easy to see how they would have come to the mistaken
conclusion concerning the significance of the "standard symbols" in
the page headings!

> Finally, if you think it's fine to use P22121, then can I assume that you
> also allow the use of space group A2 and B2?

If there's an existing isomorphous structure in the same orientation
(though possibly in a different space group), then yes. As the
conventional cell with no good reason to choose otherwise, then no,
simply because the IUCr/NIST unit cell convention won't generate those
space groups! To be clear, here's a summary of the steps involved in
using the convention:

1) Choose a unit cell which has the full point symmetry of the
diffraction pattern, or a supergroup thereof (e.g. for all trigonal
point groups the unit cell has point symmetry 6/mmm).

2) Step 1 will generate an infinite number of cells containing
different number of lattice points related by integer multiples of the
lattice vectors (i.e. not counting centred cells where the lattice
translations are fractions of the lattice vectors), so choose the
one(s) with the least no of integer lattice points.

3) In the case that the point group has a unique axis (i.e. 2, 3. 4 or
6-fold), steps 1 & 2 will generate unit cells having different
orientations of the unique axis, so choose the one(s) which have the 2
|| b in monoclinic or the 3/4/6 || c in tri/tetra/hexagonal.

4) In triclinic & monoclinic steps 1, 2 & 3 will produce cells where
the lattice vectors are not the shortest, so choose cell(s) having the
shortest lattice vectors (e.g. this is the cell with beta nearest to
90 deg in monoclinic).

5) In triclinic & monoclinic the previous steps will still give
multiple choices of cell angle(s), so in triclinic choose cell angles
either all <= 90 or all >= 90; in monoclinic choose beta >= 90.

6) In centred monoclinic space groups, the previous steps will
generate A-, C- and I-centred cells. Eliminate the A (there will
always be an equivalent C cell since you can swap a and c).

7) In the R-centred hexagonal cell, there will remain an ambiguity
between the 'obverse' & 'reverse' settings: choose the obverse.

8) Finally if an ambiguity in the orientation of the cell still
remains (in triclinic, primitive & I-centred monoclinic and all
orthorhombic), apply the rule a <= b <= c.

This procedure will generate a unique unit cell; note that A centring
is eliminated at step 6, B centring with unique axis b is eliminated
at step 2 because it contains 2 integer lattice points, and B centring
with unique axis a or c is eliminated at step 3.

Note that the space group doesn't figure at all in this decision, for
the simple reason that in most cases it cannot be reliably deduced at
the point in the process where the unit cell is chosen (i.e. the
indexing step). To choose the unit cell you don't need to have
measured any intensities (how can you if you haven't indexed the spots
yet?), whereas to select the space group unambiguously, in most cases
you need to have measured the intensities first.

-- Ian

Re: [ccp4bb] problem of conventions

Reply via email to