On 12/4/2020 12:15 PM, Marcin Wojdyr wrote:
> On Fri, 4 Dec 2020 at 19:16, Dale Tronrud <de...@daletronrud.com> wrote:
>> learn the sequence you have to go to the mmCIF records that define the
>> connectivity between residues.  It is entirely possible that "3" comes
>> before "1" because these indexes don't contain any information, other
>> than being unique within the chain.
>
> In mmCIF you have label_seq_id that must be both unique and
> sequential. So 3 is always the third residue wrt to the full sequence.
>

It is very important not to read more meaning into a data tag than is actually defined in the mmCIF spec. _atom_site.label_seq_id is defined

http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.label_seq_id.html

as a pointer into the _entity_poly_seq table. It has to be an signed integer (although I'm not clear on what a negative value for a pointer means). In that table there is a data item _entity_poly_seq.num,

http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_entity_poly_seq.num.html

which is not a pointer, not an ID, but a name for that particular _entity_poly_seq row. It must be a number that is unique and sequential, and presumably indicates a "sequence number". Note that the rows in _entity_poly_seq can be listed in the loop_ in any order. You can't assumed that the order they are listed in the mmCIF says anything about connectivity. You get the order of the "things" in the sequence from _entity_poly_seq.num.

This means that the _atom_site.label_seq_id could be "3", pointing to the third entry in _entity_poly_seq which happens to have its .num equal to "1". You may not think that someone would choose to do this, but if the first .num is -15 you can't avoid a mismatch. In either case the mmCIF is perfectly acceptable and the meaning is absolutely clear.

   Pulling up one of my favorite PDB entries I get

loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
_entity_poly_seq.hetero
1 1   ILE n
1 2   THR n
1 3   GLY n
1 4   THR n
1 5   SER n
1 6   THR n
1 7   VAL n

These rows are listed in order of their .num item, and all the _atom_site.label_seq_id's will be equal to the _entity_poly_seq.num, but nothing in the spec forces that to be the case, and your software should not, ever, make that assumption. Your software should also never assume that successive rows in _entity_poly_seq are chemically linked. The order is arbitrary. You also can't assume that the row with _entity_poly_seq.num equal to "3" is chemically linked to the one with .num equal to "2", much less the chemical nature of such a link. _entity_poly_seq is not a data table that defines chemistry, only "sequence".

The whole point of a proper data base structure is that you don't assume anything! All information has to be specifically encoded in the tables of the data base. If your software makes use of a particular tag, you should go to the definition of that tag and use it, and not make additional extrapolations about it.

I'm not saying the the data tag definitions of mmCIF are perfect, far from it. But the foundation on CIF is sound and you have to stick with that formal structure, based in data base theory, if you are going to get the benefit of a proper data base.

We have been used to the slap-dash world of PDB format for decades, where we try to make it work by stuffing extra characters on the end of the line or in a little gap that you have forgotten its real purpose. This has led to nothing but grief. When I was writing my refinement program I can tell you that the most complex and difficult subroutine system was the one trying to read PDB files. There were PDB files that had the number of electrons in the atom written in the occupancy column! Some had the name of a calcium atom shifted to the left and some did not, making them indistinguishable from Calpha atoms. The PDB format is an insane mess and is completely unworkable. Please, let it die!

The problem with Dr. Croll's suggestion "Using chain A as an example, perhaps the glycans could become Ag1, Ag2, etc.?" is that it loads connectivity information into names. How can one write a standard database validation script to verify the correctness of this information? You have defined a meaning to the characters in a "name" which is not defined in the data schema. On the other hand, the data in the mmCIF, as currently defined is certainly complete enough that his software could generate names of this style for display to his users. His user interface is not limited by mmCIF in any way, and "value added" features like this might make his software even more successful.

I certainly agree that the names chosen by the authors are of considerable value when examining their model in the light of their paper and understanding. My understanding is that mmCIF has places for these names. I do find it distressing that the PDB has chosen to change the chain names, water molecule names, and some residue names, in pretty much all of the models I've deposited, and they've changed them multiple times. That is not a problem with mmCIF, but an administrative choice of the PDB.

CIF and mmCIF are huge leaps forward, but they represent a new data model that many scientists have not been trained in. Everyone who writes code to read and write them should *not* try to learn them by simply picking up a few examples and guessing what is meant in their structure. Pick up a book, or find a web site, on relational data base theory and design, and then start looking through the tag definitions for mmCIF.

Dale Tronrud

P.S. None of this tirade should be taken as support, or argument with, any of the PDB's "remediation" efforts over the years. The value of their continual churning of the models in their care is independent of the value of the basic structure of mmCIF.

On 12/4/2020 12:15 PM, Marcin Wojdyr wrote:
On Fri, 4 Dec 2020 at 19:16, Dale Tronrud <de...@daletronrud.com> wrote:

     Creating meaning in the chain names "A, B, C, Ag1, Ag2, Ag3" is
exactly the problem.

It's not about "creating meaning" but about consistent naming. For humans.

"chain names" ( or "entity identifiers" if I
recall the mmCIF terminology correctly) are simply database "indexes".

No, entity is a somewhat different thing (multiple chains can point to
the same entity). entity_id is specified in addition to label_asym_id
and auth_asym_id.
asym = "structural element in the asymmetric unit" (so-called chain).

The values of indices are meaningless in themselves, they are just
unique values that can be used to unambiguously identify a record. In
principle, you could just assign random ISO characters (I don't think
mmCIF allows unicode) and the mmCIF would be considered identical.

And then you'd use this random string also in a publication when
referring to the chain, and in the user interface?

     You are trying to force meaning to the characters with an index, and
that puts multiple types of information in a single field. As Robbie
said already exists, if you want to encode connectivity into the data
base you have to add records that define that connectivity.  That places
the connectivity information explicitly in the data models and allows
standard data base tools to track and validate.

No one was proposing to replace connectivity with names.
It was about naming that will be easier to work with for people.

learn the sequence you have to go to the mmCIF records that define the
connectivity between residues.  It is entirely possible that "3" comes
before "1" because these indexes don't contain any information, other
than being unique within the chain.

In mmCIF you have label_seq_id that must be both unique and
sequential. So 3 is always the third residue wrt to the full sequence.


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Reply via email to