Re: [ccp4bb] Coming July 29: Improved Carbohydrate Data at the PDB -- N-glycans are now separate chains if more than one residue

Dale Tronrud Fri, 04 Dec 2020 13:36:26 -0800

On 12/4/2020 12:15 PM, Marcin Wojdyr wrote:
> On Fri, 4 Dec 2020 at 19:16, Dale Tronrud <de...@daletronrud.com> wrote:
>> learn the sequence you have to go to the mmCIF records that define the
>> connectivity between residues.  It is entirely possible that "3" comes
>> before "1" because these indexes don't contain any information, other
>> than being unique within the chain.
>
> In mmCIF you have label_seq_id that must be both unique and
> sequential. So 3 is always the third residue wrt to the full sequence.
>

It is very important not to read more meaning into a data tag thanis actually defined in the mmCIF spec. _atom_site.label_seq_id is defined


http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_atom_site.label_seq_id.html

as a pointer into the _entity_poly_seq table. It has to be an signedinteger (although I'm not clear on what a negative value for a pointermeans). In that table there is a data item _entity_poly_seq.num,


http://mmcif.wwpdb.org/dictionaries/mmcif_pdbx_v40.dic/Items/_entity_poly_seq.num.html

which is not a pointer, not an ID, but a name for that particular_entity_poly_seq row. It must be a number that is unique andsequential, and presumably indicates a "sequence number". Note that therows in _entity_poly_seq can be listed in the loop_ in any order. Youcan't assumed that the order they are listed in the mmCIF says anythingabout connectivity. You get the order of the "things" in the sequencefrom _entity_poly_seq.num.

This means that the _atom_site.label_seq_id could be "3", pointingto the third entry in _entity_poly_seq which happens to have its .numequal to "1". You may not think that someone would choose to do this,but if the first .num is -15 you can't avoid a mismatch. In either casethe mmCIF is perfectly acceptable and the meaning is absolutely clear.


   Pulling up one of my favorite PDB entries I get

loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
_entity_poly_seq.hetero
1 1   ILE n
1 2   THR n
1 3   GLY n
1 4   THR n
1 5   SER n
1 6   THR n
1 7   VAL n

These rows are listed in order of their .num item, and all the_atom_site.label_seq_id's will be equal to the _entity_poly_seq.num, butnothing in the spec forces that to be the case, and your software shouldnot, ever, make that assumption. Your software should also never assumethat successive rows in _entity_poly_seq are chemically linked. Theorder is arbitrary. You also can't assume that the row with_entity_poly_seq.num equal to "3" is chemically linked to the one with.num equal to "2", much less the chemical nature of such a link._entity_poly_seq is not a data table that defines chemistry, only"sequence".

The whole point of a proper data base structure is that you don'tassume anything! All information has to be specifically encoded in thetables of the data base. If your software makes use of a particulartag, you should go to the definition of that tag and use it, and notmake additional extrapolations about it.

I'm not saying the the data tag definitions of mmCIF are perfect,far from it. But the foundation on CIF is sound and you have to stickwith that formal structure, based in data base theory, if you are goingto get the benefit of a proper data base.

We have been used to the slap-dash world of PDB format for decades,where we try to make it work by stuffing extra characters on the end ofthe line or in a little gap that you have forgotten its real purpose.This has led to nothing but grief. When I was writing my refinementprogram I can tell you that the most complex and difficult subroutinesystem was the one trying to read PDB files. There were PDB files thathad the number of electrons in the atom written in the occupancy column!Some had the name of a calcium atom shifted to the left and some didnot, making them indistinguishable from Calpha atoms. The PDB format isan insane mess and is completely unworkable. Please, let it die!

The problem with Dr. Croll's suggestion "Using chain A as anexample, perhaps the glycans could become Ag1, Ag2, etc.?" is that itloads connectivity information into names. How can one write a standarddatabase validation script to verify the correctness of thisinformation? You have defined a meaning to the characters in a "name"which is not defined in the data schema. On the other hand, the data inthe mmCIF, as currently defined is certainly complete enough that hissoftware could generate names of this style for display to his users.His user interface is not limited by mmCIF in any way, and "value added"features like this might make his software even more successful.

I certainly agree that the names chosen by the authors are ofconsiderable value when examining their model in the light of theirpaper and understanding. My understanding is that mmCIF has places forthese names. I do find it distressing that the PDB has chosen to changethe chain names, water molecule names, and some residue names, in prettymuch all of the models I've deposited, and they've changed them multipletimes. That is not a problem with mmCIF, but an administrative choiceof the PDB.

CIF and mmCIF are huge leaps forward, but they represent a new datamodel that many scientists have not been trained in. Everyone whowrites code to read and write them should *not* try to learn them bysimply picking up a few examples and guessing what is meant in theirstructure. Pick up a book, or find a web site, on relational data basetheory and design, and then start looking through the tag definitionsfor mmCIF.


Dale Tronrud

P.S. None of this tirade should be taken as support, or argument with,any of the PDB's "remediation" efforts over the years. The value oftheir continual churning of the models in their care is independent ofthe value of the basic structure of mmCIF.


On 12/4/2020 12:15 PM, Marcin Wojdyr wrote:

On Fri, 4 Dec 2020 at 19:16, Dale Tronrud <de...@daletronrud.com> wrote:


     Creating meaning in the chain names "A, B, C, Ag1, Ag2, Ag3" is
exactly the problem.


It's not about "creating meaning" but about consistent naming. For humans.

"chain names" ( or "entity identifiers" if I
recall the mmCIF terminology correctly) are simply database "indexes".


No, entity is a somewhat different thing (multiple chains can point to
the same entity). entity_id is specified in addition to label_asym_id
and auth_asym_id.
asym = "structural element in the asymmetric unit" (so-called chain).

The values of indices are meaningless in themselves, they are just
unique values that can be used to unambiguously identify a record. In
principle, you could just assign random ISO characters (I don't think
mmCIF allows unicode) and the mmCIF would be considered identical.


And then you'd use this random string also in a publication when
referring to the chain, and in the user interface?

     You are trying to force meaning to the characters with an index, and
that puts multiple types of information in a single field. As Robbie
said already exists, if you want to encode connectivity into the data
base you have to add records that define that connectivity.  That places
the connectivity information explicitly in the data models and allows
standard data base tools to track and validate.


No one was proposing to replace connectivity with names.
It was about naming that will be easier to work with for people.

learn the sequence you have to go to the mmCIF records that define the
connectivity between residues.  It is entirely possible that "3" comes
before "1" because these indexes don't contain any information, other
than being unique within the chain.


In mmCIF you have label_seq_id that must be both unique and
sequential. So 3 is always the third residue wrt to the full sequence.


########################################################################

To unsubscribe from the CCP4BB list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=CCP4BB&A=1

This message was issued to members of www.jiscmail.ac.uk/CCP4BB, a mailing list 
hosted by www.jiscmail.ac.uk, terms & conditions are available at 
https://www.jiscmail.ac.uk/policyandsecurity/

Re: [ccp4bb] Coming July 29: Improved Carbohydrate Data at the PDB -- N-glycans are now separate chains if more than one residue

Reply via email to