Re: [ccp4bb] PDB format survey?

Warren DeLano Thu, 09 Aug 2007 17:51:08 -0700

Joe,

I feel that atom serial numbers are particularly important, since they,
combined with CONECT records, provide the only semi-standard convention
I know of for reliably encoding bond valences information into a PDB
file.

single bond = bond listed once
double bond = bond listed twice
triple bond = bond listed thrice
aromatic bond = bond listed four times.

This is a convention long supported by tools like MacroModel and PyMOL.
For example, here is formaldehyde, where the bond between atoms 1 and 3
is listed twice:

HETATM    1  C01 C=O     1       0.000  -0.020   0.000  0.00  0.00
C
HETATM    2  N01 C=O     1       1.268  -0.765   0.000  0.00  0.00
N
HETATM    3  O02 C=O     1       0.000   1.188   0.000  0.00  0.00
O
HETATM    4  H01 C=O     1       1.260  -1.775   0.000  0.00  0.00
H
HETATM    5  H02 C=O     1       2.146  -0.266   0.000  0.00  0.00
H
HETATM    6  H03 C=O     1      -0.946  -0.562   0.000  0.00  0.00
H
CONECT    1    2
CONECT    1    3
CONECT    1    3
CONECT    1    6
CONECT    2    1    4    5
CONECT    3    1
CONECT    3    1
CONECT    4    2
CONECT    5    2
CONECT    6    1

I second the proposal of treating this field as a unique string rather
than a numeric quantity.

Two letter chain IDs would be fine with me, but I do think we could also
make better use of SEGI and/or MODEL to break things up while still
preserving the utility of certain other records (SHEET, HELIX, etc.)
within their existing column definitions.

However, we are still lacking a standard way of designating formal
charges, So maybe that free column could be better used for encoding a
formal charge, such as ["q" "t", "d", "-", "+", "D", "T", "Q"] over the
formal charge range of [-4,-3,-2,-1,0,1,2,3,4] -- just an idea :)...

With valences plus formal charges along with expansion of the cap on
atom counts, I think we could support chemically-complete PDB files and
push back the date of PDB demise for a few more years!

A Wiki dedicated to practical PDB file hacks and extensions is a superb
idea -- of course, the goal should be to ultimately come up with a
single well-defined standard set of hacks we all agree upon by
supporting them in our code.

Cheers,
Warren 

-----Original Message-----
From: CCP4 bulletin board [mailto:[EMAIL PROTECTED] On Behalf Of
Joe Krahn
Sent: Thursday, August 09, 2007 1:15 PM
To: CCP4BB@JISCMAIL.AC.UK
Subject: Re: [ccp4bb] PDB format survey?

Edward A. Berry wrote:
> Ethan A Merritt wrote:
>> On Wednesday 08 August 2007 20:47, Ralf W. Grosse-Kunstleve wrote:
>>> Implementations to generate intuitive, maximally backward compatible
>>> numbers can be found here:
>>>
>>>   http://cci.lbl.gov/hybrid_36/
>>
>> From that URL:
>>
>> ATOM  99998  SD  MET L9999      48.231 -64.383  -9.257  1.00 
>> 11.54           S
>> ATOM  99999  CE  MET L9999      49.398 -63.242 -10.211  1.00 
>> 14.60           C
>> ATOM  A0000  N   VAL LA000      52.228 -67.689 -12.196  1.00  
>> 8.76           N
>> ATOM  A0001  CA  VAL LA000      53.657 -67.774 -12.458  1.00  
>> 3.40           C
>>
>> Could you please clarify this example?
>> Is that "A0000" a hexidecimal number, or is it a decimal number
>> that just happens to have an "A" in front of it?
>> [A-Z][0-9999] gives a larger range of values than 5 bytes of
hexadecimal,
>> so I'm guessing it's the former.  But the example is not clear.
>>
> I'm guessing the former also. A 5-digit hex number would not be
> backwards compatible. With this system legacy programs can still
> read the files with 99999 atoms or less, and anything more than
> that they couldn't have handled anyway. Very nice!
> 
> Ed
I still prefer the idea of just truncating serial numbers, and using an 
alternative to CONECT for large structures. Almost nobody uses 
atomSerial, but it still may be parsed as an integer, where the above 
idea could cause errors. Furthermore, non-digit encoding still results 
in another maximum, whereas truncating the numbers has no limit. The 
truncated serial number is ambiguous only if taken out of context of the

complete PDB file, but PDB files are by design sequential.

Another alternative is to define an "atom-serial offset" record. It can 
define a number which is added to all subsequently parsed atom serial 
numbers. Every ATOM/HETATM record is then perfectly valid to an older 
program, but may only be able to handle one chunk of atoms at once.

Likewise, I like the idea of a ChainID map record, which maps 
single-letter chainID's to larger named ID's. Each existing PDB record 
can then be used unchanged, but files can then support very long ChainID

strings. The only disadvantage is that old PDB readers will get 
confused, but at least the individual record formats are not changed in 
a way that makes them crash.

I think that keeping the old record definitions completely unchanged are

an important feature to any PDB format revisions. Even if we continue to

use it for another 20 years, it's primary advantage is that it is a 
well-established "legacy" format. If we change existing records, we 
break that one useful feature. Therefore, I think that any changes to 
existing records should be limited to using characters positions that 
are currently. (The one exception is that we need to make the HEADER Y2K

compatible by using a 4-digit year, which means the existing decade+year

characters have to be moved.)

Of course, the more important issue is that the final decision needs 
community involvement, and not just a decision by a small group of RCSB 
or wwPDB administrators.

Maybe it would be useful to set up a PDB format "Wiki" where 
alternatives can be defined, along with advantages and disadvantages. If

there was sufficient agreement, it could be used as a community tool to 
put together a draft revision of the next PDB format. With any luck, 
some RCSB or wwPDB people would participate as well.

Joe Krahn

Re: [ccp4bb] PDB format survey?

Reply via email to