Joe, I feel that atom serial numbers are particularly important, since they, combined with CONECT records, provide the only semi-standard convention I know of for reliably encoding bond valences information into a PDB file.
single bond = bond listed once double bond = bond listed twice triple bond = bond listed thrice aromatic bond = bond listed four times. This is a convention long supported by tools like MacroModel and PyMOL. For example, here is formaldehyde, where the bond between atoms 1 and 3 is listed twice: HETATM 1 C01 C=O 1 0.000 -0.020 0.000 0.00 0.00 C HETATM 2 N01 C=O 1 1.268 -0.765 0.000 0.00 0.00 N HETATM 3 O02 C=O 1 0.000 1.188 0.000 0.00 0.00 O HETATM 4 H01 C=O 1 1.260 -1.775 0.000 0.00 0.00 H HETATM 5 H02 C=O 1 2.146 -0.266 0.000 0.00 0.00 H HETATM 6 H03 C=O 1 -0.946 -0.562 0.000 0.00 0.00 H CONECT 1 2 CONECT 1 3 CONECT 1 3 CONECT 1 6 CONECT 2 1 4 5 CONECT 3 1 CONECT 3 1 CONECT 4 2 CONECT 5 2 CONECT 6 1 I second the proposal of treating this field as a unique string rather than a numeric quantity. Two letter chain IDs would be fine with me, but I do think we could also make better use of SEGI and/or MODEL to break things up while still preserving the utility of certain other records (SHEET, HELIX, etc.) within their existing column definitions. However, we are still lacking a standard way of designating formal charges, So maybe that free column could be better used for encoding a formal charge, such as ["q" "t", "d", "-", "+", "D", "T", "Q"] over the formal charge range of [-4,-3,-2,-1,0,1,2,3,4] -- just an idea :)... With valences plus formal charges along with expansion of the cap on atom counts, I think we could support chemically-complete PDB files and push back the date of PDB demise for a few more years! A Wiki dedicated to practical PDB file hacks and extensions is a superb idea -- of course, the goal should be to ultimately come up with a single well-defined standard set of hacks we all agree upon by supporting them in our code. Cheers, Warren -----Original Message----- From: CCP4 bulletin board [mailto:[EMAIL PROTECTED] On Behalf Of Joe Krahn Sent: Thursday, August 09, 2007 1:15 PM To: CCP4BB@JISCMAIL.AC.UK Subject: Re: [ccp4bb] PDB format survey? Edward A. Berry wrote: > Ethan A Merritt wrote: >> On Wednesday 08 August 2007 20:47, Ralf W. Grosse-Kunstleve wrote: >>> Implementations to generate intuitive, maximally backward compatible >>> numbers can be found here: >>> >>> http://cci.lbl.gov/hybrid_36/ >> >> From that URL: >> >> ATOM 99998 SD MET L9999 48.231 -64.383 -9.257 1.00 >> 11.54 S >> ATOM 99999 CE MET L9999 49.398 -63.242 -10.211 1.00 >> 14.60 C >> ATOM A0000 N VAL LA000 52.228 -67.689 -12.196 1.00 >> 8.76 N >> ATOM A0001 CA VAL LA000 53.657 -67.774 -12.458 1.00 >> 3.40 C >> >> Could you please clarify this example? >> Is that "A0000" a hexidecimal number, or is it a decimal number >> that just happens to have an "A" in front of it? >> [A-Z][0-9999] gives a larger range of values than 5 bytes of hexadecimal, >> so I'm guessing it's the former. But the example is not clear. >> > I'm guessing the former also. A 5-digit hex number would not be > backwards compatible. With this system legacy programs can still > read the files with 99999 atoms or less, and anything more than > that they couldn't have handled anyway. Very nice! > > Ed I still prefer the idea of just truncating serial numbers, and using an alternative to CONECT for large structures. Almost nobody uses atomSerial, but it still may be parsed as an integer, where the above idea could cause errors. Furthermore, non-digit encoding still results in another maximum, whereas truncating the numbers has no limit. The truncated serial number is ambiguous only if taken out of context of the complete PDB file, but PDB files are by design sequential. Another alternative is to define an "atom-serial offset" record. It can define a number which is added to all subsequently parsed atom serial numbers. Every ATOM/HETATM record is then perfectly valid to an older program, but may only be able to handle one chunk of atoms at once. Likewise, I like the idea of a ChainID map record, which maps single-letter chainID's to larger named ID's. Each existing PDB record can then be used unchanged, but files can then support very long ChainID strings. The only disadvantage is that old PDB readers will get confused, but at least the individual record formats are not changed in a way that makes them crash. I think that keeping the old record definitions completely unchanged are an important feature to any PDB format revisions. Even if we continue to use it for another 20 years, it's primary advantage is that it is a well-established "legacy" format. If we change existing records, we break that one useful feature. Therefore, I think that any changes to existing records should be limited to using characters positions that are currently. (The one exception is that we need to make the HEADER Y2K compatible by using a 4-digit year, which means the existing decade+year characters have to be moved.) Of course, the more important issue is that the final decision needs community involvement, and not just a decision by a small group of RCSB or wwPDB administrators. Maybe it would be useful to set up a PDB format "Wiki" where alternatives can be defined, along with advantages and disadvantages. If there was sufficient agreement, it could be used as a community tool to put together a draft revision of the next PDB format. With any luck, some RCSB or wwPDB people would participate as well. Joe Krahn