Edward A. Berry wrote:
Ethan A Merritt wrote:
On Wednesday 08 August 2007 20:47, Ralf W. Grosse-Kunstleve wrote:
Implementations to generate intuitive, maximally backward compatible
numbers can be found here:

  http://cci.lbl.gov/hybrid_36/

From that URL:

ATOM 99998 SD MET L9999 48.231 -64.383 -9.257 1.00 11.54 S ATOM 99999 CE MET L9999 49.398 -63.242 -10.211 1.00 14.60 C ATOM A0000 N VAL LA000 52.228 -67.689 -12.196 1.00 8.76 N ATOM A0001 CA VAL LA000 53.657 -67.774 -12.458 1.00 3.40 C

Could you please clarify this example?
Is that "A0000" a hexidecimal number, or is it a decimal number
that just happens to have an "A" in front of it?
[A-Z][0-9999] gives a larger range of values than 5 bytes of hexadecimal,
so I'm guessing it's the former.  But the example is not clear.

I'm guessing the former also. A 5-digit hex number would not be
backwards compatible. With this system legacy programs can still
read the files with 99999 atoms or less, and anything more than
that they couldn't have handled anyway. Very nice!

Ed
I still prefer the idea of just truncating serial numbers, and using an alternative to CONECT for large structures. Almost nobody uses atomSerial, but it still may be parsed as an integer, where the above idea could cause errors. Furthermore, non-digit encoding still results in another maximum, whereas truncating the numbers has no limit. The truncated serial number is ambiguous only if taken out of context of the complete PDB file, but PDB files are by design sequential.

Another alternative is to define an "atom-serial offset" record. It can define a number which is added to all subsequently parsed atom serial numbers. Every ATOM/HETATM record is then perfectly valid to an older program, but may only be able to handle one chunk of atoms at once.

Likewise, I like the idea of a ChainID map record, which maps single-letter chainID's to larger named ID's. Each existing PDB record can then be used unchanged, but files can then support very long ChainID strings. The only disadvantage is that old PDB readers will get confused, but at least the individual record formats are not changed in a way that makes them crash.

I think that keeping the old record definitions completely unchanged are an important feature to any PDB format revisions. Even if we continue to use it for another 20 years, it's primary advantage is that it is a well-established "legacy" format. If we change existing records, we break that one useful feature. Therefore, I think that any changes to existing records should be limited to using characters positions that are currently. (The one exception is that we need to make the HEADER Y2K compatible by using a 4-digit year, which means the existing decade+year characters have to be moved.)

Of course, the more important issue is that the final decision needs community involvement, and not just a decision by a small group of RCSB or wwPDB administrators.

Maybe it would be useful to set up a PDB format "Wiki" where alternatives can be defined, along with advantages and disadvantages. If there was sufficient agreement, it could be used as a community tool to put together a draft revision of the next PDB format. With any luck, some RCSB or wwPDB people would participate as well.

Joe Krahn

Reply via email to