Edward A. Berry wrote:
Ethan A Merritt wrote:
On Wednesday 08 August 2007 20:47, Ralf W. Grosse-Kunstleve wrote:
Implementations to generate intuitive, maximally backward compatible
numbers can be found here:
http://cci.lbl.gov/hybrid_36/
From that URL:
ATOM 99998 SD MET L9999 48.231 -64.383 -9.257 1.00
11.54 S
ATOM 99999 CE MET L9999 49.398 -63.242 -10.211 1.00
14.60 C
ATOM A0000 N VAL LA000 52.228 -67.689 -12.196 1.00
8.76 N
ATOM A0001 CA VAL LA000 53.657 -67.774 -12.458 1.00
3.40 C
Could you please clarify this example?
Is that "A0000" a hexidecimal number, or is it a decimal number
that just happens to have an "A" in front of it?
[A-Z][0-9999] gives a larger range of values than 5 bytes of hexadecimal,
so I'm guessing it's the former. But the example is not clear.
I'm guessing the former also. A 5-digit hex number would not be
backwards compatible. With this system legacy programs can still
read the files with 99999 atoms or less, and anything more than
that they couldn't have handled anyway. Very nice!
Ed
I still prefer the idea of just truncating serial numbers, and using an
alternative to CONECT for large structures. Almost nobody uses
atomSerial, but it still may be parsed as an integer, where the above
idea could cause errors. Furthermore, non-digit encoding still results
in another maximum, whereas truncating the numbers has no limit. The
truncated serial number is ambiguous only if taken out of context of the
complete PDB file, but PDB files are by design sequential.
Another alternative is to define an "atom-serial offset" record. It can
define a number which is added to all subsequently parsed atom serial
numbers. Every ATOM/HETATM record is then perfectly valid to an older
program, but may only be able to handle one chunk of atoms at once.
Likewise, I like the idea of a ChainID map record, which maps
single-letter chainID's to larger named ID's. Each existing PDB record
can then be used unchanged, but files can then support very long ChainID
strings. The only disadvantage is that old PDB readers will get
confused, but at least the individual record formats are not changed in
a way that makes them crash.
I think that keeping the old record definitions completely unchanged are
an important feature to any PDB format revisions. Even if we continue to
use it for another 20 years, it's primary advantage is that it is a
well-established "legacy" format. If we change existing records, we
break that one useful feature. Therefore, I think that any changes to
existing records should be limited to using characters positions that
are currently. (The one exception is that we need to make the HEADER Y2K
compatible by using a 4-digit year, which means the existing decade+year
characters have to be moved.)
Of course, the more important issue is that the final decision needs
community involvement, and not just a decision by a small group of RCSB
or wwPDB administrators.
Maybe it would be useful to set up a PDB format "Wiki" where
alternatives can be defined, along with advantages and disadvantages. If
there was sufficient agreement, it could be used as a community tool to
put together a draft revision of the next PDB format. With any luck,
some RCSB or wwPDB people would participate as well.
Joe Krahn