Re: [ccp4bb] PDB format survey?

Ethan Merritt Wed, 01 Aug 2007 15:06:31 -0700

On Wednesday 01 August 2007 14:10, Joe Krahn wrote:
> In addition to questions about the PDB standard, it is probably
> important to consider mmCIF. One thing I don't like about it is that
> columns can be randomized (i.e. X, Y, and Z can be in any column), but
> the mmCIF standards people have no interest in defining a more strict
> standard that would require files to be as human readable as RCSB's
> mmCIF files.


The important thing about mmCIF is not the precise file format,
which is ultimately not of interest except as a parsible exchange
medium, but rather the existence of the mmCIF dictionaries.

A more productive discussion may be to revisit the definition
of what information we as a community expect to be captured in the
PDB database.  The question of export formats is secondary.
 
> Does this sound useful, or have most people given up on having any
> influence on standards? Or, should the structural biology software
> developers get together and just make our own OpenPDB format?

As discussed at the PDB group discussion at the ACA meeting, some new
depositions are not representable in the PDB format (including v3).

Examples include:
- very large structures, for which the current 80 column PDB format
  runs out of space for atom numbers (4 columns -> max 9999)
  or for chain ids (1 column -> single char A-Z 0-9)
  [don't ask my why they don't want lower case]
- new classes of experiment (SAXS, EM)
- new classes of model (TLS or normal-mode displacements,
  ensemble models, envelope representations)

I am inclined to say that there should be a fork into two distinct
formats, used for different purposes.

The 80 column PDB format should be frozen, preferably at the
pre-version3 state. Freezing it would allow legacy programs to continue
to read old PDB files without modification. These programs will not be
able to handle certain classes of new structures, but this would be true
in any case for legacy code.  Churn in the 80 column PDB format would
aggravate rather than relieve this limitation. This branch would serve
the general community who are primarily viewers of previously deposited
structures, and any programs not currently being maintained.

Currently-maintained programs should move to mmCIF or XML, whichever
is convenient.  These formats are intrinsically open-ended, and can
handle the problematic structures mentioned above so long as the
corresponding mmCIF dictionaries are updated to define the relevant
entities.

The wwwPDB database is already capable of exporting to any PDB, XML,
or mmCIF format. So this would really be a change on the user
side more than on the database side. 

The barrier to converting programs to mmCIF is lower than you
might think.  Several mmCIF parsing libraries are available to
allow currently maintained programs to offer mmCIF input/output
if they do not already do so.  One such is the mmlib library
developed by Jay Painter and hosted on SourceForge:

    http://pymmlib.sourceforge.net/
        
    J Painter and EA Merritt
    J. Appl. Cryst. 37, 174-178, (2004).
    "mmLib Python toolkit for manipulating annotated structural
     models of biological macromolecules".  

-- 
Ethan A Merritt

Re: [ccp4bb] PDB format survey?

Reply via email to