I hate these "intelligent" naming schemes of connected residues - programming with them is a nightmare! The sensible mmCIF solution is to provide numerical numbering of linked residues/ nucleatides etc, and a alphanumeric residue name which can be anything.

Within CCP4 programs we require numeric numbering of monomers. you need to keep a table somewhere of what residue 72A corresponds to in a previous molecule, but we have to do that as a rule anyway to compare structure.

One of the boons of SSM which matchs structures by secondary elements is that it give you a nice list of what matches what

 Eleanor


James Whisstock wrote:

***  For details on how to be removed from this list visit the  ***
***          CCP4 home page http://www.ccp4.ac.uk         ***


Hi Linda

The serine proteases were bit of a classic for this issue - a lot of the early structures had the loops numbered 60A, 60B etc with respect to chymotrypsin, with the result that many programs merrily crashed as soon as they hit them. The result was (apart from much frantic hair tearing and time wasted because one had forgotten for the umpteenth time that the pdb file contained such oddities) a large number of parsers that converted the file into standard 1-end numbering that you ran before you did any work.
The A, B, C numbering scheme came about so as to number with respect to an 
"archetypal" first reference structure - but really I don't think this is 
practical any more, not least because of the breadth of most families - it is not in my 
opinion really feasible or useful to perform such a task for two structures that are 
(say) 15% identical, and contain major insertions and deletions or indeed impose such a 
scheme across 80 or so different structures.  Also with the quality and speed of 
structural alignment packages available means that I really don't think the specialised 
numbering schemes are useful any more.  Finally, such an effort is only really useful if 
EVERY struture in the family is numbered in that fashion to allow universal cross 
comparison - and this clearly is never going to be the case.

So we just number continously. However, I think it is useful and important 
though to ensure that the numbering is consistent with the sequence entry in 
the GENPEPT file (so long as that is accurate!).  Also, when publishing if need 
be we include an alignment with an archetypal member, with relevant and 
different numbering schemes above and below the alignment.

Happy new year!

Cheers

James


William Scott <[EMAIL PROTECTED]> wrote:>
***  For details on how to be removed from this list visit the  ***
***          CCP4 home page http://www.ccp4.ac.uk         ***


Hi Linda:

I've been working on an RNA called the "hammerhead ribozyme" that has a
particularly Kafkaesque numbering scheme. Originally I had tried adhering
to it, but found that various programs, including pymol and refmac,
simply
couldn't cope.

I finally came to my senses and numbered the RNA 1,2,3, ...N. Then I made
a table to translate between the sequential and canonical numbering
schemes.
http://xanana.ucsc.edu/hh/rosetta_hh.html

I'm not sure that this is the best way, but at least I know that if other
people want to display or re-refine the structure, it will behave
properly.

HTH,

Bill Scott


Linda Brinen wrote:
This isn't specifically a ccp4-related question, but I'm hoping for
feedback
on a topic that most of have had to consider. I'm motivated to ask the
question because I'm currently trying to answer it for myself. I
should
make the disclaimer right off that I'm not looking to start a heated
debate
about PDB guidelines, but am genuinely looking for constructive
suggestions.



My situation involves a two-domain protein in somewhat well-studied
family
of molecules. There is a long-standing history of how these are
numbered -
and examples of this can be found in the PDB. The first domain can
typically
be found with a letter-descriptor after the number (i.e., 1P, 2P,
3P..)
and
then resetting to 1 with no letter following for the second domain.
All
numbering is done relative to the original member of the family of
these
proteins - so if there is a gap based on sequence alignment to that
sequence, the numbering skips. Similarly, if there are inserts, the
numbering becomes 46a, 45b, 45c, etc. Again, lots of precedent for
this
in
the PDB.



BUT, now there is a push from databases for more 'simplification' and
standardization of numbering, i.e., start from 1 and go sequentially
to
the
end. Obviously there are arguments to be made for maintaining
biologically
relevant and historically established precedents. But there are
arguments
for the other side as well.



How do you handle the numbering of your protein sequence if there are
gaps,
inserts, different biologically relevant domains? Do you use the
accepted
precedents set by other related structures that have been solved or
do you
simply start from 1 and push on through to your end point?



Thanks in advance for any input.



-Linda Brinen





begin:vcard
fn:Eleanor  Dodson
n:Dodson;Eleanor 
email;internet:[EMAIL PROTECTED]
tel;work:+44 (0) 1904 328259
tel;fax:+44 (0) 1904 328266
tel;home:+44 (0) 1904 424449
version:2.1
end:vcard

Reply via email to