Interesting thread this - it's something that hasn't been properly discussed at previous DAS developer meetings..

On 30/07/2010 20:33, Dave Messina wrote:
I too agree with Eugene.

No magic numbers.
You're too late here. 0 *is already* a magic number in the normalised protein sequence world, since it indicates the transcription start site for a coding sequence (i.e. the initial M). This is the cause for some ambiguity in the bio* bindings, and confusion on the part of more simple minded programmers like myself :)

Types can be used for filtering, and actually you get more fine-grained control 
than simply positional or non-positional. (I use this technique now in DASher.) 
*

In my opinion, the current spec as written is correct. That is, non-positional 
features don't just apply to the whole sequence, they apply to any part of the 
sequence.
Agreed. But read on...
As an example, consider a journal reference — a particular protein was isolated 
by a lab, they wrote a paper about it, and deposited the protein sequence in a 
database. If you look at a subsequence of the protein sequence, that 
subsequence still derives from the paper, right? So therefore the feature 
containing that journal reference should still be attached to the subsequence.

On that basis, I think the uniprot server is technically doing it wrong and 
should be changed, although I have to say that in practice it hasn't been an 
issue for me.
It's a difficult call. The uniprot server's behaviour is almost certainly due to the ambiguity arising from non-positional annotation which have start/end attributes (where start==end && start==0), and those which do not (the annotation is then usually derived from some other table, viz. the BioSQL schema). Other DAS servers do similar things, and kludges are needed to fix them.

My only worry with the expectation of 'proper behaviour' - is that currently, I frequently see IDs with more non-positional annotation than positional (notwithstanding histogram like continuous quantitative annotation such as running averages of predicted or observed local sequence properties). Enforcing compliance with the spec as written means that the average DAS metaserver (i.e. uniprot, or some server that aggregates sequence database info with other data) will send a huge non-positional header in response to every range qualified feature request, which is pretty inefficient. It may not scale well, either, since the amount of database cross references is (still) increasing.

* It might be nice, though, to add 'positional' and 'non-positional' types, 
which would be a way to grab all of the existing positional or non-positional 
types in one go. (currently it's necessary to specify multiple types to get the 
same functionality.)
This is essential, I think. However, the only way you are going to be able to do this in a DAS type constraint currently is to ensure the feature annotation source is ontology aware (and said ontology includes a distinct positional/non-positional hierarchy)**. One route would be to introduce a DAS-specific type term that the server maps to its source's ontology, another simpler approach would be to introduce a new boolean constraint 'positional', which if specified, limits the response to positional annotation only.

Jim.

** but this immediatly brings to mind a nasty potential gotcha: e.g. 'expression' in the context of a genome is positional, but is a non-positional feature in the context of a proteome. So terms will have to be fully qualified in the type constraint on a feature request.

--
-------------------------------------------------------------------
J. B. Procter  (JALVIEW/ENFIN)  Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764  http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Reply via email to