Gregg Helt wrote:
I think there are better ways to assign URIs then
either the way currently used in the DAS1 registry (very opaque) or the DAS2
GlobalSeqIDs (transparent but encroaching on NCBI namespace), but the more
important point is that we should only have one strategy for all versions of
DAS.

Currently DAS1 does not formally include URIs, should it do so we can improve how the registry handles them.


Attempts to infer sequence URIs currently lead to all sorts of trouble, as
I've found in working on the Trellis/Ivy DAS1-->DAS2 proxy.  For example the
proxy assumes that if versioned source V1 has coordinates C and entry_points
capability E1, then E1 describes the segments available for coordinates C.
Based on this assumption if versioned source V2 also has coordinates C but
doesn't have an entry_points capability then the proxy uses E1 from V1
instead since the versioned sources share the same coordinates.  Which
sometimes works but not always -- what happens if versioned source V3 also
has coordinates C but has an entry_points capability E3 that disagrees with
E1?

I'm seeing the above situation in the DAS1 registry -- for example, for
coordinates .../CS_DS40 (NCBI human genome assembly v.36) which has 44
different versioned sources in the registry.  2 of these versioned sources
have entry_points capabilities:
    A) http://hgwdev-gencode.cse.ucsc.edu/cgi-bin/das/hg18/entry_points
    B) http://www.snpbox.org/cgi-box/das/SNPbox_human_44_36f/entry_points
However, these entry_points queries don't return the same thing.  They agree
on  naming for chromosome IDs, but for non-chromosomal sequences the naming
starts varying, for instance "M" vs "MT" for the mitochondrial DNA.  More
importantly, they disagree on the stop/length value for nearly every
chromosome!

So I think the sequence URIs should be specified -- given the coordinate
URIs and capability URIs of a versioned source, there should be a query
mechanism to return sequence info for the coordinate URI and this info
should include sequence URIs.  As illustrated above both the DAS1
entry_points and DAS2 segments queries currently seem too disconnected from
the coordinates URIs without some changes to the sources XML.  One would be
to add to the entry_points and/or segments capabilities of "authoritative"
versioned sources a coordinates attribute which would be a relative URI
reference to the coordinates for which they are the authoratative list of
sequences.  This is actually in the RelaxNG schema for DAS2, but currently
commented out.

Merging sequence info with sequence URIs won't work for UniProt, it's just too big.

We need to either make one source authoritative for a coordinate system, either in sources or coordsys documents, or have the registry validate coordinate system compliance. I'd suggest the latter because it allows for redundancy. Either way we need to make it a requirement that a coordinate system has at least one server providing segments and sequence, which is not currently the case.
_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Reply via email to