On Thu, Oct 30, 2008 at 2:01 PM, Garret Wilson <[EMAIL PROTECTED]>wrote:
> ... > This brings up a related issue regarding the assembly and sequence URIs at > http://www.biodas.org/wiki/GlobalSeqIDs . Before on this list I've brought > up the issue of whether DAS has authority to maintain identifiers in > namespaces from domains controlled by third parties (i.e. NCBI). This still > worries me. > > How confident can we be that the DAS GlobalSeqIDs are stable and will not > change for a while? The GlobalSeqIDs were created because at the time there were no stable URIs for genome assemblies and assembly sequences from authoritative sources like NCBI. As far as I know that's still the case, though since then there's been some movement towards stable URIs at NCBI (see here<http://lists.w3.org/Archives/Public/public-semweb-lifesci/2007Feb/0123.html>) and other authoritative sources. Also at the time the GlobalSeqIDs were created the DAS registry used IDs for coordinates but not URIs. But now the DAS registry uses the DAS1.53E/2.0 "sources" document, so every COORDINATES entry has a URI. For example: http://www.dasregistry.org/coordsys/CS_DS40 is the registry coordinates URI corresponding to the GlobalSeqID URI http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/ . Given that we have a central DAS registry I do think it makes sense that maintaining stable URIs for sequences and assemblies (and other collections of sequences) be handled in the registry -- at least when there's no stable URIs from an authoritative source. I think there are better ways to assign URIs then either the way currently used in the DAS1 registry (very opaque) or the DAS2 GlobalSeqIDs (transparent but encroaching on NCBI namespace), but the more important point is that we should only have one strategy for all versions of DAS. We discussed this back in 2006/2007, and I know Andreas Prlic joined in on several teleconference conversations about merging the DAS2 notion of global seq and assembly IDs into the DAS registry and "sources" doc coordinates elements. Secondly, related to URI resolution, I note that I cannot take an assembly > URI such as http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/ and simply > resolve the chromosome ID (e.g. chr1) against it to form the sequence URI. > My application instead has to have specific knowledge of this particular > assembly namespace, knowing that it must first append the path segment > "dna/" to the URI, yielding > http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/dna/chr1 . > > I'd rather my application, once it knew the assembly URI, simply need to > resolve the chromosome ID to the assembly URI to determine the sequence URI, > such as http://www.ncbi.nlm.nih.gov/genome/H_sapiens/B36.1/chr1 . > > Garret This illustrates one weakness of the current DAS sources XML -- given the coordinates URI, there is still no ability to directly determine authoritative/reference sequence URIs for those coordinates. These sequence URIs can't be reliably inferred from the coordinates URIs, and I don't think they should be inferred (or constructed) at all. Attempts to infer sequence URIs currently lead to all sorts of trouble, as I've found in working on the Trellis/Ivy DAS1-->DAS2 proxy. For example the proxy assumes that if versioned source V1 has coordinates C and entry_points capability E1, then E1 describes the segments available for coordinates C. Based on this assumption if versioned source V2 also has coordinates C but doesn't have an entry_points capability then the proxy uses E1 from V1 instead since the versioned sources share the same coordinates. Which sometimes works but not always -- what happens if versioned source V3 also has coordinates C but has an entry_points capability E3 that disagrees with E1? I'm seeing the above situation in the DAS1 registry -- for example, for coordinates .../CS_DS40 (NCBI human genome assembly v.36) which has 44 different versioned sources in the registry. 2 of these versioned sources have entry_points capabilities: A) http://hgwdev-gencode.cse.ucsc.edu/cgi-bin/das/hg18/entry_points B) http://www.snpbox.org/cgi-box/das/SNPbox_human_44_36f/entry_points However, these entry_points queries don't return the same thing. They agree on naming for chromosome IDs, but for non-chromosomal sequences the naming starts varying, for instance "M" vs "MT" for the mitochondrial DNA. More importantly, they disagree on the stop/length value for nearly every chromosome! So I think the sequence URIs should be specified -- given the coordinate URIs and capability URIs of a versioned source, there should be a query mechanism to return sequence info for the coordinate URI and this info should include sequence URIs. As illustrated above both the DAS1 entry_points and DAS2 segments queries currently seem too disconnected from the coordinates URIs without some changes to the sources XML. One would be to add to the entry_points and/or segments capabilities of "authoritative" versioned sources a coordinates attribute which would be a relative URI reference to the coordinates for which they are the authoratative list of sequences. This is actually in the RelaxNG schema for DAS2, but currently commented out. Gregg _______________________________________________ DAS mailing list [email protected] http://lists.open-bio.org/mailman/listinfo/das
