Re: [DAS] [Fwd: Re: Writeback implementation]

Andy Jenkinson Wed, 29 Oct 2008 15:26:15 -0700

Gregg Helt wrote:

Sorry for being imprecise about URIs, what I meant to say was that everyfeature in DAS/2.0 has a unique _absolute_ URI. Most IDs can be treatedas relative URIs but not absolute URIs, and referring to relative URIsis not particularly useful outside their context.

By relative URI do you mean URN (e.g. SO:12345)? As opposed to the HTMLdefinition (e.g. index.html). URNs are still useful since they allow usto solve this issue of identifying things that are the same. Aresolvable URI (i.e. a URL) is undoubtedly "better", but this issemantic web territory and I'm not convinced it is necessary for DAS.Certainly I think it would be too much a constraint to layer onto theexisting spec in one increment. In fact even using URNs is not easy foreverything - segment IDs cannot have colons.

Furthermore technically not all arbitrary ID strings can actually berelative URIs either. I thought this was mostly a theoretical issueuntil my Trellis/Ivy DAS1-->DAS2 proxy choked on such a case on only thethird DAS1 data source I was testing,http://www.ebi.ac.uk/das-srv/genomicdas/das/batman_CD4. It returnsfeatures that derive their IDs from their genomic location, like"21:26029715,26029814". Which can't be any form of URI, becauseaccording to the URI syntax spec <http://tools.ietf.org/html/rfc3986>the appearance of the colon before any forward slash means the "21"should be treated as the URI scheme, but the scheme can't have a digitas the first character. This isn't just a rare instance either -- Icount at least sixteen data sources like this (probably more) onProServer servers for the latest human genome assembly alone.

In this case, the ID is the least verbose but still unique-to-the-serverID possible, used because the annotation has no natural identifier (thesource has per-base annotations). Believe me, there are far worseimplementations - some servers don't even try to generate a unique IDfor this kind of data. Leaving it blank is something that can berejected in validation, but it's very difficult to verify it's actuallyunique...

There is nothing wrong with this particular example w.r.t the 1.53 spec,since the spec says nothing about IDs having to be URIs, it simply saysthey must uniquely identify the feature on the server. But you have hitupon one of the reasons _resolvable_ URIs (i.e. URLs) will be difficultto implement - annotations that have no natural identifier such as thosein the batman_CD4 source. Plus, having a unique identifier for everybase in a genome for every experiment it appears in is always going tobe verbose.

On a sidenote, I'm not sure if these IDs are legal DAS1.53 feature IDs either,since many of them will not be unique within their DAS server, anddepeding on how you interpret the 1.53 spec the colon may not be a legalID character.

I don't think there's a problem with the colon - this is an illegalcharacter for reference IDs but not for feature IDs as far as I can see.

The Trellis/Ivy proxy now deals with these cases, but checking each IDto see if it's a legal URI, and figuring out what to do if it's not, isdefinitely adding some performance overhead to the proxy.
This also points to the need for better validation of server responses,preferably as enhancements to the validation that the DAS1 registryalready does. I doubt if the current DAS2 validator would catch thesekinds of things either.

If you can give specific examples of things that could be targets forvalidation, I believe Jonathan will add them to his list so he canimplement them... :)

_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Re: [DAS] [Fwd: Re: Writeback implementation]

Reply via email to