Gregg Helt wrote:

Sorry for being imprecise about URIs, what I meant to say was that every feature in DAS/2.0 has a unique _absolute_ URI. Most IDs can be treated as relative URIs but not absolute URIs, and referring to relative URIs is not particularly useful outside their context.

By relative URI do you mean URN (e.g. SO:12345)? As opposed to the HTML definition (e.g. index.html). URNs are still useful since they allow us to solve this issue of identifying things that are the same. A resolvable URI (i.e. a URL) is undoubtedly "better", but this is semantic web territory and I'm not convinced it is necessary for DAS. Certainly I think it would be too much a constraint to layer onto the existing spec in one increment. In fact even using URNs is not easy for everything - segment IDs cannot have colons.

Furthermore technically not all arbitrary ID strings can actually be relative URIs either. I thought this was mostly a theoretical issue until my Trellis/Ivy DAS1-->DAS2 proxy choked on such a case on only the third DAS1 data source I was testing, http://www.ebi.ac.uk/das-srv/genomicdas/das/batman_CD4. It returns features that derive their IDs from their genomic location, like "21:26029715,26029814". Which can't be any form of URI, because according to the URI syntax spec <http://tools.ietf.org/html/rfc3986> the appearance of the colon before any forward slash means the "21" should be treated as the URI scheme, but the scheme can't have a digit as the first character. This isn't just a rare instance either -- I count at least sixteen data sources like this (probably more) on ProServer servers for the latest human genome assembly alone.

In this case, the ID is the least verbose but still unique-to-the-server ID possible, used because the annotation has no natural identifier (the source has per-base annotations). Believe me, there are far worse implementations - some servers don't even try to generate a unique ID for this kind of data. Leaving it blank is something that can be rejected in validation, but it's very difficult to verify it's actually unique...

There is nothing wrong with this particular example w.r.t the 1.53 spec, since the spec says nothing about IDs having to be URIs, it simply says they must uniquely identify the feature on the server. But you have hit upon one of the reasons _resolvable_ URIs (i.e. URLs) will be difficult to implement - annotations that have no natural identifier such as those in the batman_CD4 source. Plus, having a unique identifier for every base in a genome for every experiment it appears in is always going to be verbose.

On a side note, I'm not sure if these IDs are legal DAS1.53 feature IDs either, since many of them will not be unique within their DAS server, and depeding on how you interpret the 1.53 spec the colon may not be a legal ID character.

I don't think there's a problem with the colon - this is an illegal character for reference IDs but not for feature IDs as far as I can see.

The Trellis/Ivy proxy now deals with these cases, but checking each ID to see if it's a legal URI, and figuring out what to do if it's not, is definitely adding some performance overhead to the proxy.

This also points to the need for better validation of server responses, preferably as enhancements to the validation that the DAS1 registry already does. I doubt if the current DAS2 validator would catch these kinds of things either.

If you can give specific examples of things that could be targets for validation, I believe Jonathan will add them to his list so he can implement them... :)
_______________________________________________
DAS mailing list
[email protected]
http://lists.open-bio.org/mailman/listinfo/das

Reply via email to