On 9/05/2006, at 8:46 PM, Matthias Samwald wrote:
Hi Alan,
As far as I know there is no standard URI for a resource at NCBI. I
would like to propose that there be one, since we will all need
them to use when we refer to these resources in our RDF. (and I
need one *now*)
I think we should be aware that this could be a VERY important
decision for the further development of RDF in the life sciences.
The URI - scheme we come up with during this project would probably
become THE standard for referencing ressources at the NCBI. I guess
we should try to contact someone from the NCBI to make sure the
soloution we come up with is acceptable to them. Maybe they will
soon realize the need for URIs themselves and start creating their
own, conflicting URI scheme. The last thing the Semantic Web would
need would be two different URIs for each of the many ressources in
the Entrez databases.
Following other styles I've seen, I propose the following:
1. http://www.ncbi.nlm.nih.gov/2006/entrez/<DATABASE_GOES_HERE>/
<IDENTIFIER_GOES_HERE>
or
2. http://www.ncbi.nlm.nih.gov/2006/entrez/
<DATABASE_GOES_HERE>#<IDENTIFIER_GOES_HERE>
In my experience I have felt that leaving #identifier free for the
most fine-grained data resources best provides URI readability. I use
a composition rule :
If data resource_a is composed of dataresource_b and dataresource_c
and dataresource_b and dataresource_c cease to exist if
dataresource_a is destroyed, the the uri would be something like
<domainname>/<database>/resource_a#<identifier> where <identifier>
would be dataresoure_b and dataresource_c. The #identifier typically
appears in document specific contexts, i.e. id's within a particular
document that are unique. But extending this to a database means that
these documents are likely to be quite dynamic, and the document
specificity of ids becomes blurred. That's why I'm trying composition
(not aggregation) as a rule for when something is a #id. Not too sure
on the results yet.
We should have a look at how applications (especially triplestores)
handle this. Do they know how to split namespace from identifier in
the first case? I remember that the current version of the
triplestore Sesame has some performance - problems when handling
URNs, because he splits namespace and identifier in a wrong way
(creating a new namespace for almost every resource). I know that,
according to the RDF specification, the RDF ID is just an opaque
string, but applications do handle that differently.
Rational: can use owl:sameAs to make them the same if we need to.
We can suggest a best practice if we want to preferentially use one
numbering system versus another. (I like the alphanumeric ones,
myself)
We would not be happy to have huge amounts of redundant resources
linked with owl:sameAs. owl:sameAs is nice when it only needs to be
used sparingly, but having two different naming schemes of a large
protein database linked through owl:sameAs would 'pollute' the
Semantic Web right from the beginning. We should seek to avoid this
when we are still in the position to do so.
I cannot see this can be avoided. The bigger picture is that
different databases and groups associated with them will use
different URI schemes for describing the same thing. Also, things
that were deemed not the same once may become thought of as the same
later. It is also impossible to predict what URI naming schemes will
make sense further down the track, or what factors various engines
might play on (swoogle for instance). What I think there needs to be
is a combination of careful thought and tools for URI normalisation,
where yes there may come a time when suddenly a sameAs property is
defined for every database record, but that a tool can be used by
anyone to normalise to a preferred URI. Sort of like a agent's own
cache victim, but for semantic web services where you may query a
service with one URI, and if that is not a currently active version,
the webservice would say "but that uri is also the sameas this
preferred one" and so the agent can agree to update their URI and re-
perform the query.
kind regards,
Matthias Samwald
http://neuroscientific.net
Section on Medical Expert and Knowledge-Based Systems
Core Unit for Medical Statistics and Informatics
Medical University of Vienna/Austria
http://www.meduniwien.ac.at/mes/home_en.html