Re: [BioRDF] All about the LSID URI/URN

Sean Martin Fri, 07 Jul 2006 09:44:02 -0700

As background, it might help to know that one of the earliest requirements of the I3C's LSID group was that bio/pharma companies be able to copy large chunks of public databases internally so as to be able to access them repeatedly in private. They also wanted to be certain of exactly what version of the information they were using as it changed frequently. The LSID URN scheme includes the means to allow them to do both of these seamlessly. Scientists are able to use LSIDs to name or identify objects received via mail from a colleague, resolved over the web, retrieved from a copy archive or simply copied out of a file system, without ambiguity because the location of the object is unimportant and the naming syntax provides for versioning information. There can be no doubts, no further complex namespace rules from the provider of that data that need to be consulted. In short, a machine can easily be made to do it all. Perhaps it might help to think about LSIDs more in the taking a couple of copies here and there sense rather than the web proxy caching sense.

Anyway a couple of embedded comments included below:

[EMAIL PROTECTED] wrote on 07/07/2006 08:57:33 AM:

>
> http://lists.w3.org/Archives/Public/public-semweb-lifesci/2006Jun/0210.html
>
> > The root of the problem is that the URL
> > contains in it more than just a name. It also contains the network
> > location where the only copy of the named object can be found (this is the
> > hostname or ip address)
>
> Which URL is that? It's not true of all URLs. Take, for example,
> http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/
>
> That URL does not contain the network location where the only
> copy can be found; there are several copies on mirrors around the
> globe.
>
> $ host www.w3.org
> www.w3.org has address 128.30.52.46
> www.w3.org has address 193.51.208.69
> www.w3.org has address 193.51.208.70
> www.w3.org has address 128.30.52.31
> www.w3.org has address 128.30.52.45
>

Can you explain this a little further please Dan? If perhaps you mean that the W3C has mirrors and a DNS server that responds with the appropriate IP addresses depending on where you are coming from or which servers are available or have less load, I agree that the URL points to multiple copies of the object but any single access path to that object is always determined by the URL issuing authority. I actually wrote the original code to do just this for IBM sports web sites in the mid 90's! I am sure though that you will appreciate that this is not at all the same thing as being able to actively source the named object from multiple places, where the client side chooses the both the location and the means of access (protocol) and that this can still be done if the original issuing authority for the object goes away. From the client’s point of view, with a URL the protocol and the location are fixed and if either disappears the client has no way to ask anyone else for a copy. In my original post my thoughts were for the second of these meanings as the first has been obviously in practice for over a decade now. Sorry for not being explicit earlier.

>
>
> FYI, the TAG is working on a finding on URNs, Namespaces, and Registries;
> the current draft has a brief treatment of this issue of location
> (in)dependence...
> http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#loc_independent
>
>
> > as well as the only means by which one may
> > retrieve it (the protocol, usually http, https or ftp). The first question
> > to ask yourself here is that when you are uniquely naming (in all of space
> > and time!) a file/digital object which will be usefully copied far and
> > wide, does it make sense to include as an integral part of that name the
> > only protocol by which it can ever be accessed and the only place where
> > one can find that copy?
>
> If a better protocol comes along, odds are good that it will be usable
> with names starting with http:

I am not sure I understand how this can be possible. Sure for evolved HTTP perhaps, but for protocols that have not yet been conceived I am not so sanguin.

>
> See section 2.3 Protocol Independence
> http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#protocol_independent
>

hmm I am not sure I can buy the argument at the above link yet. Is this even an argument? ....because myURIs always map to http anyway it is the same as if it were http, so why bother..?

The main difference as far as I can see is that the mapping provides a level of indirection. This seems quite a significant difference and may be the point of having a myURI in the first place. The intention no doubt being to leave room for other protocols as they emerge and not tie a name to a single one as well as provide flexibility for actual deployment. In my experience indirection is the great friend of people doing actual deployments. Also in this case protocol includes not just the technical TCP socket connection and GET headers etc, but also has to include the issues surrounding domain ownership too which are part of the resolution process. While we may be reasonably certain about the technical issues, the uncertainties of tying ones long term identifier to a hostname (even a virtual one like www.w3.org ) are considerable and in the face of this a layer of indirection begins to look quite prudent.

Also note that this is not about just pure naming since retrieval is explicitly intended for both data and metadata from multiple sources. LSIDs are already mapped to multiple protocols (which would not be possible if you did not have indirection), certainly this includes http URLs but also ftp & file:// URL's for wide area file systems as well as SOAP (which itself supports multiple transport protocols). The LSID spec explicitly allows for the client to accumulate metadata from multiple metadata stores using multiple protocols without duplication using just the single URN.
>
> > Unfortunately when it
> > comes to URL?s there is no way to know that what is served one day will be
> > served out the next simply by looking at the URL string. There is no
> > social convention or technical contract to support the behavior that would
> > be required.
>
> Again, that's not true for all URLs. There are social and technical
> means to establish that
>
> http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/
>
> can be cached for a long time.

Yes, but which URLs? My original post went on to say:

--
`One type of URL response may be happily cached, perhaps for ever, the other type probably should not, but to a machine program the URL looks the same and without recourse to an error prone set of heuristics it is extremely difficult perhaps impossible to programmatically tell the difference. Given that what we are designing is meant to be a machine readable web, it is vital to know which URLs behave the way we want and which do not. Can one programmatically tell the difference without actually accessing them? Can one programmatically tell the difference even after accessing them?`
--

Perhaps I should have written that `it is vital to know which URIs behave the way we want and which do not`. You fairly responded that HTTP has an expires header and that the social conventions around how to split up a URL [and what meaning the various parts of the substrings have and what this implies for their shelf lives or versioning] can be written and published - perhaps even in a machine readable form. But for automation one would need to dereference even this at some point and load it into the machines understanding stack somehow. For the time being one would need to program in the different heuristics for every individual data source. A long road to hoe I think, and one that would likely defeat the objective of having a simple stack that can fetch the data given an identifier. We would be kidding ourselves if we did not acknowledge serious adoption problems. Note that I was optimistic on the `cached, perhaps for ever` statement, because as you note http expires only supports caching for a year. Does this mean that the object named could change after a year? (Who knows). This would be a problem for this community for both scientific and legal reasons. Of course this is one area that has some reasonably easy fixes.

Given both the mixed up social contract (is it an RCP transport or a permanent [!?] document name, how does one version it, and who owns the hostname now) baggage surrounding HTTP URLs as well as a number of technical short comings given the communities requirements, it is not hard to see how the idea of a new URN with its own specialized technical and social contracts provided a fresh start and yet still mapped down onto existing internet infrastructure.

Kindest regards, Sean

--
Sean Martin
IBM Corp

Re: [BioRDF] All about the LSID URI/URN

Reply via email to