Re: [BioRDF] All about the LSID URI/URN

Jim Myers Fri, 07 Jul 2006 07:09:29 -0700

Adding to the identifier discussion - here are my notes summarizing discussions at/after the initial Semantic Web for Life Science workshop (primarily between myself, Sean Martin, and Eric Miller). Some of the points have come up in other posts and I haven't tried to go through and update this, but I thought it might have some utility still.

My bias, which comes through some in the notes, has come from working with the webDAV protocol (which allows setting text/XML metadata on any managed URL) which has shown me how powerful a 'universal' mechanism for associating metadata with 'any' URL can be. webDAV has many limitations but a URL identifier plus URIQA-style extension to HTTP (get/set of RDF metadata ala webDAV, referenced below) seems like it would overcome them. One could then associate metadata about persistence policy with the ID and treat persistent and transient objects uniformly in terms of their other properties (e.g. I use the same protocol to get the provenance of an intermediate data set in an ongoing computation as I would for data that will be kept for a 5 years and for that which will be maintained forever).

Side note: The discussion on the list is great - please keep it up!

Jim

Issues:

The "LSID" name:

Are life science identifiers different enough that they need to be treated separately? Do we then need a physical science identifier, a computer science identifier, etc.?

LSID as a protocol as well as a name:

Similar issue, but one that can also be described as death-by-plugins - if everyone who wants to control a namespace for identifiers makes a new protocol requiring a plug-in...

Persistence policy as part of the name/protocol:

Is persistence such a unique and overriding piece of metadata that it should be part of the name and/or require a separate protocol? Does the name of data change when a researcher decides it is valid and should be kept forever? There seem to be problems analogous to the 'don't encode location in the name because it might move' issue.

Persistence policy as a binary option:

There are many shades of grey in persistence - How long is the guarantee? What happens to data with a 5, 10, or 50 year retention schedule after which is to be deleted? Is access also guaranteed or just unique naming? Is the guarantee best effort? Does it apply to bits or an equivalent (by whose definition) item, e.g. the PDF copy of an obsolete MS Word 1.0 document? Is persistence policy handled better as metadata defined by a schema(s)?

Metadata retrieval as part of a persistent identifier protocol:

Is metadata unique to persistent resources? Is there a reason to balkanize metadata access by tying the mechanism to a type of resource? Or should the semantic web provide a mechanism allowing metadata association with any resource, persistent or not, via a standard mechanism?

General Commentary:

1) A model for naming resources that a community can agree on is a good / powerful thing; LSID has defined such a model and has a large growing community behind it.

Yes, but
the issues above could limit growth and lead to fragmentation of the community as it raises awareness of what globally unique IDs can do and encourages other my communitys ID protocols, and/or modifications that attempt to get around the issues noted above. Will chemists all adopt LSID simply because some of the molecules they work on are related to biology rather than materials science? Will a pharmaceutical company adopt LSID for data with retention schedules?

2) Persistence identification and the ability to persistently resolve names are not artifacts of any technology they are an organization / community investment. It is unclear what investment the LS community has at this point for supporting resolution services (DNS, HTTP, or other).

Should expectations of persistence shouldn't be managed by naming convention rather than protocol http://persistent.my.org/ addresses or the use of Handle-style/meaning free URLs (e.g. http://456.10123.name.org/myname - see below). The convention of "www.*" for web servers seems to have worked very well for conveying that expectation that these machines support HTTP.

3) The non-http URI approach requires an extra level of infrastructure for resolving objects. For use in browsers this requires an additional plug-in. There seem to be very few available; and then only on certain browsers. Further I don't think many realize that browsers are perhaps 1/10th of the applications that follow links (e.g. robots, etc. and this is a different issue completely. One the DOI / publishers are unfortunately finding out at this very moment).

A Handle-style proxy mechanism helps a bit here, but it is certainly not as clean/clear as specifying HTTP redirect as *the* resolution mechanism.

4) non-http URIs put barriers up for adoption to other communities. There are reasons (sometimes) to do this, but has this been explored for LSID and the implications understood?

And since science is becoming more interdisciplinary, the protocol really needs to be science-wide or pervasive even if namespaces are controlled by smaller orgs.

5) The LSID community has socially agreed that the use of LSID will point top an immutable resource - the thing one points at will be the same 5, 10, n years later. How can this be enforced socially or technically? Whats the penalty for reusing an LSID? If the LSID, bits to persist, and the hash are all owned by one organization, the bits and hash could be changed together.

This requirement is science-wide - it's been the argument against allowing any URLs as references in the literature, and everyone is moving to treat data in the same way. Life science is ahead in the number of individual data items to be tracked and in how large the community is that needs to persistently refer to things, hence they have the biggest problem right now, but everyone in science (and beyond) has it at some level. Socially, it isnt clear that LSID provides any more leverage than, for example, a naming convention as in #2. Technically, without a means to make name/hash pairs non-reputable (e.g. by registering them with a neutral third party or using a digital signature), LSID cannot detect reuse of names.

6) It is unclear how best to use LSID; more specifically *when* to use it and when *not* to. There was talk at the meeting of using these for documents, reports, concepts declared on the Semantic Web, etc.

There's a slippery slope here and it will be hard to have a clear convention. I may want to name my raw data, the average of my raw data, a calibrated version of my data, my latest/best data, a graph of my data, the paper about the data, etc. From various discussions of versioning, it is clear that there are use cases that need to name/expose both the individual versions and the 'latest' version, whatever number that currently is, which means bit-level persistence will probably not meet all life-science needs, which may lead to 'abuse' of LSIDs with 0-byte data to refer to things with dynamics.

7) Is LSID bad?

No. The level of adoption of LSID is impressive (though it isn't clear how much of that is simply attaching lsids for future use versus actively producing and consuming them). While the discussions at the Semantic Web for Life Sciences workshop was negative at times, one should not criticize LSIDs without acknowledging that they are a step forward and are definitely enabling and educating the community. However, the semantic web and the life sciences will need more general mechanisms for naming and associating metadata with resources, and a means to provide more detailed persistence information; promoting LSIDs as a short-term solution may not be the best option if progress on these issues can be made quickly.

Potential Alternatives:

Naming:

The Handle System similar to LSID with its own protocol and resolution mechanism. Used in DOIs. Has a proxy mechanism so no plug-in is required - http://hdl.handle.net/<some-handle> will invoke a resolver service and redirect you to the resource. The Handle System has its own protocol with its own metadata methods and thus shares those issues with LSIDs, its proxy, and the fact that the protocol and namespaces are separate (i.e. the lsid community could organize part of handle space for themselves) seem like advantages over LSID. Handles are also being proposed as part of the Grid naming mechanism (see http://www.globusworld.org/program/abstract.php?id=33, https://forge.gridforum.org/projects/ogsa-wg/document/draft-charter-naming-wg/en ).

Persistent URLs standard URLs maintained by authorities that use HTTP Redirect to provide access to resources. The PURL website has extensive documentations and FAQ information: http://purl.oclc.org

Naming convention only - Use standard URLs and DNS resolution. Resolvers/authorities could be identified via a convention such as addresses starting with uid, e.g. http://uid.my.org/. If URIs used as persistent names are meaning-free addresses , e.g. http://456.10123.name.org/myresourcename, it would be easy to transfer resolution duties between organizations, i.e. to reassign 10123.name.org from my organization to yours if my org doesnt want to maintain things anymore. Use redirects as a resolution mechanism.

Metadata:

Protocols such as LSID and The Handle System have their own extensible metadata mechanisms. For URL-based options, there are proposals for ways to add metadata capabilities to URLs:

The Nokia MPUT/MGET/MDELETE methods proposed as part of their URI Query Agent Model (URIQA) ( http://sw.nokia.com/uriqa/URIQA.html). URIQA defines the concept of a Concise Bounded Description of a resource as the set of RDF statements accessible via these methods.

Clark et. al. propose an alternate mechanism using XPointer and HTTP in A Semantic Web Resource Protocol:Xpointer and HTTP ( http://www.mindswap.org/papers/swrp-iswc04.pdf).

Persistence Policy:

With any of these naming and metadata combinations, persistence could be treated in the same way as other metadata statements about persistence policy could be standardized and accessed via the same mechanism used to discover authors, type, creation date, etc.

Additional URLs:
Handles: www.handle.net
Tim B-L musings on names from '96: http://www.w3.org/DesignIssues/NameMyth.html
Meaning-free DNS names: http://www.frankston.com/public/essays/DNSSafeHaven.asp
Comparison of Handles and PURLs (by a Handle advocate?): http://web.mit.edu/handle/www/purl-eval.html
LSID spec: http://www.omg.org/docs/dtc/04-05-01.pdf

Persistent Indentification (sic): A Key Component of an
E-Government Infrastructure, Updated July 26, 2004 discusses PURLS and Handles and other alternatives: http://cendi.dtic.mil/publications/04-2persist_id.html

At 07:57 AM 7/7/2006, Dan Connolly wrote:

http://lists.w3.org/Archives/Public/public-semweb-lifesci/2006Jun/0210.html

> The root of the problem is that the URL
> contains in it more than just a name. It also contains the network
> location where the only copy of the named object can be found (this is the
> hostname or ip address)

Which URL is that? It's not true of all URLs. Take, for example,
http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/

That URL does not contain the network location where the only
copy can be found; there are several copies on mirrors around the
globe.

$ host www.w3.org
www.w3.org has address 128.30.52.46
www.w3.org has address 193.51.208.69
www.w3.org has address 193.51.208.70
www.w3.org has address 128.30.52.31
www.w3.org has address 128.30.52.45

FYI, the TAG is working on a finding on URNs, Namespaces, and Registries;
the current draft has a brief treatment of this issue of location (in)dependence...
http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#loc_independent

> as well as the only means by which one may
> retrieve it (the protocol, usually http, https or ftp). The first question
> to ask yourself here is that when you are uniquely naming (in all of space
> and time!) a file/digital object which will be usefully copied far and
> wide, does it make sense to include as an integral part of that name the
> only protocol by which it can ever be accessed and the only place where
> one can find that copy?

If a better protocol comes along, odds are good that it will be usable
with names starting with http: .

See section 2.3 Protocol Independence
http://www.w3.org/2001/tag/doc/URNsAndRegistries-50.html#protocol_independent

> Unfortunately when it
> comes to URL?s there is no way to know that what is served one day will be
> served out the next simply by looking at the URL string. There is no
> social convention or technical contract to support the behavior that would
> be required.

Again, that's not true for all URLs. There are social and technical
means to establish that

http://www.w3.org/TR/2006/WD-wsdl20-rdf-20060518/

can be cached for a long time.

The social mechanism includes published policies such as...

"As of this note, persistent resources include:
     1. ...
     2. Those which start " http://www.w3.org/TR/" immediately followed
        by four decimal digits."
--- http://www.w3.org/Consortium/Persistence

and the technical mechanisms include HTTP caching headers:
Expires: Sat, 07 Jul 2007 12:51:56 GMT

(a 1 year expiry time is the maximum time per rfc2616)

--
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541 0875 0F91 96DE 6E52 C29E

James D. Myers
Associate Director, Cyberenvironments and Technologies, NCSA
[EMAIL PROTECTED]

Re: [BioRDF] All about the LSID URI/URN

Reply via email to