Re: URIs

William Bug Mon, 19 Jun 2006 12:10:46 -0700


Hi All,

First, I'd like to recommend two articles I believe are very relevantto this discussion and may help provide us a clearer sense of how toproceed here:

1) X. Wang, Robert Gorlitsky, and Jonas S Almeida, From XML to RDF:how semantic web technologies will change the design of 'omic'standards (2005) Nat. Biotech., v23, n9, p1099(Xiaoshu is first author on this - I know this may be bringing"coals to Newcastle," but if there are some on this list who have notread this article, I'd strongly recommend they do.)

2) G.V. Gkoutos, E. C. J. Green, S. Greenway, A. Blank, A.-M.Mallon, J.M. Hancock, CRAVE: A database, middleware, andvisualizaiton system for phenotype ontologies (2005) Bioinformatics,v21, n7, p1257.


I'll explain below where I see these fitting in to this discussion.

I would maintain the "social" issue we are addressing here is theshared, community view of the lower-levels of the semantic graph,which is very much related to what we need to have the machinealgorithms parse.

I would also maintain the community efforts to produce a "formalform" (of the semantics) in the relevant domains of biomedicalknowledge are very much separate from research fields focused onanalyzing natural language expressions. A "string of naturallanguage" is an instance of a lexical "view" of A formal form ofsemantics - the formal semantic graph existing (somewhere) in theauthor(s) brain(s) - which may or may not conform to the sharedformal semantic frameworks being developed by the community to coverspecific knowledge domains in biomedicine. RDF is particularly goodat providing a formal way of making explicit the many semanticrelations (explicit and implied) in a phrase of natural language, butthat doesn't mean when we talk about semantic representation in RDF,we are always talking about representing natural languageexpressions. Dealing with natural language is critical when parsingmeaning from existing scientific articles, but here I believe we aremore focussed on coming up with a means by which we can specify/identify the semantic entities related to instances in datarepositories, as opposed to only dealing with that which isextrapolated from parsing the literature.

It is very important not to confuse the lexicon with an ontology.The way I like to think of it, the lexicon is to an ontology, as aSQL VIEW is to your core data model. The "view" contains a subset ofthe relational content consistent with the more complete abstractmodel, but the goal of the view is to interface to a particularapplication requirement, and thereby makes some compromises that canbe very application specific, and not necessarily reflect theunderlying model assertions.


This is not just semantics.  ;-)

As several people on this list could easily expound on much betterthan I, there is a world of difference between the computationallinguistic fields focussed on deducing semantic content from naturallanguage strings and the formal ontological efforts to derive afoundational semantic framework for biomedical KR. One would hopethe Knowledge Extraction process performed by the computationallinguists can be made to converge with (or used to re-architect whennecessary) the shared community semantic graph, but the two arecertainly not synonymous.

I would agree with what I believe Alan pointed out - it is a verycomplex issue to resolve the difference between the semanticsassociated with a particular data instance (e.g., a somatically re-combined sequence in a specific patient that lead through a verycomplex biochemical & morphogenetic process to a specific neoplasm)to the related, higher-level shared semantic descriptions (e.g., ofthe gene in which that mutation took place). I don't know we canexpect to resolve that issue given our current limited scope. I dofeel its a critical issue to the overall goal of using semantic webdescriptions of resources (including primary data) to drive newknowledge discovery.

As I mentioned in the phone call, one of the ways these issues can beresolved is via the use of semantically-based mediation technology.As someone else pointed out in the call, it is really untenable toattempt to warehouse all the data needed for field wide, higher orderdata repositories beyond the biomolecular. In many ways, even in thebiomolecular domain, we've outgrown warehousing. PDB, SwissProt,GENBANK - a great deal of bioinformatics work that focusses oncontent in these warehouses is targeting integration acrossrepositories and links to other, newer emerging repositories. Inmany ways, this is a tasks semantic web tech is most suited tosupport (see the paper by Xiaoshu cited above). To make this work,mediation technology requires an alignment of participatingrepository schemas. This can rarely be done effectively withoutreferring to some shared, community semantic framework. Of course,this also requires, as Xiaoshu has pointed out on this thread, a moreexplicit statement of the "processing contract" - an extremely thornyissue you cannot avoid when you are actually trying to do somethingwith RDF content.

Requirement for semantic data mediation is definitely required in theneuro-domain. In the BIRN project, we have a data mediator used tolink across the 40+ disparate research lab repositories of primaryand reduced data (Luis Marenco, Gordon, Perry Miller, and Kei arealso developing a mediator framework at Yale, I believe). There is aBIRN mediator "registration" process each participant lab needs to gothrough to link their data to the mediator. The goal is the mediatorwould resolve queries made to the BIRN portal into sub-queries acrossthe participating "registered" databases. Though initially onlyminimally tied to a semantic description of the source repositories,it's now clear a critical part of this process is for the sourcedatabases to map the entities in their schema to the appropriatehigher-level, semantic entities to which they refer, drawing on ashared semantic framework for the domain of neurodegerative diseasebeing developed by the BIRN Ontology Task Force (I am a member of theBIRN OTF).

In the course of developing the BIRN shared semantic framework, we'vebegun to establish a set of "best practices" at least for what we aredoing within BIRN that appears to be specifically applicable to thistopic under discussion:1) Re-use existing knowledge resources whenever possible. Thisextends from flat term lists (gene names), to integrated lexicalgraph indexes such as NeuroNames, on through to formally completeontologies such as the Foundational Model of Anatomy (FMA). We arerarely able to use these "as is", yet it is clear by making theeffort to examine where we need to adapt these resources, we expendnearly an order of magnitude less resources than were we to refashionthe resource from first principles ourselves. Often you can stilluse a given domain ontology's formalism, even when the ontologyitself doesn't provide you with the requisite granularity yourequire. Using the same - or a compatible formalism - at least holdsout the possibility of later integrating what you create into thecommunity resource.2) Most all of the semantic information we expect to expose to themediator can be reduced to an elemental view - that of measurementsmade in the course of an investigation meant to specify(quantitatively or qualitatively) phenotypic traits. This is true ofspatially-mapped, CNS gene or protein expression data (e.g., AllenBrain Atlas, GENSAT, Desmond Smiths "voxelized" microarray datasets), as well as for assays of behavior and cognition which pervadethe human focussed, neuroimaging projects within BIRN. With this inmind, we came to the understanding that it is important:a) to use a shared foundational ontology (we are trying to use theBFO model beginning to be adopted by many biomed ontology efforts -e.g., FuGO, FMA) and a community-shared, collection of semanticrelations (again - we are converging on the OBO Relations ontology -http://obo.sourceforge.net/relationship/ - another article worthreading)b) to develop a means of formal phenotypic attribute descriptionmore flexible and capable of evolving than the current approaches inuse by the community - e.g., use of the Mammalian Phenotype Ontologyby the GO folks at the Jackson Labs (http://www.informatics.jax.org/menus/vocab_menu.shtml). These "pre-coordinated" views of complexknowledge domains are very useful when you are providing a userinterface for human literature curators (as GO & MGI do with MPO),but they don't provide for algorithms re-combining the more elementalsemantic aspects represented in these "flattened" views. Both in thecase of disease and phenotype in general there is also a need to bemore specifically tied to the observations extrapolated from theprimary data. This is where the second citation above comes in(CRAVE - application using PATO). Using PATO with FuGO (once FuGOspreads to cover assays, devices, & reagents outside of gene &protein expression, as it is gradually moving toward), one can builda semantically well defined, description of phenotype maintaining theintegrity of the semantic links both to the primary data AND to theshared, higher-level, semantic frameworks in the community.

It's important to note many of these efforts - use of BFO, FuGO,PATO, etc. - are really quite new. PATO itself is so new, it'sdefinition/specification is a bit of a moving target.

Having said this, using the general approach outlined above takesinto account many hard learned lessons accumulated over the last fewdecades in the field of biomedical KR. It also appears from ourvantage within BIRN to be the best way to go. We are actuallyproceeding with our in-house BIRN semantically-oriented efforts witha mind these standards will be specified as needed in the comingyear. Where the semantic graphs are incomplete in the domains werequire, we are using what appears to be the emerging formalism andfilling out the graph ourselves with the expectation these can beincorporated into the community resource as it matures.

As I see it, all of this work can draw on semantic web technology formany aspects of the implementation, if it can be used to constructsuch graphs (which it appears it can).


Cheers,
Bill


On Jun 19, 2006, at 12:33 PM, Xiaoshu Wang wrote:

Alan,
URI http://www.example.com/gene;

You need to dereference the "gene" variable in order to
understand it
and do something meaningful about it.
That's one way. You can also publish a paper that describes
it, get a bunch of people agree to use it the same way,
supply formal logical definitions, or a subset of them in OWL.
The objective of semantic web is designed for use by machine forautomatedprocessing of information. Once it touches the social aspect, itis beyond
what the RDF's capability, don't you think?
The same analogy is the question regarding why we need to portcontrolledvocabulary into RDF/OWL. Because in the formal form, the semanticis encoded
by a string of natural language, whereas the latter is by a machine
language.
Answer to (1a), Of course, you can have "variables" that are not
intended to be dereferenced, in Java script, the type
"undefined" is
similar to a "404".
(Please note, a 404 does not mean that the URI does not
exist, it just
implies that at current time, it cannot be dereferenced.) It is not
wrong to define an "undefined" variable, it is just not much use of
it.
(1b) URI is just the name that refers a location on the
WEB, so it of
course is a name.
It is a names that *sometimes* refers to the web. See my
quote from the RFC.
Yes, of course. There are two basic types of information in theweb. The
information-resource (IR) and non-IR.  For the former, the entitiy's
manifestation can not be retrieved through dereference the URI.  For
instance, a web page, a pdf document, an RDF document etc. For thenon-IR,like me the person, dereference the URI would not give you "me theperson".
But instead, I shall offer a description about myself at the URI that
represents me via a 303 redirect.
W3C knows nothing about Biology. They are good for defining
standards, but won't help us avoid one person using a gene database
entry identifier to refer to a protein in one place and a swissprot
name to refer to what they mean to be the same protein in another
place. That's what we have to work out.
Of course, W3C won't mandate what should be a URI. But I don'tthink thereshould be a "standard" to say if a URI represents a biologicalentity, it
should be a datebase entry or not.  You can achieve this through clear
description of URI. For instance, if I declar a URI to represent aprotein
"foo". You can say

http://www.example.com/foo a someontology:Protein .
http://www.example.com/foo http://www.example.com/dbentry (some URI to
access a dabase) .
This is semantic clear, right? Why do we need to design aguideline to
"implicitly" make
http://www.example.com/foo to refer to represent certain types ofentity. Ithink one of the important key to RDF is its explicitness. If youadds alot of social guidelines to the RDF, the whole point of SW will belost.
Xiaoshu


Bill Bug
Senior Analyst/Ontological Engineer

Laboratory for Bioimaging  & Anatomical Informatics
www.neuroterrain.org
Department of Neurobiology & Anatomy
Drexel University College of Medicine
2900 Queen Lane
Philadelphia, PA    19129
215 991 8430 (ph)
610 457 0443 (mobile)
215 843 9367 (fax)


Please Note: I now have a new email - [EMAIL PROTECTED]

This email and any accompanying attachments are confidential.This information is intended solely for the use of the individualto whom it is addressed. Any review, disclosure, copying,distribution, or use of this email communication by others is strictlyprohibited. If you are not the intended recipient please notify usimmediately by returning this message to the sender and deleteall copies. Thank you for your cooperation.

Re: URIs

Reply via email to