Hi All,
First, I'd like to recommend two articles I believe are very relevant
to this discussion and may help provide us a clearer sense of how to
proceed here:
1) X. Wang, Robert Gorlitsky, and Jonas S Almeida, From XML to RDF:
how semantic web technologies will change the design of 'omic'
standards (2005) Nat. Biotech., v23, n9, p1099
(Xiaoshu is first author on this - I know this may be bringing
"coals to Newcastle," but if there are some on this list who have not
read this article, I'd strongly recommend they do.)
2) G.V. Gkoutos, E. C. J. Green, S. Greenway, A. Blank, A.-M.
Mallon, J.M. Hancock, CRAVE: A database, middleware, and
visualizaiton system for phenotype ontologies (2005) Bioinformatics,
v21, n7, p1257.
I'll explain below where I see these fitting in to this discussion.
I would maintain the "social" issue we are addressing here is the
shared, community view of the lower-levels of the semantic graph,
which is very much related to what we need to have the machine
algorithms parse.
I would also maintain the community efforts to produce a "formal
form" (of the semantics) in the relevant domains of biomedical
knowledge are very much separate from research fields focused on
analyzing natural language expressions. A "string of natural
language" is an instance of a lexical "view" of A formal form of
semantics - the formal semantic graph existing (somewhere) in the
author(s) brain(s) - which may or may not conform to the shared
formal semantic frameworks being developed by the community to cover
specific knowledge domains in biomedicine. RDF is particularly good
at providing a formal way of making explicit the many semantic
relations (explicit and implied) in a phrase of natural language, but
that doesn't mean when we talk about semantic representation in RDF,
we are always talking about representing natural language
expressions. Dealing with natural language is critical when parsing
meaning from existing scientific articles, but here I believe we are
more focussed on coming up with a means by which we can specify/
identify the semantic entities related to instances in data
repositories, as opposed to only dealing with that which is
extrapolated from parsing the literature.
It is very important not to confuse the lexicon with an ontology.
The way I like to think of it, the lexicon is to an ontology, as a
SQL VIEW is to your core data model. The "view" contains a subset of
the relational content consistent with the more complete abstract
model, but the goal of the view is to interface to a particular
application requirement, and thereby makes some compromises that can
be very application specific, and not necessarily reflect the
underlying model assertions.
This is not just semantics. ;-)
As several people on this list could easily expound on much better
than I, there is a world of difference between the computational
linguistic fields focussed on deducing semantic content from natural
language strings and the formal ontological efforts to derive a
foundational semantic framework for biomedical KR. One would hope
the Knowledge Extraction process performed by the computational
linguists can be made to converge with (or used to re-architect when
necessary) the shared community semantic graph, but the two are
certainly not synonymous.
I would agree with what I believe Alan pointed out - it is a very
complex issue to resolve the difference between the semantics
associated with a particular data instance (e.g., a somatically re-
combined sequence in a specific patient that lead through a very
complex biochemical & morphogenetic process to a specific neoplasm)
to the related, higher-level shared semantic descriptions (e.g., of
the gene in which that mutation took place). I don't know we can
expect to resolve that issue given our current limited scope. I do
feel its a critical issue to the overall goal of using semantic web
descriptions of resources (including primary data) to drive new
knowledge discovery.
As I mentioned in the phone call, one of the ways these issues can be
resolved is via the use of semantically-based mediation technology.
As someone else pointed out in the call, it is really untenable to
attempt to warehouse all the data needed for field wide, higher order
data repositories beyond the biomolecular. In many ways, even in the
biomolecular domain, we've outgrown warehousing. PDB, SwissProt,
GENBANK - a great deal of bioinformatics work that focusses on
content in these warehouses is targeting integration across
repositories and links to other, newer emerging repositories. In
many ways, this is a tasks semantic web tech is most suited to
support (see the paper by Xiaoshu cited above). To make this work,
mediation technology requires an alignment of participating
repository schemas. This can rarely be done effectively without
referring to some shared, community semantic framework. Of course,
this also requires, as Xiaoshu has pointed out on this thread, a more
explicit statement of the "processing contract" - an extremely thorny
issue you cannot avoid when you are actually trying to do something
with RDF content.
Requirement for semantic data mediation is definitely required in the
neuro-domain. In the BIRN project, we have a data mediator used to
link across the 40+ disparate research lab repositories of primary
and reduced data (Luis Marenco, Gordon, Perry Miller, and Kei are
also developing a mediator framework at Yale, I believe). There is a
BIRN mediator "registration" process each participant lab needs to go
through to link their data to the mediator. The goal is the mediator
would resolve queries made to the BIRN portal into sub-queries across
the participating "registered" databases. Though initially only
minimally tied to a semantic description of the source repositories,
it's now clear a critical part of this process is for the source
databases to map the entities in their schema to the appropriate
higher-level, semantic entities to which they refer, drawing on a
shared semantic framework for the domain of neurodegerative disease
being developed by the BIRN Ontology Task Force (I am a member of the
BIRN OTF).
In the course of developing the BIRN shared semantic framework, we've
begun to establish a set of "best practices" at least for what we are
doing within BIRN that appears to be specifically applicable to this
topic under discussion:
1) Re-use existing knowledge resources whenever possible. This
extends from flat term lists (gene names), to integrated lexical
graph indexes such as NeuroNames, on through to formally complete
ontologies such as the Foundational Model of Anatomy (FMA). We are
rarely able to use these "as is", yet it is clear by making the
effort to examine where we need to adapt these resources, we expend
nearly an order of magnitude less resources than were we to refashion
the resource from first principles ourselves. Often you can still
use a given domain ontology's formalism, even when the ontology
itself doesn't provide you with the requisite granularity you
require. Using the same - or a compatible formalism - at least holds
out the possibility of later integrating what you create into the
community resource.
2) Most all of the semantic information we expect to expose to the
mediator can be reduced to an elemental view - that of measurements
made in the course of an investigation meant to specify
(quantitatively or qualitatively) phenotypic traits. This is true of
spatially-mapped, CNS gene or protein expression data (e.g., Allen
Brain Atlas, GENSAT, Desmond Smiths "voxelized" microarray data
sets), as well as for assays of behavior and cognition which pervade
the human focussed, neuroimaging projects within BIRN. With this in
mind, we came to the understanding that it is important:
a) to use a shared foundational ontology (we are trying to use the
BFO model beginning to be adopted by many biomed ontology efforts -
e.g., FuGO, FMA) and a community-shared, collection of semantic
relations (again - we are converging on the OBO Relations ontology -
http://obo.sourceforge.net/relationship/ - another article worth
reading)
b) to develop a means of formal phenotypic attribute description
more flexible and capable of evolving than the current approaches in
use by the community - e.g., use of the Mammalian Phenotype Ontology
by the GO folks at the Jackson Labs (http://www.informatics.jax.org/
menus/vocab_menu.shtml). These "pre-coordinated" views of complex
knowledge domains are very useful when you are providing a user
interface for human literature curators (as GO & MGI do with MPO),
but they don't provide for algorithms re-combining the more elemental
semantic aspects represented in these "flattened" views. Both in the
case of disease and phenotype in general there is also a need to be
more specifically tied to the observations extrapolated from the
primary data. This is where the second citation above comes in
(CRAVE - application using PATO). Using PATO with FuGO (once FuGO
spreads to cover assays, devices, & reagents outside of gene &
protein expression, as it is gradually moving toward), one can build
a semantically well defined, description of phenotype maintaining the
integrity of the semantic links both to the primary data AND to the
shared, higher-level, semantic frameworks in the community.
It's important to note many of these efforts - use of BFO, FuGO,
PATO, etc. - are really quite new. PATO itself is so new, it's
definition/specification is a bit of a moving target.
Having said this, using the general approach outlined above takes
into account many hard learned lessons accumulated over the last few
decades in the field of biomedical KR. It also appears from our
vantage within BIRN to be the best way to go. We are actually
proceeding with our in-house BIRN semantically-oriented efforts with
a mind these standards will be specified as needed in the coming
year. Where the semantic graphs are incomplete in the domains we
require, we are using what appears to be the emerging formalism and
filling out the graph ourselves with the expectation these can be
incorporated into the community resource as it matures.
As I see it, all of this work can draw on semantic web technology for
many aspects of the implementation, if it can be used to construct
such graphs (which it appears it can).
Cheers,
Bill
On Jun 19, 2006, at 12:33 PM, Xiaoshu Wang wrote:
Alan,
URI http://www.example.com/gene;
You need to dereference the "gene" variable in order to
understand it
and do something meaningful about it.
That's one way. You can also publish a paper that describes
it, get a bunch of people agree to use it the same way,
supply formal logical definitions, or a subset of them in OWL.
The objective of semantic web is designed for use by machine for
automated
processing of information. Once it touches the social aspect, it
is beyond
what the RDF's capability, don't you think?
The same analogy is the question regarding why we need to port
controlled
vocabulary into RDF/OWL. Because in the formal form, the semantic
is encoded
by a string of natural language, whereas the latter is by a machine
language.
Answer to (1a), Of course, you can have "variables" that are not
intended to be dereferenced, in Java script, the type
"undefined" is
similar to a "404".
(Please note, a 404 does not mean that the URI does not
exist, it just
implies that at current time, it cannot be dereferenced.) It is not
wrong to define an "undefined" variable, it is just not much use of
it.
(1b) URI is just the name that refers a location on the
WEB, so it of
course is a name.
It is a names that *sometimes* refers to the web. See my
quote from the RFC.
Yes, of course. There are two basic types of information in the
web. The
information-resource (IR) and non-IR. For the former, the entitiy's
manifestation can not be retrieved through dereference the URI. For
instance, a web page, a pdf document, an RDF document etc. For the
non-IR,
like me the person, dereference the URI would not give you "me the
person".
But instead, I shall offer a description about myself at the URI that
represents me via a 303 redirect.
W3C knows nothing about Biology. They are good for defining
standards, but won't help us avoid one person using a gene database
entry identifier to refer to a protein in one place and a swissprot
name to refer to what they mean to be the same protein in another
place. That's what we have to work out.
Of course, W3C won't mandate what should be a URI. But I don't
think there
should be a "standard" to say if a URI represents a biological
entity, it
should be a datebase entry or not. You can achieve this through clear
description of URI. For instance, if I declar a URI to represent a
protein
"foo". You can say
http://www.example.com/foo a someontology:Protein .
http://www.example.com/foo http://www.example.com/dbentry (some URI to
access a dabase) .
This is semantic clear, right? Why do we need to design a
guideline to
"implicitly" make
http://www.example.com/foo to refer to represent certain types of
entity. I
think one of the important key to RDF is its explicitness. If you
adds a
lot of social guidelines to the RDF, the whole point of SW will be
lost.
Xiaoshu
Bill Bug
Senior Analyst/Ontological Engineer
Laboratory for Bioimaging & Anatomical Informatics
www.neuroterrain.org
Department of Neurobiology & Anatomy
Drexel University College of Medicine
2900 Queen Lane
Philadelphia, PA 19129
215 991 8430 (ph)
610 457 0443 (mobile)
215 843 9367 (fax)
Please Note: I now have a new email - [EMAIL PROTECTED]
This email and any accompanying attachments are confidential.
This information is intended solely for the use of the individual
to whom it is addressed. Any review, disclosure, copying,
distribution, or use of this email communication by others is strictly
prohibited. If you are not the intended recipient please notify us
immediately by returning this message to the sender and delete
all copies. Thank you for your cooperation.