I agree with Alan but feel sympathy for Eric as well. In the absence of
a universally accepted ontology for describing biological entities, Eric
has to develop something to start working on SW.
But please note, just because "http://purl.uniprot.org/core/Protein"
contains the string "Protein" does not make it the identifier for
*Protein*, unless everyone else agrees to it. In an open world
environment, which RDF is in, everything makes sense as long as there is
no contradiction. The ambiguity problem will only arise when the term
is to be aligned with other terms, which is not the case yet. The
development of SW will be an evolving process because it is impossible
to get things right at the very first try. I think the guideline to
best practice should encourage to (1) try to reuse existing ontology and
(2) if no such ontology exists, build your own. Eric's case obviously
felt into the second case. If more users agree the uniprot ontology, it
is great and uniprot can gradually evolve into a standard. If not, we
can learn some lesson.
That's my two cents,
Xiaoshu
Alan Ruttenberg wrote:
In that case, I would recommend that it is unwise to use Uniprot ids
as identifiers of protein classes on the semantic web. Doing so would
encourage exactly the kind of ambiguity that we need to avoid in order
to write statements that will not confuse semantic web agents
(including people).
I would suggest instead, that Uniprot not suggest that they represent
specific classes of proteins, and instead keep them being exactly what
they are, records containing information about diverse sets of
entites, which we all admit is very useful. If there is interest in
formalization for semantic web use at Uniprot, perhaps the focus can
be instead on the smaller entities on which these records collect
information.
Let others who are more interested in providing formal definitions for
proteins work on making definitions that carve out specific classes.
They can do so in part by pointing at information in the Uniprot
records and other sources.
-Alan
On Jul 17, 2007, at 4:33 AM, Eric Jain wrote:
Alan Ruttenberg wrote:
To clarify, no, I didn't mean this. I meant that the definition of
Uniprot records are already broad in the sense that sometimes
multiple splice variants are included in a single record, as are
population and disease-causing variants, according to Eric.
Basically I don't know what set of proteins people currently intend
to denote when they use a uniprot id as a protein, and I'm not
entirely certain what the curators intend. So step one would be an
english description of how to figure out what the curator's intent
is, and we could go on from there to define OWL definitions based on
that. I suspect that people currently using Uniprot ids may be using
them in both broader and narrow ways, but we could leave the
discovery of such cases to a reasoner once we had the basics in place.
People do indeed use UniProtKB identifiers in both broad and narrow
ways: The narrow way is to talk about the exact, main sequence that
is shown...
I
In any case, I'm not too optimistic about being able to define our
concepts in a strict, yet meaningful way, as often it's practical
criteria that are used to decide, e.g. here's what one of our
curators has to say on this:
"[Usually] we have one entry per gene. We have several entries for a
single gene when description of variations are too complicated to
describe in FT lines (of course, this criteria depends on the
annotator). For viruses, it is much more messy, due to ribosomal
frameshifts."
Formalize that! :-)