On Jul 16, 2007, at 10:29 AM, Eric Jain wrote:
Bijan Parsia wrote:
Eric, I would be very much interested in some more details about
the sort of rules used and how they are used. I personally tend to
distinguish between the use of rules in modeling and the use of
rules for data munging tasks. Obviously, where you draw this
boundary can be a matter of taste and situation, but it seems to
be a useful distinction. It's unclear to me where the rules you
describe fall.
There is some effort coming out of OWLED 2007 to improve the
infrastructure situation (from implementation to documentation)
with regard to rules and OWL, so any information you can give on
use patterns and needs would be very helpful. (Also, C&P has a
summer intern working on rule support in Pellet and having real
uses would be nicely motivating :)).
See http://expasy.org/sprot/hamap/unirules.html.
Example rule: http://expasy.org/unirules/MF_00344.
Implementing these kinds of rules using OWL, SWRL or some other logic
or non-logic based formalism would be a nice project - but I think
this is deviating somewhat from the original point. We seem to have
switched from definitions to rules. It's important to keep these
separate when we are talking OWL, despite the apparent similarities.
We have also switched from talk of defining specific proteins to
rules to automatically annotate protein records.
Alan:
I'm not advocating that we build definitions around protein
sequences, just that we build definitions, period.
And that we don't confuse a page of html with a definition.
The uniprot curators are great! They know what they are looking for
and they are skilled at finding it. Let's put work into formalizing
whatever we can about what they know so that the fruits of their
labor can be used effectively on the SW too!
We've got a SW language for making definitions - it's called OWL.
If we have class names and definitions even for broad classes of
proteins, then we can start to build new definitions by subclassing
them, for instance into specific classes of sequence and post-
translational variants. Lots of work goes on in the scientific
community to characterize specifics about these subclasses and we
need a place to anchor that knowledge in the SW.
Eric followed this with:
One thing I can say here is that there is the trend that curators
create rules (and check the outcome) instead of adding data
themselves directly. Unfortunately OWL is insufficient for the kind
of ugly rules they need to create; maybe SWRL will allow us to
distribute at least part of the rules.
Most of the rule-based annotation is done for microbial proteins at
the moment, simpler as you don't have to deal with alternative
splicing etc. Don't expect any neat rules that define what goes
together anytime soon!
(which led to Bijan's request above)
I think this is correct, but I don't think it quite follows from what
Alan was talking about.
In a sibling node in the same thread DAG, Phil said:
A uniprot record defines a class of proteins extensionally
...
It would be more satisfying for us to know intentionally what we
mean by
"protein". It would be good to have a clear set of definitions. But,
ultimately, I think it would be mistaken. If we have the ability to
express
"the class of protein molecules defined by the swissprot record
OPSD_HUMAN",
then I think we have all we need.
If we make our own definitions, all that we have done is duplicate
what the
uniprot team are already doing. And we will, almost inevitably, do
it somewhat
differently. All we would do is create confusion. The only way that
we ensure
that we do the same thing as uniprot is say "yeah, what they said".
Unsatisfying, maybe. Clear definitions are important. But
interoperability,
and the lack of duplication are more so.
I think if I understand correctly, Alan is making two requests, one
is low hanging fruit and the other is wildly ambitious.
The LHF first: Alan, being an optimistic, would like OWL definitions
of the entities in reality denoted by UniProt/SwissProt entries with
names like OPSD_HUMAN. Phil, Newcastle Brown bottle half-empty,
thinks we can do no better than the circular "the class of protein
molecules defined by the swissprot record OPSD_HUMAN". I am with Alan
and think we can do a little better than this. There *is* an implicit
definition in UniProt entries that can be made explicit using a
logical language such as OWL.
Phil says "A uniprot record defines a class of proteins
extensionally". If we are using intensional/extensional in the set-
theoretic sense, I don't believe this is true. If the implicit
definition in a UniProt record is extensional, then the UniProt entry
for OPSD_HUMAN would list every particular spatiotemporal instance of
this protein - this would be rather a long record.
There is an implicit intensional definition in OPSD_HUMAN: a protein
encoded by nuclear or mitochondrial DNA of a human cell that has a
linear sequence of amino acids commencing with an instances of
Methionine, followed by an instance of N, G, T, ..., E,T,S,Q,V,A,P,
and ending with an instance of Alanine.
Of course, a UniProt record tells us more than this, but for now we
are talking of definitions.
It would be possible to make various objections here: what about post-
translational modifications? Sequence variants? These are easily
accommodated. I'm going to duck other objects pertaining to
transgenic genes and so on since there is nothing here that doesn't
crop up all over the place. You'll notice I am also eliding on the
nature of the relation that holds between the residues in the amino
acid sequence - I believe Michel Dumontier's work on formalizing
chemical structures is relevant here.
All this could be made explicit in OWL. I'm neutral as to how useful
it would be to do this for all of UniProt, and as to the
implementation details. But this would certainly seem to satisfy
Alan's requirement for formalizing some aspect of UniProt entries.
And it can be done in OWL-DL with no need for SWRL. I initially
though this was what Alan was requesting.
But in the snippet above, Alan says:
"I'm not advocating that we build definitions around protein
sequences" ... "If we have class names and definitions even for broad
classes of proteins, then we can start to build new definitions by
subclassing them, for instance into specific classes of sequence and
post-translational variants"
I read "broad classes of proteins" as being more inclusive than the
class denoted by OPSD_HUMAN in my interpretation, but also including
for example all human opsin proteins, all vertebrate opsins, ...
This is what I class as wildly ambitious. Besides, this seems outside
the scope of what is typically in a UniProt record.
To summarise: the hypothesis is that any UniProt entry can be
formally defined using OWL-DL in an automated fashion in a way that
is reasonably concordant with the intent of UniProt. There may well
be counter-examples that disprove this.
It doesn't follow from this that UniProt should necessarily serve OWL
of this form in response to any kind of identifier resolution
request. There may well be massive advantages to the current record-
oriented RDF that is returned. I have no strong opinions here, and it
seems both should be able to coexist.