On Jul 16, 2007, at 10:29 AM, Eric Jain wrote:


Bijan Parsia wrote:
Eric, I would be very much interested in some more details about the sort of rules used and how they are used. I personally tend to distinguish between the use of rules in modeling and the use of rules for data munging tasks. Obviously, where you draw this boundary can be a matter of taste and situation, but it seems to be a useful distinction. It's unclear to me where the rules you describe fall. There is some effort coming out of OWLED 2007 to improve the infrastructure situation (from implementation to documentation) with regard to rules and OWL, so any information you can give on use patterns and needs would be very helpful. (Also, C&P has a summer intern working on rule support in Pellet and having real uses would be nicely motivating :)).

See http://expasy.org/sprot/hamap/unirules.html.

Example rule: http://expasy.org/unirules/MF_00344.

Implementing these kinds of rules using OWL, SWRL or some other logic or non-logic based formalism would be a nice project - but I think this is deviating somewhat from the original point. We seem to have switched from definitions to rules. It's important to keep these separate when we are talking OWL, despite the apparent similarities.

We have also switched from talk of defining specific proteins to rules to automatically annotate protein records.

Alan:

I'm not advocating that we build definitions around protein sequences, just that we build definitions, period.
And that we don't confuse a page of html with a definition.

The uniprot curators are great! They know what they are looking for and they are skilled at finding it. Let's put work into formalizing whatever we can about what they know so that the fruits of their labor can be used effectively on the SW too!

We've got a SW language for making definitions - it's called OWL. If we have class names and definitions even for broad classes of proteins, then we can start to build new definitions by subclassing them, for instance into specific classes of sequence and post- translational variants. Lots of work goes on in the scientific community to characterize specifics about these subclasses and we need a place to anchor that knowledge in the SW.

Eric followed this with:

One thing I can say here is that there is the trend that curators create rules (and check the outcome) instead of adding data themselves directly. Unfortunately OWL is insufficient for the kind of ugly rules they need to create; maybe SWRL will allow us to distribute at least part of the rules.

Most of the rule-based annotation is done for microbial proteins at the moment, simpler as you don't have to deal with alternative splicing etc. Don't expect any neat rules that define what goes together anytime soon!

(which led to Bijan's request above)

I think this is correct, but I don't think it quite follows from what Alan was talking about.

In a sibling node in the same thread DAG, Phil said:

A uniprot record defines a class of proteins extensionally
...
It would be more satisfying for us to know intentionally what we mean by
"protein". It would be good to have a clear set of definitions. But,
ultimately, I think it would be mistaken. If we have the ability to express "the class of protein molecules defined by the swissprot record OPSD_HUMAN",
then I think we have all we need.

If we make our own definitions, all that we have done is duplicate what the uniprot team are already doing. And we will, almost inevitably, do it somewhat differently. All we would do is create confusion. The only way that we ensure
that we do the same thing as uniprot is say "yeah, what they said".

Unsatisfying, maybe. Clear definitions are important. But interoperability,
and the lack of duplication are more so.

I think if I understand correctly, Alan is making two requests, one is low hanging fruit and the other is wildly ambitious.

The LHF first: Alan, being an optimistic, would like OWL definitions of the entities in reality denoted by UniProt/SwissProt entries with names like OPSD_HUMAN. Phil, Newcastle Brown bottle half-empty, thinks we can do no better than the circular "the class of protein molecules defined by the swissprot record OPSD_HUMAN". I am with Alan and think we can do a little better than this. There *is* an implicit definition in UniProt entries that can be made explicit using a logical language such as OWL.

Phil says "A uniprot record defines a class of proteins extensionally". If we are using intensional/extensional in the set- theoretic sense, I don't believe this is true. If the implicit definition in a UniProt record is extensional, then the UniProt entry for OPSD_HUMAN would list every particular spatiotemporal instance of this protein - this would be rather a long record.

There is an implicit intensional definition in OPSD_HUMAN: a protein encoded by nuclear or mitochondrial DNA of a human cell that has a linear sequence of amino acids commencing with an instances of Methionine, followed by an instance of N, G, T, ..., E,T,S,Q,V,A,P, and ending with an instance of Alanine.

Of course, a UniProt record tells us more than this, but for now we are talking of definitions.

It would be possible to make various objections here: what about post- translational modifications? Sequence variants? These are easily accommodated. I'm going to duck other objects pertaining to transgenic genes and so on since there is nothing here that doesn't crop up all over the place. You'll notice I am also eliding on the nature of the relation that holds between the residues in the amino acid sequence - I believe Michel Dumontier's work on formalizing chemical structures is relevant here.

All this could be made explicit in OWL. I'm neutral as to how useful it would be to do this for all of UniProt, and as to the implementation details. But this would certainly seem to satisfy Alan's requirement for formalizing some aspect of UniProt entries. And it can be done in OWL-DL with no need for SWRL. I initially though this was what Alan was requesting.

But in the snippet above, Alan says:

"I'm not advocating that we build definitions around protein sequences" ... "If we have class names and definitions even for broad classes of proteins, then we can start to build new definitions by subclassing them, for instance into specific classes of sequence and post-translational variants"

I read "broad classes of proteins" as being more inclusive than the class denoted by OPSD_HUMAN in my interpretation, but also including for example all human opsin proteins, all vertebrate opsins, ...

This is what I class as wildly ambitious. Besides, this seems outside the scope of what is typically in a UniProt record.

To summarise: the hypothesis is that any UniProt entry can be formally defined using OWL-DL in an automated fashion in a way that is reasonably concordant with the intent of UniProt. There may well be counter-examples that disprove this.

It doesn't follow from this that UniProt should necessarily serve OWL of this form in response to any kind of identifier resolution request. There may well be massive advantages to the current record- oriented RDF that is returned. I have no strong opinions here, and it seems both should be able to coexist.

Reply via email to