Re: Rules (was Re: Ambiguous names. was: Re: URL +1, LSID -1)

Chris Mungall Mon, 16 Jul 2007 20:28:17 -0700


On Jul 16, 2007, at 10:29 AM, Eric Jain wrote:

Bijan Parsia wrote:
Eric, I would be very much interested in some more details aboutthe sort of rules used and how they are used. I personally tend todistinguish between the use of rules in modeling and the use ofrules for data munging tasks. Obviously, where you draw thisboundary can be a matter of taste and situation, but it seems tobe a useful distinction. It's unclear to me where the rules youdescribe fall.There is some effort coming out of OWLED 2007 to improve theinfrastructure situation (from implementation to documentation)with regard to rules and OWL, so any information you can give onuse patterns and needs would be very helpful. (Also, C&P has asummer intern working on rule support in Pellet and having realuses would be nicely motivating :)).
See http://expasy.org/sprot/hamap/unirules.html.

Example rule: http://expasy.org/unirules/MF_00344.

Implementing these kinds of rules using OWL, SWRL or some other logicor non-logic based formalism would be a nice project - but I thinkthis is deviating somewhat from the original point. We seem to haveswitched from definitions to rules. It's important to keep theseseparate when we are talking OWL, despite the apparent similarities.

We have also switched from talk of defining specific proteins torules to automatically annotate protein records.


Alan:

I'm not advocating that we build definitions around proteinsequences, just that we build definitions, period.
And that we don't confuse a page of html with a definition.
The uniprot curators are great! They know what they are looking forand they are skilled at finding it. Let's put work into formalizingwhatever we can about what they know so that the fruits of theirlabor can be used effectively on the SW too!
We've got a SW language for making definitions - it's called OWL.If we have class names and definitions even for broad classes ofproteins, then we can start to build new definitions by subclassingthem, for instance into specific classes of sequence and post-translational variants. Lots of work goes on in the scientificcommunity to characterize specifics about these subclasses and weneed a place to anchor that knowledge in the SW.


Eric followed this with:

One thing I can say here is that there is the trend that curatorscreate rules (and check the outcome) instead of adding datathemselves directly. Unfortunately OWL is insufficient for the kindof ugly rules they need to create; maybe SWRL will allow us todistribute at least part of the rules.
Most of the rule-based annotation is done for microbial proteins atthe moment, simpler as you don't have to deal with alternativesplicing etc. Don't expect any neat rules that define what goestogether anytime soon!


(which led to Bijan's request above)

I think this is correct, but I don't think it quite follows from whatAlan was talking about.


In a sibling node in the same thread DAG, Phil said:

A uniprot record defines a class of proteins extensionally

...

It would be more satisfying for us to know intentionally what wemean by
"protein". It would be good to have a clear set of definitions. But,
ultimately, I think it would be mistaken. If we have the ability toexpress"the class of protein molecules defined by the swissprot recordOPSD_HUMAN",
then I think we have all we need.
If we make our own definitions, all that we have done is duplicatewhat theuniprot team are already doing. And we will, almost inevitably, doit somewhatdifferently. All we would do is create confusion. The only way thatwe ensure
that we do the same thing as uniprot is say "yeah, what they said".
Unsatisfying, maybe. Clear definitions are important. Butinteroperability,
and the lack of duplication are more so.

I think if I understand correctly, Alan is making two requests, oneis low hanging fruit and the other is wildly ambitious.

The LHF first: Alan, being an optimistic, would like OWL definitionsof the entities in reality denoted by UniProt/SwissProt entries withnames like OPSD_HUMAN. Phil, Newcastle Brown bottle half-empty,thinks we can do no better than the circular "the class of proteinmolecules defined by the swissprot record OPSD_HUMAN". I am with Alanand think we can do a little better than this. There *is* an implicitdefinition in UniProt entries that can be made explicit using alogical language such as OWL.

Phil says "A uniprot record defines a class of proteinsextensionally". If we are using intensional/extensional in the set-theoretic sense, I don't believe this is true. If the implicitdefinition in a UniProt record is extensional, then the UniProt entryfor OPSD_HUMAN would list every particular spatiotemporal instance ofthis protein - this would be rather a long record.

There is an implicit intensional definition in OPSD_HUMAN: a proteinencoded by nuclear or mitochondrial DNA of a human cell that has alinear sequence of amino acids commencing with an instances ofMethionine, followed by an instance of N, G, T, ..., E,T,S,Q,V,A,P,and ending with an instance of Alanine.

Of course, a UniProt record tells us more than this, but for now weare talking of definitions.

It would be possible to make various objections here: what about post-translational modifications? Sequence variants? These are easilyaccommodated. I'm going to duck other objects pertaining totransgenic genes and so on since there is nothing here that doesn'tcrop up all over the place. You'll notice I am also eliding on thenature of the relation that holds between the residues in the aminoacid sequence - I believe Michel Dumontier's work on formalizingchemical structures is relevant here.

All this could be made explicit in OWL. I'm neutral as to how usefulit would be to do this for all of UniProt, and as to theimplementation details. But this would certainly seem to satisfyAlan's requirement for formalizing some aspect of UniProt entries.And it can be done in OWL-DL with no need for SWRL. I initiallythough this was what Alan was requesting.


But in the snippet above, Alan says:

"I'm not advocating that we build definitions around proteinsequences" ... "If we have class names and definitions even for broadclasses of proteins, then we can start to build new definitions bysubclassing them, for instance into specific classes of sequence andpost-translational variants"

I read "broad classes of proteins" as being more inclusive than theclass denoted by OPSD_HUMAN in my interpretation, but also includingfor example all human opsin proteins, all vertebrate opsins, ...

This is what I class as wildly ambitious. Besides, this seems outsidethe scope of what is typically in a UniProt record.

To summarise: the hypothesis is that any UniProt entry can beformally defined using OWL-DL in an automated fashion in a way thatis reasonably concordant with the intent of UniProt. There may wellbe counter-examples that disprove this.

It doesn't follow from this that UniProt should necessarily serve OWLof this form in response to any kind of identifier resolutionrequest. There may well be massive advantages to the current record-oriented RDF that is returned. I have no strong opinions here, and itseems both should be able to coexist.

Re: Rules (was Re: Ambiguous names. was: Re: URL +1, LSID -1)

Reply via email to