RE: blog: semantic dissonance in uniprot

Michel_Dumontier Sat, 21 Mar 2009 10:52:30 -0700

Eric and friends,


 I'm very sympathetic to the simplifying assumption of not distinguishing 
between a record and the molecular entity it represents, but there are some 
important considerations. First, we need to be cautious in the transformation 
of recorded facts (as they appear in these database records) to class 
restrictions on biomolecules in logic-based (e.g. OWL) ontologies. Initially, 
we might say that a class biomolecules share a particular molecular structure 
(or biopolymer sequence), but assertions of role, function, PTMs, and 
involvement in biological process (among others) are contextual or temporally 
qualified and as such it may not be appropriate to  generalize to all 
instances. For example, some protein records list all of the _known_ PTMs .. 
hardly the basis to generalize that all instances will also have those PTMs at 
those positions at all (or any!) time. This is clearly a major knowledge 
representation challenge, in which we should engage in different approaches to 
improve our representation. Class-based representations are necessary as there 
is a need to refer to specific real world instances, whether they be 
collections of molecules in a test tube, electron micrographs that show 
individual macromolecular complexes or atomic force microscopes that manipulate 
them. In the meantime,  we'll probably continue to model database records as 
instances of their corresponding entity.

 

 There is no doubt that it is challenging to devise a consistent naming scheme 
- and nearly each member of the steering group has worked out some way to do 
this (e.g. [1][2]). If the sharednames group wants to recommend an consensual 
approach on the _syntax_ of any given name, with appropriate rationale, then 
it's possible that more people will use it as a guiding principle. However, 
attempts to _control_ the naming process will result in an undoubtedly 
unreceptive audience. Will a registry of names prevent people from making 
similar or identical (literal) names?  no. Establishing a self-registry of 
namespaces like bio2rdf [3] or lsrn.org is a more worthy goal. I, like several 
others, am interested to see how the committee will "make sure that its URIs 
... resolve to information that is useful". I expect that this will be 
challenging to establish utility, particularly in the context of a term 
contained in an expressive ontology.

 

 I applaud efforts to publish data in an open and linked manner. But somewhat 
disconcerting is that I'm (controversially) sure we'll find ourselves in the 
awkward position that there will be too much meaningless linked data, in which 
we'll have to filter useful, less useful, to identical, useless or worse, 
misguiding or erroneous. It's not hard to imagine this happening. Applying the 
correct semantics to create meaningful relations is of fundamental importance 
for answering questions about our collective knowledge. Linking concepts or 
data with clearly defined semantic links (e.g. SKOS, RO, OWL) is  indeed 
useful, and its utility goes beyond Linked Data. Eric's appeal, that we should 
be careful to (meaningfully) link to third party über- URIs, resonates for the 
same reason that you may want to say something about an entity that other 
people won't necessarily agree with. The truth is that we all have different 
perceptions of reality, and our knowledge about the world is in constant flux. 
We should be able to express our knowledge to our degree of satisfaction. In a 
competitive, distributed environment that is the web, people will choose terms 
and ontologies that best agrees with their perception and with their 
requirements. As a nascent scientific community, so early in the game of 
designing accurate, expressive and meaningful ontologies, we should encourage 
new ideas and ensure competition among them.

 

-=Michel=-

 

http://dumontierlab.com

 

 

[1] http://bio2rdf.wiki.sourceforge.net/Banff+Manifesto

[2] http://sw.neurocommons.org/2007/uri-explanation.html 

[3] 

 

 

 

From: public-semweb-lifesci-requ...@w3.org 
[mailto:public-semweb-lifesci-requ...@w3.org] On Behalf Of eric neumann
Sent: Saturday, March 21, 2009 12:01 AM
To: marsh...@science.uva.nl
Cc: W3C HCLSIG hcls
Subject: Re: blog: semantic dissonance in uniprot

 

Scott,

 

Funny, I was just about to send a message on a very similar issue; may be it's 
what you're referring to, but let me know either way...

 

After talking with many folks in industry over the last several months, it is 
becoming quite clear that when dealing with a molecular reference, such as 
uniprot or entrez-gene, we should also be treating it as a form of "proxy of 
the thing" with something akin to transitivity. Why, because they are the best 
reference we have to a protein entity (exemplar). No wonder real-world 
scientists refer to these records as "the gene" or "the protein". I for one see 
keeping things from becoming unnecessarily complicated as key to successfully 
advancing the semantic web in LS.

 

Here are some reasons why we should consider regarding this typing issue:

1.      There is no such thing as a referenceble instance of a specific 
instantiated molecule ("that specific molecule"); all gene, protein, and 
chemical records are about the category or group of exemplar molecules: SAME 
molecular structure, NOT SAME atoms (so we already aren't really things in the 
real world ;-) ); all molecular databases are based on this asserted fact.
2.      Most users of molecular information aren't ignorant about the 
difference between a protein and a record of a protein; it's just that they 
don't want to deal with all the extra CS mechanics (that prevent getting their 
job done). And so an instance of a protein record in a database (or a reference 
to it from another database) is the closest thing to saying: "here's the 
protein".
3.      Different records exist for the same protein, which indeed has been a 
historic point of complication; but this is really a social issue, not a 
semantic one, and the key data authorities have already for years coordinated 
on this point by supplying cross-references to each other. Occasionally, when 
we realize a gene was incorrectly identified, the record is merged or 
deprecated, and one group fixes things usually before the other. It would 
appear that it's beneficial not to coerce the different authorities 
pre-emptively to point to any other third party über-gene URI; each should 
correct when it has sufficient evidence, and share that change so that 
references from each quarter can be corrected. This is also sound form a 
progression of science perspective; the different agencies through their 
interactions will eventually find the "better truth" .
4.      If one creates a new node or URI for "the gene ABL-Human", and link all 
other data records to it,  it is by any definition 'also' a digital record 
(even without a URI); hence if one follows this logic to its formal conclusion, 
we have a system of references about records, that are about records, that are 
about records... and never quite get to the true instance of a gene. Voila! 
we've re-created Russell's Paradox using gene records! 
5.      The body that decides and creates "a higher form of protein record" 
that others must reference, is going to be suspect by all other authorities; if 
it is done by committee, I fear it will add a lot more unnecessary confusion; 
does it get annotated? By whom? How is this regulated by the communities 
experts and authorities? Do we allow open season for all annotators, but keep 
everything sequestered in local SW zones? I think this open an interesting but 
entangled can of worms...

I believe it's therefore best not to define protein records types separate from 
proteins, at least for general consumption by informaticists. Some day this may 
indeed be easy and useful, but I don't see it being the right thing to invest 
in right now...

 

So what should we do for now? When should we think about proteins and when 
about protein records? Well, doesn't that really depend if you are a data 
source curator like SIB or a consumer of molecular information?  Using RDF 
typing, both can be asserted at the same time, as long as we don't build in any 
contradictions. EMBL, SIB and NCBI can treat all such records as special 
"curated record classes", but expose them outwardly as "Gene" or "Protein", or 
"micro RNA".

 

For most of us who use such online information, this is something that really 
is not so complicated-- however, when writing new tools to handle new semantic 
complexities, one almost invariably experiences unpredicted side effects... 
it's the software that could become confused. I recommend we keep it simpler 
for now, and don't add semantic features that end-users can not benefit 
immediately from while making it more complicated to use.

 

cheers,

Eric

 

On Fri, Mar 20, 2009 at 1:35 PM, M. Scott Marshall <marsh...@science.uva.nl> 
wrote:

FYI:
http://i9606.blogspot.com/2009/02/semantic-dissonance-in-uniprot.html

I thought that the above blog entry would interest some of you (it apparently 
already has interested a few of you that have added comments :) ). The blog is 
from Benjamin Good (from Mark Wilkinson's Lab) and was referenced during a 
napkin discussion I had with Marco Roos and Ben about how one could best refer 
to a protein in text-mined triples. One of the best options seemed to be to use 
a PURL that referred to a record associated with the protein. Sound familiar? 
Those of you who have been with us for more than a year will think so. See 
http://sharednames.org for an attempt to approach the issue.

-Scott

RE: blog: semantic dissonance in uniprot

Reply via email to