I'm not advocating that we build definitions around protein
sequences, just that we build definitions, period.
And that we don't confuse a page of html with a definition.
The uniprot curators are great! They know what they are looking for
and they are skilled at finding it. Let's put work into formalizing
whatever we can about what they know so that the fruits of their
labor can be used effectively on the SW too!
We've got a SW language for making definitions - it's called OWL. If
we have class names and definitions even for broad classes of
proteins, then we can start to build new definitions by subclassing
them, for instance into specific classes of sequence and post-
translational variants. Lots of work goes on in the scientific
community to characterize specifics about these subclasses and we
need a place to anchor that knowledge in the SW.
-Alan
On Jul 16, 2007, at 12:06 PM, Eric Jain wrote:
Alan Ruttenberg wrote:
I'm confused. I think we all would agree that there are instances
of proteins and we have a good idea of what they are. We also know
that there are groups of proteins that are built off the same
template and share certain properties. If we define classes using
such properties, then we can in principle, decide whether these
proteins are members of a given class (subject to experimental
limitations). For instance we can define a class of proteins
that have a certain primary structure (aa sequence), and then,
via assay, measure what fraction of the proteins in some sample
have that structure.
One of the biggest (and perhaps most appreciated) jobs of our
curators is to review all the different sequences that have been
submitted, and figure out which is most likely to be the correct
sequence. But this means what you get is an interpretation of our
curators, which some may disagree with.
Note that you can build a database around sequence identity, but it
seems that this is of limited use (see http://beta.uniprot.org/
uniparc/). In order to make something that's more useful, we
aggregate information about minor (and sometimes less minor)
variations, and separate by organism (but not by strain, so far).
The aggregating, again, makes things a bit fuzzy...