Alan did a great job at coordinating previously separate ontologies of several 
participants of the HCLSIG into a coherent infrastructure for the Banff demo. 
As we all agree, we should try to keep the momentum going and keep the pace of 
ontology integration and data conversion that was driven by the deadline for 
the demo. However, as discussed in the previous BioRDF call, I think we should 
take a little time to review the ontological constructs that were created to 
make the demo possible. If we are continuing to extend our ontology 
infrastructure, we want to make sure that we all understand and agree on the 
fundaments that the infrastructure is built upon, and that it does not contain 
minor glitches that were overlooked in the heat of demo preparation. 


1) What relations do we use to connect a biological entity with artificial 
entities describing it, e.g. 'protein records', 'sequence records', 'Pubmed 

In the current ontology, we use relations like in the following examples:
* 'Protein_1 has_preptide_sequence_described_by peptide_sequence_record_1'
* 'Protein_2 is_protein_gene_product_of_dna_described_by gene_record_1'
* 'Gene_record_1 describes_gene_or_gene_product_mentioned_by journal_article_1'

We are not only using these properties to link our classes of biological 
entities to certain database entries, but also to define our classes in cases 
where there is no accepted standard ontology (e.g. for proteins). For example, 
we can partly define the class 'insulin_protein' through several 
necessary&sufficient property restrictions that relate the protein class to 
some or all of the currently known proteins sequence records describing insulin 
proteins (a very practical approach).

However, I think the properties we are using right now might be problematic in 
the long term, because:
*) The properties are somewhat redundant. Since we are using OWL, all of our 
resources are typed, which means that in a relation like 'Protein_1 
has_peptide_sequence_described_by peptide_sequence_record_1', we already know 
that we are dealing with at a protein and a peptide sequence record. In most 
cases there is not much that we need to disambiguate: of course, the peptide 
sequence record describes the peptide sequence of the protein, and not its 
shape, colour or smell. The same statement could be made with a generic 
'described_by' relation without significant ambiguity. Certainly we might 
encounter database records where the situation is less clear, but these are in 
the minority. In such a case, we could still use our generic 'described_by' 
relation in most cases, but we could NOT use it to define a class in a 
'necessary&sufficient' restriction. Not that bad.

*) When our ontologies are expanded to further fields of biology, it leads to 
the creation of a large collection of properties that are hard to manage and 
query. If one simply wanted to query all the database entries describing a 
biological entity, one would need to enumerate the long list of relations in 
the query. Again, this is a good argument for creating a generic 'described_by' 
relation; if not as a replacement for our current properties, than at least as 
a superproperty that acts as an umbrella for all the other properties.

*) Clearly, we want to focus our attention on the description of biological 
reality, and not on the description of the database artefacts that needed to be 
created in the pre-Semantic Web era. With the current solution, we are moving 
some of the biological information into the realm of information entities, 
which counters our intentions. We should try to ground our descriptions in 
biological reality, as far as possible.

For example, 'Protein_2 is_protein_gene_product_of_dna_described_by 
gene_record_1' would better be described through two statements like

'Protein_2 encoded_by Gene_1'
'Gene_1 described_by gene_record_1'

This way, we can focus on describing biology, and have better opportunities to 
refine our statements later on (e.g. making statements about the gene itself). 
I know that Alan had some reasons why he did not want to introduce a gene 
class, but this should only serve as a specific example for a general design 


2) What is evidence?

In our demo, we are using the 'evidence codes ontology' with some small 
additions. The 'evidence codes' are subclasses of 'report', which is a subclass 
of 'textual_thing'. Examples are 'immunulogical_cross_reaction', 
'similar_substrate_specifity', 'inferred from genomic analysis', 'inferred from 
bioassay' etc.
Most of these classes would better be represented as processes, e.g. processes 
defined in an ontology of biological experimentals procedures: the experiments 
and procedures 'immunulogical cross reaction', 'comparison of substrate 
specifities', 'genomic analysis', 'bioassay'.
Of course, evidence for the existence of a certain biological entity can also 
be seen in journal papers, books or similar things. I guess we should keep our 
constructs for the description of evidence relatively loose. However, like in 
the section above, it would again be preferable if would try to introduce as 
few abstractions and artefacts as possible, and try to rely on using direct 
description of experimental procedures (processes) for evidence statements.


3) How are information resources (e.g. the very abstract 'database entry', or 
the slightly less abstract 'XML document associated with a database entry') 
best represented in BFO-friendly ontologies?

These entities seem to be in conflict with the realism of BFO-friendly 
ontologies, yet we need to represent them somehow. This is probably a 
discussion for the BFO Google Group, but I could not get it started so far.
Currently, we are classifying several such entities under bfo:Object, e.g. 
protein records, MeSH qualifiers, terms, notes and journal articles. I have the 
suspicion that this might be a problem.


These issues will be discussed in the BioRDF (BioOnt?) teleconference tomorrow.

Matthias Samwald


