I see evidence as a special type of provenance for "facts",
"observations", and "conclusions" in a knowledgebase.
Motivation for evidence is the desire to represent information about an
experiment, such as the hypothesis. If we want to work with hypotheses,
then we need to represent hypothetical information. But how? A uniform
approach would treat all information as propositional or hypothetical
rather than to have a separate class so that "hypothesis" can be
promoted to "fact" but I digress.. :) However we represent it, we would
like to know how our hypothetical fact is supported by evidence, such as
protocols and methods.
Alan Ruttenberg wrote:
Maybe we can bring this back to the main subject: What problems are we
trying to solve by recording evidence? What are the ways we would know
that we've made a mistake?
(I suspect that there will be a variety of answers to this, and I'm very
curious to hear what people think)
I'll try to answer this:
We want to record evidence in order to evaluate and weigh the quality of
data/information, as well as steer and/or evaluate any conclusions that
are made on the basis of that data. This is especially important in an
environment for computational experiments. My test: If we can apply our
own criterion to evaluate our confidence in a given fact, even when it
is in someone else's knowledgebase, we have succeeded with our
representation of the evidence. So, an example of how to represent such
criterion reason with it about example evidence would be nice..
Evidence in Text mining
-----------------------
Suppose that we are trying to distill knowledge provided by a
scientific article into some representation. Example: "Is the article
about proteinX?". If so, "How relevant is proteinX to the article?" and
so forth. If the distillation process is carried out by a person, then
who? In the case of text mining, we might like to know what algorithms
and techniques, queries, pattern recognizers (Bayesian or lexical
patterns?), threshold values, etc. were used to extract knowledge. If a
person used a text mining workflow to support the distillation process,
then we would like the URL to the workflow WSDL (from which we can
usually discover the other details) and to know who the person was.
In general, we would like to know the resources involved in producing a
particular piece of data (or "fact"). We would like to know the actors,
roles, conditions, algorithms, program versions, what rules were fired,
and information resources.
An important challenge in the future will be to combine results from
manual and automated processes. Most of us would tend to view "facts"
that result from an automated process as more hypothetical or
questionable than the same coming from a human expert. On the road to
automation, however, we should eventually reach the point that the
quality of "text mining"-supported (i.e. not generated!) annotations
will be generally higher than manual-only annotation.
Evidence in Microarrays
-----------------------
I don't intend to start a debate about the particulars of microarrays
but I think that evidence comes up in practice here throughout the
entire process of measurement and analysis. Gene expression, as measured
by microarrays, is actually a measurement of changes in mRNA levels at a
particular time, which *indicates* how much change in the process of
expression has occurred under *specific* *conditions*. So, already we
have an example of terminology that is not ontologically accurate when
incorrectly applied (to microarrays) - technically, measuring mRNA
levels is not equivalent to measuring the quantity of protein product
("expression"). But the term has been in use for so long that it remains
acceptable to refer to microarray analysis as "expression analysis". :)
In the case of "gene expression", the statistical process of microarray
analysis only provides a probability that a gene is up or down regulated
(e.g. in the common reference model). However, there is a series of
decisions and conditions that lead up to the "call" (up, down,
unchanged) for a particular gene and thus the resulting set of
differentially expressed genes for the array. The following conditions
can all be relevant to decisions in how much weight to give to the
resulting data:
* Experimental design - organism, conditions, disease, phenotype, ..
* Source of cells, enzymes, ..
* Materials handling (thawed? how often?)
* Protocols used such as RNA extraction
* Operator
* Array layout and design - including choice of oligos
* Instrumentation details - array spotter/printer, laser type and
calibration, ..
* Ozone levels (I'm not kidding!)
* Image analysis ("Feature Extraction") software and settings
* Type of normalization
* Criteria for discarding data as "outliers"
* Criteria for classifying gene as differentially expressed (p-value
cutoff, ANOVA, ..)
Again, the point that I'm trying to make about microarrays is that
evidence (as well as uncertainty), can be represented and used, even for
the measurements ("observations") themselves. But this is not done in
practice. Even if you wanted to simply "pool" microarray data (most
people don't), it is very difficult to do because some of the most
important metadata (e.g. experimental design), if available, is often in
free text format.
-scott
p.s. My introduction to HCLS summarizes the way that I look at evidence
a lot more succinctly than the above: ;)
http://lists.w3.org/Archives/Public/public-semweb-lifesci/2006Feb/0131.html
--
M. Scott Marshall
http://staff.science.uva.nl/~marshall
http://adaptivedisclosure.org