I see evidence as a special type of provenance for "facts", "observations", and "conclusions" in a knowledgebase.

Motivation for evidence is the desire to represent information about an experiment, such as the hypothesis. If we want to work with hypotheses, then we need to represent hypothetical information. But how? A uniform approach would treat all information as propositional or hypothetical rather than to have a separate class so that "hypothesis" can be promoted to "fact" but I digress.. :) However we represent it, we would like to know how our hypothetical fact is supported by evidence, such as protocols and methods.

Alan Ruttenberg wrote:
Maybe we can bring this back to the main subject: What problems are we trying to solve by recording evidence? What are the ways we would know that we've made a mistake?

(I suspect that there will be a variety of answers to this, and I'm very curious to hear what people think)

I'll try to answer this:
We want to record evidence in order to evaluate and weigh the quality of data/information, as well as steer and/or evaluate any conclusions that are made on the basis of that data. This is especially important in an environment for computational experiments. My test: If we can apply our own criterion to evaluate our confidence in a given fact, even when it is in someone else's knowledgebase, we have succeeded with our representation of the evidence. So, an example of how to represent such criterion reason with it about example evidence would be nice..

Evidence in Text mining
-----------------------
Suppose that we are trying to distill knowledge provided by a scientific article into some representation. Example: "Is the article about proteinX?". If so, "How relevant is proteinX to the article?" and so forth. If the distillation process is carried out by a person, then who? In the case of text mining, we might like to know what algorithms and techniques, queries, pattern recognizers (Bayesian or lexical patterns?), threshold values, etc. were used to extract knowledge. If a person used a text mining workflow to support the distillation process, then we would like the URL to the workflow WSDL (from which we can usually discover the other details) and to know who the person was.

In general, we would like to know the resources involved in producing a particular piece of data (or "fact"). We would like to know the actors, roles, conditions, algorithms, program versions, what rules were fired, and information resources.

An important challenge in the future will be to combine results from manual and automated processes. Most of us would tend to view "facts" that result from an automated process as more hypothetical or questionable than the same coming from a human expert. On the road to automation, however, we should eventually reach the point that the quality of "text mining"-supported (i.e. not generated!) annotations will be generally higher than manual-only annotation.

Evidence in Microarrays
-----------------------
I don't intend to start a debate about the particulars of microarrays but I think that evidence comes up in practice here throughout the entire process of measurement and analysis. Gene expression, as measured by microarrays, is actually a measurement of changes in mRNA levels at a particular time, which *indicates* how much change in the process of expression has occurred under *specific* *conditions*. So, already we have an example of terminology that is not ontologically accurate when incorrectly applied (to microarrays) - technically, measuring mRNA levels is not equivalent to measuring the quantity of protein product ("expression"). But the term has been in use for so long that it remains acceptable to refer to microarray analysis as "expression analysis". :)

In the case of "gene expression", the statistical process of microarray analysis only provides a probability that a gene is up or down regulated (e.g. in the common reference model). However, there is a series of decisions and conditions that lead up to the "call" (up, down, unchanged) for a particular gene and thus the resulting set of differentially expressed genes for the array. The following conditions can all be relevant to decisions in how much weight to give to the resulting data:

* Experimental design - organism, conditions, disease, phenotype, ..
* Source of cells, enzymes, ..
* Materials handling (thawed? how often?)
* Protocols used such as RNA extraction
* Operator
* Array layout and design - including choice of oligos
* Instrumentation details - array spotter/printer, laser type and calibration, ..
* Ozone levels (I'm not kidding!)
* Image analysis ("Feature Extraction") software and settings
* Type of normalization
* Criteria for discarding data as "outliers"
* Criteria for classifying gene as differentially expressed (p-value cutoff, ANOVA, ..)

Again, the point that I'm trying to make about microarrays is that evidence (as well as uncertainty), can be represented and used, even for the measurements ("observations") themselves. But this is not done in practice. Even if you wanted to simply "pool" microarray data (most people don't), it is very difficult to do because some of the most important metadata (e.g. experimental design), if available, is often in free text format.

-scott

p.s. My introduction to HCLS summarizes the way that I look at evidence a lot more succinctly than the above: ;)
http://lists.w3.org/Archives/Public/public-semweb-lifesci/2006Feb/0131.html

--
M. Scott Marshall
http://staff.science.uva.nl/~marshall
http://adaptivedisclosure.org




Reply via email to