Hi Chris,
Thanks for pointing out the potential flaws of their method. It sounded
like there is room for improvement in terms of the accuracy of database
contents and the method of assessing database accuracy. Don't get me
wrong. I think highly of GO. :-)
I'm also thinking more about what "negative knowledge" really means.
Does it mean any or all of the following:
1. inconsistent knowledge
2. inaccurate knowledge
3. incomplete knowledge
4. knowledge with uncertainties
Can SW/ontologies help turn "negative knowledge" to "positive knowledge"?
-Kei
Chris Mungall wrote:
On Jul 4, 2007, at 8:27 PM, Kei Cheung wrote:
As a follow-up example, a study for estimating the error rate of
Gene Ontology (GO) was done:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?
artid=1892569#id2674403
The study showed that the GO term annotation error rate estimates
for the GoSeqLite database were found to be 13% to 18% for curated
non-ISS annotations, 49% for ISS annotations, and 28% to 30% for all
curated annotations. (ISS stands for inferred from sequence
similiarity). Despite these findings, the authors concluded that GO
is a comparatively high quality source of informaton. Integration of
databases involving significant error rates, however, can impact
negatively the quality of science.
I have not yet properly digested this paper, but on a cursory reading
there appear to be a few serious flaws. First, a lack of
understanding of basic ontology principles - annotations to less
specific classes in the graph are treated as errors. Second, the
authors appear to make a lot of incorrect assumptions about how ISS
annotations are curated.
It's curious they predict such a high error rate yet don't provide
any examples.
-Kei
Kei Cheung wrote:
Hi Karen,
Your questions remind me of the following classic article written
by Robert Robbins on "Challenges in the Human Genome Project".
http://www.esp.org/umdnj.pdf
Although it doesn't directly answer the questions, in the
"Nomenclature Problems" section (p. 20-21), it discusses the
significant problem of inconsistent knowledge representation. It
says that it's mistake to believe that terminology fluidity is not
an issue biological in database design. It also says that many
biologists don't realize that, in a database bulit with 5% error in
the definition of individual concepts, a query that joins across 15
concepts has less than 50% chance of returning an adequate answer.
The section also points out the importance of formal representation
of scientific knowledge in addressing the inconsistency and
nomenclature problems. Semantic Web and standard ontologies provide
a solution to these database problems. We just don't simply convert
an existing database syntactically into a semantic web format, but
we also need to do careful semantic conversion to eliminate as many
errors, ambiguities, and inconsistencies as possible in order to
reduce the costs of knowledge retrieval and discovery.
-Kei
Skinner, Karen (NIH/NIDA) [E] wrote:
Recently I read somewhere (on this list, a blog, a news story,
where...?) an assertion that struck me as an interesting passing
fact at the time. As I recall, it indicated that more websites
are accessed via a search engine than by typing a URL into a
browser web address bar.
Alas, I did not save the reference, and now I am looking for the
proverbial needle in a haystack. Namely, what is the exact
assertion, who asserted it, and where did they make it? If anyone
in the world has this information or knows how to get it, or or
has related data, I imagine they would belong to this list. I
would be most grateful for any useful pointer.
Along this same vein, if anyone has any statistics, data,
anecodotes or information related to the cost of
(1) "friction" arising from inefficient or inappropriate efforts
at information retrieval
and
(2) the cost of "negative knowledge" about an existing resource or
data,
these, too, would be helpful.
(For example, with respect to #2 above, we are all familiar with
comparison shopping for goods and services. We seek data/
information about prices and quality , but at what point does the
expenditure of that effort exceed the value of the information
learned?)
I am not looking for examples at the level of a philosophy or
ecnomics Ph.D. thesis, but rather a few examples in the sciences
that can be used at the level of an "elevator speech."
Karen Skinner
Deputy Director for Science and Technology Development
Division of Basic Neuroscience and Behavior Research
National Institute on Drug Abuse/NIH