As a follow-up example, a study for estimating the error rate of Gene
Ontology (GO) was done:
http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=1892569#id2674403
The study showed that the GO term annotation error rate estimates for
the GoSeqLite database were found to be 13% to 18% for curated non-ISS
annotations, 49% for ISS annotations, and 28% to 30% for all curated
annotations. (ISS stands for inferred from sequence similiarity).
Despite these findings, the authors concluded that GO is a comparatively
high quality source of informaton. Integration of databases involving
significant error rates, however, can impact negatively the quality of
science.
-Kei
Kei Cheung wrote:
Hi Karen,
Your questions remind me of the following classic article written by
Robert Robbins on "Challenges in the Human Genome Project".
http://www.esp.org/umdnj.pdf
Although it doesn't directly answer the questions, in the
"Nomenclature Problems" section (p. 20-21), it discusses the
significant problem of inconsistent knowledge representation. It says
that it's mistake to believe that terminology fluidity is not an
issue biological in database design. It also says that many biologists
don't realize that, in a database bulit with 5% error in the
definition of individual concepts, a query that joins across 15
concepts has less than 50% chance of returning an adequate answer. The
section also points out the importance of formal representation of
scientific knowledge in addressing the inconsistency and nomenclature
problems. Semantic Web and standard ontologies provide a solution to
these database problems. We just don't simply convert an existing
database syntactically into a semantic web format, but we also need to
do careful semantic conversion to eliminate as many errors,
ambiguities, and inconsistencies as possible in order to reduce the
costs of knowledge retrieval and discovery.
-Kei
Skinner, Karen (NIH/NIDA) [E] wrote:
Recently I read somewhere (on this list, a blog, a news story,
where...?) an assertion that struck me as an interesting passing fact
at the time. As I recall, it indicated that more websites are
accessed via a search engine than by typing a URL into a browser web
address bar.
Alas, I did not save the reference, and now I am looking for the
proverbial needle in a haystack. Namely, what is the exact assertion,
who asserted it, and where did they make it? If anyone in the world
has this information or knows how to get it, or or has related data,
I imagine they would belong to this list. I would be most grateful
for any useful pointer.
Along this same vein, if anyone has any statistics, data, anecodotes
or information related to the cost of
(1) "friction" arising from inefficient or inappropriate efforts at
information retrieval
and
(2) the cost of "negative knowledge" about an existing resource or data,
these, too, would be helpful.
(For example, with respect to #2 above, we are all familiar with
comparison shopping for goods and services. We seek data/information
about prices and quality , but at what point does the expenditure of
that effort exceed the value of the information learned?)
I am not looking for examples at the level of a philosophy or
ecnomics Ph.D. thesis, but rather a few examples in the sciences that
can be used at the level of an "elevator speech."
Karen Skinner
Deputy Director for Science and Technology Development
Division of Basic Neuroscience and Behavior Research
National Institute on Drug Abuse/NIH