I thought this might be interesting to folks here, and as an input to
the RI note.
I haven't had a chance to read the paper yet, just this email. There
are some good analogies and disanalogies to be found (ok, it's "only"
one site, but wikipedia behaves, in many ways, like the open web:
decentralized minting and evolution; "authority" however, is
weaker...the content of each page can have arbitrary authors and
while editors have some powers, there are strong rules against their
using those powers to overconstrain content; etc. etc.).
One big difference is that disagreement and variance can easily be
accumulated on the canonical page. (So, even for terms that are *not*
strictly synonyms, redirecting is ok; I worry a little about the
blithe comfort drawn from synonymy, since, e.g., connotation may be
obliterated.) There is a heavy use of disambiguation pages. That
would be an interesting tactic in building a big ontologies: some
terms have "disambiguation axioms"...i.e., if there was controversy
or independent evolution, you could pop in some sense and conflict
disambigation axioms (this really is just the same term; that really
is a different one; for Y's POV on term X see...) With a few simple
tools (SKOSiness; alignment axioms; good change representation...we
had this in swoop in prototype form wherein you could create a
"virtual version" by applying diffs to the canonical source) this
could work well.
(Btw, sameAs is almost always the wrong tool, imho ;) Not that we've
had much better....)
Cheers,
Bijan.
Begin forwarded message:
Resent-From: [EMAIL PROTECTED]
From: "Martin Hepp (UIBK)" <[EMAIL PROTECTED]>
Date: November 11, 2007 9:16:05 AM BST
To: Richard Cyganiak <[EMAIL PROTECTED]>
Cc: Chris Richard <[EMAIL PROTECTED]>, Kingsley Idehen
<[EMAIL PROTECTED]>, Chris Bizer <[EMAIL PROTECTED]>, Semantic
Web <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Subject: Re: dbpedia + disambiguation pages
Reply-To: [EMAIL PROTECTED]
Archived-At: <http://www.w3.org/mid/[EMAIL PROTECTED]>
Hi all:
> Note that redirects are often not synonyms, but artifacts of
Wikipedia's
> evolution. Redirects contain things like misspelled names, names
that
> adhere to older naming conventions (e.g. the original WikiWords
> CamelCase naming convention), instances where multiple articles were
> folded into one etc. Thus they make poor labels.
A bit late some related input: In the course of our paper [1], we
did a quantitative analysis of redirects in Wikipedia (English):
Here are the results in a nutshell:
"Redirection Pages
• 78% of the redirection pages are obvious synonyms (in
particular spelling variants or changes in word order of composite
words),
• 12 % reflect pages for which the content was integrated into
other pages,
• for 10%, we could not quickly identify the semantic
relationship (we also did not try very hard ;-)).
With regard to the impact on our analysis, we can observe the
following: First, for the vast majority (78%) of all URI’s that
represent redirects, there is no semantic difference, since they
are synonyms. For 22% (10 + 12 %) of the redirects, semantic
differences between the original URI and the target of the redirect
cannot be excluded. In 12 % of the cases, the redirect points to a
page that incorporates the original content in a larger article."
See http://www.heppnetz.de/harvesting-wikipedia/ for more
information and [1] for the full paper (also available for download
on that page).
Best
Martin
[1] Martin Hepp, Katharina Siorpaes, Daniel Bachlechner: Harvesting
Wiki Consensus: Using Wikipedia Entries as Vocabulary for
Knowledge Management, IEEE Internet Computing, Vol. 11, No. 5, pp.
54-65, Sept-Oct 2007. Available at http://www.heppnetz.de/
harvesting-wikipedia/
----------------------------------
martin hepp, http://www.heppnetz.de
[EMAIL PROTECTED]
Richard Cyganiak wrote:
Chris,
Since your question is quite specific to DBpedia, let's continue
the discussion at the DBpedia mailing list (see http://dbpedia.org/
docs/#support and CC). Please consider remove [EMAIL PROTECTED]
from the CC list for further replies.
On 4 Aug 2007, at 20:18, Chris Richard wrote:
Have you done any thinking about extracting disambiguation
information from disambiguation pages?
No, we currently don't do any special processing for Wikipedia's
disambiguation pages.
The main focus of DBpedia is extraction of information about the
*things* described in Wikipedia articles, to enable domain queries
over this information. Disambiguation information isn't really
about those things, it's about the names we use to refer to those
things. (Specifically, when a single name could refer to more than
one of those things.) So it's more linguistic in nature, and
hasn't registered prominently on our priority list.
I was working on a similar project to extract structured info
from wikipedia.org to be used as the basis for a sem-web project
(until I came across dbpedia.org), and this is one thing I was
targeting that I couldn't find any mention of on dbpedia.org.
I extract all the list items from a particular disambiguation
page and perform some basic processing to try and determine the
disambiguated article/concept. The Apple disambiguation page is a
good example of some of the different styles of information you get:
1. Apple Brook, a British actress
Simple to extract a mapping between the ambiguous "Apple" and
Apple Brook, along with a potentially useful single sentence
abstract.
2.
Apple (album), an album by Mother Love Bone
or
Ariane Passenger Payload Experiment, an Indian experimental
communication satellite with a C-Band transponder launched in 1981.
Multiple links, so it's not immediately obvious which one is the
disambiguated concept, but you can imagine heuristics to make
connections here.
I think that a large part of the disambiguation information could
be captured using relatively simple heuristics. There's no need to
capture everything, 80% might be “good enough”.
The DBpedia codebase has pluggable “Extractors” that produce RDF
triples from an article's source code; this would be yet another
extractor.
3. any of the computers made by Apple Inc. since 1976, notably
the Apple Macintosh
Somewhat unclear disambiguation, potentially difficult to extract
the correct relationship.
I haven't done a lot of thinking about the proper way to
represent these relationships in RDF, I was just writing back to
a custom DB schema for now,
I don't know how to represent this in RDF. DBpedia defines one
resource from each Wikipedia article, assuming that the topic of
each article is some meaningful entity in the real world. This
certainly doesn't hold for disambiguation pages, whose topic is
not a single thing, but a multitude of things that happen to be
related to some name, word, or term.
but I think the information is highly valuable.
Can you give us some examples where you think this information
could be used?
Also, similar to this, but easier to extract, is the synonym
information stored in the redirect links; are you currently
extracting multiple rdfs:label-s based on these redirects?
The next update will include dbpedia:redirectsTo triples for
redirected articles.
Note that redirects are often not synonyms, but artifacts of
Wikipedia's evolution. Redirects contain things like misspelled
names, names that adhere to older naming conventions (e.g. the
original WikiWords CamelCase naming convention), instances where
multiple articles were folded into one etc. Thus they make poor
labels.
Cheers,
Richard
If you have a minute let me know your thoughts on this.
Chris