I thought this might be interesting to folks here, and as an input to the RI note.

I haven't had a chance to read the paper yet, just this email. There are some good analogies and disanalogies to be found (ok, it's "only" one site, but wikipedia behaves, in many ways, like the open web: decentralized minting and evolution; "authority" however, is weaker...the content of each page can have arbitrary authors and while editors have some powers, there are strong rules against their using those powers to overconstrain content; etc. etc.).

One big difference is that disagreement and variance can easily be accumulated on the canonical page. (So, even for terms that are *not* strictly synonyms, redirecting is ok; I worry a little about the blithe comfort drawn from synonymy, since, e.g., connotation may be obliterated.) There is a heavy use of disambiguation pages. That would be an interesting tactic in building a big ontologies: some terms have "disambiguation axioms"...i.e., if there was controversy or independent evolution, you could pop in some sense and conflict disambigation axioms (this really is just the same term; that really is a different one; for Y's POV on term X see...) With a few simple tools (SKOSiness; alignment axioms; good change representation...we had this in swoop in prototype form wherein you could create a "virtual version" by applying diffs to the canonical source) this could work well.

(Btw, sameAs is almost always the wrong tool, imho ;) Not that we've had much better....)

Cheers,
Bijan.

Begin forwarded message:

Resent-From: [EMAIL PROTECTED]
From: "Martin Hepp (UIBK)" <[EMAIL PROTECTED]>
Date: November 11, 2007 9:16:05 AM BST
To: Richard Cyganiak <[EMAIL PROTECTED]>
Cc: Chris Richard <[EMAIL PROTECTED]>, Kingsley Idehen <[EMAIL PROTECTED]>, Chris Bizer <[EMAIL PROTECTED]>, Semantic Web <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Subject: Re: dbpedia + disambiguation pages
Reply-To: [EMAIL PROTECTED]
Archived-At: <http://www.w3.org/mid/[EMAIL PROTECTED]>


Hi all:

> Note that redirects are often not synonyms, but artifacts of Wikipedia's > evolution. Redirects contain things like misspelled names, names that
> adhere to older naming conventions (e.g. the original WikiWords
> CamelCase naming convention), instances where multiple articles were
> folded into one etc. Thus they make poor labels.

A bit late some related input: In the course of our paper [1], we did a quantitative analysis of redirects in Wikipedia (English): Here are the results in a nutshell:

"Redirection Pages

• 78% of the redirection pages are obvious synonyms (in particular spelling variants or changes in word order of composite words), • 12 % reflect pages for which the content was integrated into other pages, • for 10%, we could not quickly identify the semantic relationship (we also did not try very hard ;-)).

With regard to the impact on our analysis, we can observe the following: First, for the vast majority (78%) of all URI’s that represent redirects, there is no semantic difference, since they are synonyms. For 22% (10 + 12 %) of the redirects, semantic differences between the original URI and the target of the redirect cannot be excluded. In 12 % of the cases, the redirect points to a page that incorporates the original content in a larger article."

See http://www.heppnetz.de/harvesting-wikipedia/ for more information and [1] for the full paper (also available for download on that page).

Best
Martin

[1] Martin Hepp, Katharina Siorpaes, Daniel Bachlechner: Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary for Knowledge Management, IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 2007. Available at http://www.heppnetz.de/ harvesting-wikipedia/
----------------------------------
martin hepp, http://www.heppnetz.de
[EMAIL PROTECTED]



Richard Cyganiak wrote:
Chris,
Since your question is quite specific to DBpedia, let's continue the discussion at the DBpedia mailing list (see http://dbpedia.org/ docs/#support and CC). Please consider remove [EMAIL PROTECTED] from the CC list for further replies.
On 4 Aug 2007, at 20:18, Chris Richard wrote:
Have you done any thinking about extracting disambiguation information from disambiguation pages?
No, we currently don't do any special processing for Wikipedia's disambiguation pages. The main focus of DBpedia is extraction of information about the *things* described in Wikipedia articles, to enable domain queries over this information. Disambiguation information isn't really about those things, it's about the names we use to refer to those things. (Specifically, when a single name could refer to more than one of those things.) So it's more linguistic in nature, and hasn't registered prominently on our priority list.
I was working on a similar project to extract structured info from wikipedia.org to be used as the basis for a sem-web project (until I came across dbpedia.org), and this is one thing I was targeting that I couldn't find any mention of on dbpedia.org.

I extract all the list items from a particular disambiguation page and perform some basic processing to try and determine the disambiguated article/concept. The Apple disambiguation page is a good example of some of the different styles of information you get:

1. Apple Brook, a British actress

Simple to extract a mapping between the ambiguous "Apple" and Apple Brook, along with a potentially useful single sentence abstract.

2.
Apple (album), an album by Mother Love Bone

or

Ariane Passenger Payload Experiment, an Indian experimental communication satellite with a C-Band transponder launched in 1981.

Multiple links, so it's not immediately obvious which one is the disambiguated concept, but you can imagine heuristics to make connections here.
I think that a large part of the disambiguation information could be captured using relatively simple heuristics. There's no need to capture everything, 80% might be “good enough”. The DBpedia codebase has pluggable “Extractors” that produce RDF triples from an article's source code; this would be yet another extractor.
3. any of the computers made by Apple Inc. since 1976, notably the Apple Macintosh

Somewhat unclear disambiguation, potentially difficult to extract the correct relationship.

I haven't done a lot of thinking about the proper way to represent these relationships in RDF, I was just writing back to a custom DB schema for now,
I don't know how to represent this in RDF. DBpedia defines one resource from each Wikipedia article, assuming that the topic of each article is some meaningful entity in the real world. This certainly doesn't hold for disambiguation pages, whose topic is not a single thing, but a multitude of things that happen to be related to some name, word, or term.
but I think the information is highly valuable.
Can you give us some examples where you think this information could be used?
Also, similar to this, but easier to extract, is the synonym information stored in the redirect links; are you currently extracting multiple rdfs:label-s based on these redirects?
The next update will include dbpedia:redirectsTo triples for redirected articles. Note that redirects are often not synonyms, but artifacts of Wikipedia's evolution. Redirects contain things like misspelled names, names that adhere to older naming conventions (e.g. the original WikiWords CamelCase naming convention), instances where multiple articles were folded into one etc. Thus they make poor labels.
Cheers,
Richard

If you have a minute let me know your thoughts on this.

Chris






Reply via email to