Fwd: dbpedia + disambiguation pages

Bijan Parsia Mon, 12 Nov 2007 02:58:31 -0800

I thought this might be interesting to folks here, and as an input tothe RI note.

I haven't had a chance to read the paper yet, just this email. Thereare some good analogies and disanalogies to be found (ok, it's "only"one site, but wikipedia behaves, in many ways, like the open web:decentralized minting and evolution; "authority" however, isweaker...the content of each page can have arbitrary authors andwhile editors have some powers, there are strong rules against theirusing those powers to overconstrain content; etc. etc.).

One big difference is that disagreement and variance can easily beaccumulated on the canonical page. (So, even for terms that are *not*strictly synonyms, redirecting is ok; I worry a little about theblithe comfort drawn from synonymy, since, e.g., connotation may beobliterated.) There is a heavy use of disambiguation pages. Thatwould be an interesting tactic in building a big ontologies: someterms have "disambiguation axioms"...i.e., if there was controversyor independent evolution, you could pop in some sense and conflictdisambigation axioms (this really is just the same term; that reallyis a different one; for Y's POV on term X see...) With a few simpletools (SKOSiness; alignment axioms; good change representation...wehad this in swoop in prototype form wherein you could create a"virtual version" by applying diffs to the canonical source) thiscould work well.

(Btw, sameAs is almost always the wrong tool, imho ;) Not that we'vehad much better....)


Cheers,
Bijan.

Begin forwarded message:

Resent-From: [EMAIL PROTECTED]
From: "Martin Hepp (UIBK)" <[EMAIL PROTECTED]>
Date: November 11, 2007 9:16:05 AM BST
To: Richard Cyganiak <[EMAIL PROTECTED]>
Cc: Chris Richard <[EMAIL PROTECTED]>, Kingsley Idehen<[EMAIL PROTECTED]>, Chris Bizer <[EMAIL PROTECTED]>, SemanticWeb <[EMAIL PROTECTED]>, [EMAIL PROTECTED]
Subject: Re: dbpedia + disambiguation pages
Reply-To: [EMAIL PROTECTED]
Archived-At: <http://www.w3.org/mid/[EMAIL PROTECTED]>


Hi all:
> Note that redirects are often not synonyms, but artifacts ofWikipedia's> evolution. Redirects contain things like misspelled names, namesthat
> adhere to older naming conventions (e.g. the original WikiWords
> CamelCase naming convention), instances where multiple articles were
> folded into one etc. Thus they make poor labels.
A bit late some related input: In the course of our paper [1], wedid a quantitative analysis of redirects in Wikipedia (English):Here are the results in a nutshell:
"Redirection Pages
• 78% of the redirection pages are obvious synonyms (inparticular spelling variants or changes in word order of compositewords),• 12 % reflect pages for which the content was integrated intoother pages,• for 10%, we could not quickly identify the semanticrelationship (we also did not try very hard ;-)).
With regard to the impact on our analysis, we can observe thefollowing: First, for the vast majority (78%) of all URI’s thatrepresent redirects, there is no semantic difference, since theyare synonyms. For 22% (10 + 12 %) of the redirects, semanticdifferences between the original URI and the target of the redirectcannot be excluded. In 12 % of the cases, the redirect points to apage that incorporates the original content in a larger article."
See http://www.heppnetz.de/harvesting-wikipedia/ for moreinformation and [1] for the full paper (also available for downloadon that page).
Best
Martin
[1] Martin Hepp, Katharina Siorpaes, Daniel Bachlechner: HarvestingWiki Consensus: Using Wikipedia Entries as Vocabulary forKnowledge Management, IEEE Internet Computing, Vol. 11, No. 5, pp.54-65, Sept-Oct 2007. Available at http://www.heppnetz.de/harvesting-wikipedia/
----------------------------------
martin hepp, http://www.heppnetz.de
[EMAIL PROTECTED]



Richard Cyganiak wrote:
Chris,
Since your question is quite specific to DBpedia, let's continuethe discussion at the DBpedia mailing list (see http://dbpedia.org/docs/#support and CC). Please consider remove [EMAIL PROTECTED]from the CC list for further replies.
On 4 Aug 2007, at 20:18, Chris Richard wrote:
Have you done any thinking about extracting disambiguationinformation from disambiguation pages?
No, we currently don't do any special processing for Wikipedia'sdisambiguation pages.The main focus of DBpedia is extraction of information about the*things* described in Wikipedia articles, to enable domain queriesover this information. Disambiguation information isn't reallyabout those things, it's about the names we use to refer to thosethings. (Specifically, when a single name could refer to more thanone of those things.) So it's more linguistic in nature, andhasn't registered prominently on our priority list.
I was working on a similar project to extract structured infofrom wikipedia.org to be used as the basis for a sem-web project(until I came across dbpedia.org), and this is one thing I wastargeting that I couldn't find any mention of on dbpedia.org.
I extract all the list items from a particular disambiguationpage and perform some basic processing to try and determine thedisambiguated article/concept. The Apple disambiguation page is agood example of some of the different styles of information you get:
1. Apple Brook, a British actress
Simple to extract a mapping between the ambiguous "Apple" andApple Brook, along with a potentially useful single sentenceabstract.
2.
Apple (album), an album by Mother Love Bone

or
Ariane Passenger Payload Experiment, an Indian experimentalcommunication satellite with a C-Band transponder launched in 1981.
Multiple links, so it's not immediately obvious which one is thedisambiguated concept, but you can imagine heuristics to makeconnections here.
I think that a large part of the disambiguation information couldbe captured using relatively simple heuristics. There's no need tocapture everything, 80% might be “good enough”.The DBpedia codebase has pluggable “Extractors” that produce RDFtriples from an article's source code; this would be yet anotherextractor.
3. any of the computers made by Apple Inc. since 1976, notablythe Apple Macintosh
Somewhat unclear disambiguation, potentially difficult to extractthe correct relationship.
I haven't done a lot of thinking about the proper way torepresent these relationships in RDF, I was just writing back toa custom DB schema for now,
I don't know how to represent this in RDF. DBpedia defines oneresource from each Wikipedia article, assuming that the topic ofeach article is some meaningful entity in the real world. Thiscertainly doesn't hold for disambiguation pages, whose topic isnot a single thing, but a multitude of things that happen to berelated to some name, word, or term.
but I think the information is highly valuable.
Can you give us some examples where you think this informationcould be used?
Also, similar to this, but easier to extract, is the synonyminformation stored in the redirect links; are you currentlyextracting multiple rdfs:label-s based on these redirects?
The next update will include dbpedia:redirectsTo triples forredirected articles.Note that redirects are often not synonyms, but artifacts ofWikipedia's evolution. Redirects contain things like misspellednames, names that adhere to older naming conventions (e.g. theoriginal WikiWords CamelCase naming convention), instances wheremultiple articles were folded into one etc. Thus they make poorlabels.
Cheers,
Richard
If you have a minute let me know your thoughts on this.

Chris

Fwd: dbpedia + disambiguation pages

Reply via email to