Something I found out recently is that the page links don't capture links that are generated by macros, in particular almost all of the links to pages like
http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 don't show up because they are generated by the {cite} macro. These can be easily extracted from the Wikipedia HTML of course, which is what I did to pull off this project http://blog.databaseanimals.com/the-top-most-cited-books-in-wikipedia http://blog.databaseanimals.com/true-semantic-advertising On Tue, Dec 3, 2013 at 4:32 AM, Andrea Di Menna <ninn...@gmail.com> wrote: > Hi Dario, > > the dataset you are using is extracted by the > org.dbpedia.extraction.mappings.PageLinksExtractor [1]. > This extractor collects internal wiki links [2] from Wikipedia content > articles (that is, wikipedia pages which belong to the Main namespace [3]) > to other wikipedia pages (please note I am not talking about content > articles here, because also links to pages in the File or Category > namespaces are collected). > > Each row - triple <subject> <predicate> <object> - in the Pagelinks > represent a directed link between two pages, e.g. > > <http://dbpedia.org/resource/Albedo> > <http://dbpedia.org/ontology/wikiPageWikiLink> > <http://dbpedia.org/resource/Latin> . > > means that an internal link to http://en.wikipedia.org/wiki/Latin was found > in http://en.wikipedia.org/wiki/Albedo. > > You can check this link exists here (first sentence) [6] > > Basically this can be modeled in a directed graph as an edge "Albedo -> > Latin" > > > The reason why you have 17M instances (I suppose you are counting the nodes > in your graph) is because objects in each triple can be outside the Main > namespace. > As far as I remember, 4M articles are wiki pages with belong to the Main > namespace and which are neither redirects [4] nor disambiguation pages [5]. > > Hope this clarifies a bit :-) > > Cheers > Andrea > > [1] > https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/PageLinksExtractor.scala > [2] https://en.wikipedia.org/wiki/Help:Link > [3] https://en.wikipedia.org/wiki/Wikipedia:Main_namespace > [4] https://en.wikipedia.org/wiki/Wikipedia:Redirect > [5] https://en.wikipedia.org/wiki/Wikipedia:Disambiguation > [6] https://en.wikipedia.org/wiki/Albedo > > > > > 2013/12/2 Dario Garcia Gasulla <dar...@lsi.upc.edu> >> >> Hi, >> >> I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC). >> >> I'm currently doing research on very large directed graphs and I am using >> one of your datasets for testing. Concretly, I am using the "Wikipedia >> Pagelinks" dataset as available in the DBpedia web site. >> >> Unfortunately the description of the dataset is not very detailed: >> >> Wikipedia Pagelinks >> >> Dataset containing internal links between DBpedia instances. The dataset >> was created from the internal links between Wikipedia articles. The dataset >> might be useful for structural analysis, data mining or for ranking DBpedia >> instances using Page Rank or similar algorithms. >> >> I wonder if you could give me more information on how the dataset was >> built and what composes it. >> I understand Wikipedia has 4M articles and 31M pages, while this dataset >> has 17M instances and 130M links (couldn't find the number of links of >> Wikipedia). >> >> What's the relation between both? Could someone briefly explain the nature >> of the Pagelinks dataset and the differences with the Wikipedia? >> >> Thank you for your time, >> Dario. >> >> >> ------------------------------------------------------------------------------ >> Rapidly troubleshoot problems before they affect your business. Most IT >> organizations don't have a clear picture of how application performance >> affects their revenue. With AppDynamics, you get 100% visibility into your >> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics >> Pro! >> >> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk >> _______________________________________________ >> Dbpedia-discussion mailing list >> Dbpedia-discussion@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion >> > > > ------------------------------------------------------------------------------ > Rapidly troubleshoot problems before they affect your business. Most IT > organizations don't have a clear picture of how application performance > affects their revenue. With AppDynamics, you get 100% visibility into your > Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics > Pro! > http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk > _______________________________________________ > Dbpedia-discussion mailing list > Dbpedia-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion > -- Paul Houle Expert on Freebase, DBpedia, Hadoop and RDF (607) 539 6254 paul.houle on Skype ontol...@gmail.com ------------------------------------------------------------------------------ Rapidly troubleshoot problems before they affect your business. Most IT organizations don't have a clear picture of how application performance affects their revenue. With AppDynamics, you get 100% visibility into your Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro! http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk _______________________________________________ Dbpedia-discussion mailing list Dbpedia-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion