Something I found out recently is that the page links don't capture
links that are generated by macros,  in particular almost all of the
links to pages like

http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1

don't show up because they are generated by the {cite} macro.  These
can be easily extracted from the Wikipedia HTML of course,  which is
what I did to pull off this project

http://blog.databaseanimals.com/the-top-most-cited-books-in-wikipedia
http://blog.databaseanimals.com/true-semantic-advertising

On Tue, Dec 3, 2013 at 4:32 AM, Andrea Di Menna <ninn...@gmail.com> wrote:
> Hi Dario,
>
> the dataset you are using is extracted by the
> org.dbpedia.extraction.mappings.PageLinksExtractor [1].
> This extractor collects internal wiki links [2] from Wikipedia content
> articles (that is, wikipedia pages which belong to the Main namespace [3])
> to other wikipedia pages (please note I am not talking about content
> articles here, because also links to pages in the File or Category
> namespaces are collected).
>
> Each row - triple <subject> <predicate> <object> - in the Pagelinks
> represent a directed link between two pages, e.g.
>
> <http://dbpedia.org/resource/Albedo>
> <http://dbpedia.org/ontology/wikiPageWikiLink>
> <http://dbpedia.org/resource/Latin> .
>
> means that an internal link to http://en.wikipedia.org/wiki/Latin was found
> in http://en.wikipedia.org/wiki/Albedo.
>
> You can check this link exists here (first sentence) [6]
>
> Basically this can be modeled in a directed graph as an edge "Albedo ->
> Latin"
>
>
> The reason why you have 17M instances (I suppose you are counting the nodes
> in your graph) is because objects in each triple can be outside the Main
> namespace.
> As far as I remember, 4M articles are wiki pages with belong to the Main
> namespace and which are neither redirects [4] nor disambiguation pages [5].
>
> Hope this clarifies a bit :-)
>
> Cheers
> Andrea
>
> [1]
> https://github.com/dbpedia/extraction-framework/blob/master/core/src/main/scala/org/dbpedia/extraction/mappings/PageLinksExtractor.scala
> [2] https://en.wikipedia.org/wiki/Help:Link
> [3] https://en.wikipedia.org/wiki/Wikipedia:Main_namespace
> [4] https://en.wikipedia.org/wiki/Wikipedia:Redirect
> [5] https://en.wikipedia.org/wiki/Wikipedia:Disambiguation
> [6] https://en.wikipedia.org/wiki/Albedo
>
>
>
>
> 2013/12/2 Dario Garcia Gasulla <dar...@lsi.upc.edu>
>>
>> Hi,
>>
>> I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC).
>>
>> I'm currently doing research on very large directed graphs and I am using
>> one of your datasets for testing. Concretly, I am using the "Wikipedia
>> Pagelinks" dataset as available in the DBpedia web site.
>>
>> Unfortunately the description of the dataset is not very detailed:
>>
>> Wikipedia Pagelinks
>>
>> Dataset containing internal links between DBpedia instances. The dataset
>> was created from the internal links between Wikipedia articles. The dataset
>> might be useful for structural analysis, data mining or for ranking DBpedia
>> instances using Page Rank or similar algorithms.
>>
>> I wonder if you could give me more information on how the dataset was
>> built and what composes it.
>> I understand Wikipedia has 4M articles and 31M pages, while this dataset
>> has 17M instances and 130M links (couldn't find the number of links of
>> Wikipedia).
>>
>> What's the relation between both? Could someone briefly explain the nature
>> of the Pagelinks dataset and the differences with the Wikipedia?
>>
>> Thank you for your time,
>> Dario.
>>
>>
>> ------------------------------------------------------------------------------
>> Rapidly troubleshoot problems before they affect your business. Most IT
>> organizations don't have a clear picture of how application performance
>> affects their revenue. With AppDynamics, you get 100% visibility into your
>> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
>> Pro!
>>
>> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Dbpedia-discussion mailing list
>> Dbpedia-discussion@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>>
>
>
> ------------------------------------------------------------------------------
> Rapidly troubleshoot problems before they affect your business. Most IT
> organizations don't have a clear picture of how application performance
> affects their revenue. With AppDynamics, you get 100% visibility into your
> Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics
> Pro!
> http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
> _______________________________________________
> Dbpedia-discussion mailing list
> Dbpedia-discussion@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   ontol...@gmail.com

------------------------------------------------------------------------------
Rapidly troubleshoot problems before they affect your business. Most IT 
organizations don't have a clear picture of how application performance 
affects their revenue. With AppDynamics, you get 100% visibility into your 
Java,.NET, & PHP application. Start your 15-day FREE TRIAL of AppDynamics Pro!
http://pubads.g.doubleclick.net/gampad/clk?id=84349351&iu=/4140/ostg.clktrk
_______________________________________________
Dbpedia-discussion mailing list
Dbpedia-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion

Reply via email to