Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Paul Houle
The "DBpedia Way" of extracting the citations probably would be to build something that treats the citations the way infoboxes are treated. It's one way of doing things, and it has it's own integrity, but it's not the way I do things. (DBpedia does it this way about as well as it can be done,

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Andrea Di Menna
@Paul, unfortunately HTML wikipedia dumps are not released anymore (they are old static dumps as you said). This is a problem for a project like DBpedia, as you can easily understand. Moreover, I did not mean that it is not possible to crawl Wikipedia instances or load dump into a private Mediawi

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Paul Houle
@Andrea, there are old static dumps available, but I can say that running the web crawler is not at all difficult. I got a list of topics by looking at the ?s for DBpedia descriptions and then wrote a very simple single-threaded crawler that took a few days to run on a micro instance in

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Andrea Di Menna
2013/12/4 Paul Houle > I think I could get this data out of some API, but there are great > HTML 5 parsing libraries now, so a link extractor from HTML can be > built as quickly than an API client. > > There are two big advantages of looking at links in HTML: (i) you can > use the same softwar

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-04 Thread Paul Houle
I think I could get this data out of some API, but there are great HTML 5 parsing libraries now, so a link extractor from HTML can be built as quickly than an API client. There are two big advantages of looking at links in HTML: (i) you can use the same software to analyze multiple sites, and

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-03 Thread Andrea Di Menna
I guess Paul wanted to know which book is cited by one wikipedia page (e.g. page A cites book x). If I am not wrong by asking template transclusions you only get the first part of the triple (page A). Paul, your use case is interesting. At the moment we are not dealing with the {{cite}} template n

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-03 Thread Tom Morris
On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle wrote: > Something I found out recently is that the page links don't capture > links that are generated by macros, in particular almost all of the > links to pages like > > http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 > > don't sho

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-03 Thread Paul Houle
Something I found out recently is that the page links don't capture links that are generated by macros, in particular almost all of the links to pages like http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1 don't show up because they are generated by the {cite} macro. These can

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-03 Thread Dimitris Kontokostas
In addition to Adrea's reply, we also collect the "red links" which means links to pages that do not exist (yet). On Tue, Dec 3, 2013 at 11:32 AM, Andrea Di Menna wrote: > Hi Dario, > > the dataset you are using is extracted by > the org.dbpedia.extraction.mappings.PageLinksExtractor [1]. > Thi

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-03 Thread Andrea Di Menna
Hi Dario, the dataset you are using is extracted by the org.dbpedia.extraction.mappings.PageLinksExtractor [1]. This extractor collects internal wiki links [2] from Wikipedia content articles (that is, wikipedia pages which belong to the Main namespace [3]) to other wikipedia pages (please note I

[Dbpedia-discussion] Pagelinks dataset

2013-12-02 Thread Dario Garcia Gasulla
Hi, I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC). I'm currently doing research on very large directed graphs and I am using one of your datasets for testing. Concretly, I am using the "Wikipedia Pagelinks" dataset as available in the DBpedia web site. Unfortunately the desc