The "DBpedia Way" of extracting the citations probably would be to
build something that treats the citations the way infoboxes are
treated.
It's one way of doing things, and it has it's own integrity, but
it's not the way I do things. (DBpedia does it this way about as well
as it can be done,
@Paul,
unfortunately HTML wikipedia dumps are not released anymore (they are old
static dumps as you said).
This is a problem for a project like DBpedia, as you can easily understand.
Moreover, I did not mean that it is not possible to crawl Wikipedia
instances or load dump into a private Mediawi
@Andrea,
there are old static dumps available, but I can say that running
the web crawler is not at all difficult. I got a list of topics by looking
at the ?s for DBpedia descriptions and then wrote a very simple
single-threaded crawler that took a few days to run on a micro instance in
2013/12/4 Paul Houle
> I think I could get this data out of some API, but there are great
> HTML 5 parsing libraries now, so a link extractor from HTML can be
> built as quickly than an API client.
>
> There are two big advantages of looking at links in HTML: (i) you can
> use the same softwar
I think I could get this data out of some API, but there are great
HTML 5 parsing libraries now, so a link extractor from HTML can be
built as quickly than an API client.
There are two big advantages of looking at links in HTML: (i) you can
use the same software to analyze multiple sites, and
I guess Paul wanted to know which book is cited by one wikipedia page (e.g.
page A cites book x).
If I am not wrong by asking template transclusions you only get the first
part of the triple (page A).
Paul, your use case is interesting.
At the moment we are not dealing with the {{cite}} template n
On Tue, Dec 3, 2013 at 1:44 PM, Paul Houle wrote:
> Something I found out recently is that the page links don't capture
> links that are generated by macros, in particular almost all of the
> links to pages like
>
> http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
>
> don't sho
Something I found out recently is that the page links don't capture
links that are generated by macros, in particular almost all of the
links to pages like
http://en.wikipedia.org/wiki/Special:BookSources/978-0-936389-27-1
don't show up because they are generated by the {cite} macro. These
can
In addition to Adrea's reply, we also collect the "red links" which means
links to pages that do not exist (yet).
On Tue, Dec 3, 2013 at 11:32 AM, Andrea Di Menna wrote:
> Hi Dario,
>
> the dataset you are using is extracted by
> the org.dbpedia.extraction.mappings.PageLinksExtractor [1].
> Thi
Hi Dario,
the dataset you are using is extracted by
the org.dbpedia.extraction.mappings.PageLinksExtractor [1].
This extractor collects internal wiki links [2] from Wikipedia content
articles (that is, wikipedia pages which belong to the Main namespace [3])
to other wikipedia pages (please note I
Hi,
I'm Dario Garcia-Gasulla, an AI researcher at Barcelona Tech (UPC).
I'm currently doing research on very large directed graphs and I am
using one of your datasets for testing. Concretly, I am using the
"Wikipedia Pagelinks" dataset as available in the DBpedia web site.
Unfortunately the desc
11 matches
Mail list logo