I'd like to announce the preliminary availability of a very fascinating dataset
http://blog.databaseanimals.com/wikipedia-pagelinks-in-amazon-s3
The gist is that the page counts contain hourly usage information for
every page in all the wikipedias, wiktionaries, wikimedia commons,
etc.
This 3T
I just released a version of Infovore that can do scalable
differencing of RDF data sets, producing output in the RDF Patch
format
http://afs.github.io/rdf-patch/
The tool is written up here
https://github.com/paulhoule/infovore/wiki/rdfDiff
I ran this against two different weeks of Freebase d
The "DBpedia Way" of extracting the citations probably would be to
build something that treats the citations the way infoboxes are
treated.
It's one way of doing things, and it has it's own integrity, but
it's not the way I do things. (DBpedia does it this way about as well
as it can be done,
@Paul,
unfortunately HTML wikipedia dumps are not released anymore (they are old
static dumps as you said).
This is a problem for a project like DBpedia, as you can easily understand.
Moreover, I did not mean that it is not possible to crawl Wikipedia
instances or load dump into a private Mediawi
@Andrea,
there are old static dumps available, but I can say that running
the web crawler is not at all difficult. I got a list of topics by looking
at the ?s for DBpedia descriptions and then wrote a very simple
single-threaded crawler that took a few days to run on a micro instance in
Hi,
I'm trying to retrieve the DBpedia Lexicalizations Dataset from
http://spotlight.dbpedia.org/download/datasets/
The file lexicalizations_en.nq.bz2 can be donloaded but fails to uncompress
(the file seems to be incomplete), is there any mirror where I can get this
data?
I would like to match
Thx for the quick answer. SimpleWikiParser is fine. I just noticed that it
does not remove '' or '''. Is that on purpose?
thx,
Karsten
On 4 December 2013 17:01, Dimitris Kontokostas wrote:
> Hi Karsten,
>
> DBpedia 3.9 used only SimpleWikiParser but is a stable branch. If you just
> want Simpl
2013/12/4 Paul Houle
> I think I could get this data out of some API, but there are great
> HTML 5 parsing libraries now, so a link extractor from HTML can be
> built as quickly than an API client.
>
> There are two big advantages of looking at links in HTML: (i) you can
> use the same softwar