[Dbpedia-discussion] Wikipedia Pagecounts Now in Amazon S3

2013-12-05 Thread Paul Houle
I'd like to announce the preliminary availability of a very fascinating dataset http://blog.databaseanimals.com/wikipedia-pagelinks-in-amazon-s3 The gist is that the page counts contain hourly usage information for every page in all the wikipedias, wiktionaries, wikimedia commons, etc. This 3T

[Dbpedia-discussion] parallel rdfDiff

2013-12-05 Thread Paul Houle
I just released a version of Infovore that can do scalable differencing of RDF data sets, producing output in the RDF Patch format http://afs.github.io/rdf-patch/ The tool is written up here https://github.com/paulhoule/infovore/wiki/rdfDiff I ran this against two different weeks of Freebase d

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Paul Houle
The "DBpedia Way" of extracting the citations probably would be to build something that treats the citations the way infoboxes are treated. It's one way of doing things, and it has it's own integrity, but it's not the way I do things. (DBpedia does it this way about as well as it can be done,

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Andrea Di Menna
@Paul, unfortunately HTML wikipedia dumps are not released anymore (they are old static dumps as you said). This is a problem for a project like DBpedia, as you can easily understand. Moreover, I did not mean that it is not possible to crawl Wikipedia instances or load dump into a private Mediawi

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Paul Houle
@Andrea, there are old static dumps available, but I can say that running the web crawler is not at all difficult. I got a list of topics by looking at the ?s for DBpedia descriptions and then wrote a very simple single-threaded crawler that took a few days to run on a micro instance in

[Dbpedia-discussion] DBpedia Lexicalizations Dataset uncompress error

2013-12-05 Thread Rodrigo Baquero
Hi, I'm trying to retrieve the DBpedia Lexicalizations Dataset from http://spotlight.dbpedia.org/download/datasets/ The file lexicalizations_en.nq.bz2 can be donloaded but fails to uncompress (the file seems to be incomplete), is there any mirror where I can get this data? I would like to match

Re: [Dbpedia-discussion] WikiParser is not working but explicit SimpleWikiParser

2013-12-05 Thread Karsten Jeschkies
Thx for the quick answer. SimpleWikiParser is fine. I just noticed that it does not remove '' or '''. Is that on purpose? thx, Karsten On 4 December 2013 17:01, Dimitris Kontokostas wrote: > Hi Karsten, > > DBpedia 3.9 used only SimpleWikiParser but is a stable branch. If you just > want Simpl

Re: [Dbpedia-discussion] Pagelinks dataset

2013-12-05 Thread Andrea Di Menna
2013/12/4 Paul Houle > I think I could get this data out of some API, but there are great > HTML 5 parsing libraries now, so a link extractor from HTML can be > built as quickly than an API client. > > There are two big advantages of looking at links in HTML: (i) you can > use the same softwar