Oh, right! Yes! Thanks, Max. I thought I was going crazy for suggesting
using Nutch, but now I remember why. Now I think I am going crazy for going
back on my suggestion of using Nutch. :) Damn, I'm doing too much stuff at
the same time! Great to have awesome people around that pick up the ball
Hey all,
I added this task to the Spotlight ideas, it's smallish, so it's maybe more
of a warm-up task:
For creating Spotlight models, we need instance_types.nt, redirects.nt and
disambiguations.nt. Since we want these to be from the same Wikipedia dump
as the one from which we create the
On Apr 16, 2013 2:41 PM, Shivani Poddar shivani.podda...@gmail.com
wrote:
JC,
Thanks a lot for your prompt reply.
Will fix the depreciated documentations in the updated wiki :)
Thank you!
Thanks a lot,
Shivani
On Tue, Apr 16, 2013 at 6:08 PM, Jona Christopher Sahnwaldt
On Tue, Apr 16, 2013 at 6:14 PM, Jona Christopher Sahnwaldt j...@sahnwaldt.de
wrote:
On Apr 16, 2013 2:41 PM, Shivani Poddar shivani.podda...@gmail.com
wrote:
JC,
Thanks a lot for your prompt reply.
Will fix the depreciated documentations in the updated wiki :)
Thank you!
Hi Jo,
This is a good interdisciplinary task ;)
About the extraction script, DBpedia now uses a predefined folder structure
for locating dumps / extracting data and follows the wIkipedia dumps
structure [1].
There are two options here
1) Spotlight adapts the configuration to accommodate that
2)
On Tue, Apr 16, 2013 at 3:38 PM, Shivani Poddar
shivani.podda...@gmail.comwrote:
On Tue, Apr 16, 2013 at 11:52 AM, Dimitris Kontokostas
kontokos...@informatik.uni-leipzig.de wrote:
Hi Shivani,
Like any PHP/MySQL this code read from your (triple-store) database and
generates an HTML
Or we run DEF extraction on Hadoop. :)
Another task idea?
Cheers,
Pablo
On Tue, Apr 16, 2013 at 4:34 PM, Joachim Daiber daiber.joac...@gmail.comwrote:
Hey,
so far, we download the Wikipedia dumps straight into HDFS. For the
DBpedia extraction, we would store the dumps locally first, so
Yes, crawling from one machine is not feasible. Nutch is hence a good
option if we really go through with extracting these mentions
ourselves, or some other kind of parallel download because we don't
need the crawler functionality. Common Crawl is another cool option.
In both cases we would need
We have the cluster set up and last week I had already downloaded and
preprocessed the corpus there to get a subset of mentions of my interest. I
was going to install and run Nutch myself, but other priorities came to the
top of my priority queue.
This is why I suggested that Zhiwei starts with a
Sounds good to me.
@Zhiwei, you could start with downloading a couple of pages that are
listed in the mention corpus with a script and then extract the
mentions (we call them occurrences) from it. I suggest you do it
regardless of what Google found and look for links to Wikipedia on the
10 matches
Mail list logo