Re: [Dbpedia-gsoc] Woring on Idea Generalize input formats and add support for Google mention corpus

2013-04-16 Thread Pablo N. Mendes
Oh, right! Yes! Thanks, Max. I thought I was going crazy for suggesting using Nutch, but now I remember why. Now I think I am going crazy for going back on my suggestion of using Nutch. :) Damn, I'm doing too much stuff at the same time! Great to have awesome people around that pick up the ball

[Dbpedia-gsoc] Spotlight Task: Extract the necessary DBpedia data directly from the Wikipedia dump

2013-04-16 Thread Joachim Daiber
Hey all, I added this task to the Spotlight ideas, it's smallish, so it's maybe more of a warm-up task: For creating Spotlight models, we need instance_types.nt, redirects.nt and disambiguations.nt. Since we want these to be from the same Wikipedia dump as the one from which we create the

Re: [Dbpedia-gsoc] Documentation for Internationalization/Guide unclear

2013-04-16 Thread Jona Christopher Sahnwaldt
On Apr 16, 2013 2:41 PM, Shivani Poddar shivani.podda...@gmail.com wrote: JC, Thanks a lot for your prompt reply. Will fix the depreciated documentations in the updated wiki :) Thank you! Thanks a lot, Shivani On Tue, Apr 16, 2013 at 6:08 PM, Jona Christopher Sahnwaldt

Re: [Dbpedia-gsoc] Documentation for Internationalization/Guide unclear

2013-04-16 Thread Shivani Poddar
On Tue, Apr 16, 2013 at 6:14 PM, Jona Christopher Sahnwaldt j...@sahnwaldt.de wrote: On Apr 16, 2013 2:41 PM, Shivani Poddar shivani.podda...@gmail.com wrote: JC, Thanks a lot for your prompt reply. Will fix the depreciated documentations in the updated wiki :) Thank you!

Re: [Dbpedia-gsoc] Spotlight Task: Extract the necessary DBpedia data directly from the Wikipedia dump

2013-04-16 Thread Dimitris Kontokostas
Hi Jo, This is a good interdisciplinary task ;) About the extraction script, DBpedia now uses a predefined folder structure for locating dumps / extracting data and follows the wIkipedia dumps structure [1]. There are two options here 1) Spotlight adapts the configuration to accommodate that 2)

Re: [Dbpedia-gsoc] Regarding the Idea Design a better / interactive display page. for GSoC 2013

2013-04-16 Thread Dimitris Kontokostas
On Tue, Apr 16, 2013 at 3:38 PM, Shivani Poddar shivani.podda...@gmail.comwrote: On Tue, Apr 16, 2013 at 11:52 AM, Dimitris Kontokostas kontokos...@informatik.uni-leipzig.de wrote: Hi Shivani, Like any PHP/MySQL this code read from your (triple-store) database and generates an HTML

Re: [Dbpedia-gsoc] Spotlight Task: Extract the necessary DBpedia data directly from the Wikipedia dump

2013-04-16 Thread Pablo N. Mendes
Or we run DEF extraction on Hadoop. :) Another task idea? Cheers, Pablo On Tue, Apr 16, 2013 at 4:34 PM, Joachim Daiber daiber.joac...@gmail.comwrote: Hey, so far, we download the Wikipedia dumps straight into HDFS. For the DBpedia extraction, we would store the dumps locally first, so

Re: [Dbpedia-gsoc] Woring on Idea Generalize input formats and add support for Google mention corpus

2013-04-16 Thread Max Jakob
Yes, crawling from one machine is not feasible. Nutch is hence a good option if we really go through with extracting these mentions ourselves, or some other kind of parallel download because we don't need the crawler functionality. Common Crawl is another cool option. In both cases we would need

Re: [Dbpedia-gsoc] Woring on Idea Generalize input formats and add support for Google mention corpus

2013-04-16 Thread Pablo N. Mendes
We have the cluster set up and last week I had already downloaded and preprocessed the corpus there to get a subset of mentions of my interest. I was going to install and run Nutch myself, but other priorities came to the top of my priority queue. This is why I suggested that Zhiwei starts with a

Re: [Dbpedia-gsoc] Woring on Idea Generalize input formats and add support for Google mention corpus

2013-04-16 Thread Max Jakob
Sounds good to me. @Zhiwei, you could start with downloading a couple of pages that are listed in the mention corpus with a script and then extract the mentions (we call them occurrences) from it. I suggest you do it regardless of what Google found and look for links to Wikipedia on the