Sounds good to me.

@Zhiwei, you could start with downloading a couple of pages that are
listed in the mention corpus with a script and then extract the
mentions (we call them occurrences) from it. I suggest you do it
regardless of what Google found and look for links to Wikipedia on the
respective pages.
Please let us know if you would like to take up this task.

Cheers,
Max

On Tue, Apr 16, 2013 at 7:38 PM, Pablo N. Mendes <pablomen...@gmail.com> wrote:
> We have the cluster set up and last week I had already downloaded and
> preprocessed the corpus there to get a subset of mentions of my interest. I
> was going to install and run Nutch myself, but other priorities came to the
> top of my priority queue.
>
> This is why I suggested that Zhiwei starts with a small set, so that he can
> test in his single machine setup. If it works, we can take whatever he has
> and run with a larger set on the cluster.
>
> Cheers,
> Pablo
>
>
> On Tue, Apr 16, 2013 at 6:49 PM, Max Jakob <max.ja...@gmail.com> wrote:
>>
>> Yes, crawling from one machine is not feasible. Nutch is hence a good
>> option if we really go through with extracting these mentions
>> ourselves, or some other kind of parallel download because we don't
>> need the crawler functionality. Common Crawl is another cool option.
>> In both cases we would need some kind of OccurrenceSource from html.
>> boilerpipe is already there as a dependency anyways.
>>
>> Maybe it is worth pinging Sameer Singh who maintains [2] to ask about
>> the time line of the release of the complete context dataset? If it
>> will take long, we could start with the extraction of occurrences from
>> html and see if we can arrange a cluster somewhere to download the
>> pages.
>>
>> What do you guys think?
>>
>> Cheers,
>> Max
>>
>> [2] http://www.iesl.cs.umass.edu/data/wiki-links
>>
>>
>> On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber
>> <daiber.joac...@gmail.com> wrote:
>> > Ah, good that you spotted this! Well, this might take a while to crawl
>> > :)
>> > Maybe we could also extract the relevant pages from the common crawl
>> > corpus
>> > if crawling ourselves takes too long.
>
>
>
>
> --
>
> Pablo N. Mendes
> http://pablomendes.com

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to