Yes, crawling from one machine is not feasible. Nutch is hence a good
option if we really go through with extracting these mentions
ourselves, or some other kind of parallel download because we don't
need the crawler functionality. Common Crawl is another cool option.
In both cases we would need some kind of OccurrenceSource from html.
boilerpipe is already there as a dependency anyways.

Maybe it is worth pinging Sameer Singh who maintains [2] to ask about
the time line of the release of the complete context dataset? If it
will take long, we could start with the extraction of occurrences from
html and see if we can arrange a cluster somewhere to download the
pages.

What do you guys think?

Cheers,
Max

[2] http://www.iesl.cs.umass.edu/data/wiki-links


On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber
<daiber.joac...@gmail.com> wrote:
> Ah, good that you spotted this! Well, this might take a while to crawl :)
> Maybe we could also extract the relevant pages from the common crawl corpus
> if crawling ourselves takes too long.

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Dbpedia-gsoc mailing list
Dbpedia-gsoc@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc

Reply via email to