Yes, crawling from one machine is not feasible. Nutch is hence a good option if we really go through with extracting these mentions ourselves, or some other kind of parallel download because we don't need the crawler functionality. Common Crawl is another cool option. In both cases we would need some kind of OccurrenceSource from html. boilerpipe is already there as a dependency anyways.
Maybe it is worth pinging Sameer Singh who maintains [2] to ask about the time line of the release of the complete context dataset? If it will take long, we could start with the extraction of occurrences from html and see if we can arrange a cluster somewhere to download the pages. What do you guys think? Cheers, Max [2] http://www.iesl.cs.umass.edu/data/wiki-links On Tue, Apr 16, 2013 at 11:20 AM, Joachim Daiber <daiber.joac...@gmail.com> wrote: > Ah, good that you spotted this! Well, this might take a while to crawl :) > Maybe we could also extract the relevant pages from the common crawl corpus > if crawling ourselves takes too long. ------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter _______________________________________________ Dbpedia-gsoc mailing list Dbpedia-gsoc@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-gsoc