Hello from a lurker. Powerset was able to get a dump of all content from en.wikipedia.org. Maybe the Wikimedia Foundation can help you as well.
Alternatively I know that the Heritrix crawler has been modified to store content into HDFS. See http://corporate.zvents.com/developers/thelab.html , at the bottom. Let that run long enough as an initialization step of your experiment(s) and you could build up an enormous corpus. Of course the results of the above would be noisy and unlabeled. Another option, if labeled and preprocessed data is desired, is the KDD archive of course. http://kdd.ics.uci.edu/ - Andy > From: Grant Ingersoll <[EMAIL PROTECTED]> > Subject: FYI Cloud Computing Resources > To: [email protected] > Date: Wednesday, July 30, 2008, 8:26 AM > http://research.yahoo.com/node/2328 > > It _MAY_ (stressed, emphasized, etc.) be possible for Mahouters > (or are we just Mahouts?) to get some access to these resources. > One big question is where can we get some fairly large data sets > (large, but not super large, I think, but am not sure) > > If you have ideas, etc. please let us know. > > -Grant
