Re: FYI Cloud Computing Resources

Andrew Purtell Wed, 30 Jul 2008 09:53:31 -0700

Hello from a lurker.

Powerset was able to get a dump of all content from en.wikipedia.org.
Maybe the Wikimedia Foundation can help you as well.


Alternatively I know that the Heritrix crawler has been modified to
store content into HDFS. See http://corporate.zvents.com/developers/thelab.html 
, at the bottom. Let that run long enough as an
initialization step of your experiment(s) and you could build up an
enormous corpus. 

Of course the results of the above would be noisy and unlabeled. 

Another option, if labeled and preprocessed data is desired, is the
KDD archive of course. http://kdd.ics.uci.edu/

    - Andy

> From: Grant Ingersoll <[EMAIL PROTECTED]>
> Subject: FYI Cloud Computing Resources
> To: [email protected]
> Date: Wednesday, July 30, 2008, 8:26 AM
> http://research.yahoo.com/node/2328
> 
> It _MAY_ (stressed, emphasized, etc.) be possible for Mahouters
> (or are we just Mahouts?) to get some access to these resources.
> One big question is where can we get some fairly large data sets
> (large, but not super large, I think, but am not sure)
> 
> If you have ideas, etc. please let us know.
> 
> -Grant

Re: FYI Cloud Computing Resources

Reply via email to