On Nov 2, 2011, at 1:01 PM, Jake Mannix wrote: > On Wed, Nov 2, 2011 at 5:36 AM, Grant Ingersoll <gsing...@apache.org> wrote: > >> >> Alternatively, the ASF email data is license free. We could take and use >> a chunk of that. You can pretty much have as much or as little as you >> want. Since it's broken down by project, it has the rough look and feel of >> 20newsgroups at much bigger scale. >> > > I like it. Where does that data live, can I download it easily?
Whole Enchilada on EC2: http://aws.amazon.com/datasets/7791434387204566 Small subset at: http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout You can also log into people.a.o with your ASF creds and grab some of the public sets there. I think there is a link somewhere else to download the actual mbox files, but I can't find it at the moment. I can send you the location off list if you want.