These are data sets. Not sample data for testing. If you have good examples of how to use one or more of these data sets for a realistic test case or demo, please speak up.
On Sat, Oct 5, 2013 at 6:46 PM, Andrew Musselman <andrew.mussel...@gmail.com > wrote: > Amazon hosts some public data sets at > http://aws.amazon.com/publicdatasets/ and http://aws.amazon.com/datasets > > > On Oct 5, 2013, at 1:11 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > > > I was asked to answer an anonymous question about the future of Mahout on > > Quora and thought I should share the answer here as well. > > > > That really depends on where the community of users wants to take Mahout. > > > > Some possibilities include: > > > > a) better classifiers. Mahout's capabilities in this respect include > Naive > > Bayes, Random Forest and logistic regression trained via single threaded > > stochastic gradient descent (SGD). It would be good to have a high > quality > > parallel implementation of SGD and it would be good to have some kind of > > deep learning as well. The random forest could also use some work. > > > > b) faster horses. I think that the sparse matrices can be made > > significantly faster even considering the cost-based optimizer versions > > that we already have. The addition of JBLAS support for dense matrices > > would also be interesting. > > > > c) better API interfaces. The clustering interfaces are a bit of a > > shambles in spite of the cool capabilities available with streaming > k-means > > and friends. > > > > d) better human interfaces. It would be great to have products like > > Dataiku drive Mahout capabilities. Dataiku does a really great job of > the > > cleansing end of machine learning and Mahout really has not much in that > > area. It would also be nice to move forward with Dmitriy Lyubimov's work > > on Scala bindings for Mahout. > > > > e) bigger community. There are some closely related communities like the > > folks working on Spark with MLI. More cross fertilization would be very > > cool. > > > > f) more data. Getting sample data for testing is very hard. Getting > data > > at scale is exceedingly hard. If people could suggest a good, big and > > freely available dataset, that would be awesome. > > > > None of these possibilities matter, however, if somebody doesn't do them. > > So the question to each reader of this answer is "What would you like to > > see and how can you help make that happen"? >