> These are data sets.
That's what you asked for but okay. > On Oct 5, 2013, at 7:08 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: > > These are data sets. Not sample data for testing. > > If you have good examples of how to use one or more of these data sets for > a realistic test case or demo, please speak up. > > > On Sat, Oct 5, 2013 at 6:46 PM, Andrew Musselman <andrew.mussel...@gmail.com >> wrote: > >> Amazon hosts some public data sets at >> http://aws.amazon.com/publicdatasets/ and http://aws.amazon.com/datasets >> >>> On Oct 5, 2013, at 1:11 PM, Ted Dunning <ted.dunn...@gmail.com> wrote: >>> >>> I was asked to answer an anonymous question about the future of Mahout on >>> Quora and thought I should share the answer here as well. >>> >>> That really depends on where the community of users wants to take Mahout. >>> >>> Some possibilities include: >>> >>> a) better classifiers. Mahout's capabilities in this respect include >> Naive >>> Bayes, Random Forest and logistic regression trained via single threaded >>> stochastic gradient descent (SGD). It would be good to have a high >> quality >>> parallel implementation of SGD and it would be good to have some kind of >>> deep learning as well. The random forest could also use some work. >>> >>> b) faster horses. I think that the sparse matrices can be made >>> significantly faster even considering the cost-based optimizer versions >>> that we already have. The addition of JBLAS support for dense matrices >>> would also be interesting. >>> >>> c) better API interfaces. The clustering interfaces are a bit of a >>> shambles in spite of the cool capabilities available with streaming >> k-means >>> and friends. >>> >>> d) better human interfaces. It would be great to have products like >>> Dataiku drive Mahout capabilities. Dataiku does a really great job of >> the >>> cleansing end of machine learning and Mahout really has not much in that >>> area. It would also be nice to move forward with Dmitriy Lyubimov's work >>> on Scala bindings for Mahout. >>> >>> e) bigger community. There are some closely related communities like the >>> folks working on Spark with MLI. More cross fertilization would be very >>> cool. >>> >>> f) more data. Getting sample data for testing is very hard. Getting >> data >>> at scale is exceedingly hard. If people could suggest a good, big and >>> freely available dataset, that would be awesome. >>> >>> None of these possibilities matter, however, if somebody doesn't do them. >>> So the question to each reader of this answer is "What would you like to >>> see and how can you help make that happen"? >>