These are data sets.  Not sample data for testing.

If you have good examples of how to use one or more of these data sets for
a realistic test case or demo, please speak up.


On Sat, Oct 5, 2013 at 6:46 PM, Andrew Musselman <andrew.mussel...@gmail.com
> wrote:

> Amazon hosts some public data sets at
> http://aws.amazon.com/publicdatasets/ and http://aws.amazon.com/datasets
>
> > On Oct 5, 2013, at 1:11 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> >
> > I was asked to answer an anonymous question about the future of Mahout on
> > Quora and thought I should share the answer here as well.
> >
> > That really depends on where the community of users wants to take Mahout.
> >
> > Some possibilities include:
> >
> > a) better classifiers.  Mahout's capabilities in this respect include
> Naive
> > Bayes, Random Forest and logistic regression trained via single threaded
> > stochastic gradient descent (SGD).  It would be good to have a high
> quality
> > parallel implementation of SGD and it would be good to have some kind of
> > deep learning as well.  The random forest could also use some work.
> >
> > b) faster horses.  I think that the sparse matrices can be made
> > significantly faster even considering the cost-based optimizer versions
> > that we already have.  The addition of JBLAS support for dense matrices
> > would also be interesting.
> >
> > c) better API interfaces.  The clustering interfaces are a bit of a
> > shambles in spite of the cool capabilities available with streaming
> k-means
> > and friends.
> >
> > d) better human interfaces.  It would be great to have products like
> > Dataiku drive Mahout capabilities.  Dataiku does a really great job of
> the
> > cleansing end of machine learning and Mahout really has not much in that
> > area.  It would also be nice to move forward with Dmitriy Lyubimov's work
> > on Scala bindings for Mahout.
> >
> > e) bigger community.  There are some closely related communities like the
> > folks working on Spark with MLI.  More cross fertilization would be very
> > cool.
> >
> > f) more data.  Getting sample data for testing is very hard.  Getting
> data
> > at scale is exceedingly hard.  If people could suggest a good, big and
> > freely available dataset, that would be awesome.
> >
> > None of these possibilities matter, however, if somebody doesn't do them.
> > So the question to each reader of this answer is "What would you like to
> > see and how can you help make that happen"?
>

Reply via email to