Re: Mahout's future

Andrew Musselman Sat, 05 Oct 2013 20:08:00 -0700

> These are data sets.


That's what you asked for but okay.

> On Oct 5, 2013, at 7:08 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
> These are data sets.  Not sample data for testing.
> 
> If you have good examples of how to use one or more of these data sets for
> a realistic test case or demo, please speak up.
> 
> 
> On Sat, Oct 5, 2013 at 6:46 PM, Andrew Musselman <andrew.mussel...@gmail.com
>> wrote:
> 
>> Amazon hosts some public data sets at
>> http://aws.amazon.com/publicdatasets/ and http://aws.amazon.com/datasets
>> 
>>> On Oct 5, 2013, at 1:11 PM, Ted Dunning <ted.dunn...@gmail.com> wrote:
>>> 
>>> I was asked to answer an anonymous question about the future of Mahout on
>>> Quora and thought I should share the answer here as well.
>>> 
>>> That really depends on where the community of users wants to take Mahout.
>>> 
>>> Some possibilities include:
>>> 
>>> a) better classifiers.  Mahout's capabilities in this respect include
>> Naive
>>> Bayes, Random Forest and logistic regression trained via single threaded
>>> stochastic gradient descent (SGD).  It would be good to have a high
>> quality
>>> parallel implementation of SGD and it would be good to have some kind of
>>> deep learning as well.  The random forest could also use some work.
>>> 
>>> b) faster horses.  I think that the sparse matrices can be made
>>> significantly faster even considering the cost-based optimizer versions
>>> that we already have.  The addition of JBLAS support for dense matrices
>>> would also be interesting.
>>> 
>>> c) better API interfaces.  The clustering interfaces are a bit of a
>>> shambles in spite of the cool capabilities available with streaming
>> k-means
>>> and friends.
>>> 
>>> d) better human interfaces.  It would be great to have products like
>>> Dataiku drive Mahout capabilities.  Dataiku does a really great job of
>> the
>>> cleansing end of machine learning and Mahout really has not much in that
>>> area.  It would also be nice to move forward with Dmitriy Lyubimov's work
>>> on Scala bindings for Mahout.
>>> 
>>> e) bigger community.  There are some closely related communities like the
>>> folks working on Spark with MLI.  More cross fertilization would be very
>>> cool.
>>> 
>>> f) more data.  Getting sample data for testing is very hard.  Getting
>> data
>>> at scale is exceedingly hard.  If people could suggest a good, big and
>>> freely available dataset, that would be awesome.
>>> 
>>> None of these possibilities matter, however, if somebody doesn't do them.
>>> So the question to each reader of this answer is "What would you like to
>>> see and how can you help make that happen"?
>>

Re: Mahout's future

Reply via email to