I'll join Chandler w/ the downstream user input fwiw...
Pat's earlier email described our shop perfectly re: recommendations. We're a
large organization using Mahout's recommendation capability in several projects
but shy away from the other components. We have a half dozen business units and
several of them have had fits-and-starts with Mahout for
clustering/classification/some fpm but collectively we've started to share the
recommendation capability because it's approachable, has efficient data
requirements for input and is fairly well documented for our use cases. I think
the documentation ills have been captured extensively (esp. recently) on @user
and even here on @dev around some of the other components and I can vouch that
folks in our organization cite that as a reason they abandoned Mahout.
I share Chandler's desire (and others that have offered thoughts in this
direction in the past week or so on @dev) that whatever the roadmap is that
it's clear and I can plan around it for the next 24-36 months. We have h20 up
and I confess the potential of migrating our 'data science' activities to a
singular execution framework/interface/ride-along atop existing Hadoop clusters
is alluring. We have expansive sprawl wrt stats packages and some diversity in
ML libs/packages and for an organization our size that's extremely costly. Any
opportunity we have to consolidate capabilities in this space helps us
tremendously. Re: Spark we understand the diversification from MR is coming but
in many important areas of our business we're only now gaining traction with
leaders to implement MR-based solutions. We're a large ship and turn slow, so
all I ask is that there's a long tail for deprecated MR capabilities because
we'll be slow to convert.
-Original Message-
From: Chandler Burgess [mailto:cburg...@icontrolesi.com]
Sent: Monday, April 07, 2014 4:11 PM
To: dev@mahout.apache.org
Subject: RE: Board Report
First, take my opinions with a grain of salt, as I'm sure most will. This is
basically an anecdote to back up Sean's and Pat's concerns.
I come from an industry (legal) where there is a huge demand for increased
analytics and machine learning applications. Our stack already includes
Lucene/Solr, I had heard about Mahout and was curious about applying it to some
of the things we wanted to do.
I spent around a month playing with Mahout, reading all the documentation and
articles I could, Mahout In Action, Taming Text, etc. After a month, I came
away highly disappointed. The documentation in general is very poor, some of
the drivers are buggy, others unusable because there is basically no
documentation, examples/potential applications are missing (what the hell can I
do with Lanczos SVD output? I just want LSI!), and, now, reading more about
Spark/h20 it leaves me uneasy that anything I write and use Mahout for will
change in the near future, not to mention another platform/technology
(potentially 2!) I have to learn.
It seems far, far away from a 1.0 release, which by all public indications is
next.
It was attractive from a licensing standpoint, and we will probably still use
it just for seq2sparse. And that will be about it. We're already putting a
stack together using other libraries which are better documented, from all
appearances more stable and feature rich, and faster (though maybe not as
scalable in some cases).
I have deadlines to meet, deliverables to produce, and other projects to work
on. As it is, I can't trust Mahout and the learning curve is too steep for
someone like me to apply this in a production environment without being in a
much bigger company with a lot more resources.
That said, my opinion would be that ONE direction needs to be chosen as the
main focus and efforts geared toward that. If it's moving to Spark, which
sounds awesome, then so be it. Otherwise, I fear Mahout will end up a toy for
hobbyists, people who are already vested in it, or relegated to the trash bin
while industry moves on to bigger and better things.
-Original Message-
From: Pat Ferrel [mailto:p...@occamsmachete.com]
Sent: Monday, April 07, 2014 1:03 PM
To: dev@mahout.apache.org
Subject: Re: Board Report
Mahout needs a reboot. Grant has the right perspective, but I'd take it
further. His #2 (two efforts) is not and never would be reasonable in anything
but a huge company.
I have never and would never take a team the size of Mahout (even with some new
commiters) and split a reboot into two parts on two engines. No sane project
manager would allow this. Why do we think it will work here?
The recent Gigaom article left me sympathetic with how confused the readers
must be, let alone potential users or contributors.
Sean is not being nihilistic, two directions will not work for Mahout. Mahout
has a bad reputation already for being a poorly documented and a poorly
integrated loose collections of code with a lot of technical debt. Honestly has