On Sat, Mar 1, 2014 at 2:05 PM, Sebastian Schelter <s...@apache.org> wrote:
> Hi, > > I think this is an important discussion to have and its good that we have > it. I wish I could say different, but I encountered a lot of the > impressions that Sean mentioned. To be honest, I don't see Mahout being > ready to move to 1.0 in its current state. > > I still see our main problem in failing to provide viable documentation > and guidance to users. We cleaned up the wiki, but this is only a first > step. I feel that it is extremely hard for people to use a majority of our > algorithms, except if they do understand the mathematical details and are > willing to dig through the source code. I think Mahout contains a lot of > "hidden gems" that make it unique (e.g. Cooccurrence Analysis with > RowSimilarityJob, LDA with CVB, SSVD+PCA) but for the majority of users > these gems are out of reach. > > Another important aspect is that machine learning on MapReduce will vanish > very soon and there's no vision to move Mahout to more suitable platforms > yet. > Before we can even work on supporting other platforms we have to handle the Hadoop dependencies in the codebase. Perhaps we can start to slowly but surely reduce the dependencies on Hadoop or at least contain them by adding more abstraction. Only MR code should be using the Hadoop API IMO. For example, many classes depend on Hadoop for serializing and deserializing models. Perhaps we can make it so a model can be written to or read from some model interface, which can have implementations for HDFS, the local filesystem or perhaps even a remote API. Take NeuralNetwork for instance. It has dependencies on Hadoop but only for reading and writing the model to and from HDFS. > I think our lack of documentation causes a lack of users which stalls the > development and, together with the emergence of other platforms like Spark, > makes it hard for us to attract new people. > Here is a radical idea: how about creating reference documentation, i.e. a single PDF or HTML? This can be generated using Maven docbook. If the docs are part of the code and generated, users can contribute patches to the documentation because it sits along the source code. We might even be able to generate algorithm characteristics (sequential, MR) from the source code using a script, perhaps through annotations. We move the current Wiki docs inside the project and create Wiki pages only for logistical project information about Mahout and Apache. Let me know what you think. I can make tickets for these two issues of there is enough interest. > > I must say that I think that the architecture of Oryx is really what I > would envision for Mahout. Provide a computation layer for training models > and a serving layer with a REST API or Solr for deploying them. And then > abstract the training in the computation layer to enable training > in-memory, with Hadoop, Spark, Stratosphere, you name it. I was very > emotional when he had the discussion after Oryx was announced as a separate > project because I felt that this is what Mahout should have become. > If Mahout has a well designed Java API, a REST layer can be added easily via other frameworks. Frank > Just my 2 cents, > Sebastian > > > On 02/28/2014 10:56 AM, Sean Owen wrote: > >> OK, your defeatism is my realism. Why has Negative Nancy intruded on >> this conversation? >> >> I have a view into many large Hadoop users. The feedback from the >> minority that have tried Mahout is that it is inconsistent/unfinished >> ("a confederation of unrelated grad-school projects" as one put it), >> buggy, and hard to use except as a few copied snippets of code. Ouch! >> >> Only a handful that I'm aware of actually use it. Internally, there is >> a perception that there is no community attention to most of the code >> (see JIRA backlog). As a result -- software problems, community >> issues, little demand -- it is almost certainly not going to be in our >> next major packaging release, and was almost not in the current >> forthcoming one. >> >> Your Reality May Vary. This seems like yellow-flag territory for an >> Apache project though, if this is representative of a wider reality. >> So a conversation about whole other projects' worth of new >> functionality feels quite disconnected -- red-flag territory. >> >> To be constructive, here are four items that seem more important for >> something like "1.0.0" and are even a lot less work: >> >> - Use Hadoop .mapreduce API consistently >> - Standardize input output formats of all jobs >> - Remove use of deprecated code >> - Clear even a third of the open JIRA backlog >> >> (I still think it's fine to make different projects for quite >> different ideas. Hadoop has another ML project, and is about to have >> another other ML project. These good ideas might well better belong >> there. Here, I think there is a big need for shoring up if it's even >> going to survive to 1.0.) >> >> On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sro...@gmail.com> wrote: >> >> I think each of several >>> other of these points are probably on their own several times the amount >>> of >>> work that has been put into this project over the past year so I'm >>> wondering if this close to realistic as a to do list for 1.0 of this >>> project. >>> >>> >> That is means. I think that everything on this list is possible in >> relatively short order, but let's talk goals for a bit. >> >> What is missing here? What really doesn't matter? >> >> >