Great step, thanks Frank
> On Mar 1, 2014, at 10:29 AM, Frank Scholten <fr...@frankscholten.nl> wrote: > > I got inspired by the discussion so I took a first step in reducing Hadoop > dependencies in the naive bayes code. > > See my Github branch: > https://github.com/frankscholten/mahout/tree/naivebayes-modelrepository > > I introduced a repository class for reading and writing the NaiveBayesModel > to and from HDFS. > > Turns out we store the model in 2 ways: in a HDFS folder structure and in > an HDFS file. The code I added makes this explicit. > > In this branch NaiveBayesModel only depends on Vector, Matrix and > Preconditions but no longer on Hadoop. > > If we apply this approach for the other models in Mahout we get could get > rid of a lot of of Hadoop dependencies. > > Frank > > > On Sat, Mar 1, 2014 at 5:32 PM, Frank Scholten <fr...@frankscholten.nl>wrote: > >> On Sat, Mar 1, 2014 at 2:05 PM, Sebastian Schelter <s...@apache.org> wrote: >> >>> Hi, >>> >>> I think this is an important discussion to have and its good that we have >>> it. I wish I could say different, but I encountered a lot of the >>> impressions that Sean mentioned. To be honest, I don't see Mahout being >>> ready to move to 1.0 in its current state. >>> >>> I still see our main problem in failing to provide viable documentation >>> and guidance to users. We cleaned up the wiki, but this is only a first >>> step. I feel that it is extremely hard for people to use a majority of our >>> algorithms, except if they do understand the mathematical details and are >>> willing to dig through the source code. I think Mahout contains a lot of >>> "hidden gems" that make it unique (e.g. Cooccurrence Analysis with >>> RowSimilarityJob, LDA with CVB, SSVD+PCA) but for the majority of users >>> these gems are out of reach. >>> >>> Another important aspect is that machine learning on MapReduce will >>> vanish very soon and there's no vision to move Mahout to more suitable >>> platforms yet. >> >> >> Before we can even work on supporting other platforms we have to handle >> the Hadoop dependencies in the codebase. Perhaps we can start to slowly but >> surely reduce the dependencies on Hadoop or at least contain them by adding >> more abstraction. Only MR code should be using the Hadoop API IMO. >> >> For example, many classes depend on Hadoop for serializing and >> deserializing models. Perhaps we can make it so a model can be written to >> or read from some model interface, which can have implementations for HDFS, >> the local filesystem or perhaps even a remote API. Take NeuralNetwork for >> instance. It has dependencies on Hadoop but only for reading and writing >> the model to and from HDFS. >> >> >>> I think our lack of documentation causes a lack of users which stalls the >>> development and, together with the emergence of other platforms like Spark, >>> makes it hard for us to attract new people. >> >> >> Here is a radical idea: how about creating reference documentation, i.e. a >> single PDF or HTML? This can be generated using Maven docbook. If the docs >> are part of the code and generated, users can contribute patches to the >> documentation because it sits along the source code. We might even be able >> to generate algorithm characteristics (sequential, MR) from the source code >> using a script, perhaps through annotations. We move the current Wiki docs >> inside the project and create Wiki pages only for logistical project >> information about Mahout and Apache. >> >> Let me know what you think. I can make tickets for these two issues of >> there is enough interest. >> >> >>> >>> I must say that I think that the architecture of Oryx is really what I >>> would envision for Mahout. Provide a computation layer for training models >>> and a serving layer with a REST API or Solr for deploying them. And then >>> abstract the training in the computation layer to enable training >>> in-memory, with Hadoop, Spark, Stratosphere, you name it. I was very >>> emotional when he had the discussion after Oryx was announced as a separate >>> project because I felt that this is what Mahout should have become. >> >> If Mahout has a well designed Java API, a REST layer can be added easily >> via other frameworks. >> >> Frank >> >> >>> Just my 2 cents, >>> Sebastian >>> >>> >>>> On 02/28/2014 10:56 AM, Sean Owen wrote: >>>> >>>> OK, your defeatism is my realism. Why has Negative Nancy intruded on >>>> this conversation? >>>> >>>> I have a view into many large Hadoop users. The feedback from the >>>> minority that have tried Mahout is that it is inconsistent/unfinished >>>> ("a confederation of unrelated grad-school projects" as one put it), >>>> buggy, and hard to use except as a few copied snippets of code. Ouch! >>>> >>>> Only a handful that I'm aware of actually use it. Internally, there is >>>> a perception that there is no community attention to most of the code >>>> (see JIRA backlog). As a result -- software problems, community >>>> issues, little demand -- it is almost certainly not going to be in our >>>> next major packaging release, and was almost not in the current >>>> forthcoming one. >>>> >>>> Your Reality May Vary. This seems like yellow-flag territory for an >>>> Apache project though, if this is representative of a wider reality. >>>> So a conversation about whole other projects' worth of new >>>> functionality feels quite disconnected -- red-flag territory. >>>> >>>> To be constructive, here are four items that seem more important for >>>> something like "1.0.0" and are even a lot less work: >>>> >>>> - Use Hadoop .mapreduce API consistently >>>> - Standardize input output formats of all jobs >>>> - Remove use of deprecated code >>>> - Clear even a third of the open JIRA backlog >>>> >>>> (I still think it's fine to make different projects for quite >>>> different ideas. Hadoop has another ML project, and is about to have >>>> another other ML project. These good ideas might well better belong >>>> there. Here, I think there is a big need for shoring up if it's even >>>> going to survive to 1.0.) >>>> >>>> On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sro...@gmail.com> wrote: >>>> >>>> I think each of several >>>>> other of these points are probably on their own several times the >>>>> amount of >>>>> work that has been put into this project over the past year so I'm >>>>> wondering if this close to realistic as a to do list for 1.0 of this >>>>> project. >>>> That is means. I think that everything on this list is possible in >>>> relatively short order, but let's talk goals for a bit. >>>> >>>> What is missing here? What really doesn't matter? >>