Great step, thanks Frank

> On Mar 1, 2014, at 10:29 AM, Frank Scholten <fr...@frankscholten.nl> wrote:
> 
> I got inspired by the discussion so I took a first step in reducing Hadoop
> dependencies in the naive bayes code.
> 
> See my Github branch:
> https://github.com/frankscholten/mahout/tree/naivebayes-modelrepository
> 
> I introduced a repository class for reading and writing the NaiveBayesModel
> to and from HDFS.
> 
> Turns out we store the model in 2 ways: in a HDFS folder structure and in
> an HDFS file. The code I added makes this explicit.
> 
> In this branch NaiveBayesModel only depends on Vector, Matrix and
> Preconditions but no longer on Hadoop.
> 
> If we apply this approach for the other models in Mahout we get could get
> rid of a lot of of Hadoop dependencies.
> 
> Frank
> 
> 
> On Sat, Mar 1, 2014 at 5:32 PM, Frank Scholten <fr...@frankscholten.nl>wrote:
> 
>> On Sat, Mar 1, 2014 at 2:05 PM, Sebastian Schelter <s...@apache.org> wrote:
>> 
>>> Hi,
>>> 
>>> I think this is an important discussion to have and its good that we have
>>> it. I wish I could say different, but I encountered a lot of the
>>> impressions that Sean mentioned. To be honest, I don't see Mahout being
>>> ready to move to 1.0 in its current state.
>>> 
>>> I still see our main problem in failing to provide viable documentation
>>> and guidance to users. We cleaned up the wiki, but this is only a first
>>> step. I feel that it is extremely hard for people to use a majority of our
>>> algorithms, except if they do understand the mathematical details and are
>>> willing to dig through the source code. I think Mahout contains a lot of
>>> "hidden gems" that make it unique (e.g. Cooccurrence Analysis with
>>> RowSimilarityJob, LDA with CVB, SSVD+PCA) but for the majority of users
>>> these gems are out of reach.
>>> 
>>> Another important aspect is that machine learning on MapReduce will
>>> vanish very soon and there's no vision to move Mahout to more suitable
>>> platforms yet.
>> 
>> 
>> Before we can even work on supporting other platforms we have to handle
>> the Hadoop dependencies in the codebase. Perhaps we can start to slowly but
>> surely reduce the dependencies on Hadoop or at least contain them by adding
>> more abstraction. Only MR code should be using the Hadoop API IMO.
>> 
>> For example, many classes depend on Hadoop for serializing and
>> deserializing models. Perhaps we can make it so a model can be written to
>> or read from some model interface, which can have implementations for HDFS,
>> the local filesystem or perhaps even a remote API. Take NeuralNetwork for
>> instance. It has dependencies on Hadoop but only for reading and writing
>> the model to and from HDFS.
>> 
>> 
>>> I think our lack of documentation causes a lack of users which stalls the
>>> development and, together with the emergence of other platforms like Spark,
>>> makes it hard for us to attract new people.
>> 
>> 
>> Here is a radical idea: how about creating reference documentation, i.e. a
>> single PDF or HTML? This can be generated using Maven docbook. If the docs
>> are part of the code and generated, users can contribute patches to the
>> documentation because it sits along the source code. We might even be able
>> to generate algorithm characteristics (sequential, MR) from the source code
>> using a script, perhaps through annotations. We move the current Wiki docs
>> inside the project and create Wiki pages only for logistical project
>> information about Mahout and Apache.
>> 
>> Let me know what you think. I can make tickets for these two issues of
>> there is enough interest.
>> 
>> 
>>> 
>>> I must say that I think that the architecture of Oryx is really what I
>>> would envision for Mahout. Provide a computation layer for training models
>>> and a serving layer with a REST API or Solr for deploying them. And then
>>> abstract the training in the computation layer to enable training
>>> in-memory, with Hadoop, Spark, Stratosphere, you name it. I was very
>>> emotional when he had the discussion after Oryx was announced as a separate
>>> project because I felt that this is what Mahout should have become.
>> 
>> If Mahout has a well designed Java API, a REST layer can be added easily
>> via other frameworks.
>> 
>> Frank
>> 
>> 
>>> Just my 2 cents,
>>> Sebastian
>>> 
>>> 
>>>> On 02/28/2014 10:56 AM, Sean Owen wrote:
>>>> 
>>>> OK, your defeatism is my realism. Why has Negative Nancy intruded on
>>>> this conversation?
>>>> 
>>>> I have a view into many large Hadoop users. The feedback from the
>>>> minority that have tried Mahout is that it is inconsistent/unfinished
>>>> ("a confederation of unrelated grad-school projects" as one put it),
>>>> buggy, and hard to use except as a few copied snippets of code. Ouch!
>>>> 
>>>> Only a handful that I'm aware of actually use it. Internally, there is
>>>> a perception that there is no community attention to most of the code
>>>> (see JIRA backlog). As a result -- software problems, community
>>>> issues, little demand -- it is almost certainly not going to be in our
>>>> next major packaging release, and was almost not in the current
>>>> forthcoming one.
>>>> 
>>>> Your Reality May Vary. This seems like yellow-flag territory for an
>>>> Apache project though, if this is representative of a wider reality.
>>>> So a conversation about whole other projects' worth of new
>>>> functionality feels quite disconnected -- red-flag territory.
>>>> 
>>>> To be constructive, here are four items that seem more important for
>>>> something like "1.0.0" and are even a lot less work:
>>>> 
>>>> - Use Hadoop .mapreduce API consistently
>>>> - Standardize input output formats of all jobs
>>>> - Remove use of deprecated code
>>>> - Clear even a third of the open JIRA backlog
>>>> 
>>>> (I still think it's fine to make different projects for quite
>>>> different ideas. Hadoop has another ML project, and is about to have
>>>> another other ML project. These good ideas might well better belong
>>>> there. Here, I think there is a big need for shoring up if it's even
>>>> going to survive to 1.0.)
>>>> 
>>>> On Thu, Feb 27, 2014 at 5:25 PM, Sean Owen <sro...@gmail.com> wrote:
>>>> 
>>>> I think each of several
>>>>> other of these points are probably on their own several times the
>>>>> amount of
>>>>> work that has been put into this project over the past year so I'm
>>>>> wondering if this close to realistic as a to do list for 1.0 of this
>>>>> project.
>>>> That is means.  I think that everything on this list is possible in
>>>> relatively short order, but let's talk goals for a bit.
>>>> 
>>>> What is missing here?  What really doesn't matter?
>> 

Reply via email to