Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Ted Dunning Thu, 24 Mar 2011 17:09:28 -0700

I would invert the order of 1 and 2 and estimate the times as 30, 30, 40.
 You can view this from a few points of view:


a) writing good tests is hard

b) writing good tests makes writing code much easier

c) writing any kind of decent documentation is VASTLY harder than most
people think.  It is also very important for user adoption, mentor
satisfaction and personal marketing (for instance a resume or portfolio)

On Thu, Mar 24, 2011 at 4:41 PM, Dhruv Kumar <[email protected]> wrote:

> Thanks Ted, I'll start working on a proposal having the following sub tasks
> (I have given a rudimentary percent time estimate, please feel free to
> suggest alterations):
>
> 1. Implementing the BW on Map Reduce following the line of k-means. Focus
> on
> re-using as much of the existing k-means code as possible. (60%)
>
> 2. Unit testing the Mapper, Combiner, Reducer and testing the integration,
> in local and pseudo-distributed modes. I may be able to get access to a
> small cluster at UMass for unit testing in the real-distributed mode. (35%)
>
> 3. Writing clear documentation directing clients how to use the implemented
> library code for their needs. (5%)
>
>
>
> On Thu, Mar 24, 2011 at 6:45 PM, Ted Dunning <[email protected]>
> wrote:
>
> > On Thu, Mar 24, 2011 at 3:34 PM, Dhruv Kumar <[email protected]>
> wrote:
> >
> > > 2. Another very interesting possibility is to express the BW as a
> > recursive
> > > join.  There's a very interesting offshoot of Hadoop, called Haloop (
> > > http://code.google.com/p/haloop/) which supports loop control, and
> > caching
> > > of the intermediate results on the mapper inputs,  reducer inputs and
> > > reducer outputs to improve performance. The paper [1] describes this in
> > > more
> > > detail. They have implemented k-means as a recursive join.
> > >
> >
> > Until there is flexibility around execution model such as the recent
> > map-reduce 2.0 announcement
> > from Yahoo and until that flexibility is pretty much standard, it is hard
> > to
> > justify this.
> >
> > The exception is where such extended capabilities fit into standard
> hadoop
> > 0.20 environments.
> >
>
> >
> > > In either case, I want to clearly define the scope and task list. BW
> will
> > > be
> > > the core of the project but:
> > >
> > > 1. Does it make sense for implementing the "counting method" for model
> > > discovery as well? It is clearly inferior but will it be a good
> reference
> > > for comparison to the BW. Any added benefit?
> > >
> >
> > No opinion here except that increased scope decreases probability of even
> > partial success.
> >
> >
> > > 2. What has been the standard in the past GSoC Mahout projects
> regarding
> > > unit testing and documentation?
> > >
> >
> > Do it.
> >
> > Seriously.
> >
> > We use junit 4+ and very much prefer strong unit tests.  Nothing in what
> > you
> > are proposing should
> > require anything interesting in this regard.  Testing the mapper,
> combiner
> > and reducer in isolation is
> > good.  Testing the integrated program in local mode or pseudo distributed
> > mode should suffice beyond
> > that.  It is best if you can separate command line argument parsing from
> > execution path to that you
> > can test them separately.
> >
> > >
> > > In the meantime, I've been understanding more about Mahout, Map Reduce
> > and
> > > Hadoop's internals. One of my course projects this semester is to
> > implement
> > > the Bellman Iteration algorithm on Map Reduce and so far it has been
> > coming
> > > along well.
> > >
> > > Any feedback is much appreciated.
> > >
> > > Dhruv
> > >
> >
>

Re: [jira] [Commented] (MAHOUT-627) Parallelization of Baum-Welch Algorithm for HMM Training

Reply via email to