[
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146479#comment-13146479
]
Suneel Marthi commented on MAHOUT-627:
--------------------------------------
While reviewing the code in BaumWelchTrainer.java, noticed that we have a bunch
of System.out.println() statements. Code needs some cleanup to replace these by
SLF4j logger calls.
> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -----------------------------------------------------------------------------
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
> Issue Type: Task
> Components: Classification
> Affects Versions: 0.4, 0.5
> Reporter: Dhruv Kumar
> Assignee: Grant Ingersoll
> Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch,
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch,
> MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov
> Model Training.
> Student Name: Dhruv Kumar
> Student E-mail: [email protected]
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model
> because of its superior numerical stability and its ability to guarantee the
> discovery of a locally maximum, Maximum Likelihood Estimator, in the
> presence of incomplete training data. Currently, Apache Mahout has a
> sequential implementation of the Baum-Welch which cannot be scaled to train
> over large data sets. This restriction reduces the quality of training and
> constrains generalization of the learned model when used for prediction. This
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed
> version using the Map-Reduce programming framework for enhanced model fitting
> over large data sets.
> Detailed Description:
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool
> for applications generating temporal or spatial sequential data. Relative
> simplicity of implementation, combined with their ability to discover latent
> domain knowledge have made them very popular in diverse fields such as DNA
> sequence alignment, gene discovery, handwriting analysis, voice recognition,
> computer vision, language translation and parts-of-speech tagging.
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
> unobservable, hidden states emitting symbols from a finite observable
> vocabulary set O according to a probabilistic model Theta. The parameters of
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic
> transition matrix of the hidden states of size |S| X |S|. The elements
> a_(i,j) of A specify the probability of transitioning from a state i to state
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose
> elements b_(s, o) provide the probability that a symbol o will be emitted
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi
> determine the probability that the system starts in the hidden state s. The
> transitions of hidden states are unobservable and follow the Markov property
> of memorylessness.
> Rabiner [1] defined three main problems for HMMs:
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the
> observation sequence, determine the probability that the model generated the
> observed sequence. This is useful for evaluating the quality of the model and
> is solved using the so called Forward algorithm.
> 2. Decoding: Given the complete model (S, O, Theta) and an observation
> sequence, determine the hidden state sequence which generated the observed
> sequence. This can be viewed as an inference problem where the model and
> observed sequence are used to predict the value of the unobservable random
> variables. The backward algorithm, also known as the Viterbi decoding
> algorithm is used for predicting the hidden state sequence.
> 3. Training: Given the set of hidden states S, the set of observation
> vocabulary O and the observation sequence, determine the parameters (A, B,
> Pi) of the model Theta. This problem can be viewed as a statistical machine
> learning problem of model fitting to a large set of training data. The
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and
> the Viterbi training algorithm are commonly used for model fitting.
> In general, the quality of HMM training can be improved by employing large
> training vectors but currently, Mahout only supports sequential versions of
> HMM trainers which are incapable of scaling. Among the Viterbi and the
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate,
> and a better candidate for a parallel implementation for two reasons:
> (1) The BW is numerically stable and provides a guaranteed discovery of the
> locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters
> (Theta). In Viterbi training, the MLE is approximated in order to reduce
> computation time.
> (2) The BW belongs to the general class of Expectation Maximization (EM)
> algorithms which naturally fit into the Map-Reduce framework [2], such as the
> existing Map Reduce implementation of k-means in Mahout.
> Hence, this project proposes to extend Mahout's current sequential
> implementation of the Baum-Welch HMM trainer to a scalable, distributed case.
> Since the distributed version of the BW will use the sequential
> implementations of the Forward and the Backward algorithms to compute the
> alpha and the beta factors in each iteration, a lot of existing HMM training
> code will be reused. Specifically, the parallel implementation of the BW
> algorithm on Map Reduce has been elaborated at great length in [3] by viewing
> it as a specific case of the Expectation-Maximization algorithm and will be
> followed for implementation in this project.
> The BW EM algorithm iteratively refines the model's parameters and consists
> of two distinct steps in each iteration--Expectation and Maximization. In the
> distributed case, the Expectation step is computed by the mappers and the
> reducers, while the Maximization is handled by the reducers. Starting from an
> initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is
> input to the algorithm, and the end result Theta^(i+1) is fed to the next
> iteration i+1. The iteration stops on a user specified convergence condition
> expressed as a fixpoint or when the number of iterations exceeds a user
> defined value.
> Expectation computes the posterior probability of each latent variable for
> each observed variable, weighed by the relative frequency of the observed
> variable in the input split. The mappers process independent training
> instances and emit expected state transition and emission counts using the
> Forward and Backward algorithms. The reducers finish Expectation by
> aggregating the expected counts. The input to a mapper consists of (k, v_o)
> pairs where k is a unique key and v_o is a string of observed symbols. For
> each training instance, the mappers emit the same set of keys corresponding
> to the following three optimization problems to be solved during the
> Maximization, and their values in a hash-map:
> (1) Expected number of times a hidden state is reached (Pi).
> (2) Number of times each observable symbol is generated by each hidden state
> (B).
> (3) Number of transitions between each pair of states in the hidden state
> space (A).
> The M step computes the updated Theta(i+1) from the values generated during
> the E part. This involves aggregating the values (as hash-maps) for each key
> corresponding to one of the optimization problems. The aggregation summarizes
> the statistics necessary to compute a subset of the parameters for the next
> EM iteration. The optimal parameters for the next iteration are arrived by
> computing the relative frequency of each event with respect to its expected
> count at the current iteration. The emitted optimal parameters by each
> reducer are written to the HDFS and are fed to the mappers in the next
> iteration.
> The project can be subdivided into distinct tasks of programming, testing and
> documenting the driver, mapper, reducer and the combiner with the Expectation
> and Maximization parts split between them. For each of these tasks, a new
> class will be programmed, unit tested and documented within the
> org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is
> also an EM algorithm, particular attention will be paid to its code at each
> step for possible reuse.
> A list of milestones, associated deliverable and high level implementation
> details is given below.
> Time-line: April 26 - Aug 15.
> Milestones:
> April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my
> mentor, refine the project's plan and requirements, understand the
> community's code styling requirements, expand the knowledge on Hadoop and
> Mahout internals. Thoroughly familiarize with the classes within the
> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and
> math packages.
> May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the
> class HmmDriver by extending the AbstractJob class and by reusing the code
> from the KMeansDriver class.
> June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the
> class HmmMapper. The HmmMapper class will include setup() and map() methods.
> The setup() method will read in the HmmModel and the parameter values
> obtained from the previous iteration. The map() method will call the
> HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm()
> and complete the Expectation step partially.
> July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the
> class HmmReducer. The reducer will complete the Expectation step by summing
> over all the occurences emitted by the mappers for the three optimization
> problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if
> possible. Also, mid-term review.
> July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document
> the class HmmCombiner. The combiner will reduce the network traffic and
> improve efficiency by aggregating the values for each of the three keys
> corresponding to each of the optimization problems for the Maximization stage
> in reducers. Look at the possibility of code reuse from the KMeansCombiner
> class.
> July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer,
> combiner and driver together. Give an example demonstrating the new parallel
> BW algorithm by employing the parts-of-speech tagger data set also used by
> the sequential BW [4]. Tidy up code and fix loose ends, finish wiki
> documentation.
> Additional Information:
> I am in the final stages of finishing my Master's degree in Electrical and
> Computer Engineering from the University of Massachusetts Amherst. Working
> under the guidance of Prof. Wayne Burleson, as part of my Master's research
> work, I have applied the theory of Markov Decision Process (MDP) to increase
> the duration of service of mobile computers. This semester I am involved with
> two course projects involving machine learning over large data sets. In the
> Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the
> dependence of side chain geometry on a protein's secondary structure, and
> comparing it with the Dynamic Bayesian Network approach used in [6]. In
> another project for the Online Social Networks class, I am using
> reinforcement learning to build an online recommendation system by
> reformulating MDP optimal policy search as an EM problem [7] and employing
> Map Reduce (extending Mahout) to arrive at it in a scalable, distributed
> manner.
> I owe much to the open source community as all my research experiments have
> only been possible due to the freely available Linux distributions,
> performance analyzers, scripting languages and associated documentation.
> After joining the Apache Mahout's developer mailing list a few weeks ago, I
> have found the community extremely vibrant, helpful and welcoming. If
> selected, I feel that the GSOC 2011 project will be a great learning
> experience for me from both a technical and professional standpoint and will
> also allow me to contribute within my modest means to the overall spirit of
> open source programming and Machine Learning.
> References:
> [1] A tutorial on hidden markov models and selected applications in speech
> recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77
> (1989), pp. 257-286.
> [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K.
> Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun.
> In NIPS (2006), pp. 281-288.
> [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer.
> Morgan & Claypool 2010.
> [4] http://flexcrfs.sourceforge.net/#Case_Study
> [5] http://www.rcsb.org/pdb/home/home.do
> [6] Beyond rotamers: a generative, probabilistic model of side chains in
> proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE,
> Hamelryck T. BMC Bioinformatics. 2010 Jun 5.
> [7] Probabilistic inference for solving discrete and continuous state Markov
> Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira