[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

Dhruv Kumar (JIRA) Tue, 05 Apr 2011 09:12:48 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015982#comment-13015982
 ]


Dhruv Kumar commented on MAHOUT-627:
------------------------------------

Updated JIRA: added gsoc labels, edited the JIRA title to match project's title.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-627
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-627
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Dhruv Kumar
>              Labels: gsoc, gsoc2011, mahout-gsoc-11
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> as it is numerically stable and provides a guaranteed discovery of the 
> Maximum Likelihood Estimator in the presence of incomplete data. Currently, 
> Apache Mahout has a sequential implementation of the Baum-Welch which cannot 
> be scaled to train over large data sets. This project proposes to extend the 
> sequential implementation of the Baum-Welch to a parallel, distributed 
> version using the Map Reduce programming framework to allow scalable Hidden 
> Markov Model training. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Their 
> relative simplicity of implementation and their ability to discover latent 
> domain knowledge have made them very popular in fields such as DNA sequence 
> alignment, handwriting analysis, voice recognition, computer vision and 
> parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for determining the quality of the model 
> and is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is slower but more 
> accurate, and a better candidate for a parallel implementation for two 
> reasons:
> (1) The BW is more numerically stable and provides a guaranteed local maximum 
> of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In 
> Viterbi training, the MLE is approximated in order to reduce computation 
> time. 
> (2) The BW belongs to the general class of Expectation Maximization (EM) 
> algorithms which naturally fit into the Map Reduce framework [2]. 
> Hence, this project proposes to extend Mahout's current sequential 
> implementation of the Baum-Welch HMM trainer to a scalable, distributed case. 
> Since the distributed version of the BW will use the sequential 
> implementations of the Forward and the Backward algorithms to compute the 
> alpha and the beta factors in each iteration, a lot of existing HMM training 
> code will be reused. Specifically, the parallel implementation of the BW 
> algorithm on Map Reduce has been elaborated at great length in [3] by viewing 
> it as a specific case of the Expectation-Maximization algorithm and will be 
> followed for implementation in this project. 
> The BW EM algorithm iteratively refines the model's parameters and consists 
> of two distinct steps in each iteration--Expectation and Maximization. In the 
> distributed case, the Expectation step is computed by the mappers and the 
> reducers, while the Maximization is handled by the reducers. Starting from an 
> initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is 
> input to the algorithm, and the end result Theta^(i+1) is fed to the next 
> iteration i+1. The iteration stops on a user specified convergence condition 
> expressed as a fixpoint or when the number of iterations exceeds a user 
> defined value. 
> Expectation computes the posterior probability of each latent variable for 
> each observed variable, weighed by the relative frequency of the observed 
> variable in the input split. The mappers process independent training 
> instances and emit expected state transition and emission counts using the 
> forward-backward algorithm. The reducers finish Expectation by aggregating 
> the expected counts. The input to a mapper consists of (k, v_o) pairs where k 
> is a unique key and v_o is a string of observed symbols. For each training 
> instance, the mappers emit the same set of keys corresponding to the 
> following three optimization problems to be solved during the Maximization, 
> and their values in a hash-map:
> (1) Expected number of times a hidden state is reached (Pi).
> (2) Number of times each observable symbol is generated by each hidden state 
> (B).
> (3) Number of transitions between each pair of states in the hidden state 
> space (A). 
> The M step computes the updated Theta(i+1) from the values generated during 
> the E part. This involves aggregating the values (as hash-maps) for each key 
> corresponding to one of the optimization problems. The aggregation summarizes 
> the statistics necessary to compute a subset of the parameters for the next 
> EM iteration. The optimal parameters for the next iteration are arrived by 
> computing the relative frequency of each event with respect to its expected 
> count at the current iteration. The emitted optimal parameters by each 
> reducer are written to the HDFS and are fed to the mappers in the next 
> iteration. 
> The project can be subdivided into distinct tasks of programming, testing and 
> documenting the driver, mapper, reducer and the combiner with the Expectation 
> and Maximization parts split between them. For each of these tasks, a new 
> class will be programmed, unit tested and documented within the 
> org.apache.mahout.classifier.sequencelearning.hmm package. A list of 
> milestones, associated deliverable and high level implementation details is 
> given below. 
> Time-line: April 26 - Aug 15. 
> Milestones: 
> April 26 - May 22 (4 weeks): This is the pre-coding stage. Open communication 
> with my mentor, refine the project's plan and requirements, understand the 
> community's code styling requirements, expand the knowledge on Hadoop and 
> Mahout internals. Thoroughly familiarize with the classes within the 
> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and 
> math packages. Write unit tests against the exisitng Mahout code.
> May 23 - June 3 (2 weeks): Implement, test and document the class HmmDriver 
> by extending the AbstractJob class and by reusing the code from the 
> KMeansDriver class. 
> June 3 - July 1 (4 weeks): Implement, test and document the class HmmMapper. 
> The HmmMapper class will include setup() and map() methods. The setup() 
> method will read in the HmmModel and the parameter values obtained from the 
> previous iteration. The map() method will call the 
> HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() 
> and complete the Expectation step partially. 
> July 1 - July 15 (2 weeks): Implement, test and document the class 
> HmmReducer. The reducer will complete the Espectation step by summing over 
> all the occurences emitted by the mappers for the three optimization 
> problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if 
> possible. 
> July 15 - July 29 (2 weeks): Implement, test and document the class 
> HmmCombiner. The combiner will reduce the network traffic and improve 
> efficiency by aggregating the values for each of the three keys corresponding 
> to each of the optimization problems for the Maximization stage in reducers. 
> Look at the possibility of code reuse from the KMeansCombiner class. 
> July 29 - August 15 (2 weeks): Test the mapper, reducer, combiner and driver 
> together. Give an example demonstrating the new parallel BW algorithm by 
> employing the parts-of-speech tagger data set also used by the sequential BW 
> [4]. Tidy up code and fix loose ends, finish wiki documentation. 
> Additional Information: 
> I am in the final stages of finishing my Master's degree in Electrical and 
> Computer Engineering from the University of Massachusetts Amherst. Working 
> under the guidance of Prof. Wayne Burleson, as part of my Master's research 
> work, I have applied the theory of Markov Decision Process (MDP) to increase 
> the duration of service of mobile computers. This semester I am involved with 
> two course projects involving machine learning over large data sets. In a 
> Bioinformatics class I am mining the RCSB Protein Data Bank to learn the 
> dependence of side chain geometry on the protein's secondary structure. In 
> another project for the Online Social Networks class, I am building an online 
> recommendation system using the MDP. 
> I owe much to the open source community as all my research experiments have 
> only been possible due to the freely available Linux distributions, 
> performance analyzers, scripting languages and documentation. Since joining 
> the Apache dev mailing list,  I have found the Apache Mahout's developer 
> community vibrant, helpful and welcoming. If selected, I feel that the GSOC 
> 2011 project will be a great learning experience for me from both a technical 
> and professional standpoint and also allow me to contribute within my modest 
> means to the overall spirit of open source programming. 
> References: 
> [1] A tutorial on hidden markov models and selected applications in speech 
> recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 
> (1989), pp. 257-286. 
> [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. 
> Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. 
> In NIPS (2006), pp. 281-288. 
> [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. 
> Morgan & Claypool 2010. 
> [4] http://flexcrfs.sourceforge.net/#Case_Study 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

Reply via email to