[ https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015982#comment-13015982 ]
Dhruv Kumar commented on MAHOUT-627: ------------------------------------ Updated JIRA: added gsoc labels, edited the JIRA title to match project's title. > Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training. > ----------------------------------------------------------------------------- > > Key: MAHOUT-627 > URL: https://issues.apache.org/jira/browse/MAHOUT-627 > Project: Mahout > Issue Type: Task > Components: Classification > Affects Versions: 0.4 > Reporter: Dhruv Kumar > Labels: gsoc, gsoc2011, mahout-gsoc-11 > > Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov > Model Training. > Student Name: Dhruv Kumar > Student E-mail: dku...@ecs.umass.edu > Organization/Project: Apache Mahout > Assigned Mentor: > Proposal Abstract: > The Baum-Welch algorithm is commonly used for training a Hidden Markov Model > as it is numerically stable and provides a guaranteed discovery of the > Maximum Likelihood Estimator in the presence of incomplete data. Currently, > Apache Mahout has a sequential implementation of the Baum-Welch which cannot > be scaled to train over large data sets. This project proposes to extend the > sequential implementation of the Baum-Welch to a parallel, distributed > version using the Map Reduce programming framework to allow scalable Hidden > Markov Model training. > Detailed Description: > Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool > for applications generating temporal or spatial sequential data. Their > relative simplicity of implementation and their ability to discover latent > domain knowledge have made them very popular in fields such as DNA sequence > alignment, handwriting analysis, voice recognition, computer vision and > parts-of-speech tagging. > A HMM is defined as a tuple (S, O, Theta) where S is a finite set of > unobservable, hidden states emitting symbols from a finite observable > vocabulary set O according to a probabilistic model Theta. The parameters of > the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic > transition matrix of the hidden states of size |S| X |S|. The elements > a_(i,j) of A specify the probability of transitioning from a state i to state > j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose > elements b_(s, o) provide the probability that a symbol o will be emitted > from the hidden state s. The elements pi_(s) of the |S| length vector Pi > determine the probability that the system starts in the hidden state s. The > transitions of hidden states are unobservable and follow the Markov property > of memorylessness. > Rabiner [1] defined three main problems for HMMs: > 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the > observation sequence, determine the probability that the model generated the > observed sequence. This is useful for determining the quality of the model > and is solved using the so called Forward algorithm. > 2. Decoding: Given the complete model (S, O, Theta) and an observation > sequence, determine the hidden state sequence which generated the observed > sequence. This can be viewed as an inference problem where the model and > observed sequence are used to predict the value of the unobservable random > variables. The backward algorithm, also known as the Viterbi decoding > algorithm is used for predicting the hidden state sequence. > 3. Training: Given the set of hidden states S, the set of observation > vocabulary O and the observation sequence, determine the parameters (A, B, > Pi) of the model Theta. This problem can be viewed as a statistical machine > learning problem of model fitting to a large set of training data. The > Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and > the Viterbi training algorithm are commonly used for model fitting. > In general, the quality of HMM training can be improved by employing large > training vectors but currently, Mahout only supports sequential versions of > HMM trainers which are incapable of scaling. Among the Viterbi and the > Baum-Welch training methods, the Baum-Welch algorithm is slower but more > accurate, and a better candidate for a parallel implementation for two > reasons: > (1) The BW is more numerically stable and provides a guaranteed local maximum > of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In > Viterbi training, the MLE is approximated in order to reduce computation > time. > (2) The BW belongs to the general class of Expectation Maximization (EM) > algorithms which naturally fit into the Map Reduce framework [2]. > Hence, this project proposes to extend Mahout's current sequential > implementation of the Baum-Welch HMM trainer to a scalable, distributed case. > Since the distributed version of the BW will use the sequential > implementations of the Forward and the Backward algorithms to compute the > alpha and the beta factors in each iteration, a lot of existing HMM training > code will be reused. Specifically, the parallel implementation of the BW > algorithm on Map Reduce has been elaborated at great length in [3] by viewing > it as a specific case of the Expectation-Maximization algorithm and will be > followed for implementation in this project. > The BW EM algorithm iteratively refines the model's parameters and consists > of two distinct steps in each iteration--Expectation and Maximization. In the > distributed case, the Expectation step is computed by the mappers and the > reducers, while the Maximization is handled by the reducers. Starting from an > initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is > input to the algorithm, and the end result Theta^(i+1) is fed to the next > iteration i+1. The iteration stops on a user specified convergence condition > expressed as a fixpoint or when the number of iterations exceeds a user > defined value. > Expectation computes the posterior probability of each latent variable for > each observed variable, weighed by the relative frequency of the observed > variable in the input split. The mappers process independent training > instances and emit expected state transition and emission counts using the > forward-backward algorithm. The reducers finish Expectation by aggregating > the expected counts. The input to a mapper consists of (k, v_o) pairs where k > is a unique key and v_o is a string of observed symbols. For each training > instance, the mappers emit the same set of keys corresponding to the > following three optimization problems to be solved during the Maximization, > and their values in a hash-map: > (1) Expected number of times a hidden state is reached (Pi). > (2) Number of times each observable symbol is generated by each hidden state > (B). > (3) Number of transitions between each pair of states in the hidden state > space (A). > The M step computes the updated Theta(i+1) from the values generated during > the E part. This involves aggregating the values (as hash-maps) for each key > corresponding to one of the optimization problems. The aggregation summarizes > the statistics necessary to compute a subset of the parameters for the next > EM iteration. The optimal parameters for the next iteration are arrived by > computing the relative frequency of each event with respect to its expected > count at the current iteration. The emitted optimal parameters by each > reducer are written to the HDFS and are fed to the mappers in the next > iteration. > The project can be subdivided into distinct tasks of programming, testing and > documenting the driver, mapper, reducer and the combiner with the Expectation > and Maximization parts split between them. For each of these tasks, a new > class will be programmed, unit tested and documented within the > org.apache.mahout.classifier.sequencelearning.hmm package. A list of > milestones, associated deliverable and high level implementation details is > given below. > Time-line: April 26 - Aug 15. > Milestones: > April 26 - May 22 (4 weeks): This is the pre-coding stage. Open communication > with my mentor, refine the project's plan and requirements, understand the > community's code styling requirements, expand the knowledge on Hadoop and > Mahout internals. Thoroughly familiarize with the classes within the > classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and > math packages. Write unit tests against the exisitng Mahout code. > May 23 - June 3 (2 weeks): Implement, test and document the class HmmDriver > by extending the AbstractJob class and by reusing the code from the > KMeansDriver class. > June 3 - July 1 (4 weeks): Implement, test and document the class HmmMapper. > The HmmMapper class will include setup() and map() methods. The setup() > method will read in the HmmModel and the parameter values obtained from the > previous iteration. The map() method will call the > HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() > and complete the Expectation step partially. > July 1 - July 15 (2 weeks): Implement, test and document the class > HmmReducer. The reducer will complete the Espectation step by summing over > all the occurences emitted by the mappers for the three optimization > problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if > possible. > July 15 - July 29 (2 weeks): Implement, test and document the class > HmmCombiner. The combiner will reduce the network traffic and improve > efficiency by aggregating the values for each of the three keys corresponding > to each of the optimization problems for the Maximization stage in reducers. > Look at the possibility of code reuse from the KMeansCombiner class. > July 29 - August 15 (2 weeks): Test the mapper, reducer, combiner and driver > together. Give an example demonstrating the new parallel BW algorithm by > employing the parts-of-speech tagger data set also used by the sequential BW > [4]. Tidy up code and fix loose ends, finish wiki documentation. > Additional Information: > I am in the final stages of finishing my Master's degree in Electrical and > Computer Engineering from the University of Massachusetts Amherst. Working > under the guidance of Prof. Wayne Burleson, as part of my Master's research > work, I have applied the theory of Markov Decision Process (MDP) to increase > the duration of service of mobile computers. This semester I am involved with > two course projects involving machine learning over large data sets. In a > Bioinformatics class I am mining the RCSB Protein Data Bank to learn the > dependence of side chain geometry on the protein's secondary structure. In > another project for the Online Social Networks class, I am building an online > recommendation system using the MDP. > I owe much to the open source community as all my research experiments have > only been possible due to the freely available Linux distributions, > performance analyzers, scripting languages and documentation. Since joining > the Apache dev mailing list, I have found the Apache Mahout's developer > community vibrant, helpful and welcoming. If selected, I feel that the GSOC > 2011 project will be a great learning experience for me from both a technical > and professional standpoint and also allow me to contribute within my modest > means to the overall spirit of open source programming. > References: > [1] A tutorial on hidden markov models and selected applications in speech > recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 > (1989), pp. 257-286. > [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. > Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. > In NIPS (2006), pp. 281-288. > [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. > Morgan & Claypool 2010. > [4] http://flexcrfs.sourceforge.net/#Case_Study -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira