[jira] [Updated] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

Grant Ingersoll (JIRA) Sun, 09 Jun 2013 13:56:15 -0700

     [ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Grant Ingersoll updated MAHOUT-627:
-----------------------------------

    Fix Version/s:     (was: 0.8)
                   0.9
    
> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-627
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-627
>             Project: Mahout
>          Issue Type: Task
>          Components: Classification
>    Affects Versions: 0.4, 0.5
>            Reporter: Dhruv Kumar
>            Assignee: Grant Ingersoll
>              Labels: gsoc, gsoc2011, mahout-gsoc-11
>             Fix For: 0.9
>
>         Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a parallel implementation for two reasons:
> (1) The BW is numerically stable and provides a guaranteed discovery of the 
> locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters 
> (Theta). In Viterbi training, the MLE is approximated in order to reduce 
> computation time. 
> (2) The BW belongs to the general class of Expectation Maximization (EM) 
> algorithms which naturally fit into the Map-Reduce framework [2], such as the 
> existing Map Reduce implementation of k-means in Mahout. 
> Hence, this project proposes to extend Mahout's current sequential 
> implementation of the Baum-Welch HMM trainer to a scalable, distributed case. 
> Since the distributed version of the BW will use the sequential 
> implementations of the Forward and the Backward algorithms to compute the 
> alpha and the beta factors in each iteration, a lot of existing HMM training 
> code will be reused. Specifically, the parallel implementation of the BW 
> algorithm on Map Reduce has been elaborated at great length in [3] by viewing 
> it as a specific case of the Expectation-Maximization algorithm and will be 
> followed for implementation in this project. 
> The BW EM algorithm iteratively refines the model's parameters and consists 
> of two distinct steps in each iteration--Expectation and Maximization. In the 
> distributed case, the Expectation step is computed by the mappers and the 
> reducers, while the Maximization is handled by the reducers. Starting from an 
> initial Theta^(0), in each iteration i, the model parameter tuple Theta^i is 
> input to the algorithm, and the end result Theta^(i+1) is fed to the next 
> iteration i+1. The iteration stops on a user specified convergence condition 
> expressed as a fixpoint or when the number of iterations exceeds a user 
> defined value. 
> Expectation computes the posterior probability of each latent variable for 
> each observed variable, weighed by the relative frequency of the observed 
> variable in the input split. The mappers process independent training 
> instances and emit expected state transition and emission counts using the 
> Forward and Backward algorithms. The reducers finish Expectation by 
> aggregating the expected counts. The input to a mapper consists of (k, v_o) 
> pairs where k is a unique key and v_o is a string of observed symbols. For 
> each training instance, the mappers emit the same set of keys corresponding 
> to the following three optimization problems to be solved during the 
> Maximization, and their values in a hash-map:
> (1) Expected number of times a hidden state is reached (Pi).
> (2) Number of times each observable symbol is generated by each hidden state 
> (B).
> (3) Number of transitions between each pair of states in the hidden state 
> space (A). 
> The M step computes the updated Theta(i+1) from the values generated during 
> the E part. This involves aggregating the values (as hash-maps) for each key 
> corresponding to one of the optimization problems. The aggregation summarizes 
> the statistics necessary to compute a subset of the parameters for the next 
> EM iteration. The optimal parameters for the next iteration are arrived by 
> computing the relative frequency of each event with respect to its expected 
> count at the current iteration. The emitted optimal parameters by each 
> reducer are written to the HDFS and are fed to the mappers in the next 
> iteration. 
> The project can be subdivided into distinct tasks of programming, testing and 
> documenting the driver, mapper, reducer and the combiner with the Expectation 
> and Maximization parts split between them. For each of these tasks, a new 
> class will be programmed, unit tested and documented within the 
> org.apache.mahout.classifier.sequencelearning.hmm package. Since k-means is 
> also an EM algorithm, particular attention will be paid to its code at each 
> step for possible reuse.
> A list of milestones, associated deliverable and high level implementation 
> details is given below. 
> Time-line: April 26 - Aug 15. 
> Milestones: 
> April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with my 
> mentor, refine the project's plan and requirements, understand the 
> community's code styling requirements, expand the knowledge on Hadoop and 
> Mahout internals. Thoroughly familiarize with the classes within the 
> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and 
> math packages. 
> May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document the 
> class HmmDriver by extending the AbstractJob class and by reusing the code 
> from the KMeansDriver class. 
> June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document the 
> class HmmMapper. The HmmMapper class will include setup() and map() methods. 
> The setup() method will read in the HmmModel and the parameter values 
> obtained from the previous iteration. The map() method will call the 
> HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm() 
> and complete the Expectation step partially. 
> July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and document the 
> class HmmReducer. The reducer will complete the Expectation step by summing 
> over all the occurences emitted by the mappers for the three optimization 
> problems. Reuse the code from the HmmTrainer.trainBaumWelch() method if 
> possible. Also, mid-term review.
> July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and document 
> the class HmmCombiner. The combiner will reduce the network traffic and 
> improve efficiency by aggregating the values for each of the three keys 
> corresponding to each of the optimization problems for the Maximization stage 
> in reducers. Look at the possibility of code reuse from the KMeansCombiner 
> class. 
> July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer, 
> combiner and driver together. Give an example demonstrating the new parallel 
> BW algorithm by employing the parts-of-speech tagger data set also used by 
> the sequential BW [4]. Tidy up code and fix loose ends, finish wiki 
> documentation. 
> Additional Information: 
> I am in the final stages of finishing my Master's degree in Electrical and 
> Computer Engineering from the University of Massachusetts Amherst. Working 
> under the guidance of Prof. Wayne Burleson, as part of my Master's research 
> work, I have applied the theory of Markov Decision Process (MDP) to increase 
> the duration of service of mobile computers. This semester I am involved with 
> two course projects involving machine learning over large data sets. In the 
> Bioinformatics class, I am mining the RCSB Protein Data Bank [5] to learn the 
> dependence of side chain geometry on a protein's secondary structure, and 
> comparing it with the Dynamic Bayesian Network approach used in [6]. In 
> another project for the Online Social Networks class, I am using 
> reinforcement learning to build an online recommendation system by 
> reformulating MDP optimal policy search as an EM problem [7] and employing 
> Map Reduce (extending Mahout) to arrive at it in a scalable, distributed 
> manner. 
> I owe much to the open source community as all my research experiments have 
> only been possible due to the freely available Linux distributions, 
> performance analyzers, scripting languages and associated documentation. 
> After joining the Apache Mahout's developer mailing list a few weeks ago,  I 
> have found the community extremely vibrant, helpful and welcoming. If 
> selected, I feel that the GSOC 2011 project will be a great learning 
> experience for me from both a technical and professional standpoint and will 
> also allow me to contribute within my modest means to the overall spirit of 
> open source programming and Machine Learning. 
> References: 
> [1] A tutorial on hidden markov models and selected applications in speech 
> recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol. 77 
> (1989), pp. 257-286. 
> [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang K. 
> Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle Olukotun. 
> In NIPS (2006), pp. 281-288. 
> [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris Dyer. 
> Morgan & Claypool 2010. 
> [4] http://flexcrfs.sourceforge.net/#Case_Study 
> [5] http://www.rcsb.org/pdb/home/home.do
> [6] Beyond rotamers: a generative, probabilistic model of side chains in 
> proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE, 
> Hamelryck T. BMC Bioinformatics. 2010 Jun 5.
> [7] Probabilistic inference for solving discrete and continuous state Markov 
> Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

Reply via email to