On Jan 4, 2012, at 2:01 PM, Dhruv Kumar wrote:

> Hi Grant,
> 
> Sorry for being MIA for a while. I have looked at the error which you
> reported while running the POS tagger example and I have been able to
> recreate it.
> 
> The error is a result of arithmetic underflow. For most nominal HMM
> applications where the number of observed and hidden states is not large,
> the trainer should work perfectly (both the regular and the log-scaled
> version). However, for this POS tagging example, the number of observed
> states is huge (40K) which yields extremely small probabilities in the
> transition and emission matrices. Although the example uses the log-scaled
> variant of the trainer, the computation of convergence at the end of each
> Map Reduce iteration is done by converting probability numbers back to real
> space and checking their distance from the model obtained in the previous
> iteration. This is exactly where the problem occurs. Once the numbers have
> been converted to real space, in the next iteration, the model is rebuilt
> for next round of processing and it is validated by ensuring that the sum
> of probabilities does add up to 1. This fails after a few rounds, tripping
> the assert.
> 
> Since most use cases of HMM trainers have a small, manageable number of
> observed and hidden states, the trainer will work even in the current
> form. Within the mappers, combiners and reducers, everything happens in log
> space.  However, for improving accuracy, the code could use refactoring to
> avoid the log space to real space jump at the end of each iteration. This
> will require a couple of weekends of effort and should not be too hard. I
> can work on it in the coming weeks.

Thanks!

> 
> Please let me know if there is anything else which needs improvement in the
> code.

Will do.  I think it will be ready to go for 0.7.

> 
> --Dhruv
> 
> On Wed, Jan 4, 2012 at 8:09 AM, Grant Ingersoll (Updated) (JIRA) <
> [email protected]> wrote:
> 
>> 
>>    [
>> https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>> 
>> Grant Ingersoll updated MAHOUT-627:
>> -----------------------------------
>> 
>>   Fix Version/s:     (was: 0.6)
>>                  0.7
>> 
>> Marking this as 0.7, as much as I would love to get it in for 0.6.
>> 
>>> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model
>> Training.
>>> 
>> -----------------------------------------------------------------------------
>>> 
>>>                Key: MAHOUT-627
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-627
>>>            Project: Mahout
>>>         Issue Type: Task
>>>         Components: Classification
>>>   Affects Versions: 0.4, 0.5
>>>           Reporter: Dhruv Kumar
>>>           Assignee: Grant Ingersoll
>>>             Labels: gsoc, gsoc2011, mahout-gsoc-11
>>>            Fix For: 0.7
>>> 
>>>        Attachments: MAHOUT-627.patch, MAHOUT-627.patch,
>> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch,
>> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>>> 
>>> 
>>> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden
>> Markov Model Training.
>>> Student Name: Dhruv Kumar
>>> Student E-mail: [email protected]
>>> Organization/Project: Apache Mahout
>>> Assigned Mentor:
>>> Proposal Abstract:
>>> The Baum-Welch algorithm is commonly used for training a Hidden Markov
>> Model because of its superior numerical stability and its ability to
>> guarantee the discovery of a locally maximum,  Maximum Likelihood
>> Estimator, in the presence of incomplete training data. Currently, Apache
>> Mahout has a sequential implementation of the Baum-Welch which cannot be
>> scaled to train over large data sets. This restriction reduces the quality
>> of training and constrains generalization of the learned model when used
>> for prediction. This project proposes to extend Mahout's Baum-Welch to a
>> parallel, distributed version using the Map-Reduce programming framework
>> for enhanced model fitting over large data sets.
>>> Detailed Description:
>>> Hidden Markov Models (HMMs) are widely used as a probabilistic inference
>> tool for applications generating temporal or spatial sequential data.
>> Relative simplicity of implementation, combined with their ability to
>> discover latent domain knowledge have made them very popular in diverse
>> fields such as DNA sequence alignment, gene discovery, handwriting
>> analysis, voice recognition, computer vision, language translation and
>> parts-of-speech tagging.
>>> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
>> unobservable, hidden states emitting symbols from a finite observable
>> vocabulary set O according to a probabilistic model Theta. The parameters
>> of the model Theta are defined by the tuple (A, B, Pi) where A is a
>> stochastic transition matrix of the hidden states of size |S| X |S|. The
>> elements a_(i,j) of A specify the probability of transitioning from a state
>> i to state j. Matrix B is a size |S| X |O| stochastic symbol emission
>> matrix whose elements b_(s, o) provide the probability that a symbol o will
>> be emitted from the hidden state s. The elements pi_(s) of the |S| length
>> vector Pi determine the probability that the system starts in the hidden
>> state s. The transitions of hidden states are unobservable and follow the
>> Markov property of memorylessness.
>>> Rabiner [1] defined three main problems for HMMs:
>>> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of
>> the observation sequence, determine the probability that the model
>> generated the observed sequence. This is useful for evaluating the quality
>> of the model and is solved using the so called Forward algorithm.
>>> 2. Decoding: Given the complete model (S, O, Theta) and an observation
>> sequence, determine the hidden state sequence which generated the observed
>> sequence. This can be viewed as an inference problem where the model and
>> observed sequence are used to predict the value of the unobservable random
>> variables. The backward algorithm, also known as the Viterbi decoding
>> algorithm is used for predicting the hidden state sequence.
>>> 3. Training: Given the set of hidden states S, the set of observation
>> vocabulary O and the observation sequence, determine the parameters (A, B,
>> Pi) of the model Theta. This problem can be viewed as a statistical machine
>> learning problem of model fitting to a large set of training data. The
>> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and
>> the Viterbi training algorithm are commonly used for model fitting.
>>> In general, the quality of HMM training can be improved by employing
>> large training vectors but currently, Mahout only supports sequential
>> versions of HMM trainers which are incapable of scaling.  Among the Viterbi
>> and the Baum-Welch training methods, the Baum-Welch algorithm is superior,
>> accurate, and a better candidate for a parallel implementation for two
>> reasons:
>>> (1) The BW is numerically stable and provides a guaranteed discovery of
>> the locally maximum, Maximum Likelihood Estimator (MLE) for model's
>> parameters (Theta). In Viterbi training, the MLE is approximated in order
>> to reduce computation time.
>>> (2) The BW belongs to the general class of Expectation Maximization (EM)
>> algorithms which naturally fit into the Map-Reduce framework [2], such as
>> the existing Map Reduce implementation of k-means in Mahout.
>>> Hence, this project proposes to extend Mahout's current sequential
>> implementation of the Baum-Welch HMM trainer to a scalable, distributed
>> case. Since the distributed version of the BW will use the sequential
>> implementations of the Forward and the Backward algorithms to compute the
>> alpha and the beta factors in each iteration, a lot of existing HMM
>> training code will be reused. Specifically, the parallel implementation of
>> the BW algorithm on Map Reduce has been elaborated at great length in [3]
>> by viewing it as a specific case of the Expectation-Maximization algorithm
>> and will be followed for implementation in this project.
>>> The BW EM algorithm iteratively refines the model's parameters and
>> consists of two distinct steps in each iteration--Expectation and
>> Maximization. In the distributed case, the Expectation step is computed by
>> the mappers and the reducers, while the Maximization is handled by the
>> reducers. Starting from an initial Theta^(0), in each iteration i, the
>> model parameter tuple Theta^i is input to the algorithm, and the end result
>> Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a user
>> specified convergence condition expressed as a fixpoint or when the number
>> of iterations exceeds a user defined value.
>>> Expectation computes the posterior probability of each latent variable
>> for each observed variable, weighed by the relative frequency of the
>> observed variable in the input split. The mappers process independent
>> training instances and emit expected state transition and emission counts
>> using the Forward and Backward algorithms. The reducers finish Expectation
>> by aggregating the expected counts. The input to a mapper consists of (k,
>> v_o) pairs where k is a unique key and v_o is a string of observed symbols.
>> For each training instance, the mappers emit the same set of keys
>> corresponding to the following three optimization problems to be solved
>> during the Maximization, and their values in a hash-map:
>>> (1) Expected number of times a hidden state is reached (Pi).
>>> (2) Number of times each observable symbol is generated by each hidden
>> state (B).
>>> (3) Number of transitions between each pair of states in the hidden
>> state space (A).
>>> The M step computes the updated Theta(i+1) from the values generated
>> during the E part. This involves aggregating the values (as hash-maps) for
>> each key corresponding to one of the optimization problems. The aggregation
>> summarizes the statistics necessary to compute a subset of the parameters
>> for the next EM iteration. The optimal parameters for the next iteration
>> are arrived by computing the relative frequency of each event with respect
>> to its expected count at the current iteration. The emitted optimal
>> parameters by each reducer are written to the HDFS and are fed to the
>> mappers in the next iteration.
>>> The project can be subdivided into distinct tasks of programming,
>> testing and documenting the driver, mapper, reducer and the combiner with
>> the Expectation and Maximization parts split between them. For each of
>> these tasks, a new class will be programmed, unit tested and documented
>> within the org.apache.mahout.classifier.sequencelearning.hmm package. Since
>> k-means is also an EM algorithm, particular attention will be paid to its
>> code at each step for possible reuse.
>>> A list of milestones, associated deliverable and high level
>> implementation details is given below.
>>> Time-line: April 26 - Aug 15.
>>> Milestones:
>>> April 26 - May 22 (4 weeks): Pre-coding stage. Open communication with
>> my mentor, refine the project's plan and requirements, understand the
>> community's code styling requirements, expand the knowledge on Hadoop and
>> Mahout internals. Thoroughly familiarize with the classes within the
>> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer and
>> math packages.
>>> May 23 - June 3 (2 weeks): Work on Driver. Implement, test and document
>> the class HmmDriver by extending the AbstractJob class and by reusing the
>> code from the KMeansDriver class.
>>> June 3 - July 1 (4 weeks): Work on Mapper. Implement, test and document
>> the class HmmMapper. The HmmMapper class will include setup() and map()
>> methods. The setup() method will read in the HmmModel and the parameter
>> values obtained from the previous iteration. The map() method will call the
>> HmmAlgorithms.backwardAlgorithm() and the HmmAlgorithms.forwardAlgorithm()
>> and complete the Expectation step partially.
>>> July 1 - July 15 (2 weeks): Work on Reducer. Implement, test and
>> document the class HmmReducer. The reducer will complete the Expectation
>> step by summing over all the occurences emitted by the mappers for the
>> three optimization problems. Reuse the code from the
>> HmmTrainer.trainBaumWelch() method if possible. Also, mid-term review.
>>> July 15 - July 29 (2 weeks): Work on Combiner. Implement, test and
>> document the class HmmCombiner. The combiner will reduce the network
>> traffic and improve efficiency by aggregating the values for each of the
>> three keys corresponding to each of the optimization problems for the
>> Maximization stage in reducers. Look at the possibility of code reuse from
>> the KMeansCombiner class.
>>> July 29 - August 15 (2 weeks): Final touches. Test the mapper, reducer,
>> combiner and driver together. Give an example demonstrating the new
>> parallel BW algorithm by employing the parts-of-speech tagger data set also
>> used by the sequential BW [4]. Tidy up code and fix loose ends, finish wiki
>> documentation.
>>> Additional Information:
>>> I am in the final stages of finishing my Master's degree in Electrical
>> and Computer Engineering from the University of Massachusetts Amherst.
>> Working under the guidance of Prof. Wayne Burleson, as part of my Master's
>> research work, I have applied the theory of Markov Decision Process (MDP)
>> to increase the duration of service of mobile computers. This semester I am
>> involved with two course projects involving machine learning over large
>> data sets. In the Bioinformatics class, I am mining the RCSB Protein Data
>> Bank [5] to learn the dependence of side chain geometry on a protein's
>> secondary structure, and comparing it with the Dynamic Bayesian Network
>> approach used in [6]. In another project for the Online Social Networks
>> class, I am using reinforcement learning to build an online recommendation
>> system by reformulating MDP optimal policy search as an EM problem [7] and
>> employing Map Reduce (extending Mahout) to arrive at it in a scalable,
>> distributed manner.
>>> I owe much to the open source community as all my research experiments
>> have only been possible due to the freely available Linux distributions,
>> performance analyzers, scripting languages and associated documentation.
>> After joining the Apache Mahout's developer mailing list a few weeks ago,
>> I have found the community extremely vibrant, helpful and welcoming. If
>> selected, I feel that the GSOC 2011 project will be a great learning
>> experience for me from both a technical and professional standpoint and
>> will also allow me to contribute within my modest means to the overall
>> spirit of open source programming and Machine Learning.
>>> References:
>>> [1] A tutorial on hidden markov models and selected applications in
>> speech recognition by Lawrence R. Rabiner. In Proceedings of the IEEE, Vol.
>> 77 (1989), pp. 257-286.
>>> [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang
>> K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle
>> Olukotun. In NIPS (2006), pp. 281-288.
>>> [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris
>> Dyer. Morgan & Claypool 2010.
>>> [4] http://flexcrfs.sourceforge.net/#Case_Study
>>> [5] http://www.rcsb.org/pdb/home/home.do
>>> [6] Beyond rotamers: a generative, probabilistic model of side chains in
>> proteins by Harder T, Boomsma W, Paluszewski M, Frellsen J, Johansson KE,
>> Hamelryck T. BMC Bioinformatics. 2010 Jun 5.
>>> [7] Probabilistic inference for solving discrete and continuous state
>> Markov Decision Processes by M. Toussaint and A. Storkey. ICML, 2006.
>> 
>> --
>> This message is automatically generated by JIRA.
>> If you think it was sent incorrectly, please contact your JIRA
>> administrators:
>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>> 
>> 
>> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com



Reply via email to