[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dhruv Kumar updated MAHOUT-627:
-------------------------------

    Description: 
Among the unsupervised learning methods available in the current sequential 
implementation of HMM training (MAHOUT-396), the Baum-Welch (BW) algorithm is 
an attractive candidate for a parallel, MapReduce implementation. Although 
slower than the Viterbi training algorithm, the BW is more numerically stable 
and provides guaranteed discovery of Maximum Likelihood Estimator (MLE). In 
Viterbi training, the MLE is approximated in order to reduce computation time.

A parallel, MapReduce implementation of BW will allow scalable model learning 
over large data sets. The resulting model can be used for prediction using the 
current sequential implementation of the Viterbi decoding algorithm. 

BW is a general case of the Expectation Maximization (EM) algorithm and it is 
shown that all EM algorithms are candidates for a MapReduce implementation: 
http://www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf. Also, the 
k-means clustering is an EM algorithm and has an existing MapReduce 
implementation in Mahout which hints at the possibility of code reuse to aid 
parallel BW development. The other possibility for a parallel implementation 
would be to use the notion of matrix probability as shown by Turin: 
http://tagh.de/tom/wp-content/uploads/turin-1998.pdf. 

Although non trivial, a parallel BW implementation would greatly add to the 
existing set of Mahout's HMM functionality. 

--------- Proposal Begin ---------

Proposal Title: Implementation of the Baum-Welch Algorithm for parallel
Hidden Markov Model training on Map-Reduce.

Student Name: Dhruv Kumar

Student E-mail: [email protected]

Organization/Project: Apache Mahout

Assigned Mentor:

Proposal Abstract:

The Baum-Welch algorithm is commonly used for training a Hidden Markov
Model as it is numerically stable and provides a guaranteed discovery of
the Maximum Likelihood Estimator in the presence of incomplete data.
Currently, Apache Mahout has a sequential implementation of the
Baum-Welch which cannot be scaled to train over large data-sets. This
project proposes to extend the sequential implementation of the
Baum-Welch to a parallel, distributed version using the Map Reduce
programming framework to allow scalable Hidden Markov Model training.

Detailed Description:

Hidden Markov Models (HMM) are widely used as a probabilistic inference
tool for applications generating temporal or spatial sequential data.
Their relative simplicity of implementation and their ability to
discover latent domain knowledge have made them very popular in fields
such as DNA sequence alignment, handwriting analysis, voice recognition,
computer vision and parts-of-speech tagging.

A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
unobservable, hidden states emitting symbols from a finite observable
vocabulary set O according to a probabilistic model Theta. The
parameters of the model Theta are defined by the tuple (A, B, Pi) where
A is a stochastic transition matrix of the hidden states of size S X
S. The elements a_(i,j) of A specify the probability of transitioning
from a state i to state j. Matrix B is a size S X O stochastic
symbol emission matrix whose elements b_(s, o) provide the probability
that a symbol o will be emitted from the hidden state s. The elements
pi_(s) of the S length vector Pi determine the probability that the
system starts in the hidden state s. The transitions of hidden states
are unobservable and follow the Markov property of memorylessness.

Rabiner [1] defined three main problems for HMMs:

1. Evaluation: Given the complete model (S, O, Theta) and a subset of
the observation sequence, determine the probability that the model
generated the observed sequence. This is useful for determining the
quality of the model and is solved using the so called Forward
algorithm.

2. Decoding: Given the complete model (S, O, Theta) and an observation
sequence, determine the hidden state sequence which generated the
observed sequence. This can be viewed as an inference problem where the
model and observed sequence are used to predict the value of the
unobservable random variables. The backward algorithm, also known as the
Viterbi decoding algorithm is used for predicting the hidden state
sequence.

3. Training: Given the set of hidden states S, the set of observation
vocabulary O and the observation sequence, determine the parameters (A,
B, Pi) of the model Theta. This problem can be viewed as a statistical
machine learning problem of model fitting to a large set of training
data. The Baum-Welch (BW) algorithm (also called the Forward-Backward
algorithm) and the Viterbi training algorithm are commonly used for
model fitting.

Since most problems modeled by HMMs have a modest number of hidden and
observable states, the sequential versions of the Forward and the
Viterbi algorithms (currently implemented in Mahout) are sufficient for
the evaluation and decoding purposes. However, often the training data
is so large that a single compute node is incapable of handling it.
Attempts to prune the training data can lead to lower quality model
fitting and may miss out on crucial domain knowldege because of inferior
posterior probabilities. Currently, Mahout only supports sequential HMM
trainers--Viterbi and Baum Welch which are incapable of scaling to large
data sets.

This project proposes to extend Mahout's current sequential
implementation of the Baum Welch HMM trainer to a scalable, distributed
case. The distributed version of the BW will use the sequential
implementations of the Forward and the Backward algorithms in each
iteration, hence a lot of exisitng HMM training code will be reused.

Among the two sequential HMM training methods, the Baum-Welch (BW) or
the Forward-Backward algorithm is superiand a better candidate for a
parallel implementation for two reasons. (1) The BW is more numerically
stable and provides guaranteed discovery of Maximum Likelihood Estimator
(MLE) (albiet a local maximum) unlike Viterbi training where the MLE is
approximated in order to reduce computation time. (2) The BW belongs to
the general class of Expectation Maximization (EM) algorithms which
naturally fit into the Map Reduce framework [2]. Specifically, the
parallel implementation of the BW algorithm on Map Reduce has been
elaborated at great length in [3] and will be followed in the remainder
of this proposal.

The BW EM algorithm iteratively refines the model's parameters and
consists of two distinct steps in each iteration--Expectation (E) and
Maximization (M). In the distributed case, the E step is computed by the
mappers and the reducers, while the M is handled by the reducers.
Starting from an initial Theta^(0), in each iteration i, the model
parameter tuple Theta^i is input to the algorithm, and the end result
Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a
user specified convergence condition expressed as a fixpoint or when the
number of iterations exceeds a user defined value.

The E step computes the posterior probability of each latent varaiable
for each observed variable, weighed by the relative frequency of the
observered variable in the input split. The mappers process independent
training instances and emit expected state transition and emission
counts using the forward-backward algorithm. The reducers finish E by
aggregating the expected counts. The input to a mapper consists of (k,
v_o) pairs where k is a unique key and v_o is a string of observerd
symbols of length n. For each training instance, the mappers emit the
same set of keys corresponding to the following three optimization
problems to be solved during the M: (1) expected number of times a
hidden state is reached, (2) the number of times each observable symbol
is generated by each hidden state and, (3) the number of transitions
between each pair of states in the hidden state space.

The M step computes the updated Theta^(i+1) from the values generated
during the E part. This involves aggregating the values obtained in the
E step for each key corresponding to one of the optimization problems.
The aggregation summarizes the statistics necessary to compute a subset
of the parameters for the next EM iteration. The optimal parameters for
the next iteration are arrived by computing the relative frequency of
each event with respect to its expected count at the current iteration.
The emitted optimal parameters by each reducer are written to the HDFS
and are fed to the mappers in the next iteration.

The project can be subdivided into distinct tasks of programming,
testing and documenting the driver, mapper, reducer and the combiner. A
list of milestones with their associated deliverables is given below.

Timeline: April 26 - Aug 15.

Milestones:

April 26 - May 22 (4 weeks): Pre coding tasks. Open communication with
my mentor, refine the project's plan and requirements, understand the
code styling requirements, expand the knowledge on Hadoop and Mahout's
internals. Thoroughly familiarize with the classes within the
classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer
and math packages.

May 23 - June 3 (2 weeks): Driver. Implement and test the class
HmmDriver by extending the AbstractJob class and by reusing the code
from the KMeansDriver class.

June 3 - July 1 (4 weeks): Mapper. Implement and test the class
HmmMapper. The HmmMapper class will include setup() and map() methods.
The setup() method will read in the HmmModel and the parameter values
obtained from the previous iteration. The map() method will call the
HmmAlgorithms.backwardAlgorithm() and the
HmmAlgorithms.forwardAlgorithm() to compute the E step partially.

July 1 - July 15 (2 weeks): Reducer. Implement and test the class
HmmReducer. The reducer will complete the E step by summing over all the
occurences emitted by the mappers for the three optimization problems.
Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.

July 15 - July 29 (2 weeks): Combiner. Implement and test the class
HmmCombiner. The combiner will reduce the network traffic and improve
efficiency by aggregating the values for each of the three keys
corresponding to each of the optimization problems for the M stage. Look
at the possibility of code reuse from KMeansCombiner.

July 29 - August 15 (2 weeks): Give an example demonstrating the new
parallel BW algorithm by employing the parts-of-speech tagger data set
also used by the sequential BW. Tidy up code and fix loose ends, conduct
final tests, finish any remaining documentation.

Additional Information:

I am in the final stages of finishing my Master's degree in Electrical
and Computer Engineering from the University of Massachusetts Amherst.
Working under the guidance of Prof. Wayne Burleson, as part of my
Master's research work, I have applied the theory of Markov Decision
Process (MDP) to increase the duration of service of mobile computers.
This semester I am involved with two course projects involving machine
learning over large data sets. In a Bioinformatics class I am mining the
RCSB Protein Data Bank to learn the dependence of side chain geometry on
the protein's secondary structure. In another project for the Online
Social Networks class, I am building an online recommendation system by
using the MDP theory and recasting the problem to be an instance of
sequential optimal control.

I owe much to the open source community as all my research experiments
have only been possible due to the freely available Linux distributions,
open source performance analyzers, scripting languages such as Python
and the suite of GNU tools. I have found the Apache Mahout's developer
community vibrant, helpful and welcoming. If selected, I feel that the
GSOC 2011 project will be a great learning experience for me from both a
technical and professional standpoint and allow me to contribute within
my modest means to the overall spirit of open coding.

References:

[1] A tutorial on hidden markov models and selected
applications in speech recognition by Lawrence R. Rabiner. In
Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.

[2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang
K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle
Olukotun. In NIPS (2006), pp. 281-288.

[3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris
Dyer. Morgan & Claypool 2010.


--------- Proposal End -------------

  was:
Among the unsupervised learning methods available in the current sequential 
implementation of HMM training (MAHOUT-396), the Baum-Welch (BW) algorithm is 
an attractive candidate for a parallel, MapReduce implementation. Although 
slower than the Viterbi training algorithm, the BW is more numerically stable 
and provides guaranteed discovery of Maximum Likelihood Estimator (MLE). In 
Viterbi training, the MLE is approximated in order to reduce computation time.

A parallel, MapReduce implementation of BW will allow scalable model learning 
over large data sets. The resulting model can be used for prediction using the 
current sequential implementation of the Viterbi decoding algorithm. 

BW is a general case of the Expectation Maximization (EM) algorithm and it is 
shown that all EM algorithms are candidates for a MapReduce implementation: 
http://www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf. Also, the 
k-means clustering is an EM algorithm and has an existing MapReduce 
implementation in Mahout which hints at the possibility of code reuse to aid 
parallel BW development. The other possibility for a parallel implementation 
would be to use the notion of matrix probability as shown by Turin: 
http://tagh.de/tom/wp-content/uploads/turin-1998.pdf. 

Although non trivial, a parallel BW implementation would greatly add to the 
existing set of Mahout's HMM functionality. 


> Parallelization of Baum-Welch Algorithm for HMM Training 
> ---------------------------------------------------------
>
>                 Key: MAHOUT-627
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-627
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.4
>            Reporter: Dhruv Kumar
>            Priority: Minor
>
> Among the unsupervised learning methods available in the current sequential 
> implementation of HMM training (MAHOUT-396), the Baum-Welch (BW) algorithm is 
> an attractive candidate for a parallel, MapReduce implementation. Although 
> slower than the Viterbi training algorithm, the BW is more numerically stable 
> and provides guaranteed discovery of Maximum Likelihood Estimator (MLE). In 
> Viterbi training, the MLE is approximated in order to reduce computation time.
> A parallel, MapReduce implementation of BW will allow scalable model learning 
> over large data sets. The resulting model can be used for prediction using 
> the current sequential implementation of the Viterbi decoding algorithm. 
> BW is a general case of the Expectation Maximization (EM) algorithm and it is 
> shown that all EM algorithms are candidates for a MapReduce implementation: 
> http://www-cs.stanford.edu/~ang/papers/nips06-mapreducemulticore.pdf. Also, 
> the k-means clustering is an EM algorithm and has an existing MapReduce 
> implementation in Mahout which hints at the possibility of code reuse to aid 
> parallel BW development. The other possibility for a parallel implementation 
> would be to use the notion of matrix probability as shown by Turin: 
> http://tagh.de/tom/wp-content/uploads/turin-1998.pdf. 
> Although non trivial, a parallel BW implementation would greatly add to the 
> existing set of Mahout's HMM functionality. 
> --------- Proposal Begin ---------
> Proposal Title: Implementation of the Baum-Welch Algorithm for parallel
> Hidden Markov Model training on Map-Reduce.
> Student Name: Dhruv Kumar
> Student E-mail: [email protected]
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> The Baum-Welch algorithm is commonly used for training a Hidden Markov
> Model as it is numerically stable and provides a guaranteed discovery of
> the Maximum Likelihood Estimator in the presence of incomplete data.
> Currently, Apache Mahout has a sequential implementation of the
> Baum-Welch which cannot be scaled to train over large data-sets. This
> project proposes to extend the sequential implementation of the
> Baum-Welch to a parallel, distributed version using the Map Reduce
> programming framework to allow scalable Hidden Markov Model training.
> Detailed Description:
> Hidden Markov Models (HMM) are widely used as a probabilistic inference
> tool for applications generating temporal or spatial sequential data.
> Their relative simplicity of implementation and their ability to
> discover latent domain knowledge have made them very popular in fields
> such as DNA sequence alignment, handwriting analysis, voice recognition,
> computer vision and parts-of-speech tagging.
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
> unobservable, hidden states emitting symbols from a finite observable
> vocabulary set O according to a probabilistic model Theta. The
> parameters of the model Theta are defined by the tuple (A, B, Pi) where
> A is a stochastic transition matrix of the hidden states of size S X
> S. The elements a_(i,j) of A specify the probability of transitioning
> from a state i to state j. Matrix B is a size S X O stochastic
> symbol emission matrix whose elements b_(s, o) provide the probability
> that a symbol o will be emitted from the hidden state s. The elements
> pi_(s) of the S length vector Pi determine the probability that the
> system starts in the hidden state s. The transitions of hidden states
> are unobservable and follow the Markov property of memorylessness.
> Rabiner [1] defined three main problems for HMMs:
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of
> the observation sequence, determine the probability that the model
> generated the observed sequence. This is useful for determining the
> quality of the model and is solved using the so called Forward
> algorithm.
> 2. Decoding: Given the complete model (S, O, Theta) and an observation
> sequence, determine the hidden state sequence which generated the
> observed sequence. This can be viewed as an inference problem where the
> model and observed sequence are used to predict the value of the
> unobservable random variables. The backward algorithm, also known as the
> Viterbi decoding algorithm is used for predicting the hidden state
> sequence.
> 3. Training: Given the set of hidden states S, the set of observation
> vocabulary O and the observation sequence, determine the parameters (A,
> B, Pi) of the model Theta. This problem can be viewed as a statistical
> machine learning problem of model fitting to a large set of training
> data. The Baum-Welch (BW) algorithm (also called the Forward-Backward
> algorithm) and the Viterbi training algorithm are commonly used for
> model fitting.
> Since most problems modeled by HMMs have a modest number of hidden and
> observable states, the sequential versions of the Forward and the
> Viterbi algorithms (currently implemented in Mahout) are sufficient for
> the evaluation and decoding purposes. However, often the training data
> is so large that a single compute node is incapable of handling it.
> Attempts to prune the training data can lead to lower quality model
> fitting and may miss out on crucial domain knowldege because of inferior
> posterior probabilities. Currently, Mahout only supports sequential HMM
> trainers--Viterbi and Baum Welch which are incapable of scaling to large
> data sets.
> This project proposes to extend Mahout's current sequential
> implementation of the Baum Welch HMM trainer to a scalable, distributed
> case. The distributed version of the BW will use the sequential
> implementations of the Forward and the Backward algorithms in each
> iteration, hence a lot of exisitng HMM training code will be reused.
> Among the two sequential HMM training methods, the Baum-Welch (BW) or
> the Forward-Backward algorithm is superiand a better candidate for a
> parallel implementation for two reasons. (1) The BW is more numerically
> stable and provides guaranteed discovery of Maximum Likelihood Estimator
> (MLE) (albiet a local maximum) unlike Viterbi training where the MLE is
> approximated in order to reduce computation time. (2) The BW belongs to
> the general class of Expectation Maximization (EM) algorithms which
> naturally fit into the Map Reduce framework [2]. Specifically, the
> parallel implementation of the BW algorithm on Map Reduce has been
> elaborated at great length in [3] and will be followed in the remainder
> of this proposal.
> The BW EM algorithm iteratively refines the model's parameters and
> consists of two distinct steps in each iteration--Expectation (E) and
> Maximization (M). In the distributed case, the E step is computed by the
> mappers and the reducers, while the M is handled by the reducers.
> Starting from an initial Theta^(0), in each iteration i, the model
> parameter tuple Theta^i is input to the algorithm, and the end result
> Theta^(i+1) is fed to the next iteration i+1. The iteration stops on a
> user specified convergence condition expressed as a fixpoint or when the
> number of iterations exceeds a user defined value.
> The E step computes the posterior probability of each latent varaiable
> for each observed variable, weighed by the relative frequency of the
> observered variable in the input split. The mappers process independent
> training instances and emit expected state transition and emission
> counts using the forward-backward algorithm. The reducers finish E by
> aggregating the expected counts. The input to a mapper consists of (k,
> v_o) pairs where k is a unique key and v_o is a string of observerd
> symbols of length n. For each training instance, the mappers emit the
> same set of keys corresponding to the following three optimization
> problems to be solved during the M: (1) expected number of times a
> hidden state is reached, (2) the number of times each observable symbol
> is generated by each hidden state and, (3) the number of transitions
> between each pair of states in the hidden state space.
> The M step computes the updated Theta^(i+1) from the values generated
> during the E part. This involves aggregating the values obtained in the
> E step for each key corresponding to one of the optimization problems.
> The aggregation summarizes the statistics necessary to compute a subset
> of the parameters for the next EM iteration. The optimal parameters for
> the next iteration are arrived by computing the relative frequency of
> each event with respect to its expected count at the current iteration.
> The emitted optimal parameters by each reducer are written to the HDFS
> and are fed to the mappers in the next iteration.
> The project can be subdivided into distinct tasks of programming,
> testing and documenting the driver, mapper, reducer and the combiner. A
> list of milestones with their associated deliverables is given below.
> Timeline: April 26 - Aug 15.
> Milestones:
> April 26 - May 22 (4 weeks): Pre coding tasks. Open communication with
> my mentor, refine the project's plan and requirements, understand the
> code styling requirements, expand the knowledge on Hadoop and Mahout's
> internals. Thoroughly familiarize with the classes within the
> classifier.sequencelearning.hmm, clustering.kmeans, common, vectorizer
> and math packages.
> May 23 - June 3 (2 weeks): Driver. Implement and test the class
> HmmDriver by extending the AbstractJob class and by reusing the code
> from the KMeansDriver class.
> June 3 - July 1 (4 weeks): Mapper. Implement and test the class
> HmmMapper. The HmmMapper class will include setup() and map() methods.
> The setup() method will read in the HmmModel and the parameter values
> obtained from the previous iteration. The map() method will call the
> HmmAlgorithms.backwardAlgorithm() and the
> HmmAlgorithms.forwardAlgorithm() to compute the E step partially.
> July 1 - July 15 (2 weeks): Reducer. Implement and test the class
> HmmReducer. The reducer will complete the E step by summing over all the
> occurences emitted by the mappers for the three optimization problems.
> Reuse the code from the HmmTrainer.trainBaumWelch() method if possible.
> July 15 - July 29 (2 weeks): Combiner. Implement and test the class
> HmmCombiner. The combiner will reduce the network traffic and improve
> efficiency by aggregating the values for each of the three keys
> corresponding to each of the optimization problems for the M stage. Look
> at the possibility of code reuse from KMeansCombiner.
> July 29 - August 15 (2 weeks): Give an example demonstrating the new
> parallel BW algorithm by employing the parts-of-speech tagger data set
> also used by the sequential BW. Tidy up code and fix loose ends, conduct
> final tests, finish any remaining documentation.
> Additional Information:
> I am in the final stages of finishing my Master's degree in Electrical
> and Computer Engineering from the University of Massachusetts Amherst.
> Working under the guidance of Prof. Wayne Burleson, as part of my
> Master's research work, I have applied the theory of Markov Decision
> Process (MDP) to increase the duration of service of mobile computers.
> This semester I am involved with two course projects involving machine
> learning over large data sets. In a Bioinformatics class I am mining the
> RCSB Protein Data Bank to learn the dependence of side chain geometry on
> the protein's secondary structure. In another project for the Online
> Social Networks class, I am building an online recommendation system by
> using the MDP theory and recasting the problem to be an instance of
> sequential optimal control.
> I owe much to the open source community as all my research experiments
> have only been possible due to the freely available Linux distributions,
> open source performance analyzers, scripting languages such as Python
> and the suite of GNU tools. I have found the Apache Mahout's developer
> community vibrant, helpful and welcoming. If selected, I feel that the
> GSOC 2011 project will be a great learning experience for me from both a
> technical and professional standpoint and allow me to contribute within
> my modest means to the overall spirit of open coding.
> References:
> [1] A tutorial on hidden markov models and selected
> applications in speech recognition by Lawrence R. Rabiner. In
> Proceedings of the IEEE, Vol. 77 (1989), pp. 257-286.
> [2] Map-Reduce for Machine Learning on Multicore by Cheng T. Chu, Sang
> K. Kim, Yi A. Lin, Yuanyuan Yu, Gary R. Bradski, Andrew Y. Ng, Kunle
> Olukotun. In NIPS (2006), pp. 281-288.
> [3] Data-Intensive Text Processing with MapReduce by Jimmy Lin, Chris
> Dyer. Morgan & Claypool 2010.
> --------- Proposal End -------------

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to