[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-10-24 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13804475#comment-13804475
 ] 

Suneel Marthi commented on MAHOUT-627:
--

Moving this to Backlog per email from Grant.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: Backlog
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is supe

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-10-16 Thread Suneel Marthi (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797541#comment-13797541
 ] 

Suneel Marthi commented on MAHOUT-627:
--

Hi Dhruv,

 Will this be ready for 0.9 (tentatively mid-November)? It would be great if 
you could address Grant's comments above.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.9
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among th

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-07-30 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13724029#comment-13724029
 ] 

Grant Ingersoll commented on MAHOUT-627:


Dhruv,

Any chance this can get done?

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.9
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorith

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-03 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673689#comment-13673689
 ] 

Grant Ingersoll commented on MAHOUT-627:


Hi Dhruv,

Thanks for the response.  We are trying to get 0.8 in the next week or two.  
Any help on a short example as well as updating the code to trunk would be 
awesome.

Thanks,
Grant

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports se

Re: [jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-03 Thread Ted Dunning
On Mon, Jun 3, 2013 at 12:11 PM, Dhruv Kumar (JIRA)  wrote:

> When is 0.8 due? I can chip away on this issue for the next few days in
> the evenings and hunt for a short example from the book mentioned above.
> Should require a week or two at least to sign off from my side.
>
> There are also unit tests with the trainer which demonstrate that it
> works--the results of Map Reduce based training are identical to the ones
> obtained in the sequential version.
>

Code freeze is the 10th.  If you run, you might make it.


[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-03 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673255#comment-13673255
 ] 

Dhruv Kumar commented on MAHOUT-627:


Hi Grant,

As I understand the only blocker for this issue is a small, self contained 
example which the users can run in a reasonable amount of time and see the 
results. The parts of speech tagger example which I originally adapted for this 
trainer can take hours to converge, and sometimes it fails with arithmetic 
underflow due to an unusually large set of states for the Observations 
(observed states are the words of the corpus in the POS tagger's model). 

When is 0.8 due? I can chip away on this issue for the next few days in the 
evenings and hunt for a short example from the book mentioned above. Should 
require a week or two at least to sign off from my side. 

There are also unit tests with the trainer which demonstrate that it works--the 
results of Map Reduce based training are identical to the ones obtained in the 
sequential version.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viter

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2013-06-02 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13672495#comment-13672495
 ] 

Grant Ingersoll commented on MAHOUT-627:


Dhruv, can you update by chance?  Otherwise, will remove this from 0.8.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch trainin

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2012-09-13 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13454823#comment-13454823
 ] 

Grant Ingersoll commented on MAHOUT-627:


Cool, thanks, Dhruv.  Any luck on the examples?

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welc

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2012-09-10 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13452295#comment-13452295
 ] 

Dhruv Kumar commented on MAHOUT-627:


Hi Tim,

Thanks a lot for trying out my patch and providing these examples. Please let 
me know if you need any help or clarification about the API. Like I've 
mentioned above, I need a good example to demonstrate the capability so I'll 
look at your link to see if it fits the need here.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by 

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2012-09-08 Thread Tim Schultz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451334#comment-13451334
 ] 

Tim Schultz commented on MAHOUT-627:


Hi all - hoping I can revive this discussion

I'm not yet well-versed with Mahout especially applied to HMM, but find the 
idea quite interesting - especially since I have massive amounts of data that 
would be ideal for it.

I'm more familiar with R, and am knowledgeable about some example datasets that 
can be used for testing (below).  I've applied Dhruv's patch and currently 
rebuilding Mahout.  I will see if I can get some of these examples working on 
my local Hadoop instance, but there will be a slight learning curve.

http://134.76.173.220/hmm-with-r/data/

README: http://134.76.173.220/hmm-with-r/data/00_README.txt

Thoughts?

Hope this helps.

-Tim

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: ASF.LICENSE.NOT.GRANTED--screenshot.png, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the para

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2012-06-27 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13402116#comment-13402116
 ] 

Grant Ingersoll commented on MAHOUT-627:


bq. Does anyone else have a recommendation for a HMM training example I can use?

Perhaps look at other HMM packages?

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, screenshot.png
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch trainin

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2012-06-23 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13400037#comment-13400037
 ] 

Dhruv Kumar commented on MAHOUT-627:


Sorry for being MIA for a while. I have relocated to SF and was extremely busy 
coming up to speed with my new job. That being said, I do want to work on this, 
maintain it and make sure that this feature makes it to Mahout's trunk. 

This example is not entirely suitable for demonstrating the MR version of HMM 
training. The state space is very large which causes underflow very easily. 

I'm searching for a good example for this feature. 

Does anyone else have a recommendation for a HMM training example I can use? 

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.8
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, screenshot.png
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Ba

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2012-02-28 Thread Grant Ingersoll (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13218074#comment-13218074
 ] 

Grant Ingersoll commented on MAHOUT-627:


Dhruv, any luck on the example issue?

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.7
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a paral

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-12-19 Thread Grant Ingersoll (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13172658#comment-13172658
 ] 

Grant Ingersoll commented on MAHOUT-627:


Once I get past the example issue, I think this is ready to go for the most 
part.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, ac

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-12-18 Thread Grant Ingersoll (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13171950#comment-13171950
 ] 

Grant Ingersoll commented on MAHOUT-627:


I'm getting
{quote}
1/12/18 17:21:02 WARN mapred.LocalJobRunner: job_local_0010
java.lang.IllegalArgumentException: The output state probability from hidden 
state 0 to output state 2 is negative
at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
at 
org.apache.mahout.classifier.sequencelearning.hmm.HmmUtils.validate(HmmUtils.java:187)
at 
org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchMapper.setup(BaumWelchMapper.java:111)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
11/12/18 17:21:03 INFO mapred.JobClient:  map 0% reduce 0%
11/12/18 17:21:03 INFO mapred.JobClient: Job complete: job_local_0010
11/12/18 17:21:03 INFO mapred.JobClient: Counters: 0
Exception in thread "main" java.lang.InterruptedException: Baum-Welch Iteration 
failed processing tmp/output/model-9
at 
org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchDriver.runIteration(BaumWelchDriver.java:315)
at 
org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BaumWelchDriver.runBaumWelchMR(BaumWelchDriver.java:253)
at 
org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BWPosTagger.trainModel(BWPosTagger.java:293)
at 
org.apache.mahout.classifier.sequencelearning.hmm.hadoop.BWPosTagger.main(BWPosTagger.java:364)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:188)
{quote}
when running the example.  Otherwise, all tests pass.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transi

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-11-08 Thread Suneel Marthi (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146479#comment-13146479
 ] 

Suneel Marthi commented on MAHOUT-627:
--

While reviewing the code in BaumWelchTrainer.java, noticed that we have a bunch 
of System.out.println() statements. Code needs some cleanup to replace these by 
SLF4j logger calls.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-10-31 Thread Grant Ingersoll (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140149#comment-13140149
 ] 

Grant Ingersoll commented on MAHOUT-627:


I'm going to look to commit this soon after ApacheCon (or perhaps during)

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a paral

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-09-09 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101614#comment-13101614
 ] 

Dhruv Kumar commented on MAHOUT-627:


Hi Grant,

Sorry I was caught up with the job interviews and turning in the graduation 
documents.

Here is the patch with the changes which you wanted:

1. Created a new shell script which automatically downloads the training and 
test sets. If the sets are already present, it skips the download.

2. Modified the POS tagger example code to avoid the download and accept the 
training and test sets via command line arguments. These arguments are passed 
by the script in #1.

Please let me know if you need further changes to make it commit ready!

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-09-07 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13099283#comment-13099283
 ] 

Grant Ingersoll commented on MAHOUT-627:


Dhruv, any progress on the last pieces here?  I'd like to see this get 
committed relatively soon.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a parallel implemen

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-24 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13090249#comment-13090249
 ] 

Dhruv Kumar commented on MAHOUT-627:


Hi Grant,

Thanks for the feedback and for fixing the code style.

I will create a script for the example and have it download the test data only 
if it is not already present. 

In your testing, if you come across any corner case which has missed my 
testing, please let me know. I can add a test for it and refactor the code to 
eliminate the bug. I am traveling until Saturday for a job interview in Seattle 
but I should be able to roll out the patch soon after that! 

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitt

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-22 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088828#comment-13088828
 ] 

Dhruv Kumar commented on MAHOUT-627:


In its current form, one should follow the following steps to use the trainer:

1) Encode the hidden and emitted state tags to integer ids and store them as a 
MapWritable SequenceFiles. The MapWritableCache.Save() method comes handy here.
2) Convert the input sequence to the integer ids as described by the emitted 
states map in step 1. Wrap the input sequence as a VectorWritable.
3) Store the VectorWritable obtained in step 2 as a sequence file containing 
key as an arbitrary LongWritable, and the Value as the integer sequence. This 
forms the input to the trainer.
4) (Optional) Use the BaumWelchUtils.BuildHmmModelFromDistributions() method to 
store an initial model with given distributions. The model will be stored as a 
SequenceFile containing MapWritables as the distributions.
4) Invoke the trainer via the command line or using the API by calling the 
driver's run() method. The users can specify if they want to use the log scaled 
variant by setting the logScaled to true, and they can specify the convergence 
delta, the max number of iterations etc. In case step 4 is omitted, the users 
must ask the program to create a random initial model by setting the 
buildRandom to true. This starts the iterative training using Maximum 
Likelihood Estimation.
5) At the end, as the result of the training, a HmmModel is stored as a 
MapWritable with probability distributions encoded as DoubleWritables. The 
utility method BaumWelchUtils.CreateHmmModel(Path) can be used to decode the 
result and obtain the HmmModel.


---Design Discussion---

The design uses MapWritables and SequenceFiles to freely convert between the 
legacy HmmModel to a serializable varaint which also encodes the probability 
distributions. This design choice had the following advantages:

1 I could leverage a lot of existing functionality of the legacy sequential Hmm 
code by writing utility methods to encode and decode (BaumWelchUtils class was 
made for this purpose).

2. The users ultimately get the legacy HMM Model at the end of step 5, they can 
then use it to decode the test sequence using HmmAlgorithms.decodeSequence() 
method. They also have at their disposal all the other methods provided by the 
legacy code.

3. Since the trained model is persisted in a SequenceFile, one can store these 
models for future reference and use the BaumWelchUtils.CreateHmmModel(Path) 
later to decode it and compare with other trained models (possibly with 
different initial seed values).


> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitti

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-22 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13088789#comment-13088789
 ] 

Dhruv Kumar commented on MAHOUT-627:


First MapReduce based open-source Baum-Welch HMM Trainer!

I have attached the Patch for inclusion into the trunk, keeping in line with 
the "firm pencils down date." 

It is complete with all the deliverables as listed in the project's timeline: 
unit tests, documentation, a POS example. Specifically, the following 
improvements have taken place in the last 5 days:

1. Created a new Log scaled training variant by refactoring the mapper, 
combiner, reducer and driver classes (and added a unit test for the same). The 
option for log scaling can be invoked via the command line and causes the 
trainer to operate in log space between mapper -> combiner -> reducer. This 
should provide numerical stability for extremely long sequences or large state 
spaces (or both). 

2. Added a scalable Map-Reduce based Parts Of Speech tagger which uses the log 
scaled training.  

3. (Minor) Changed the input format from IntArrayWritable to Mahout's 
VectorWritable.

It will be awesome to get some feedback on the code, functionality, design etc. 

I'm eager to keep improving the trainer, fix any bugs when they arise and 
making it more useful for users!

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-16 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13085921#comment-13085921
 ] 

Dhruv Kumar commented on MAHOUT-627:


Hi Grant,

I have uploaded the first candidate patch for this issue's resolution and it 
will be great to get some feedback on it from you and the dev community. It 
contains:

1. Complete individual unit tests for the mapper, combiner, reducer to verify 
accurate summarization, normalization, probability matrices and vectors lengths.
2. Unit tests for the overall trainer. The trained model's probability values 
are validated against the sequential HMM implementation of Mahout (which in 
turn used the R and Matlab HMM packages for validation).
3. Documentation for each of the 8 classes under the new 
classifier.sequencelearning.baumwelchmapreduce package.
4. Command line BaumWelch MapReduce training utilities--can be invoked using 
"bin/mahout hmmBaumWelchMapReduce ". The driver.classes.props file was 
modified for the same.


On my system, which is an aging Pentium 4, the unit tests for 
baumwelchmapreduce took 57 seconds to complete.

Please let me know what you think and where things can be improved. I will be 
refactoring this based on yours and others feedback until the firm pencils down 
date next week on Monday 22nd.

Thank you.
Dhruv


> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observe

Re: [jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-14 Thread Ted Dunning
Post a patch early so you can get feedback before it is too late to make use of 
it. 

Sent from my iPad

On Aug 14, 2011, at 5:17 AM, Dhruv Kumar  wrote:

> Very good actually, just tying up some loose ends and I will be putting up
> the patch with all the unit tests and the documentation soon!
> 
> I have to just debug the slf4j issue which I have discussed on the mailing
> list with Sean and Ted...
> 
> I still need the rest of this week to tidy up the documentation a little
> bit.
> 
> On Sun, Aug 14, 2011 at 7:25 AM, Grant Ingersoll (JIRA) 
> wrote:
> 
>> 
>>   [
>> https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084820#comment-13084820]
>> 
>> Grant Ingersoll commented on MAHOUT-627:
>> 
>> 
>> Hey Dhruv, nearing pencils down, how are we doing?
>> 
>>> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model
>> Training.
>>> 
>> -
>>> 
>>>Key: MAHOUT-627
>>>URL: https://issues.apache.org/jira/browse/MAHOUT-627
>>>Project: Mahout
>>> Issue Type: Task
>>> Components: Classification
>>>   Affects Versions: 0.4, 0.5
>>>   Reporter: Dhruv Kumar
>>>   Assignee: Grant Ingersoll
>>> Labels: gsoc, gsoc2011, mahout-gsoc-11
>>>Fix For: 0.6
>>> 
>>>Attachments: MAHOUT-627.patch, MAHOUT-627.patch,
>> MAHOUT-627.patch, MAHOUT-627.patch
>>> 
>>> 
>>> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden
>> Markov Model Training.
>>> Student Name: Dhruv Kumar
>>> Student E-mail: dku...@ecs.umass.edu
>>> Organization/Project: Apache Mahout
>>> Assigned Mentor:
>>> Proposal Abstract:
>>> The Baum-Welch algorithm is commonly used for training a Hidden Markov
>> Model because of its superior numerical stability and its ability to
>> guarantee the discovery of a locally maximum,  Maximum Likelihood Estimator,
>> in the presence of incomplete training data. Currently, Apache Mahout has a
>> sequential implementation of the Baum-Welch which cannot be scaled to train
>> over large data sets. This restriction reduces the quality of training and
>> constrains generalization of the learned model when used for prediction.
>> This project proposes to extend Mahout's Baum-Welch to a parallel,
>> distributed version using the Map-Reduce programming framework for enhanced
>> model fitting over large data sets.
>>> Detailed Description:
>>> Hidden Markov Models (HMMs) are widely used as a probabilistic inference
>> tool for applications generating temporal or spatial sequential data.
>> Relative simplicity of implementation, combined with their ability to
>> discover latent domain knowledge have made them very popular in diverse
>> fields such as DNA sequence alignment, gene discovery, handwriting analysis,
>> voice recognition, computer vision, language translation and parts-of-speech
>> tagging.
>>> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
>> unobservable, hidden states emitting symbols from a finite observable
>> vocabulary set O according to a probabilistic model Theta. The parameters of
>> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic
>> transition matrix of the hidden states of size |S| X |S|. The elements
>> a_(i,j) of A specify the probability of transitioning from a state i to
>> state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix
>> whose elements b_(s, o) provide the probability that a symbol o will be
>> emitted from the hidden state s. The elements pi_(s) of the |S| length
>> vector Pi determine the probability that the system starts in the hidden
>> state s. The transitions of hidden states are unobservable and follow the
>> Markov property of memorylessness.
>>> Rabiner [1] defined three main problems for HMMs:
>>> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the
>> observation sequence, determine the probability that the model generated the
>> observed sequence. This is useful for evaluating the quality of the model
>> and is solved using the so called Forward algorithm.
>>> 2. Decoding: Given the complete model (S, O, Theta) and an observation
>> sequence, determine the hidden state sequence which generated the observed
>> sequence. This can be viewed as an inference problem where the model and
>> observed sequence are used to predict the value of the unobservable random
>> variables. The backward algorithm, also known as the Viterbi decoding
>> algorithm is used for predicting the hidden state sequence.
>>> 3. Training: Given the set of hidden states S, the set of observation
>> vocabulary O and the observation sequence, determine the parameters (A, B,
>> Pi) of the model Theta. This problem can be viewed as a statistical machine
>> learning problem of model fitting to a large s

Re: [jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-14 Thread Dhruv Kumar
Very good actually, just tying up some loose ends and I will be putting up
the patch with all the unit tests and the documentation soon!

I have to just debug the slf4j issue which I have discussed on the mailing
list with Sean and Ted...

I still need the rest of this week to tidy up the documentation a little
bit.

On Sun, Aug 14, 2011 at 7:25 AM, Grant Ingersoll (JIRA) wrote:

>
>[
> https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084820#comment-13084820]
>
> Grant Ingersoll commented on MAHOUT-627:
> 
>
> Hey Dhruv, nearing pencils down, how are we doing?
>
> > Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model
> Training.
> >
> -
> >
> > Key: MAHOUT-627
> > URL: https://issues.apache.org/jira/browse/MAHOUT-627
> > Project: Mahout
> >  Issue Type: Task
> >  Components: Classification
> >Affects Versions: 0.4, 0.5
> >Reporter: Dhruv Kumar
> >Assignee: Grant Ingersoll
> >  Labels: gsoc, gsoc2011, mahout-gsoc-11
> > Fix For: 0.6
> >
> > Attachments: MAHOUT-627.patch, MAHOUT-627.patch,
> MAHOUT-627.patch, MAHOUT-627.patch
> >
> >
> > Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden
> Markov Model Training.
> > Student Name: Dhruv Kumar
> > Student E-mail: dku...@ecs.umass.edu
> > Organization/Project: Apache Mahout
> > Assigned Mentor:
> > Proposal Abstract:
> > The Baum-Welch algorithm is commonly used for training a Hidden Markov
> Model because of its superior numerical stability and its ability to
> guarantee the discovery of a locally maximum,  Maximum Likelihood Estimator,
> in the presence of incomplete training data. Currently, Apache Mahout has a
> sequential implementation of the Baum-Welch which cannot be scaled to train
> over large data sets. This restriction reduces the quality of training and
> constrains generalization of the learned model when used for prediction.
> This project proposes to extend Mahout's Baum-Welch to a parallel,
> distributed version using the Map-Reduce programming framework for enhanced
> model fitting over large data sets.
> > Detailed Description:
> > Hidden Markov Models (HMMs) are widely used as a probabilistic inference
> tool for applications generating temporal or spatial sequential data.
> Relative simplicity of implementation, combined with their ability to
> discover latent domain knowledge have made them very popular in diverse
> fields such as DNA sequence alignment, gene discovery, handwriting analysis,
> voice recognition, computer vision, language translation and parts-of-speech
> tagging.
> > A HMM is defined as a tuple (S, O, Theta) where S is a finite set of
> unobservable, hidden states emitting symbols from a finite observable
> vocabulary set O according to a probabilistic model Theta. The parameters of
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic
> transition matrix of the hidden states of size |S| X |S|. The elements
> a_(i,j) of A specify the probability of transitioning from a state i to
> state j. Matrix B is a size |S| X |O| stochastic symbol emission matrix
> whose elements b_(s, o) provide the probability that a symbol o will be
> emitted from the hidden state s. The elements pi_(s) of the |S| length
> vector Pi determine the probability that the system starts in the hidden
> state s. The transitions of hidden states are unobservable and follow the
> Markov property of memorylessness.
> > Rabiner [1] defined three main problems for HMMs:
> > 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the
> observation sequence, determine the probability that the model generated the
> observed sequence. This is useful for evaluating the quality of the model
> and is solved using the so called Forward algorithm.
> > 2. Decoding: Given the complete model (S, O, Theta) and an observation
> sequence, determine the hidden state sequence which generated the observed
> sequence. This can be viewed as an inference problem where the model and
> observed sequence are used to predict the value of the unobservable random
> variables. The backward algorithm, also known as the Viterbi decoding
> algorithm is used for predicting the hidden state sequence.
> > 3. Training: Given the set of hidden states S, the set of observation
> vocabulary O and the observation sequence, determine the parameters (A, B,
> Pi) of the model Theta. This problem can be viewed as a statistical machine
> learning problem of model fitting to a large set of training data. The
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and
> the Viterbi training algorithm are commonly used for model fitting.
> > In general, the quality of HMM training can be impro

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-14 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13084820#comment-13084820
 ] 

Grant Ingersoll commented on MAHOUT-627:


Hey Dhruv, nearing pencils down, how are we doing?

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a parallel implementation for two reasons:
> (1) The BW is numerically stable and provides a guaranteed discovery of the 

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-08 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080981#comment-13080981
 ] 

Dhruv Kumar commented on MAHOUT-627:


Hi Grant,

I am finishing up some documentation and a few tests. The unit testing has led 
to a lot of refactoring in the BaumWelchMapper and the BaumWelchUtils classes, 
which was somewhat expected. I should be able to wrap this up before the 
pencils down deadline though, with an example of POS tagging to follow.



> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-08-08 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13080901#comment-13080901
 ] 

Grant Ingersoll commented on MAHOUT-627:


Dhruv,

Any luck yet on the unit tests?

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a parallel implementation for two reasons:
> (1) The BW is numerically stable and provides a guaranteed discovery of the 
> locally 

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-07-13 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065022#comment-13065022
 ] 

Dhruv Kumar commented on MAHOUT-627:


Uploaded a new patch with refactorings and miscellaneous improvements. This 
concludes the chain's implementation and testing with manual inputs. The 
trainer works and provides a scalable variant of Baum Welch. Next phase of 
project will entail more testing of the chain via unit tests and implementation 
of the log-scaled variant.

Next patch will contain more documentation and unit tests for some of the 
methods of the trainer.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch, 
> MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> traini

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-07-11 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063373#comment-13063373
 ] 

Dhruv Kumar commented on MAHOUT-627:


Uploaded a new patch after a week's worth of testing:

- Bug fixes for a few corner cases
- Refactoring of the BaumWelchUtils and BaumWelchMapper class.
- Added verbose loggers for debugging.




> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a para

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-06-28 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056923#comment-13056923
 ] 

Dhruv Kumar commented on MAHOUT-627:


Thanks Sergey.

I ended up creating a version of serialized HmmModel too. However, for the time 
being I'm not going to use it per my design which relies heavily on 
MapWritables for storing the distributions with a large key space equal to 2|S| 
+1, where |S| is the number of hidden states.

The sequential command line utils are convenient for validating the results 
against the MapReduce variant so they are definitely useful in my case.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahou

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-06-28 Thread Sergey Bartunov (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056800#comment-13056800
 ] 

Sergey Bartunov commented on MAHOUT-627:


Hey, Dhruv. I'd submitted some code in the 
https://issues.apache.org/jira/browse/MAHOUT-734 which contains HmmModel 
serialization utility and command-line tools for sequential HMM functionality 
and it could be integrated to your code.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> 

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-06-28 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13056782#comment-13056782
 ] 

Dhruv Kumar commented on MAHOUT-627:


Uploaded a new patch:

1. Added a new method for building a random HmmModel, wrapping the distribution 
matrices rows as MapWritable types, and writing them to the specified directory 
in a Sequence File format.

2. Updated common/DefaultOptionCreator for the new option in #1. Also added an 
option for the user to specify the directory containing a pre-written HmmModel 
object (as a Sequence File type containing MapWritable).

2. Updated the driver class for accomodating #1 and #2.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch, MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing la

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-06-26 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13055329#comment-13055329
 ] 

Dhruv Kumar commented on MAHOUT-627:


Hi Grant, 

Since my last update, I have finished the first round of implementation of the 
end to end functionality resulting in 7 new classes, present under the 
classifier.sequencelearning.baumwelchmapreduce package. The attached patch 
contains the following:

1. BaumWelchDriver, BaumWelchMapper, BaumWelchCombiner and BaumWelchReducer.
2. MapWritableCache, a general class to load MapWritable files from the HDFS. 
3. BaumWelchUtils, a utility class for constructing the legacy HmmModel objects 
from a given HDFS directory containing the probability distributions (emission, 
transition and initial) as MapWritable types, stored as Sequence Files. 
4. BaumWelchModel, a serializable version of HmmModel.

In addition to these classes, the patch also adds support for command line 
training of HMM using the BaumWelch MapReduce variant by modifying 
common/commandline/DefaultOptionCreator.java and the 
src/conf/driver.classes.props files.

In order to maximize parallelization, a total of 2|S| + 1 keys are emitted 
where |S| is the number of hidden states. There are |S| keys, one for each 
hidden state with corresponding values containing expected transition counts as 
a MapWritable object. Similarly , there are |S| keys and corresponding 
MapWritable values encoding the emission probability distribution. Finally, 
there is one key with a MapWritable value encoding the initial probability 
distribution vector. The large key space permits recruitment of more number of 
reducers, with each key being processed separately by one reducer in the best 
case.

The input and output formats are of Sequence File type. The keys for input are 
LongWritable and the values are ArrayWritable containing int[] observations 
where each int in the observed sequence is a mapping of the training set's 
tokens defined by the user. The reducers write the distributions as Sequence 
Files with keys of type Text and values as MapWritable. I have extensively 
re-used the current sequential variant of HMM training, and a lot of my design 
decisions w.r.t to input and output types were guided by the legacy code's API.

Now that the driver-mapper-combiner-reducer chain's preliminary implementation 
is complete, the rest of the time will be actively spent in testing, debugging 
and refinement of the new trainer's features. In particular, I'm looking at 
alternative types to ArrayWritable for wrapping the observation sequence given 
to the mappers. For text mining, a simple mapping which encodes tokens in the 
text corpus as integer states could be performed, as expected by the current 
design. However, I'm not sure if this approach is the best w.r.t scalability or 
whether it is at all applicable to domains different from Information Retrieval 
requiring scalable HMM Training.  I'm aware that a lot of other algorithms in 
Mahout require the input in the form of Vectors, packed into a Sequence File 
and it will be useful to get feedback on this issue.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
> Attachments: MAHOUT-627.patch
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementa

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-06-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13054871#comment-13054871
 ] 

Grant Ingersoll commented on MAHOUT-627:


Hi Dhruv,

How goes progress on this?

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4, 0.5
>Reporter: Dhruv Kumar
>Assignee: Grant Ingersoll
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
> Fix For: 0.6
>
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model when used for prediction. This 
> project proposes to extend Mahout's Baum-Welch to a parallel, distributed 
> version using the Map-Reduce programming framework for enhanced model fitting 
> over large data sets. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a parallel implementation for two reasons:
> (1) The BW is numerically stable and provides a guaranteed discovery of the 
> locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters 
> (Theta). In Viterbi training, t

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-04-06 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016505#comment-13016505
 ] 

Dhruv Kumar commented on MAHOUT-627:


Updated the abstract.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4
>Reporter: Dhruv Kumar
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> because of its superior numerical stability and its ability to guarantee the 
> discovery of a locally maximum,  Maximum Likelihood Estimator, in the 
> presence of incomplete training data. Currently, Apache Mahout has a 
> sequential implementation of the Baum-Welch which cannot be scaled to train 
> over large data sets. This restriction reduces the quality of training and 
> constrains generalization of the learned model in production environments. 
> This project proposes to extend Mahout's sequential implementation of the 
> Baum-Welch to a parallel, distributed version using the Map-Reduce 
> programming framework to allow training at a large scale for enhanced model 
> fitting. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Relative 
> simplicity of implementation, combined with their ability to discover latent 
> domain knowledge have made them very popular in diverse fields such as DNA 
> sequence alignment, gene discovery, handwriting analysis, voice recognition, 
> computer vision, language translation and parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for evaluating the quality of the model and 
> is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is superior, accurate, 
> and a better candidate for a parallel implementation for two reasons:
> (1) The BW is numerically stable and provides a guaranteed discovery of the 
> locally maximum, Maximum Likelihood Estimator (MLE) for model's parameters 
> (Theta). In Viterbi training, the MLE is approximated in order to reduce 
> 

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-04-06 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016424#comment-13016424
 ] 

Dhruv Kumar commented on MAHOUT-627:


Updated JIRA proposal: added references, fixed one typo, made the task list 
more clear. Updated on GSOC.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4
>Reporter: Dhruv Kumar
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> as it is numerically stable and provides a guaranteed discovery of the 
> Maximum Likelihood Estimator in the presence of incomplete data. Currently, 
> Apache Mahout has a sequential implementation of the Baum-Welch which cannot 
> be scaled to train over large data sets. This project proposes to extend the 
> sequential implementation of the Baum-Welch to a parallel, distributed 
> version using the Map Reduce programming framework to allow scalable Hidden 
> Markov Model training. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Their 
> relative simplicity of implementation and their ability to discover latent 
> domain knowledge have made them very popular in fields such as DNA sequence 
> alignment, handwriting analysis, voice recognition, computer vision and 
> parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for determining the quality of the model 
> and is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is slower but more 
> accurate, and a better candidate for a parallel implementation for two 
> reasons:
> (1) The BW is more numerically stable and provides a guaranteed local maximum 
> of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In 
> Viterbi training, the MLE is approximated in order to reduce computation 
> time. 
> (2) The BW belongs to the general class of Expectation Maximization (EM) 
> algorithms which naturally fit into the Map Reduce framework [2]. 
> Hence

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-04-05 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015993#comment-13015993
 ] 

Dhruv Kumar commented on MAHOUT-627:


Submitted to the GSoC 2011 website under the Apache Software Foundation.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4
>Reporter: Dhruv Kumar
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> as it is numerically stable and provides a guaranteed discovery of the 
> Maximum Likelihood Estimator in the presence of incomplete data. Currently, 
> Apache Mahout has a sequential implementation of the Baum-Welch which cannot 
> be scaled to train over large data sets. This project proposes to extend the 
> sequential implementation of the Baum-Welch to a parallel, distributed 
> version using the Map Reduce programming framework to allow scalable Hidden 
> Markov Model training. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Their 
> relative simplicity of implementation and their ability to discover latent 
> domain knowledge have made them very popular in fields such as DNA sequence 
> alignment, handwriting analysis, voice recognition, computer vision and 
> parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for determining the quality of the model 
> and is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is slower but more 
> accurate, and a better candidate for a parallel implementation for two 
> reasons:
> (1) The BW is more numerically stable and provides a guaranteed local maximum 
> of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In 
> Viterbi training, the MLE is approximated in order to reduce computation 
> time. 
> (2) The BW belongs to the general class of Expectation Maximization (EM) 
> algorithms which naturally fit into the Map Reduce framework [2]. 
> Hence, this project proposes to extend

[jira] [Commented] (MAHOUT-627) Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.

2011-04-05 Thread Dhruv Kumar (JIRA)

[ 
https://issues.apache.org/jira/browse/MAHOUT-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13015982#comment-13015982
 ] 

Dhruv Kumar commented on MAHOUT-627:


Updated JIRA: added gsoc labels, edited the JIRA title to match project's title.

> Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov Model Training.
> -
>
> Key: MAHOUT-627
> URL: https://issues.apache.org/jira/browse/MAHOUT-627
> Project: Mahout
>  Issue Type: Task
>  Components: Classification
>Affects Versions: 0.4
>Reporter: Dhruv Kumar
>  Labels: gsoc, gsoc2011, mahout-gsoc-11
>
> Proposal Title: Baum-Welch Algorithm on Map-Reduce for Parallel Hidden Markov 
> Model Training. 
> Student Name: Dhruv Kumar 
> Student E-mail: dku...@ecs.umass.edu 
> Organization/Project: Apache Mahout 
> Assigned Mentor: 
> Proposal Abstract: 
> The Baum-Welch algorithm is commonly used for training a Hidden Markov Model 
> as it is numerically stable and provides a guaranteed discovery of the 
> Maximum Likelihood Estimator in the presence of incomplete data. Currently, 
> Apache Mahout has a sequential implementation of the Baum-Welch which cannot 
> be scaled to train over large data sets. This project proposes to extend the 
> sequential implementation of the Baum-Welch to a parallel, distributed 
> version using the Map Reduce programming framework to allow scalable Hidden 
> Markov Model training. 
> Detailed Description: 
> Hidden Markov Models (HMMs) are widely used as a probabilistic inference tool 
> for applications generating temporal or spatial sequential data. Their 
> relative simplicity of implementation and their ability to discover latent 
> domain knowledge have made them very popular in fields such as DNA sequence 
> alignment, handwriting analysis, voice recognition, computer vision and 
> parts-of-speech tagging. 
> A HMM is defined as a tuple (S, O, Theta) where S is a finite set of 
> unobservable, hidden states emitting symbols from a finite observable 
> vocabulary set O according to a probabilistic model Theta. The parameters of 
> the model Theta are defined by the tuple (A, B, Pi) where A is a stochastic 
> transition matrix of the hidden states of size |S| X |S|. The elements 
> a_(i,j) of A specify the probability of transitioning from a state i to state 
> j. Matrix B is a size |S| X |O| stochastic symbol emission matrix whose 
> elements b_(s, o) provide the probability that a symbol o will be emitted 
> from the hidden state s. The elements pi_(s) of the |S| length vector Pi 
> determine the probability that the system starts in the hidden state s. The 
> transitions of hidden states are unobservable and follow the Markov property 
> of memorylessness. 
> Rabiner [1] defined three main problems for HMMs: 
> 1. Evaluation: Given the complete model (S, O, Theta) and a subset of the 
> observation sequence, determine the probability that the model generated the 
> observed sequence. This is useful for determining the quality of the model 
> and is solved using the so called Forward algorithm. 
> 2. Decoding: Given the complete model (S, O, Theta) and an observation 
> sequence, determine the hidden state sequence which generated the observed 
> sequence. This can be viewed as an inference problem where the model and 
> observed sequence are used to predict the value of the unobservable random 
> variables. The backward algorithm, also known as the Viterbi decoding 
> algorithm is used for predicting the hidden state sequence. 
> 3. Training: Given the set of hidden states S, the set of observation 
> vocabulary O and the observation sequence, determine the parameters (A, B, 
> Pi) of the model Theta. This problem can be viewed as a statistical machine 
> learning problem of model fitting to a large set of training data. The 
> Baum-Welch (BW) algorithm (also called the Forward-Backward algorithm) and 
> the Viterbi training algorithm are commonly used for model fitting. 
> In general, the quality of HMM training can be improved by employing large 
> training vectors but currently, Mahout only supports sequential versions of 
> HMM trainers which are incapable of scaling.  Among the Viterbi and the 
> Baum-Welch training methods, the Baum-Welch algorithm is slower but more 
> accurate, and a better candidate for a parallel implementation for two 
> reasons:
> (1) The BW is more numerically stable and provides a guaranteed local maximum 
> of the Maximum Likelihood Estimator (MLE) for model's parameters (Theta). In 
> Viterbi training, the MLE is approximated in order to reduce computation 
> time. 
> (2) The BW belongs to the general class of Expectation Maximization (EM) 
> algorithms which naturally fit into the Map Reduce framework [2]. 
> Hence, this project proposes t