[
https://issues.apache.org/jira/browse/SPARK-17716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-17716:
---------------------------------
Labels: bulk-closed (was: )
> Hidden Markov Model (HMM)
> -------------------------
>
> Key: SPARK-17716
> URL: https://issues.apache.org/jira/browse/SPARK-17716
> Project: Spark
> Issue Type: New Feature
> Components: MLlib
> Reporter: Xiangrui Meng
> Assignee: Runxin Li
> Priority: Major
> Labels: bulk-closed
>
> Had an offline chat with [~Lil'Rex], who implemented HMM on Spark at
> https://github.com/apache/spark/compare/master...lilrex:sequence. I asked him
> to list popular HMM applications, describe public API (params, input/output
> schemas), compare its API with existing HMM implementations.
> h1. Hidden Markov Model (HMM) Design Doc
> h2. Overview
> h3. Introduction to HMM
> Hidden Markov Model is a type of statistical Machine Learning model that
> assumes a sequence of observations is generated by a Markov process with
> hidden states. There are 3 (or 2, depending on the implementation) main
> components of the model:
> * *Transition Probability*: describes the probability distribution of
> transitions from each state to other states (including self) in the Markov
> process
> * *Emission Probability*: describes the probability distribution for an
> observation associated with hidden states
> * *Initial/Start Probability* (optional): represents the prior probability of
> each state at the beginning of the observation sequence
> _Note: some implementations merge the Initial Probability into Transition
> Probability by adding an arbitrary Start state before the first observation
> point._
> h3. HMM Models and Algorithms
> Given a limited number of states, most HMM models have the same form of
> Transition Probability: a matrix, where each element _(i, j)_ represents the
> probability of transition from state _i_ to state _j_. The Initial
> Probability usually take the simple form of a probabilistic vector.
> The Emission Probability, on the other hand, can be represented in many
> different ways, depending on different nature of observations, i.e.
> continuous vs. discrete, or different model assumptions, e.g. single Gaussian
> vs. Gaussian Mixtures.
> There are three main problems associated with HMM models, and their canonical
> algorithms:
> # *Evaluation*: What’s the probability of a given observation sequence, based
> on the model? It’s usually done by either *Forward* or *Backward* algorithms
> # *Decoding*: What’s the most likely state sequence, given the observation
> sequence and the model? It’s usually done by *Viterbi* decoding
> # *Learning*: How to train the parameters of the model based on the
> observation sequences? *Baum-Welch* (Forward-Backward) is usually used as
> part of the *EM* algorithm in unsupervised training
> h3. Popular Applications of HMM
> * Speech Recognition
> * Part-of-speech Tagging
> * Named Entity Recognition
> * Machine Translation
> * Gene Prediction
> h2. Alternate Libraries
> [Mahout|https://mahout.apache.org/users/classification/hidden-markov-models.html]
> * Assume emission probability to be an m-by-n matrix
> * Use Baum-Welch algorithm for training and Viterbi algorithm for prediction
> * API (command line)
> ** training
> {{$ echo "0 1 2 2 2 1 1 0 0 3 3 3 2 1 2 1 1 1 1 2 2 2 0 0 0 0 0 0 2 2 2 0 0 0
> 0 0 0 2 2 2 3 3 3 3 3 3 2 3 2 3 2 3 2 1 3 0 0 0 1 0 1 0 2 1 2 1 2 1 2 3 3 3 3
> 2 2 3 2 1 1 0" > hmm-input}}
> {{$ export MAHOUT_LOCAL=true}}
> {{$ $MAHOUT_HOME/bin/mahout baumwelch -i hmm-input -o hmm-model -nh 3 -no 4
> -e .0001 -m 1000}}
> ** prediction
> {{$ $MAHOUT_HOME/bin/mahout hmmpredict -m hmm-model -o hmm-predictions -l 10}}
> {{$ cat hmm-predictions}}
> [Mallet|http://mallet.cs.umass.edu/api/cc/mallet/fst/HMM.html]
> * Treat HMM as a Finite State Transducer (FST)
> * Theoretically can go beyond first-order Markov assumption by setting an
> arbitrary order
> * Limited to text data, i.e. discrete observation sequence with Multinomial
> emission model assumption
> * Supervised training only
> * API:
> ** Training:
> {{HMM hmm = new HMM(pipe, null);}}
> {{hmm.addStatesForLabelsConnectedAsIn(trainingInstances);}}
> {{HMMTrainerByLikelihood trainer = new HMMTrainerByLikelihood(hmm);}}
> {{trainer.train(trainingInstances, 10);}}
> ** Testing:
> {{evaluator.evaluate(trainer);}}
> [HMMLearn|https://github.com/hmmlearn/hmmlearn]
> * Previously part of scikit-learn
> * Algo:
> ** Standard HMM unsupervised training algorithm
> ** Three types of emission models: GMM, Gaussian and Multinomial
> * API:
> ** Training:
> {{model = hmm.GaussianHMM(n_components=3, covariance_type="full")}}
> {{model.fit(X)}}
> ** Testing:
> {{hidden_states = model.predict(X)}}
> h2. API
> Design Goals
> * Build the foundation for general Sequential Tagging models (HMM, CRF, etc.)
> * Support multiple Emission Probability models such as “Multinomial” and
> “Gaussian Mixture”
> * Keep both supervised and unsupervised learning for HMM in mind
> h3. Proposed API
> _Note: This is written for the spark.ml API._
> Decoder API
> {code:title=Decoder.scala}
> trait DecoderParams extends Params {
> def featureCol: DataType // column of sequential features, e.g. MatrixUDT
> def predictionCol: DoubleType // column of prediction
> def labelCol: DataType // column of sequential labels, e.g. VectorUDT
> }
> abstract class Decoder extends Estimator with DecoderParams {
> def extractLabeledSequences(dataset: Dataset[_]): RDD[LabeledSequence]
> }
> abstract class DecodingModel extends Model with DecoderParams {
> def numFeatures: Int
> def decode(features: FeatureType): Vector
> }
> {code}
> Tagger API
> {code:title=Tagger.scala}
> trait TaggerParams extends DecoderParams with HasRawPredictionCol {
> def rawPredictionCol: MatrixUDT // column for all predicted label sequences
> }
> abstract class Tagger extends Decoder with TaggerParams
> abstract class TaggingModel extends DecodingModel with TaggerParams {
> def numClasses: Int
> def decodeRaw(features: FeaturesType): Array[(Double, Vector)]
> def raw2prediction(rawPrediction: Array[(Double, Vector)]): Vector
> }
> {code}
> MarginalTagger (Probabilistic Tagger) API
> {code:title=MarginalTagger.scala}
> trait MarginalTaggerParams extends TaggerParams with HasProbabilityCol with
> HasThreshold {
> def probabilityCol: DoubleType // column for probability
> }
> abstract class MarginalTagger extends Tagger with MarginalTaggerParams
> abstract class MarginalTaggingModel extends TaggingModel with
> MarginalTaggerParams {
> def getMargin(featuers: FeaturesType): Double
> }
> {code}
> HiddenMarkovModel API
> {code:title=HiddenMarkovModel.scala}
> trait HiddenMarkovModelParams extends MarginalTaggerParams with HasMaxIter
> with HasTol with HasStandardization with HasThreshold {
> def smoothing: DoubleParam
> def emissionType: Param[String] //can be either Multinomial or Gaussian
> }
> class HiddenMarkovModel extends MarginalTagger with HiddenMarkovModelParams {
> def initialModel: Option[HMMModel] //initial model before training
> def def trainSupervised(dataset: Dataset[_]): Option[HMMModel]
> def trainUnsupervised(dataset: Dataset[_]): HMMModel
> }
> abstract class HMMModel extends MarginalTaggingModel with
> HiddenMarkovModelParams {
> def initialProb: Vector // initial probability for states
> def transitionProb: DenseMatrix // transition probability between states
> //compute feature scores for all states in all frames in a sequence,
> different for different emission models, e.g. Multinomial vs. GMM
> def precomputeEmission(features: Matrix): List[Array[Double]]
> // accumulate sufficient statistics for emission model
> def getEmissionStats(features: Matrix, gammas: List[Array[Double]]):
> DenseMatrix
> // forward algorithm
> def forward(emissions: Traversable[Array[Double]]): List[Array[Double]]
> // backword algorithm
> def backward(emissions: Traversable[Array[Double]]):List[Array[Double]]
> }
> class MultinomialHMMModel extends HMMModel {
> def emissionProb: Matrix // emission probability from states to observations
> }
> object MultinomialHMMModel extends MLReadable[MultinomialHMMModel]
> class HMMModelReader extends MLReader[MultinomialHMMModel]
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]