[ https://issues.apache.org/jira/browse/MAHOUT-396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12878319#action_12878319 ]
Qiuyan Xu commented on MAHOUT-396: ---------------------------------- Marc and I will do the 2. task: Find real-world examples that make use of the HMM implementation to demonstrate both usage and usefullness > Proposal for Implementing Hidden Markov Model > --------------------------------------------- > > Key: MAHOUT-396 > URL: https://issues.apache.org/jira/browse/MAHOUT-396 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.4 > Reporter: Max Heimel > Priority: Minor > Fix For: 0.4 > > Attachments: MAHOUT-396.diff, MAHOUT-396.diff, MAHOUT-396.diff, > MAHOUT-396.diff > > > h4. Overview > This is a project proposal for a summer-term university project to write a > (sequential) HMM implementation for Mahout. Five students will work on this > project as part of a course mentored by Isabel Drost. > h4. Abstract: > Hidden Markov Models are used in multiple areas of Machine Learning, such as > speech recognition, handwritten letter recognition or natural language > processing. A Hidden Markov Model (HMM) is a statistical model of a process > consisting of two (in our case discrete) random variables O and Y, which > change their state sequentially. The variable Y with states {y_1, ... , y_n} > is called the "hidden variable", since its state is not directly observable. > The state of Y changes sequentially with a so called - in our case > first-order - Markov Property. This means, that the state change probability > of Y only depends on its current state and does not change in time. Formally > we write: P(Y(t+1)=y_i|Y(0)...Y(t)) = P(Y(t+1)=y_i|Y(t)) = P(Y(2)=y_i|Y(1)). > The variable O with states {o_1, ... , o_m} is called the "observable > variable", since its state can be directly observed. O does not have a Markov > Property, but its state propability depends statically on the current state > of Y. > Formally, an HMM is defined as a tuple M=(n,m,P,A,B), where n is the number > of hidden states, m is the number of observable states, P is an n-dimensional > vector containing initial hidden state probabilities, A is the > nxn-dimensional "transition matrix" containing the transition probabilities > such that A[i,j]=P(Y(t)=y_i|Y(t-1)=y_j) and B is the mxn-dimensional > "observation matrix" containing the observation probabilties such that > B[i,j]= P(O=o_i|Y=y_j). > Rabiner [[1|My Page#reference1]] defined three main problems for HMM models: > # Evaluation: Given a sequence O of observations and a model M, what is the > probability P(O|M) that sequence O was generated by model M. The Evaluation > problem can be efficiently solved using the Forward algorithm > # Decoding: Given a sequence O of observations and a model M, what is the > most likely sequence Y*=argmax(Y) P(O|M,Y) of hidden variables to generate > this sequence. The Decoding problem can be efficiently sovled using the > Viterbi algorithm. > # Learning: Given a sequence O of observations, what is the most likely model > M*=argmax(M)P(O|M) to generate this sequence. The Learning problem can be > efficiently solved using the Baum-Welch algorithm. > The target of each milestone is defined as the implementation for the given > goals, the respective documentation and unit tests for the implementation. > h4.Timeline > Mid of May 2010 - Mid of July 2010 > h4.Milestones > I) Define an HMM class based on Apache Mahout Math package offering > interfaces to set model parameters, perform consistency checks, perform > output prediction. > 1 week from May 18th till May 25th. > II) Write sequential implementations of forward (cf. problem 1 [[1|My > Page#reference1]]) and backward algorithm. > 2 weeks from May 25th till June 8th. > III) Write a sequential implementation of Viterbi algorithm (cf. problem 2 > [[1|My Page#reference1]]), based on existing forward algorithm implementation. > 2 weeks from June 8th till June 22nd > IV) Have a running sequential implementation of Baum-Welch algorithm for > model parameter learning (application II [ref]), based on existing > forward/backward algorithm implementation. > 2 weeks from June 8th till June 22nd > V) Provide a usage example of HMM implementation, demonstrating all three > problems. > 2 weeks from June 22nd till July 6th > VI) Finalize documentation and implemenation, clean up open ends. > 1 week from July 6th till July 13th > h4.References: > {anchor:reference1}[[1|http://www.cs.ubc.ca/~murphyk/Bayes/rabiner.pdf]] > Lawrence R. Rabiner (February 1989). "A tutorial on Hidden Markov Models and > selected applications in speech recognition". Proceedings of the IEEE 77 (2): > 257-286. doi:10.1109/5.18626. > Potential test data sets: > [http://www.cnts.ua.ac.be/conll2000/chunking/] > [http://archive.ics.uci.edu/ml/datasets/Character+Trajectories] -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.