[ 
https://issues.apache.org/jira/browse/MAHOUT-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119348#comment-13119348
 ] 

Isabel Drost commented on MAHOUT-652:
-------------------------------------

Would be great if you could add this information to our wiki. I think the HMM 
page https://cwiki.apache.org/MAHOUT/hidden-markov-models.html would be the 
best place to put this - maybe also add some information on when to prefer the 
parallel implementation over the sequential one (rough rule of thumb 
sufficient).

Patch still applies cleanly, compiles, tests still running.

Overall a very useful patch that is in good shape already. More detailed 
comments below.


Some comments that need a bit more work to further improve your patch:

* I did not find any integration of your work into our mahout script - would be 
really nice to have.
* Only found an integration test. When I think about having to change your code 
at any point in the future it would be great if you could also provide more 
fine-grained tests that check smaller units of work. Did you consider using 
mr-unit?


A few simpler comments:

* For most of your classes I'd prefer some more documentation both on the 
package-, class- and method level. Not sure if people not familiar with your 
code can easily understand what's going on.
* Please provide a list of all configuration options you use - either in 
javaDoc or even in the above wiki page.
* In cases where you catch an exception (e.g. PrepareChunks) in a map/reduce 
job it might make sense to at least increment a counter for each exception that 
is swallowed.
* You use log.info quite often even within loops - as a general rule of thumb 
to reduce the amount of information to the absolute minimum and remain able to 
spot problems I tend to prefer to see an info statement just once in a method
* Please consider re-using the default options provided by Mahout e.g. for 
input and output paths (IIRC org.apache.mahout.common.commandline)

Class level comments:

PrepareChunks:
* Please wrap anything between opening readers/writers etc. in a try{}finally{} 
block and close the streams in the finally block. Using 
org.apache.commons.io.IOUtils.closeQuietly might prove useful.
* Don't add code to the patch that is commented out.

ViterbiDataWritable
* There is an array of  type Class named classes - either give it a speaking 
name or at least document the general strategy (or reference the upstream 
writable strategy that you adopt)

ObservedSequenceWritable
* NullArgumentException - I'd prefer using guava's Precondition concept here 
for consistency with existing code.
* Maybe take a look at the facilities provided by guava that make it easier to 
write hashCode/equals and friends?
* Add a toString?

ForwardViterbiReducer
* I do not really understand why it has so many static functions?

ViterbiEvaluator
* I'd rather not see .* imports

HmmOnlineViterbi
* You are referencing a thesis in the java doc comment - please complete the 
reference (e.g. title and year are missing)
* The class seems pretty long - maybe factor tree and node into their own 
separate classes?

OnlineViterbiReducer:
* I did not find more detailed documentation on the expected input/output 
key/values - might help others to re-use



                
> [GSoC Proposal] Parallel Viterbi algorithm for HMM
> --------------------------------------------------
>
>                 Key: MAHOUT-652
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-652
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.5
>            Reporter: Sergey Bartunov
>            Assignee: Isabel Drost
>              Labels: classification, gsoc,, gsoc11, hmm, mahout-gsoc-11, 
> sequence-learning
>             Fix For: 0.6
>
>         Attachments: GSoC performance.csv, parallel-viterbi.patch
>
>
> Proposal Title: Parallel Viterbi algorithm for HMM
> Student Name: Sergey Bartunov
> Student E-mail: sbos....@gmail.com
>  
> Organization/Project: Apache Mahout
> Assigned Mentor:
> Proposal Abstract:
> The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. 
> It estimates the most likely sequence of hidden states called "Viterbi path" 
> from given sequence of observed states. Viterbi algorithm is also used in 
> Viterbi training which is less numerical stable than Baum-Welch algorithm but 
> faster and can be adjusted to have near the same accuracy [2]. Hidden Markov 
> Model is widely used for speech and gesture recognition, part-of-speech 
> tagging, bioinformatics etc. At the moment Apache Mahout contains only 
> sequential HMM functionality, and this project is intended to extend it by 
> implementing Map-Reduce version of Viterbi algorithm which would make Mahout 
> able to evaluate HMM on big amounts of data in parallel mode. 
> Detailed Description:
> The Viterbi algorithms is a quite common dynamic programming algorithm, it's 
> well described in many sources such as [3], [4], [5]. As being popular and 
> needed, Viterbi algorithm already have parallel versions which became a 
> subject of several studies, for example, this paper about implementation for 
> GPU [6]. Some of these papers may contain useful ideas for map-reduce 
> implementation.
> There are several possible strategies for parallelization (which one is
> better is an open question for the project) I discovered and posted to the 
> mailing list.
> See [this 
> thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%3CBANLkTin7kZm=6taxcko80vvm6ps7vb5...@mail.gmail.com%3E]
> The most reasonable one is:
> Assume we have O - set of oberved state sequences (that's our input data), 
> $O_i$ is i-th sequence with length 
> len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the 
> observed state of i-th sequence at the moment of time t. Let the N be the 
> maximum of len($O_i$).
> Let's write all the seqences one above other. We will get the matrix which 
> rows are the input seqeunces. Then take the first L columns of this matrix 
> and name them the "first chunk", then next L columns and so on. Thus we 
> divide our input into the chunks of size L (less or equal to L). L is the 
> parameter of the algorithm. So the n-th chunk contain subsequences with $t$ 
> in the iterval [L*n, L*(n+1)). 
> The key idea is to process each chunk by standard sequential Viterbi 
> algorithm (the existing Mahout code could be reused here) in parallel mode 
> one by one using results of previous chunk processing.
> It will require about O((M/P) * K^2) time where M is sum of all sequence 
> lengths if most of input sequences have near the same length (the good case). 
> P here is the number of "cores" / computational nodes. This time complexity 
> means that such strategy scales well. 
> Here is Map-Reduce scheme for the good case:
> * Map emits (subsequence, probabilites, paths) vector for each subsequence in 
> the chunk, where probabilities is the initial probabilites in case of the 
> first chunk and optimal-path probabilities in all other cases, paths means 
> optimal paths to the first observations of each subsequence. 
> * Reduce function just performs sequential Viterbi algorithm on these data 
> and returns (probabilities, paths).
> * Driver class runs the Map then Reduce for the first chunk, then Map and 
> Reduce for the second, and so on. Then provide the results (probably using 
> another map-reduce combination).
> Probably, it's possible to use lazy viterbi [7,8] instead of the standard 
> Viterbi for chunk processing.
> Let's focus now on all other cases when first T chunks contain the 
> subsequences with near the same length and all other N-T chunks do not. Well, 
> then they could be processed using the next technique (which is strategy 
> number 2 in [the 
> list|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%3CBANLkTin7kZm=6taxcko80vvm6ps7vb5...@mail.gmail.com%3E]):
> At first, denote $V_{t,k}$ as the probabilty that optimal hidden state path 
> goes through hidden state $k$ at the moment $t$. We need such probabilities 
> to select the most likely (or optimal) path so we need to compute $V_{t,k}$ 
> for every $t$ and $k$ (that is the part of standard Viterbi). I omit here the 
> $i$ index of the sequence, because each seqence is the separate task.
> 1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval, 
> where n is the number of chunk and to compute  Actually to do this, we need 
> to compute $V_{t,k}$ we need to know $V_{L*n-1,k1}$ for each $k1$. But this 
> value belong to the chunk the previous number (n-1) which is processed 
> separately. We need to obtain the max and arg-max of $V_{L*n-1, k1}$ for each 
> k1. 
> 2) Since we don't know the max's and arg-max's let's just assume that optimal 
> path goes through the k1=1 then through k2=2 and so on up to the K, and 
> $V_{L*n-1, k1}$ = 1 (or 0 if we use log(prob) adding instead of prob 
> multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the 
> $V_{L*(n+1)-1, k1}$ by the precise probability max{ $V_{L*n-1, k1}$ } (or add 
> the log of it). Thus we get  K^2 possibly optimal paths instead of K.
> 3) After we process all the chunks we start "connecting" phase. 
> The first of these N-T chunks could be processed by using usual Viterbi 
> processing since we know the results for the previous "good" chunk. All 
> others need to be connected each to other starting from connecting second 
> this first then 3th with 4th and so on. During connection process we throw 
> away all non-optimal paths (know we know which are optimal and which are not) 
> and leave just K possibly optimal paths.
> To handle such difficult cases the Driver class should know the sequence 
> lengths to recognize which Map-Reduce combination it should use. The 
> resulting time complexity for difficult case is 
> O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3). The first term of this 
> sum is time for computing the first T chunks using simple case approach, and 
> the second term is time for computing the rest using difficult case approach. 
> When T -> N overall time tend to O((M/P) * K^2). T * L / N is about total 
> length of subsequences in the first T chunks and (((T-N) * L / N)) is total 
> length of the rest.
> The only problem not discussed here is how to deal with obtained most likely 
> paths. Path length is equal to length of a sequence, so it's nood a good idea 
> to emit optimal path in Map stage. They could be stored separately in the 
> files or HBase since they are required only in getting the results to the 
> user.
> This implementation should work (or be compatible) with 
> org.apache.mahout.classifier.sequencelearning.hmm.HmmModel
> As you see, Viterbi algorithm use very common dynamic approach. It computes 
> the next computation layer by combining values the previous, that's all we 
> need to know how to implement it in the Map-Reduce and it shares the 
> mentioned difficulties with all other such dynamic algorithms, so the 
> implentation may be highly resuable.
> Timeline:
> 1. 1 week. Discover Mahout internals, sequential HMM code and code best 
> practices. Think on problem with path storing. I'll work on this stage 
> actually before GSoC starts but I leave here 1 week just to be sure.
> 2. 5 days. Write the chunk division routines. 
> 3. 5 days. Write the SimpleMapper class for simple good case of data.
> 4. 1 week. Write the SimpleReducer class for simple good case.
> 5. 1 week. Write the Driver class for simple good case which will use 
> SimpleMapper and SimpleReducer. 
> 6. 1 week. Collect debug dataset, write tests for the code and find possible 
> problems, fix them.
> 7. 10 days. Write the HardReducer class for difficult bad case and rewrite 
> Driver class to handle such cases. Perform tests, ensure that everything 
> works.
> 7. 10 days. Try to get a large dataset to test performance, scalability etc, 
> write and perform such a test (here a community help might be needed).
> If there are any problems discuss and analyze them, then fix.
> 8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy Viterbi) 
> in chunk processing phase, measure speed improvement especially for difficult 
> cases.
> 9. 1 week. Write documentation for all classes, description of algorithm in
> wiki and complete example of usage.
> Additional Information:
> The project relates to [this 
> proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about 
> implementing parallel Baum-Welch traning algorithm for HMM.
> My motivation for this project is that I need such a functionality as a 
> regular Mahout user since I face HMM often in my research work. Also I'm very 
> interested in what kinds or classes of algorithms could be efficiently 
> implemented in Map-Reduce, I also think that this is important question for 
> the community and such an experience is significant.
> The important detail is that I have exams in the university until 16th June. 
> After that I will be ready to work on the project.
> References:
> [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models 
> and selected applications in speech 
> recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. 
> Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. 
> [2] [Adjusted Viterbi Training. A proof of 
> concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf]
> [3] 
> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf
> [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf
> [5] http://www.kanungo.com/software/hmmtut.pdf
> [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903
> [7] http://www.scribd.com/doc/48568001/Lazy-Viterbi-Slides
> [8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to