[ https://issues.apache.org/jira/browse/MAHOUT-652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119348#comment-13119348 ]
Isabel Drost commented on MAHOUT-652: ------------------------------------- Would be great if you could add this information to our wiki. I think the HMM page https://cwiki.apache.org/MAHOUT/hidden-markov-models.html would be the best place to put this - maybe also add some information on when to prefer the parallel implementation over the sequential one (rough rule of thumb sufficient). Patch still applies cleanly, compiles, tests still running. Overall a very useful patch that is in good shape already. More detailed comments below. Some comments that need a bit more work to further improve your patch: * I did not find any integration of your work into our mahout script - would be really nice to have. * Only found an integration test. When I think about having to change your code at any point in the future it would be great if you could also provide more fine-grained tests that check smaller units of work. Did you consider using mr-unit? A few simpler comments: * For most of your classes I'd prefer some more documentation both on the package-, class- and method level. Not sure if people not familiar with your code can easily understand what's going on. * Please provide a list of all configuration options you use - either in javaDoc or even in the above wiki page. * In cases where you catch an exception (e.g. PrepareChunks) in a map/reduce job it might make sense to at least increment a counter for each exception that is swallowed. * You use log.info quite often even within loops - as a general rule of thumb to reduce the amount of information to the absolute minimum and remain able to spot problems I tend to prefer to see an info statement just once in a method * Please consider re-using the default options provided by Mahout e.g. for input and output paths (IIRC org.apache.mahout.common.commandline) Class level comments: PrepareChunks: * Please wrap anything between opening readers/writers etc. in a try{}finally{} block and close the streams in the finally block. Using org.apache.commons.io.IOUtils.closeQuietly might prove useful. * Don't add code to the patch that is commented out. ViterbiDataWritable * There is an array of type Class named classes - either give it a speaking name or at least document the general strategy (or reference the upstream writable strategy that you adopt) ObservedSequenceWritable * NullArgumentException - I'd prefer using guava's Precondition concept here for consistency with existing code. * Maybe take a look at the facilities provided by guava that make it easier to write hashCode/equals and friends? * Add a toString? ForwardViterbiReducer * I do not really understand why it has so many static functions? ViterbiEvaluator * I'd rather not see .* imports HmmOnlineViterbi * You are referencing a thesis in the java doc comment - please complete the reference (e.g. title and year are missing) * The class seems pretty long - maybe factor tree and node into their own separate classes? OnlineViterbiReducer: * I did not find more detailed documentation on the expected input/output key/values - might help others to re-use > [GSoC Proposal] Parallel Viterbi algorithm for HMM > -------------------------------------------------- > > Key: MAHOUT-652 > URL: https://issues.apache.org/jira/browse/MAHOUT-652 > Project: Mahout > Issue Type: New Feature > Affects Versions: 0.5 > Reporter: Sergey Bartunov > Assignee: Isabel Drost > Labels: classification, gsoc,, gsoc11, hmm, mahout-gsoc-11, > sequence-learning > Fix For: 0.6 > > Attachments: GSoC performance.csv, parallel-viterbi.patch > > > Proposal Title: Parallel Viterbi algorithm for HMM > Student Name: Sergey Bartunov > Student E-mail: sbos....@gmail.com > > Organization/Project: Apache Mahout > Assigned Mentor: > Proposal Abstract: > The Viterbi Algorithm is an evaluating algorithm for Hidden Markov Model [1]. > It estimates the most likely sequence of hidden states called "Viterbi path" > from given sequence of observed states. Viterbi algorithm is also used in > Viterbi training which is less numerical stable than Baum-Welch algorithm but > faster and can be adjusted to have near the same accuracy [2]. Hidden Markov > Model is widely used for speech and gesture recognition, part-of-speech > tagging, bioinformatics etc. At the moment Apache Mahout contains only > sequential HMM functionality, and this project is intended to extend it by > implementing Map-Reduce version of Viterbi algorithm which would make Mahout > able to evaluate HMM on big amounts of data in parallel mode. > Detailed Description: > The Viterbi algorithms is a quite common dynamic programming algorithm, it's > well described in many sources such as [3], [4], [5]. As being popular and > needed, Viterbi algorithm already have parallel versions which became a > subject of several studies, for example, this paper about implementation for > GPU [6]. Some of these papers may contain useful ideas for map-reduce > implementation. > There are several possible strategies for parallelization (which one is > better is an open question for the project) I discovered and posted to the > mailing list. > See [this > thread|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%3CBANLkTin7kZm=6taxcko80vvm6ps7vb5...@mail.gmail.com%3E] > The most reasonable one is: > Assume we have O - set of oberved state sequences (that's our input data), > $O_i$ is i-th sequence with length > len($O_i$), K is number of hidden states in our model. {$O_{i, t}$} is the > observed state of i-th sequence at the moment of time t. Let the N be the > maximum of len($O_i$). > Let's write all the seqences one above other. We will get the matrix which > rows are the input seqeunces. Then take the first L columns of this matrix > and name them the "first chunk", then next L columns and so on. Thus we > divide our input into the chunks of size L (less or equal to L). L is the > parameter of the algorithm. So the n-th chunk contain subsequences with $t$ > in the iterval [L*n, L*(n+1)). > The key idea is to process each chunk by standard sequential Viterbi > algorithm (the existing Mahout code could be reused here) in parallel mode > one by one using results of previous chunk processing. > It will require about O((M/P) * K^2) time where M is sum of all sequence > lengths if most of input sequences have near the same length (the good case). > P here is the number of "cores" / computational nodes. This time complexity > means that such strategy scales well. > Here is Map-Reduce scheme for the good case: > * Map emits (subsequence, probabilites, paths) vector for each subsequence in > the chunk, where probabilities is the initial probabilites in case of the > first chunk and optimal-path probabilities in all other cases, paths means > optimal paths to the first observations of each subsequence. > * Reduce function just performs sequential Viterbi algorithm on these data > and returns (probabilities, paths). > * Driver class runs the Map then Reduce for the first chunk, then Map and > Reduce for the second, and so on. Then provide the results (probably using > another map-reduce combination). > Probably, it's possible to use lazy viterbi [7,8] instead of the standard > Viterbi for chunk processing. > Let's focus now on all other cases when first T chunks contain the > subsequences with near the same length and all other N-T chunks do not. Well, > then they could be processed using the next technique (which is strategy > number 2 in [the > list|http://mail-archives.apache.org/mod_mbox/mahout-dev/201104.mbox/%3CBANLkTin7kZm=6taxcko80vvm6ps7vb5...@mail.gmail.com%3E]): > At first, denote $V_{t,k}$ as the probabilty that optimal hidden state path > goes through hidden state $k$ at the moment $t$. We need such probabilities > to select the most likely (or optimal) path so we need to compute $V_{t,k}$ > for every $t$ and $k$ (that is the part of standard Viterbi). I omit here the > $i$ index of the sequence, because each seqence is the separate task. > 1) process each chunk separately. $t$ lie in the [L*n, L*(n+1)) interval, > where n is the number of chunk and to compute Actually to do this, we need > to compute $V_{t,k}$ we need to know $V_{L*n-1,k1}$ for each $k1$. But this > value belong to the chunk the previous number (n-1) which is processed > separately. We need to obtain the max and arg-max of $V_{L*n-1, k1}$ for each > k1. > 2) Since we don't know the max's and arg-max's let's just assume that optimal > path goes through the k1=1 then through k2=2 and so on up to the K, and > $V_{L*n-1, k1}$ = 1 (or 0 if we use log(prob) adding instead of prob > multiplying) and argmax is k1=1, 2 ... K. Later we may multiply the > $V_{L*(n+1)-1, k1}$ by the precise probability max{ $V_{L*n-1, k1}$ } (or add > the log of it). Thus we get K^2 possibly optimal paths instead of K. > 3) After we process all the chunks we start "connecting" phase. > The first of these N-T chunks could be processed by using usual Viterbi > processing since we know the results for the previous "good" chunk. All > others need to be connected each to other starting from connecting second > this first then 3th with 4th and so on. During connection process we throw > away all non-optimal paths (know we know which are optimal and which are not) > and leave just K possibly optimal paths. > To handle such difficult cases the Driver class should know the sequence > lengths to recognize which Map-Reduce combination it should use. The > resulting time complexity for difficult case is > O((T*L / N) / P) * K^2 + (((T-N) * L / N)) / P * K^3). The first term of this > sum is time for computing the first T chunks using simple case approach, and > the second term is time for computing the rest using difficult case approach. > When T -> N overall time tend to O((M/P) * K^2). T * L / N is about total > length of subsequences in the first T chunks and (((T-N) * L / N)) is total > length of the rest. > The only problem not discussed here is how to deal with obtained most likely > paths. Path length is equal to length of a sequence, so it's nood a good idea > to emit optimal path in Map stage. They could be stored separately in the > files or HBase since they are required only in getting the results to the > user. > This implementation should work (or be compatible) with > org.apache.mahout.classifier.sequencelearning.hmm.HmmModel > As you see, Viterbi algorithm use very common dynamic approach. It computes > the next computation layer by combining values the previous, that's all we > need to know how to implement it in the Map-Reduce and it shares the > mentioned difficulties with all other such dynamic algorithms, so the > implentation may be highly resuable. > Timeline: > 1. 1 week. Discover Mahout internals, sequential HMM code and code best > practices. Think on problem with path storing. I'll work on this stage > actually before GSoC starts but I leave here 1 week just to be sure. > 2. 5 days. Write the chunk division routines. > 3. 5 days. Write the SimpleMapper class for simple good case of data. > 4. 1 week. Write the SimpleReducer class for simple good case. > 5. 1 week. Write the Driver class for simple good case which will use > SimpleMapper and SimpleReducer. > 6. 1 week. Collect debug dataset, write tests for the code and find possible > problems, fix them. > 7. 10 days. Write the HardReducer class for difficult bad case and rewrite > Driver class to handle such cases. Perform tests, ensure that everything > works. > 7. 10 days. Try to get a large dataset to test performance, scalability etc, > write and perform such a test (here a community help might be needed). > If there are any problems discuss and analyze them, then fix. > 8. 1 week. Try to use some optimized Viterbi implentation (i.e. Lazy Viterbi) > in chunk processing phase, measure speed improvement especially for difficult > cases. > 9. 1 week. Write documentation for all classes, description of algorithm in > wiki and complete example of usage. > Additional Information: > The project relates to [this > proposal|https://issues.apache.org/jira/browse/MAHOUT-627] which is about > implementing parallel Baum-Welch traning algorithm for HMM. > My motivation for this project is that I need such a functionality as a > regular Mahout user since I face HMM often in my research work. Also I'm very > interested in what kinds or classes of algorithms could be efficiently > implemented in Map-Reduce, I also think that this is important question for > the community and such an experience is significant. > The important detail is that I have exams in the university until 16th June. > After that I will be ready to work on the project. > References: > [1] Lawrence R. Rabiner (February 1989). ["A tutorial on Hidden Markov Models > and selected applications in speech > recognition"|http://www.cs.cornell.edu/courses/cs481/2004fa/rabiner.pdf]. > Proceedings of the IEEE 77 (2): 257-286. doi:10.1109/5.18626. > [2] [Adjusted Viterbi Training. A proof of > concept.|http://personal.rhul.ac.uk/utah/113/VA/AVTPC.pdf] > [3] > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.118.2081&rep=rep1&type=pdf > [4] http://www.cambridge.org/resources/0521882672/7934_kaeslin_dynpro_new.pdf > [5] http://www.kanungo.com/software/hmmtut.pdf > [6] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=5470903 > [7] http://www.scribd.com/doc/48568001/Lazy-Viterbi-Slides > [8] http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=1040367 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira