[CONF] Apache Mahout > Parallel Viterbi

confluence Sun, 04 Dec 2011 04:30:29 -0800

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Parallel Viterbi 
(https://cwiki.apache.org/confluence/display/MAHOUT/Parallel+Viterbi)



Edited by Sergey Bartunov:
---------------------------------------------------------------------
Viterbi algorithm is known as inference algorithm (synonyms: segmentation, 
decoding etc) for Hidden Markov Model \[1\] which finds the most likely 
sequence of hidden states by given sequence of observed states.

Apache Mahout has both [sequential|Hidden Markov Models] and parallel (that's 
what you're reading about) implementations of the algorithm.

Detailed presentation about parallel viterbi implementation could be found in 
[there|http://modis.ispras.ru/seminar/wp-content/uploads/2011/11/Mahout_Viterbi.pdf]
 (in russian)

h3. Parallelization strategy

is quite straightforward and based on data parallelizm. There are some studies 
on Viterbi (and Belief Propogation which is inference algorithm for loop-less 
Markov Random Fields and is quite similar to Viterbi) parallelization, but at 
the moment of writing this article none of them seem to be applyable for 
MapReduce paradigm.

For example, forward pass of Viterbi could be represented in terms of matrix 
computations (as being dynamic programming algorithm) an thus essentially 
paralleled, but overhead for MapReduce would be greater than profit for 
parallel matrix multiplication.

Input sequences of observed variables are supposed to be divided into the 
chunks of some length, enough to store O(N*K) data in main memory. A set of all 
chunks number N is called a "serie number N". The algorithm process the data 
from serie number N-1 to serie number N (or vice versa), performing forward and 
backward Viterbi passes independently for each chunk (and consequently for each 
sequence) in reducers. Only data that is nescessary for computation of next 
serie is being outputed by direct output of reducers, all other data is 
collected in background. For example, when performing forward Viterbi pass only 
probabilities of last hidden state are nescessary for the next step, 
backpointers tables could be written&nbsp;

If all the sequences are of the same length approximately and the number of 
sequences to decode is much more that number of reducers, O(N*M/K) time is 
required to decode them in parallel (N is number of each sequence, M is number 
of all sequences, K is number of reducers).

h3. Data format

Each sequence of observed states must be stored in sequence files, where key is 
the name of the sequence and value is&nbsp;ObservedSequenceWritable where 
number of chunk, data length and data itself are stored. At the moment it is 
hardcoded requirement, but it seems to be easy to implement any input file 
format that will output this information.

The easiest way to get adjust plain text files with space-delimeted numbers of 
observed states to this format is to use "bin/mahout hmmchunks".

After parallel Viterbi is ended, decoded sequences will be stored in sequence 
files, one for each serie

h3. Usage

Run "bin/mahout pviterbi" and see what it wants from you. That is:&nbsp;

* serialized HmmModel (i.e. by LossyHmmModelSerializer class)
* input data (observed sequences) in the format described above

*References*

# [Wikipedia article|http://en.wikipedia.org/wiki/Viterbi_algorithm]

Change your notification preferences: 
https://cwiki.apache.org/confluence/users/viewnotifications.action

[CONF] Apache Mahout > Parallel Viterbi

Reply via email to