This post is related to https://issues.apache.org/jira/browse/MAHOUT-652

As I mentioned before, I thought on possible parallelization
strategiest and here is what I have on the moment from my head.
NB I’m from Russia and my english is far from ideal. I’ll try to
explain my thoughts clean as much as I can, but you misunderstand
something, just ask, please.

The standard sequentian Viterbi algorithm is well described in many
sources (for example, in wikipedia,
http://en.wikipedia.org/wiki/Viterbi_algorithm), so I won’t explain it
here.

Here is what I discovered about possible parallelization of this algorithm.

Suppose we have the D - set of observed state sequences, HMM with
known parameters, K is the number of hidden states, N_i is the length
of i-th sequence.

1. the first idea that comes to mind is that we could move through the
sequences and parallel the computation of optimal path to the next
step. That means, that we will parallel on the hidden states.

In the terms of the wikipedia article that is parallelization of
computation $V_{t,k}$ for each $k$.

Since the number of states is not great usually (for part-of-speech
tagging task it is about 6 for english) such a parallelization is not
reasonable in map-reduce paradigm.

2. well, then we could divide all input seqences in chunks and process
them independently.

The first chunk of each sequences is computed as usual. All other
chunks could not be processed such easily.

Let’s divide any given sequence into the chunks of size L and the s[m]
is the chunk number m (starting from zero).
We need to compute $V_{t,k}$ for every t in [L*m, L*(m+1)) and every
k. But to compute $V_{L*m,k}$ we need to know $V_{L*m-1, k1}$ for each
k that is to know the value from the previous chunk which is computed
independently. Actually, we do not need to know what is  $V_{L*m-1,
k1}$ exactly to get the optimal path. It’s enough to know which k1
maximizes the equation. But we don’t know even this, so we may compute
all optimal paths for each k1 by thinking that each k1 is the argmax
and $V_{L*m-1, k1}$ = 1 (or 0 if we add the log of probalities).
Later, on “connecting” step we just have to multiply the $V_{L*(m+1) -
1,k}$ by actual value of argmax equation (or add it) and obtain the
optimal sub-path.

This step will require O(L*K^3) time for each chunk.

After all chunks was processed we are sure that we know optimal paths
in the first chunk - so we may easily throw away all wittingly
not-optimal paths in the second chunk then do it for the third and so
on.

In parallel mode it will take O( (M/P) * K^3) vs O(N * K^2) in
sequential Viterbi where P is the number of “cores” and M is the sum
of all N’s. It’s bad that there is K in cube, but since K is usually
small P may be much greater than K^3 this will get the speed up. At
least, this scales.

3. We could just iteratively process every chunk in parallel mode for
O((M/P) * K^2) without “connecting chunks” to each other. But this
will work well only if we have a lot of sequences near the same
length. If not, well, I need to think on this.

That’s not at all but all for today.

Reply via email to