On Wed, Jun 25, 2014 at 3:40 AM, Jan Krämer <kraemer...@googlemail.com> wrote: > right in time for Midtermevalutaions I have a bigger update for you. > Viterbi Decoder is now fully Volkified, but results are a bit meh :/ > > You can read more on my blog > http://spectrejan.blogspot.de/
Hi Jan, I only recently read your blog, so I may not be familiar with the overall objectives. But I have some questions and comments. How essential is it for you to compute full Euclidean metrics? The general case for convolutional decoding does call for Euclidean metrics, but most practical FEC use cases will use binary codes and equal power symbols to which the distance calculation reduces to a basic dot product. Furthermore, using sign changes, the branch metric unit simplifies to a max-sum operation and the metrics can be computed without the use of multipliers. If you look at the SPIRAL code that Tom mentioned, you'll see that that multiplication is not used. But, those are specific cases, and if the objective necessitates full Euclidean metrics, than one needs to compute Euclidean metrics. for (int k = 0; k < _PACKETLENGTH; ++k) { obsValLSB = _mm_load1_ps(obsPtr++); distLB = _mm_sub_ps(obsValLSB,tValLSB); distLB = _mm_mul_ps(distLB, distLB); obsValMSB = _mm_load1_ps(obsPtr++); distHB = _mm_sub_ps(obsValMSB,tValMSB); distHB = _mm_mul_ps(distHB, distHB); metricVal = _mm_add_ps(distHB, distLB); _mm_store_ps(metricPtr, metricVal); metricPtr += 4; } Regarding the code, the main performance limitation you'll run into with this segment is pipeline utilization. If you look at the argument dependencies, and therefore register dependencies at the instruction level, each math intrinsic is dependent on the results of the prior instruction. This creates read-after-write (RAW) hazards, and will inevitably lead to a pipeline stall at some point. Both the compiler and the CPU will aggressively attempt to reorder instructions to avoid this scenario, but there's very little to reorder in this case. What may help, then, is to give the compiler and cpu some help by unrolling the loop such that there is more code between branch points. That will allow instructions to be rearranged to alleviate the argument dependencies. Looking again at the SPIRAL example, the entire vertical pass through the 64-state trellis is fully unrolled leaving only a single conditional for normalization purposes. SPIRAL is an extreme example, but the above code will likely benefit from a longer loop section. Hope that helps. -TT _______________________________________________ Discuss-gnuradio mailing list Discuss-gnuradio@gnu.org https://lists.gnu.org/mailman/listinfo/discuss-gnuradio