> Rasmus, your version has ANDing by mask, and resetting the mask at each > iteration > of main loop. I think we can avoid it. What do you think on next?
Yes, that's basically what I proposed (modulo checking for zero size and my buggy LAST_WORD_MASK). But two unconditional instructions in the loop are awfully minor; it's loads and conditional branches that cost. The reset of the mask can be done in parallel with other operations; it's only the AND that actually takes a cycle. I can definitely see the argument that, for code that's not used often enough to stay resident in the L1 cache, any speedup has to win by at least one L2 cache access to be worth taking another cache line. For Ivy bridge, those numbers are 32 KB and 12 cycles. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/