On Wed, May 31, 2017 at 03:46:17PM -0700, Jonathan M Davis via Digitalmars-d-learn wrote: > On Wednesday, May 31, 2017 12:13:04 H. S. Teoh via Digitalmars-d-learn > wrote: > > I did some digging around, and it seems that wc is using glibc's > > memchr, which is highly-optimized, whereas std.algorithm.count just > > uses a simplistic loop. Which is strange, because I'm pretty sure > > somebody optimized std.algorithm some time ago to use memchr() > > instead of a loop when searching for a byte value in an array. > > Whatever happened to that?? > > I don't know, but memchr wouldn't work with CTFE, so someone might > have removed it to make it work in CTFE (though that could be done > with a different branch for CTFE). Or maybe it never made it into > std.algorithm for one reason or another. [...]
I checked the Phobos code again, and it appears that my memory deceived me. Somebody *did* add memchr optimization to find() and its friends, but not to count(). CTFE compatibility is not a problem, since we can just if(__ctfe) the optimized block away. I'm currently experimenting with a memchr-optimized version of count(), but I'm getting mixed results: on small arrays or large arrays densely packed with matching elements, the memchr version runs rather slowly, because it involves a function call into the C library per matching element. On large arrays only sparsely populated with matching elements, though, the memchr-optimized version beats the current code by about an order of magnitude. Since it wouldn't be a wise idea to assume sparsity of matches in Phobos, I decided to do a little more digging, and looked up the glibc implementation of memchr. The main optimization is that it iterates over the array not by byte, as a naïve loop would do, but by ulong's. (Of course, the first n bytes and last n bytes that are not ulong-aligned are checked with a per-byte loop; so for very short arrays it doesn't lose out to the naïve loop.) In each iteration over ulong, it performs the bit-twiddling hack alluded to by Nitram to detect the presence of matching bytes, upon which it breaks out to the closing per-byte loop to find the first match. For short arrays, or arrays where a match is quickly found, it's comparable in performance to the naïve loop; for large arrays where the match is not found until later, it easily outperforms the naïve loop. My current thought is to adopt the same approach: iterate over size_t or some such larger unit, and adapt the bit-twiddling hack to be able to count the number of matches in each size_t. This is turning out to be trickier than I'd like, though, because there is a case where carry propagation makes it unclear how to derive the number of matches without iterating over the bytes a second time. But this may not be a big problem, since size_t.sizeof is relatively small, so I can probably loop over individual bytes when one or more matches is detected, and a sufficiently-capable optimizer like ldc or gdc would be able to unroll this into a series of sete + add instructions, no branches that might stall the CPU pipeline. For densely-matching arrays, this should still have comparable performance to the naïve loops; for sparsely-matching arrays this should show significant speedups. T -- My program has no bugs! Only unintentional features...