I have done more experiments, it seems movntps instruction gives a performance gain only when the array is longer than about 200_000 ints (celeron CPU). For [25, 200_000] integers the movaps is better (and better than C memset). For n < 25 the best thing I've found is just an inlined loop.
I'll post some code in the main D newsgroup in a short time. Bye, bearophile