> I would still suggest Salsa20 or ChaCha. My measurements with naive C > code suggest that, if you buffer the output for short outputs, these > take on average 40-50 Ivy Bridge cycles per request. (If you don't > buffer the output, it's 300 cycles.) Long requests get ~4 cpb. In > contrast, libc random(3) takes on average 50-60 Ivy Bridge cycles per > request, and long requests get ~13 cpb.
It is also possible to used reduced round versions of these; I believe the best known attacks are still on 8 rounds of Salsa20 and 7 of ChaCha and the default is 20 rounds. 20, 12, and 8 are the suggested numbers of rounds. It may well be safer to use ChaCha/8 than some other algorithm (the known attacks on Salsa20/8 and ChaCha/7 are not that good). -Matt