Actually, DJB just made a very relevant suggestion. As I've mentioned, the 32-bit performance problems are an x86-specific problem. ARM does very well, and other processors aren't bad at all.
SipHash fits very nicely (and runs very fast) in the MMX registers. They're 64 bits, and there are 8 of them, so the integer registers can be reserved for pointers and loop counters and all that. And there's reference code available. How much does kernel_fpu_begin()/kernel_fpu_end() cost? Although there are a lot of pre-MMX x86es in embedded control applications, I don't think anyone is worried about their networking performance. (Specifically, all of this affects only connection setup, not throughput on established connections.)