I agree. I don't know how the CPU handles misaligned floats, but from what I understand, it will do two loads to fetch the two word-aligned parts of the float, and then assemble it together. This may be what's causing the slowdown.T
Remvoing the `align(1)` changes nothing, not 1ms slower or faster, unfortunatly.