> The swap could be done without temporaries, but I assume you're also trying to match the look of the pseudocode?
It would be interesting to see how fast the code can get without significantly altering its look, or alternatively how much one would have to change to achieve speedups. I profiled the code for a 500 x 500 random matrix and the swaps took ~ 0.5% of the execution time, IIRC. I'm not too concerned with those particular lines.