Suddenly reminds me some of the speedup assembly I was writing for wideint, but seems I lost my code. too bad, the 128bit multiply had sped up and the division needed some work.
I'm a taker if you have some algorithm to reuse 32-bit divide in wideint division instead of scanning bits :)