Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: The problem is the final use, where Q2 is added, with carry, to a different register. It's tempting to replace adc Q1I, Q2 with sbb Q2, Q1I and negated Q2, but I'm afraid that will get the sense of the carry

Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund writes: > I looked at the logic following this: > > sbb U2, U2 C 7 13 > > You negate the U2 copy in Q2. It seems that three adc by sbb > could avoid the neg. The problem is the final use, where Q2 is added, with carry, to a different register. It's temptin

Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
I looked at the logic following this: sbb U2, U2 C 7 13 You negate the U2 copy in Q2. It seems that three adc by sbb could avoid the neg. I might also be possible to replace the early loop "and" stuff by cmov. Note that the carry flag survives dec, although that causes a pi

Re: div_qr_1 interface

2013-10-21 Thread Niels Möller
Torbjorn Granlund writes: > On Intel chips, op-to-mem is expensive. Even op-from-memory is often > slower than load+op. (I understand the register shortage problem.) The following (untested) variant needs one register too many. UP, QP, UN: Load, store, loop counter. DINV, B2, B2

Re: div_qr_1 interface

2013-10-21 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Will try that. I think one could also try to delay the quotient store one iteration, keeping "Q1" in a register until the next iteration. Then one gets rid of the adc Q2,8(QP, UN, 8) in the loop, using only a single store per it