Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: * The code is no win for AMD k10/k8 (although close to 10 c/l might well be possible) I tried replacing one masking op by cmov, as you suggested. We then get down to 11.25 c/l on K10. I put this modified version in the k10 subdirectory, since it was

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: Torbjorn Granlund t...@gmplib.org writes: * The code is no win for AMD k10/k8 (although close to 10 c/l might well be possible) I tried replacing one masking op by cmov, as you suggested. We then get down to 11.25 c/l on K10. I put

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I turned out the code was a bit slower on k8. This patch changes that. With it applied, things takes 11 c/l on both pipelines. This is also a 2 c/l improvement for piledriver. I have not tested that this is correct. If you like the patch, please consider putting the result in the k8 subdir.

Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: I turned out the code was a bit slower on k8. This patch changes that. With it applied, things takes 11 c/l on both pipelines. This is also a 2 c/l improvement for piledriver. Cool. I have not tested that this is correct. If you like the patch,

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I played more with the code, now trying to break the add-adc-sbb-cmov chain, for the benefit of most Intel processors. But I lack unit testing code for the function, making hacking quite cumbersome. I don't feel safe hacking *any* GMP assembly code without tests/devel/try.c's function and access

Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: But I lack unit testing code for the function, making hacking quite cumbersome. I don't feel safe hacking *any* GMP assembly code without tests/devel/try.c's function and access checks. tests/mpn/t-div.c includes tests for mpn_div_qr_1, including

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: But sure, support also in try.c would be good. Added now. Please have a look if it the changes are sane. I use the second source for the uh input, and I added a DATA_DIV_QR_1 to get it in the

Re: div_qr_1 interface

2013-10-22 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: But sure, support also in try.c would be good. Added now. And sure enough, it detects some bugs in the new assembly code. For size n==1, there's a missing mov. I'll add that shortly. Then there's another

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: And sure enough, it detects some bugs in the new assembly code. For size n==1, there's a missing mov. I'll add that shortly. Then there's another problem with n==2, which needs a bit more debugging. Good. So now you have debugged the new try.c

Re: div_qr_1 interface

2013-10-22 Thread Torbjorn Granlund
I added data for the new code at http://gmplib.org/devel/asm.html. There is a line for div_qr_1u_pi1 as well, since that will also be needed. It might actually be more common that the divisor is not normalised. I should try to wrap up div_qr_1n_pi2 and div_qr_1u_pi2 as well, and then add