Re: ARM public key benchmark

2013-04-03 Thread Niels Möller
ni...@lysator.liu.se (Niels Möller) writes: So it should be doable with the addmul_1 loop and two additional, non-recurrency, not instructions per limb, and then maybe some extra logic for the return value. One could aim for 4.25 c/l, I guess. The below seems to give correct results. But

Re: ARM public key benchmark

2013-04-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: 1. I guess one can expect submul_1 to always be a bit slower than addmul_1, since submul_1 needs additional arithmetics besides the umaal? One could perhaps do some negations on the fly, a - b C = - ((-a) + b*C), maybe that

Re: ARM public key benchmark

2013-04-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: For large operands, it's strictly between add_n and addmul_1, which I guess is as expected. For small sizes, I had a look at the loop setup for add_n, which checks bit 0 and 1 of n separately. If that's faster, maybe one could borrow that logic.

Re: ARM public key benchmark

2013-04-03 Thread Torbjorn Granlund
ni...@lysator.liu.se (Niels Möller) writes: ni...@lysator.liu.se (Niels Möller) writes: So it should be doable with the addmul_1 loop and two additional, non-recurrency, not instructions per limb, and then maybe some extra logic for the return value. One could aim for 4.25 c/l, I

Re: ARM public key benchmark

2013-04-03 Thread Niels Möller
Torbjorn Granlund t...@gmplib.org writes: Have you considered complementing C instead? Not until now. Actually looks nice: A - b C = A + b (~C) + b - b B^n So this saves one not instruction, and we have to add and subtract the scalar b from incoming and outgoing carry. Regards, /Niels --

New T3/T4 code batch

2013-04-03 Thread Torbjorn Granlund
David, First mul_1, renamed again, now encoding the load scheduling. Only the 6c variant is new. Please time it. If it doesn't run at 3 c/l, then there are 2 simple things to try, indicated in a comment. sparct34-mul_1-3c.asm Description: Binary data sparct34-mul_1-6c.asm Description:

Re: New T3/T4 code batch

2013-04-03 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: First mul_1, renamed again, now encoding the load scheduling. Only the 6c variant is new. Please time it. If it doesn't run at 3 c/l, then there are 2 simple things to try, indicated in a comment. Looks exciting, I'll play around with this

Re: New T3/T4 code batch

2013-04-03 Thread David Miller
From: Torbjorn Granlund t...@gmplib.org Date: Thu, 04 Apr 2013 02:40:58 +0200 David Miller da...@davemloft.net writes: First mul_1, renamed again, now encoding the load scheduling. Only the 6c variant is new. Please time it. If it doesn't run at 3 c/l, then there are 2 simple

Re: New T3/T4 code batch

2013-04-03 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: Please don't do this, you checked in code that doesn't even compile again. Easy to fix. Please pull again. I was just starting to work on getting the information for you so this is very disappointing. :-/ Well, bugs happen. -- Torbjörn

Re: New T3/T4 code batch

2013-04-03 Thread Torbjorn Granlund
David Miller da...@davemloft.net writes: I can tell by looking at the commit that it's still broken, can you please stop jumping the gun and simply be patient enough for me to test things out? Since I am wrapping up, I wanted to push things and clean out unfinished things. Why is is