Hi Magnus,
hi Rob,

a while ago I made the same observations you did. On an m68k-nommu with 166 MHz 
the RSA exchange took quite forever. After some profiling I found out the comba 
multiply routine in libtommath was eating most of the time. It seems gcc 
produces quite inefficient code there. Libtommath resizes its large integers 
while calculating leading to more work for user memory management. Therefore I 
converted dropbear to use libtomsfastmath which helped a lot at the expense of 
a larger memory footprint. After porting some parts to assembler (which 
libtomsfastmath has special hooks for) I cut down the time to 10sec which is 
IMHO much better. 

The version I did was more a proof of concept and is not shiny and packed but 
will compile, maybe you could have a look at it. 
(http://peter.turczak.de/dropbear-tfm.tgz)

Rob is right in a way, but openssl uses assembler all along. Furthermore a 
missing L1 will contribute to slowing the keyexchange to a crawl. 

Best regards,

Peter

On Mar 15, 2011, at 10:25 PM, Rob Landley wrote:

> On 03/15/2011 08:02 AM, Magnus Nilsson wrote:
>> Sorry, I was unclear - it's only 100% busy during those 45s.
>> 
>> This is what it looks like if I first start the load monitor (-r outputs
>> 1 sample/second), then start to log in from a remote ssh client:
>> # cpu -r
>> CPU:  busy 0%  (system=0% user=0% nice=0% idle=100%)
>> CPU:  busy 24%  (system=4% user=19% nice=0% idle=75%)
>> CPU:  busy 100%  (system=1% user=98% nice=0% idle=0%)
>> CPU:  busy 100%  (system=0% user=100% nice=0% idle=0%)
>> <39 repeats of the above busy 100%>
>> CPU:  busy 100%  (system=2% user=97% nice=0% idle=0%)
>> CPU:  busy 100%  (system=8% user=91% nice=0% idle=0%)
>> CPU:  busy 100%  (system=22% user=77% nice=0% idle=0%)
>> CPU:  busy 100%  (system=0% user=100% nice=0% idle=0%)
>> CPU:  busy 100%  (system=0% user=100% nice=0% idle=0%)
>> CPU:  busy 67%  (system=8% user=58% nice=0% idle=32%)
>> CPU:  busy 0%  (system=0% user=0% nice=0% idle=100%)
>> 
>> Thanks for the tip on prebuilt busybox Rob, but would I need it in flat
>> format. I don't think arm-elf-elf2flt can do that without reloc info or?
>> And from the above I don't think it would add much info.
>> 
>> My question is:
>> Is 45s reasonable on a 192MHz cpu,
> 
> No.  I had a 200mhz celeron that did 3.2 ssh logins per second ten years
> ago.  (I did a VPN built on top of ssh, dynamic port forwarding, and
> netcat, and had to benchmark it.)  Going from i686 to arm could cost you
> some performance (ever since the pentium it's had multiple execution
> cores, speculative execution, instruction reordering and such), but
> there's no _way_ that's more than an order of magnitude in performance.
> I could see 4 seconds, but but 45 seconds is pathological.  Something
> is wrong.
> 
> My next step would be "stick printfs in the source code and see where
> the big delay is".
> 
> Hmmm...  Do they still _make_ CPUs with no L1 cache?
> 
> Rob

Reply via email to