Hi again,

On Mon, Apr 10, 2017 at 8:36 AM, Mateusz Viste <mate...@nospam.viste.fr> wrote:
> On Mon, 10 Apr 2017 00:56:17 -0500, Rugxulo wrote:
>>
>> It would be interesting to see some benchmark numbers for that (for
>> various specific tools, 8086, 386, etc).
>
> Just for the fun of it, I did some quick measures on my 386SX PC,
> computing various checksums of a 2 MiB file. Results below.

Very interesting ....

> CRC32 (by Colin Plumb)  : 26.7s  (22%)
> MD5 (by Colin Plumb)    : 52.9s  (11%)
> SHA1 (by Colin Plumb)   : 85.7s   (7%)

Blair's (16-bit, FD) MD5SUM can do all of those hashes as well. Not
sure if it'd be faster, though.

> BSUM is the fastest, which is no surprise since the algorithm is
> extremely simple (4 CPU instructions). The CRC32 computation by Joe
> Forster is surprisingly fast as well. It's 30% slower than bsum and the
> binary is 4x times larger (and I suppose the memory usage is also much
> higher) but that's still quite impressive for a 32-bit checksum.

"30% slower" is machine specific, and I'm quite sure it can be
improved. Although his tool does seem to use a fairly big (64512 byte)
buffer.

***
If extremely bored, check out these "modern" (CRC32C, aka Castagnoli)
implementations, which I don't grok:

http://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software
http://www.drdobbs.com/parallel/fast-parallelized-crc-computation-using/229401411
***

Of course he also combines (unused) decimal output routine with (used)
hex output routine, which unnecessarily (in this case) always uses
slow DIV (which you don't need at all for converting to hex). Of
course he only needs to call that routine once at the end. It would be
much worse result if called more often (e.g. hundreds of times). I've
done the same mistake in the past, too.

"4x times larger" is only in raw bytes, but in reality it uses a full
cluster (as you well know), so even a 256 byte .COM will still use
minimum one cluster (e.g. 512 bytes on 1.44 MB floppy). So 1024 isn't
really much worse than 512.   ;-)    Believe me, shrinking size is
fairly easy, but it's a tradeoff in accidental errors, readability,
and speed.

>> Splurge on the memory, give it 32 kb or so. It'll "probably" be faster
>> with a bigger buffer.
>
> At the cost of reducing the number of platforms it would be able to run on.
> Currently bsum uses an 8K memory buffer to optimize disk reads. Using a
> buffer of 64KB increases the overall speed by 10%. Not that much, for a
> 700% increase of memory usage.

Don't you have an 8086 machine? How much RAM does it have? I had
thought most had at least 64 kb of RAM, but I guess that's not
accounting for the DOS + shell overhead. Honestly, I wrote several
simple hexdump variants in recent months, and the biggest slowdown was
my small buffer (only 16 bytes in the .ASM version). The C version is
larger but always well-buffered, so it's the fastest. I even got 2x
speedup (and noticeable size decrease) by avoiding printf entirely and
using my own outhex routine.

Okay, so let me break down your source and give some (trivial)
comments here. I assume that's okay with you!  ;-)

Irrelevant aesthetics:   lines too long (shouldn't be more than 80),
not enough indentation (instructions vs. labels), irrelevant "jz short
..." (when "short" conditional jump is always mandatory for "cpu
8086").

"section .data align=1" is probably what you intended here. (No need
to comment it out entirely. I think default is align=4 or some such,
that's probably what you didn't like.)

"buff resb 8192" and "mov cx, 8192" should be moved to EQU for clarity
(and, even better, as "1024 * 8" constant expression).

The program does not end in a CR+LF pair. Thus the output is an
incomplete line. Not a huge deal but still (sometimes) noticeable.

"int 21h // xchg ax, bp // int 21h" is repeated several times. If you
really want to save space, put "msgquit:" before the first one and
"jmp short msgquit" for the others (since this is quitting the program
anyways).

BTW, most asm devs actively hate "loop" in lieu of "dec // jnz". Not
sure if this would really be worth it, even for your 8086.

"shl bx, cl" (where CL=4) is also shunned, AFAIK, on 8086 machines, in
lieu of speedier (times 4) "shl bx,1". But if it's only done extremely
rarely then it won't add up to much difference. Only when done
thousands of times would you barely even notice.

Converting hex nibble to ASCII shouldn't need a jump at all. On the
8086 all jumps are very slow. Best to avoid them entirely if possible.
Here you can easily use the old "cmp al, 0Ah // sbb al, 69h // das"
trick instead. But since you're only printing hex one time (instead of
thousands), you probably don't care.

Okay, just wanted to add my $0.02 in case it was (accidentally) helpful.   :-)

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Freedos-user mailing list
Freedos-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/freedos-user

Reply via email to