On 04-Jul-12 18:58, Johannes Pfau wrote:
Code: https://github.com/D-Programming-Language/phobos/pull/646 Docs: http://dl.dropbox.com/u/24218791/d/phobos/std_hash_hash.html http://dl.dropbox.com/u/24218791/d/phobos/std_hash_crc.html http://dl.dropbox.com/u/24218791/d/phobos/std_hash_md.html http://dl.dropbox.com/u/24218791/d/phobos/std_hash_sha.htmlI just had another look at my initial std.hash design, and I realized that the API could be simplified a little: There's a reset function that's implemented in every hash. For sha1, md5, crc32 it only forwards to the start function though. So I'm not sure how useful this function is or if it should be dropped. Advantages of keeping it: * 'reset' better documents what's done than 'start' if the hash has already processed data * Are there hashes which can implement a reset function in a faster way than calling start again? Cons: * Adds an additional function which probably isn't necessary The start function is probably not needed as well. Tango doesn't have a start function or something similar, but it could use constructors for this (I only looked at docs, not code).
The only thing I can think of that would require start function is using unconventional initial vectors.
We can't use constructors, so a start function would be necessary for advanced initialization. But do we actually need that advanced initialization? SHA1, MD5 and CRC32 just do a "this = typeof(this).init" so a start function isn't necessary here. Advantages of keeping it: * Are there hash algorithms which need some sort of complex initialization which can't be done with .init / default values? * If we drop both start and reset the only way to reset the internal state is calling finish. This might be a little less efficient than a start/reset method. Advantages of dropping it: * Using hashes is easier, no need to call 'start' before hashing data I think someone more familiar with hash functions than me needs to answer the "do we need start/reset functions" questions. API question: CRC32 sums are usually presented as a uint, not a ubyte[4]. To fit the rest of the API ubyte[4] is used. Now there's a small annoying detail: The CRC32 should be printed in LSB-first order.
You probably meant MSB first.
When printing an uint like this, that works well:
writefln("%#x", 4157704578); //0xf7d18982
but this doesn't:
toHexString(*cast(ubyte[4]*)&4157704578); //8289D1F7
There is no problem it's just order of printing that at fault. So I suggest to *stop* doing a bswap.
It's just that printing something as an array of ubytes does it from least significant byte to most significant. You could try to add MSB/LSB first options to toHexString.
I can't change toHexString as it's used for all hashes and it's correct for SHA1, MD5, ... So I currently use bswap in the CRC32 finish() implementation to fix this issue.
no-no-no see the above ;)
Now the question is should I provide an additional finishUint function which avoids the bswap? Implementation issue: The current implementation of SHA1 and MD5 uses memcpy which doesn't work in CTFE IIRC and which also prevents the code from being pure. I could replace those memcpy calls with array copying but I'm not sure if memcpy was used for performance, so I'd like to keep it as long as we have no performance tests.
Replace memcpy with and array ops: ptr1[x..y] = ptr2[x2..y2];note that it's better to have them be pointers as it avoid bounds check & D runtime magic.
If need be I can provide benchmarks but I'm certain from the days of optimizing std.regex that it's faster or on par with memcpy.
-- Dmitry Olshansky
