[ https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14002713#comment-14002713 ]
Philip (flip) Kromer commented on DATAFU-47: -------------------------------------------- Supplying the seed value is kinda annoying, because Java. And any time I see the word "cryptographic" anywhere near a thing I want an implementation that is as ungenerous as possible. I believe the following is unambiguous and ungenerous and will work the same anywhere: * You must supply a string with exactly the right number of hex digits: length 8 for the 32-bit murmur3 seeds, length 32 for the 128-bit sip24 seed. That should prevent a value that is out of range of the unsigned type. * for sip24, take the first 16 characters to be k0 and the other 16 to be k1. * use new BigInteger(seed_str, 16) to parse as hexadecimal. This will give the correct unsigned value, and raise a NumberFormatException if the string has any weirdness. * use [longValue()|http://docs.oracle.com/javase/6/docs/api/java/math/BigInteger.html#longValue()] or intValue() to get the right return value. The [docs say|http://docs.oracle.com/javase/specs/jls/se7/html/jls-5.html#jls-5.1.3] "Note that this conversion can lose information about the overall magnitude of the BigInteger value as well as return a result with the opposite sign", which I believe means that we can trust that on any machine the BigInteger 0xffffffff will become the int -1 which will become the bits for 0xffffffff for the purpose of the algorithm. ----------- h4. Why this is hard sip24 asks for a 128-bit seed supplied as two longs; murmur3 for a 32-bit seed supplied as an int. Both of them are happy to accept signed values and just pay attention to the bits. So we "just" have to smuggle either two longs or one int from Pig to the constructor, (which AFAIK Pig only allows string constructor args) but the following constraints exist: * We want to supply the seed as an unsigned hexadecimal string of exact length, because that's how somebody testing for correctness thinks, and because it's the best way to convince yourself endian-ness is right, and because that's the most ungenerous way to send in the seed. * Java's Long.parseLong() and Integer.parseInteger() _will not parse hexadecimal strings with values greater than 0x7ffff... -- only signed values may be supplied. * The behavior should be identical on any platform (This is the hard part) * That value should be interpreted identically to that of a C or any other implementation. (This is the other hard part.) * Under no circumstances should the parsing ** be clever ** require me or a future reader to worry about Endian-ness. ** have lots of exceptions to handle ** offer flexibility * It's best if the seed arg is treated similarly for each algorithm. ** I don't want to have a one-constructor-arg version and a two-constructor-arg version for reasons of readability and future-proofing h4. Alternatives: * as above, but declare as _illegal_, for both algorithms, seed values greater than maxint (i.e. strings that are not exactly 8 hexadecimal characters with a first character in 0..7). This still gives you 2 billion seeds to choose from; anyone peeved by that can necromance this discussion and conduct a proper review. * as above, but declare as _undefined behavior_ seed values greater than maxint * Instead of the above, make the user jump through hoops for seed values of greater than 0x7ff.... ** murmur3 accept any string that Integer.parseString() will handle. For unsigned values greater than 0x7ff... you have to back into it. ** sip24 accepts a colon-separated string (matching '^[^:]+:[^:]+$') where each half is any thing Long.parseString() accepts. For me this is right at the line of clever and just over the line of generous. > UDF for Murmur3 (and other) Hash functions > ------------------------------------------ > > Key: DATAFU-47 > URL: https://issues.apache.org/jira/browse/DATAFU-47 > Project: DataFu > Issue Type: Improvement > Reporter: Philip (flip) Kromer > Labels: Guava, Hash, UDF > Attachments: > 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch, > 0001-UDF-for-Murmur3-and-other-Hash-functions.patch > > > Datafu should offer the murmur3 hash. > The attached patch uses Guava to add murmur3 (a fast hash with good > statistical properties), SipHash-2-4 (a fast cryptographically secure hash), > crc32, adler32, md5 and sha. > From the javadoc: > * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a > [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. > Murmur3 is fast, with has exceptionally good statistical properties; it's a > good choice if all you need is good mixing of the inputs. It is _not_ > cryptographically secure; that is, given an output value from murmur3, there > are efficient algorithms to find an input yielding the same output value. > Supply the seed as a string that > [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)] > can handle. > * 'sip24', [optional seed]: Returns a [64-bit > SipHash-2-4|https://131002.net/siphash/]. SipHash is competitive in > performance with Murmur3, and is simpler and faster than the cryptographic > algorithms below. When used with a seed, it can be considered > cryptographically secure: given the output from a sip24 instance but not the > seed used, we cannot efficiently craft a message yielding the same output > from that instance. > * 'adler32': Returns an Adler-32 checksum (32 hash bits) by delegating to > Java's Adler32 Checksum > * 'crc32': Returns a CRC-32 checksum (32 hash bits) by delegating to Java's > CRC32 Checksum. > * 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5 > MessageDigest. > * 'sha1': Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 > MessageDigest. > * 'sha256': Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 > MessageDigest. > * 'sha512': Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 > MessageDigest. > * 'good-(integer number of bits)': Returns a general-purpose, > non-cryptographic-strength, streaming hash function that produces hash codes > of length at least minimumBits. Users without specific compatibility > requirements and who do not persist the hash codes are encouraged to choose > this hash function. (Cryptographers, like dieticians and fashionistas, > occasionally realize that We've Been Doing it Wrong This Whole Time. Using > 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To > (Wear|Eat|Hash With) this Fall.) Values for this hash will change from run to > run. > Examples: > {code} > define DefaultH datafu.pig.hash.Hasher(); > define GoodH datafu.pig.hash.Hasher('good-32'); > define BetterH datafu.pig.hash.Hasher('good-127'); > define MurmurH32 datafu.pig.hash.Hasher('murmur3-32'); > define MurmurH32A datafu.pig.hash.Hasher('murmur3-32', '0x0'); > define MurmurH32B datafu.pig.hash.Hasher('murmur3-32', '0x56789abc'); > define MurmurH128 datafu.pig.hash.Hasher('murmur3-128'); > define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0'); > define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678'); > define MD5H datafu.pig.hash.Hasher('md5'); > define SHA1H datafu.pig.hash.Hasher('sha1'); > define SHA256H datafu.pig.hash.Hasher('sha256'); > define SHA512H datafu.pig.hash.Hasher('sha512'); > > data_in = LOAD 'input' as (val:chararray); > > data_out = FOREACH data_in GENERATE > DefaultH(val), GoodH(val), BetterH(val), > MurmurH32(val), MurmurH32A(val), MurmurH32B(val), > MurmurH128(val), MurmurH128A(val), MurmurH128B(val), > SHA1H(val), SHA256H(val), SHA512H(val), > MD5H(val) > ; > STORE data_out INTO 'output'; > {code} > In practice: > {code} > -- Consistent shuffle of large dataset with only one full-table reduce > step. > -- Every pig run with the same seed will generate sorted output in the same > order > define MurmurH32 datafu.pig.hash.Hasher('murmur3-32'); > -- Force each file to go in whole to a single mapper (or in the LOAD use > -tagSplit, to be added in future Pig version) > SET mapred.max.split.size 1099511627776; > -- -tagPath option labels each file > data_in = LOAD 'input' USING PigStorage('\t', '-tagPath') AS > (path:chararray, val:chararray); > data_numbered = RANK data_in; > data_ided = FOREACH numbered GENERATE > MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS > shuffle_key, > val AS val; > data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val; > STORE data_shuffled INTO 'data_shuffled'; > {code} > Important notes about this patch: > * It should be applied _after_ the patch for DATAFU-46 and DATAFU-48. > * -(It expands the dependence on Guava. Does [pull req > 75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to > de-Guava datafu?)- > * -(The patch has (commented out) code that shows what life would be like if > the sip24, crc32 and adler32 hashes were available. On your advice, I will > either (a) put in a patch removing the spurious comments or (b) file a > separate bug to update guava, push in a patch for that, and put in a patch > restoring to glory the extra hashes.)- -- This message was sent by Atlassian JIRA (v6.2#6252)