[ 
https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip (flip) Kromer updated DATAFU-47:
---------------------------------------

    Attachment: 0001-UDF-for-Murmur3-and-other-Hash-functions.patch

Sorry had to prep the patch.

Can also supply the string 'rand' as a seed to the seedable hashes. This gives 
you a random sequence on every run without depending heavily on the random 
number generator.

> UDF for Murmur3 (and other) Hash functions
> ------------------------------------------
>
>                 Key: DATAFU-47
>                 URL: https://issues.apache.org/jira/browse/DATAFU-47
>             Project: DataFu
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>              Labels: Guava, Hash, UDF
>         Attachments: 0001-UDF-for-Murmur3-and-other-Hash-functions.patch
>
>
> Datafu should offer the murmur3 hash.
> The attached patch uses Guava to add murmur3 (and also handles md5 and sha*). 
> If you're willing to upgrade Guava, we'd get sip24 (a fast cryptographically 
> secure hash), crc32 (in case you want that) and adler32 (another 
> checksum-style hash). 
> From the javadoc:
> * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
> [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. 
> Murmur3 is fast, with has exceptionally good statistical properties; it's a 
> good choice if all you need is good mixing of the inputs. It is _not_ 
> cryptographically secure; that is, given an  output value from murmur3, there 
> are efficient algorithms to find an input yielding the same output value. 
> Supply the seed as a string that 
> [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
>  can handle.
> * 'good-{integer number of bits}': Returns a general-purpose, 
> non-cryptographic-strength, streaming hash function that produces hash codes 
> of length at least minimumBits. Users without specific compatibility 
> requirements and who do not persist the hash codes are encouraged to choose 
> this hash function. (Cryptographers, like dieticians and fashionistas, 
> occasionally realize that We've Been Doing it Wrong This Whole Time. Using 
> 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To 
> (Wear|Eat|Hash With) this Fall. Therefore, expect values returned with this 
> choice may change version-to-version.)
> * 'md5':     Returns an MD5 hash (128 hash bits) using Java's MD5 
> MessageDigest.
> * 'sha1':    Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
> MessageDigest.
> * 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
> MessageDigest.
> * 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
> MessageDigest.
> Examples: 
> {code}
>   define DefaultH    datafu.pig.hash.Hasher();
>   define GoodH       datafu.pig.hash.Hasher('good-32');
>   define BetterH     datafu.pig.hash.Hasher('good-127');
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   define MurmurH32A  datafu.pig.hash.Hasher('murmur3-32', '0x0');
>   define MurmurH32B  datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
>   define MurmurH128  datafu.pig.hash.Hasher('murmur3-128');
>   define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
>   define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
>   define MD5H        datafu.pig.hash.Hasher('md5');
>   define SHA1H       datafu.pig.hash.Hasher('sha1');
>   define SHA256H     datafu.pig.hash.Hasher('sha256');
>   define SHA512H     datafu.pig.hash.Hasher('sha512');
>   
>   data_in = LOAD 'input' as (val:chararray);
>   
>   data_out = FOREACH data_in GENERATE
>     DefaultH(val),   GoodH(val),       BetterH(val),
>     MurmurH32(val),  MurmurH32A(val),  MurmurH32B(val),
>     MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
>     SHA1H(val),       SHA256H(val),    SHA512H(val),
>     MD5H(val)
>     ;
>   STORE data_out INTO 'output';
> {code}
> In practice: 
> {code}
>   -- Consistent shuffle of large dataset with only one full-table reduce 
> step. 
>   -- Every pig run with the same seed will generate sorted output in the same 
> order
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   -- Force each file to go in whole to a single mapper (or in the LOAD use 
> -tagSplit, to be added in future Pig version)
>   SET mapred.max.split.size 1099511627776;
>   -- -tagPath option labels each file
>   data_in = LOAD 'input' USING PigStorage('\t', '-tagPath') AS 
> (path:chararray, val:chararray);
>   data_numbered = RANK data_in;
>   data_ided = FOREACH numbered GENERATE
>     MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS 
> shuffle_key,
>     val AS val;
>   data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val;
>   STORE data_shuffled INTO 'data_shuffled';
> {code}
> Important notes about this patch:
> * It should be applied _after_ the patch for DATAFU-46
> * It expands the dependence on Guava. Does [pull req 
> 75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to 
> de-Guava datafu?
> * The patch has (commented out) code that shows what life would be like if 
> the sip24, crc32 and adler32 hashes were available. On your advice, I will 
> either (a) put in a patch removing the spurious comments or (b) file a 
> separate bug to update guava, push in a patch for that, and put in a patch 
> restoring to glory the extra hashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to