[ 
https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-47:
-------------------------------
    Attachment: DATAFU-47-new.patch

I looked at the review board for this issue, and fixed the merge conflicts in 
HashTests and addressed the comments that were left. It depends on DATAFU-50, 
which was reopened, but I put a new patch there so that we can proceed with 
both.

Since I didn't create the review, I can't upload a new diff there, but I've 
attached it to the Jira issue, and commented in the review board where 
appropriate.

Tests pass, and I've run the content of "hasherTest" on a cluster using the 
assembled DataFu jar to make sure that the autojarring of the new Guava version 
works properly.

I'll respond to the review board comments later.

> UDF for Murmur3 (and other) Hash functions
> ------------------------------------------
>
>                 Key: DATAFU-47
>                 URL: https://issues.apache.org/jira/browse/DATAFU-47
>             Project: DataFu
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>              Labels: Guava, Hash, UDF
>         Attachments: 
> 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch, 
> 0001-UDF-for-Murmur3-and-other-Hash-functions.patch, DATAFU-47-new.patch
>
>
> Datafu should offer the murmur3 hash.
> The attached patch uses Guava to add murmur3 (a fast hash with good 
> statistical properties), SipHash-2-4 (a fast cryptographically secure hash), 
> crc32, adler32, md5 and sha.
> From the javadoc:
> * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
> [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. 
> Murmur3 is fast, with has exceptionally good statistical properties; it's a 
> good choice if all you need is good mixing of the inputs. It is _not_ 
> cryptographically secure; that is, given an  output value from murmur3, there 
> are efficient algorithms to find an input yielding the same output value. 
> Supply the seed as a string that 
> [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
>  can handle.
> * 'sip24', [optional seed]: Returns a [64-bit 
> SipHash-2-4|https://131002.net/siphash/]. SipHash is competitive in 
> performance with Murmur3, and is simpler and faster than the cryptographic 
> algorithms below. When used with a seed, it can be considered 
> cryptographically secure: given the output from a sip24 instance but not the 
> seed used, we cannot efficiently craft a message yielding the same output 
> from that instance.
> * 'adler32': Returns an Adler-32 checksum (32 hash bits) by delegating to 
> Java's Adler32 Checksum
> * 'crc32':   Returns a CRC-32 checksum (32 hash bits) by delegating to Java's 
> CRC32 Checksum.
> * 'md5':     Returns an MD5 hash (128 hash bits) using Java's MD5 
> MessageDigest.
> * 'sha1':    Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
> MessageDigest.
> * 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
> MessageDigest.
> * 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
> MessageDigest.
> * 'good-(integer number of bits)': Returns a general-purpose, 
> non-cryptographic-strength, streaming hash function that produces hash codes 
> of length at least minimumBits. Users without specific compatibility 
> requirements and who do not persist the hash codes are encouraged to choose 
> this hash function. (Cryptographers, like dieticians and fashionistas, 
> occasionally realize that We've Been Doing it Wrong This Whole Time. Using 
> 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To 
> (Wear|Eat|Hash With) this Fall.) Values for this hash will change from run to 
> run.
> Examples: 
> {code}
>   define DefaultH    datafu.pig.hash.Hasher();
>   define GoodH       datafu.pig.hash.Hasher('good-32');
>   define BetterH     datafu.pig.hash.Hasher('good-127');
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   define MurmurH32A  datafu.pig.hash.Hasher('murmur3-32', '0x0');
>   define MurmurH32B  datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
>   define MurmurH128  datafu.pig.hash.Hasher('murmur3-128');
>   define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
>   define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
>   define MD5H        datafu.pig.hash.Hasher('md5');
>   define SHA1H       datafu.pig.hash.Hasher('sha1');
>   define SHA256H     datafu.pig.hash.Hasher('sha256');
>   define SHA512H     datafu.pig.hash.Hasher('sha512');
>   
>   data_in = LOAD 'input' as (val:chararray);
>   
>   data_out = FOREACH data_in GENERATE
>     DefaultH(val),   GoodH(val),       BetterH(val),
>     MurmurH32(val),  MurmurH32A(val),  MurmurH32B(val),
>     MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
>     SHA1H(val),       SHA256H(val),    SHA512H(val),
>     MD5H(val)
>     ;
>   STORE data_out INTO 'output';
> {code}
> In practice: 
> {code}
>   -- Consistent shuffle of large dataset with only one full-table reduce 
> step. 
>   -- Every pig run with the same seed will generate sorted output in the same 
> order
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   -- Force each file to go in whole to a single mapper (or in the LOAD use 
> -tagSplit, to be added in future Pig version)
>   SET mapred.max.split.size 1099511627776;
>   -- -tagPath option labels each file
>   data_in = LOAD 'input' USING PigStorage('\t', '-tagPath') AS 
> (path:chararray, val:chararray);
>   data_numbered = RANK data_in;
>   data_ided = FOREACH numbered GENERATE
>     MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS 
> shuffle_key,
>     val AS val;
>   data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val;
>   STORE data_shuffled INTO 'data_shuffled';
> {code}
> Important notes about this patch:
> * It should be applied _after_ the patch for DATAFU-46 and DATAFU-48.
> * -(It expands the dependence on Guava. Does [pull req 
> 75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to 
> de-Guava datafu?)- 
> * -(The patch has (commented out) code that shows what life would be like if 
> the sip24, crc32 and adler32 hashes were available. On your advice, I will 
> either (a) put in a patch removing the spurious comments or (b) file a 
> separate bug to update guava, push in a patch for that, and put in a patch 
> restoring to glory the extra hashes.)-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to