[jira] [Updated] (DATAFU-47) UDF for Murmur3 (and other) Hash functions

2017-12-05 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-47:
---
Attachment: DATAFU-47-new.patch

I looked at the review board for this issue, and fixed the merge conflicts in 
HashTests and addressed the comments that were left. It depends on DATAFU-50, 
which was reopened, but I put a new patch there so that we can proceed with 
both.

Since I didn't create the review, I can't upload a new diff there, but I've 
attached it to the Jira issue, and commented in the review board where 
appropriate.

Tests pass, and I've run the content of "hasherTest" on a cluster using the 
assembled DataFu jar to make sure that the autojarring of the new Guava version 
works properly.

I'll respond to the review board comments later.

> UDF for Murmur3 (and other) Hash functions
> --
>
> Key: DATAFU-47
> URL: https://issues.apache.org/jira/browse/DATAFU-47
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>  Labels: Guava, Hash, UDF
> Attachments: 
> 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch, 
> 0001-UDF-for-Murmur3-and-other-Hash-functions.patch, DATAFU-47-new.patch
>
>
> Datafu should offer the murmur3 hash.
> The attached patch uses Guava to add murmur3 (a fast hash with good 
> statistical properties), SipHash-2-4 (a fast cryptographically secure hash), 
> crc32, adler32, md5 and sha.
> From the javadoc:
> * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
> [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. 
> Murmur3 is fast, with has exceptionally good statistical properties; it's a 
> good choice if all you need is good mixing of the inputs. It is _not_ 
> cryptographically secure; that is, given an  output value from murmur3, there 
> are efficient algorithms to find an input yielding the same output value. 
> Supply the seed as a string that 
> [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
>  can handle.
> * 'sip24', [optional seed]: Returns a [64-bit 
> SipHash-2-4|https://131002.net/siphash/]. SipHash is competitive in 
> performance with Murmur3, and is simpler and faster than the cryptographic 
> algorithms below. When used with a seed, it can be considered 
> cryptographically secure: given the output from a sip24 instance but not the 
> seed used, we cannot efficiently craft a message yielding the same output 
> from that instance.
> * 'adler32': Returns an Adler-32 checksum (32 hash bits) by delegating to 
> Java's Adler32 Checksum
> * 'crc32':   Returns a CRC-32 checksum (32 hash bits) by delegating to Java's 
> CRC32 Checksum.
> * 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5 
> MessageDigest.
> * 'sha1':Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
> MessageDigest.
> * 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
> MessageDigest.
> * 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
> MessageDigest.
> * 'good-(integer number of bits)': Returns a general-purpose, 
> non-cryptographic-strength, streaming hash function that produces hash codes 
> of length at least minimumBits. Users without specific compatibility 
> requirements and who do not persist the hash codes are encouraged to choose 
> this hash function. (Cryptographers, like dieticians and fashionistas, 
> occasionally realize that We've Been Doing it Wrong This Whole Time. Using 
> 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To 
> (Wear|Eat|Hash With) this Fall.) Values for this hash will change from run to 
> run.
> Examples: 
> {code}
>   define DefaultHdatafu.pig.hash.Hasher();
>   define GoodH   datafu.pig.hash.Hasher('good-32');
>   define BetterH datafu.pig.hash.Hasher('good-127');
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   define MurmurH32A  datafu.pig.hash.Hasher('murmur3-32', '0x0');
>   define MurmurH32B  datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
>   define MurmurH128  datafu.pig.hash.Hasher('murmur3-128');
>   define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
>   define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
>   define MD5Hdatafu.pig.hash.Hasher('md5');
>   define SHA1H   datafu.pig.hash.Hasher('sha1');
>   define SHA256H datafu.pig.hash.Hasher('sha256');
>   define SHA512H datafu.pig.hash.Hasher('sha512');
>   
>   data_in = LOAD 'input' as (val:chararray);
>   
>   data_out = FOREACH data_in GENERATE
> DefaultH(val),   GoodH(val),   BetterH(val),
> MurmurH32(val),  MurmurH32A(val),  MurmurH32B(val),
> MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
> SHA1H(val),   SHA256H(val),

[jira] [Updated] (DATAFU-47) UDF for Murmur3 (and other) Hash functions

2014-05-19 Thread Philip (flip) Kromer (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip (flip) Kromer updated DATAFU-47:
---

Attachment: 
0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch

Updated patch to add the full range of hash functions Guava offers. Depends 
therefore on DATAFU-48.

 UDF for Murmur3 (and other) Hash functions
 --

 Key: DATAFU-47
 URL: https://issues.apache.org/jira/browse/DATAFU-47
 Project: DataFu
  Issue Type: Improvement
Reporter: Philip (flip) Kromer
  Labels: Guava, Hash, UDF
 Attachments: 
 0001-DATAFU-47-UDF-for-Murmur3-SipHash-2-4-and-other-Hash-functions.patch, 
 0001-UDF-for-Murmur3-and-other-Hash-functions.patch


 Datafu should offer the murmur3 hash.
 The attached patch uses Guava to add murmur3 (and also handles md5 and sha*). 
 If you're willing to upgrade Guava, we'd get sip24 (a fast cryptographically 
 secure hash), crc32 (in case you want that) and adler32 (another 
 checksum-style hash). 
 From the javadoc:
 * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
 [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. 
 Murmur3 is fast, with has exceptionally good statistical properties; it's a 
 good choice if all you need is good mixing of the inputs. It is _not_ 
 cryptographically secure; that is, given an  output value from murmur3, there 
 are efficient algorithms to find an input yielding the same output value. 
 Supply the seed as a string that 
 [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
  can handle.
 * 'good-{integer number of bits}': Returns a general-purpose, 
 non-cryptographic-strength, streaming hash function that produces hash codes 
 of length at least minimumBits. Users without specific compatibility 
 requirements and who do not persist the hash codes are encouraged to choose 
 this hash function. (Cryptographers, like dieticians and fashionistas, 
 occasionally realize that We've Been Doing it Wrong This Whole Time. Using 
 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To 
 (Wear|Eat|Hash With) this Fall. Therefore, expect values returned with this 
 choice may change version-to-version.)
 * 'md5': Returns an MD5 hash (128 hash bits) using Java's MD5 
 MessageDigest.
 * 'sha1':Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
 MessageDigest.
 * 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
 MessageDigest.
 * 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
 MessageDigest.
 Examples: 
 {code}
   define DefaultHdatafu.pig.hash.Hasher();
   define GoodH   datafu.pig.hash.Hasher('good-32');
   define BetterH datafu.pig.hash.Hasher('good-127');
   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
   define MurmurH32A  datafu.pig.hash.Hasher('murmur3-32', '0x0');
   define MurmurH32B  datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
   define MurmurH128  datafu.pig.hash.Hasher('murmur3-128');
   define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
   define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
   define MD5Hdatafu.pig.hash.Hasher('md5');
   define SHA1H   datafu.pig.hash.Hasher('sha1');
   define SHA256H datafu.pig.hash.Hasher('sha256');
   define SHA512H datafu.pig.hash.Hasher('sha512');
   
   data_in = LOAD 'input' as (val:chararray);
   
   data_out = FOREACH data_in GENERATE
 DefaultH(val),   GoodH(val),   BetterH(val),
 MurmurH32(val),  MurmurH32A(val),  MurmurH32B(val),
 MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
 SHA1H(val),   SHA256H(val),SHA512H(val),
 MD5H(val)
 ;
   STORE data_out INTO 'output';
 {code}
 In practice: 
 {code}
   -- Consistent shuffle of large dataset with only one full-table reduce 
 step. 
   -- Every pig run with the same seed will generate sorted output in the same 
 order
   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
   -- Force each file to go in whole to a single mapper (or in the LOAD use 
 -tagSplit, to be added in future Pig version)
   SET mapred.max.split.size 1099511627776;
   -- -tagPath option labels each file
   data_in = LOAD 'input' USING PigStorage('\t', '-tagPath') AS 
 (path:chararray, val:chararray);
   data_numbered = RANK data_in;
   data_ided = FOREACH numbered GENERATE
 MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS 
 shuffle_key,
 val AS val;
   data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val;
   STORE data_shuffled INTO 'data_shuffled';
 {code}
 Important notes about this patch:
 * It should be applied _after_ the patch for DATAFU-46
 * It expands the dependence on Guava. Does [pull req