[ 
https://issues.apache.org/jira/browse/DATAFU-47?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philip (flip) Kromer updated DATAFU-47:
---------------------------------------

    Description: 
Datafu should offer the murmur3 hash.

The attached patch uses Guava to add murmur3 (and also handles md5 and sha*). 
If you're willing to upgrade Guava, we'd get sip24 (a fast cryptographically 
secure hash), crc32 (in case you want that) and adler32 (another checksum-style 
hash). 

>From the javadoc:

* 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
[murmur3 hash|https://code.google.com/p/smhasher/] of the given length. Murmur3 
is fast, with has exceptionally good statistical properties; it's a good choice 
if all you need is good mixing of the inputs. It is _not_ cryptographically 
secure; that is, given an  output value from murmur3, there are efficient 
algorithms to find an input yielding the same output value. Supply the seed as 
a string that 
[Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
 can handle.
* 'good-{integer number of bits}': Returns a general-purpose, 
non-cryptographic-strength, streaming hash function that produces hash codes of 
length at least minimumBits. Users without specific compatibility requirements 
and who do not persist the hash codes are encouraged to choose this hash 
function. (Cryptographers, like dieticians and fashionistas, occasionally 
realize that We've Been Doing it Wrong This Whole Time. Using 'good-*' lets you 
track What the Experts From (Milan|NIH|IEEE) Say To (Wear|Eat|Hash With) this 
Fall. Therefore, expect values returned with this choice may change 
version-to-version.)
* 'md5':     Returns an MD5 hash (128 hash bits) using Java's MD5 MessageDigest.
* 'sha1':    Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
MessageDigest.
* 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
MessageDigest.
* 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
MessageDigest.

Examples: 
{code}

  define DefaultH    datafu.pig.hash.Hasher();
  define GoodH       datafu.pig.hash.Hasher('good-32');
  define BetterH     datafu.pig.hash.Hasher('good-127');
  define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
  define MurmurH32A  datafu.pig.hash.Hasher('murmur3-32', '0x0');
  define MurmurH32B  datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
  define MurmurH128  datafu.pig.hash.Hasher('murmur3-128');
  define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
  define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
  define MD5H        datafu.pig.hash.Hasher('md5');
  define SHA1H       datafu.pig.hash.Hasher('sha1');
  define SHA256H     datafu.pig.hash.Hasher('sha256');
  define SHA512H     datafu.pig.hash.Hasher('sha512');
  
  data_in = LOAD 'input' as (val:chararray);
  
  data_out = FOREACH data_in GENERATE
    DefaultH(val),   GoodH(val),       BetterH(val),
    MurmurH32(val),  MurmurH32A(val),  MurmurH32B(val),
    MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
    SHA1H(val),       SHA256H(val),    SHA512H(val),
    MD5H(val)
    ;
  
  STORE data_out INTO 'output';

  -- Consistent shuffle of large dataset
  data_in = LOAD 'input' USING PigStorage('\n', '-tagPath') AS (path:chararray, 
val:chararray);
  data_numbered = RANK data_in;
  data_ided = FOREACH numbered GENERATE
    MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS 
shuffle_key,
    val AS val;
  data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val;
  STORE data_shuffled INTO 'data_shuffled';
{code}

Important notes about this patch:

* It should be applied _after_ the patch for DATAFU-46
* It expands the dependence on Guava. Does [pull req 
75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to 
de-Guava datafu?
* The patch has (commented out) code that shows what life would be like if the 
sip24, crc32 and adler32 hashes were available. On your advice, I will either 
(a) put in a patch removing the spurious comments or (b) file a separate bug to 
update guava, push in a patch for that, and put in a patch restoring to glory 
the extra hashes.


  was:
Datafu should offer the murmur3 hash.

The attached patch uses Guava to add murmur3 (and also handles md5 and sha*). 
If you're willing to upgrade Guava, we'd get sip24 (a fast cryptographically 
secure hash), crc32 (in case you want that) and adler32 (another checksum-style 
hash). 

>From the javadoc:

* 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
[murmur3 hash|https://code.google.com/p/smhasher/] of the given length. Murmur3 
is fast, with has exceptionally good statistical properties; it's a good choice 
if all you need is good mixing of the inputs. It is _not_ cryptographically 
secure; that is, given an  output value from murmur3, there are efficient 
algorithms to find an input yielding the same output value. Supply the seed as 
a string that 
[Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
 can handle.
* 'good-{integer number of bits}': Returns a general-purpose, 
non-cryptographic-strength, streaming hash function that produces hash codes of 
length at least minimumBits. Users without specific compatibility requirements 
and who do not persist the hash codes are encouraged to choose this hash 
function. (Cryptographers, like dieticians and fashionistas, occasionally 
realize that We've Been Doing it Wrong This Whole Time. Using 'good-*' lets you 
track What the Experts From (Milan|NIH|IEEE) Say To (Wear|Eat|Hash With) this 
Fall. Therefore, expect values returned with this choice may change 
version-to-version.)
* 'md5':     Returns an MD5 hash (128 hash bits) using Java's MD5 MessageDigest.
* 'sha1':    Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
MessageDigest.
* 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
MessageDigest.
* 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
MessageDigest.

Examples: 
{code}
datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
datafu.pig.hash.Hasher('murmur3-32', '-12345678');
{code}

Important notes about this patch:

* It should be applied _after_ the patch for DATAFU-46
* It expands the dependence on Guava. Does [pull req 
75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to 
de-Guava datafu?
* The patch has (commented out) code that shows what life would be like if the 
sip24, crc32 and adler32 hashes were available. On your advice, I will either 
(a) put in a patch removing the spurious comments or (b) file a separate bug to 
update guava, push in a patch for that, and put in a patch restoring to glory 
the extra hashes.



> UDF for Murmur3 (and other) Hash functions
> ------------------------------------------
>
>                 Key: DATAFU-47
>                 URL: https://issues.apache.org/jira/browse/DATAFU-47
>             Project: DataFu
>          Issue Type: Improvement
>            Reporter: Philip (flip) Kromer
>              Labels: Guava, Hash, UDF
>
> Datafu should offer the murmur3 hash.
> The attached patch uses Guava to add murmur3 (and also handles md5 and sha*). 
> If you're willing to upgrade Guava, we'd get sip24 (a fast cryptographically 
> secure hash), crc32 (in case you want that) and adler32 (another 
> checksum-style hash). 
> From the javadoc:
> * 'murmur3-32', [optional seed] or 'murmur3-128', [optional seed]: Returns a 
> [murmur3 hash|https://code.google.com/p/smhasher/] of the given length. 
> Murmur3 is fast, with has exceptionally good statistical properties; it's a 
> good choice if all you need is good mixing of the inputs. It is _not_ 
> cryptographically secure; that is, given an  output value from murmur3, there 
> are efficient algorithms to find an input yielding the same output value. 
> Supply the seed as a string that 
> [Integer.decode|http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#decode(java.lang.String)]
>  can handle.
> * 'good-{integer number of bits}': Returns a general-purpose, 
> non-cryptographic-strength, streaming hash function that produces hash codes 
> of length at least minimumBits. Users without specific compatibility 
> requirements and who do not persist the hash codes are encouraged to choose 
> this hash function. (Cryptographers, like dieticians and fashionistas, 
> occasionally realize that We've Been Doing it Wrong This Whole Time. Using 
> 'good-*' lets you track What the Experts From (Milan|NIH|IEEE) Say To 
> (Wear|Eat|Hash With) this Fall. Therefore, expect values returned with this 
> choice may change version-to-version.)
> * 'md5':     Returns an MD5 hash (128 hash bits) using Java's MD5 
> MessageDigest.
> * 'sha1':    Returns a SHA-1 hash (160 hash bits) using Java's SHA-1 
> MessageDigest.
> * 'sha256':  Returns a SHA-256 hash (256 hash bits) using Java's SHA-256 
> MessageDigest.
> * 'sha512':  Returns a SHA-512 hash (160 hash bits) using Java's SHA-512 
> MessageDigest.
> Examples: 
> {code}
>   define DefaultH    datafu.pig.hash.Hasher();
>   define GoodH       datafu.pig.hash.Hasher('good-32');
>   define BetterH     datafu.pig.hash.Hasher('good-127');
>   define MurmurH32   datafu.pig.hash.Hasher('murmur3-32');
>   define MurmurH32A  datafu.pig.hash.Hasher('murmur3-32', '0x0');
>   define MurmurH32B  datafu.pig.hash.Hasher('murmur3-32', '0x56789abc');
>   define MurmurH128  datafu.pig.hash.Hasher('murmur3-128');
>   define MurmurH128A datafu.pig.hash.Hasher('murmur3-128', '0x0');
>   define MurmurH128B datafu.pig.hash.Hasher('murmur3-128', '-12345678');
>   define MD5H        datafu.pig.hash.Hasher('md5');
>   define SHA1H       datafu.pig.hash.Hasher('sha1');
>   define SHA256H     datafu.pig.hash.Hasher('sha256');
>   define SHA512H     datafu.pig.hash.Hasher('sha512');
>   
>   data_in = LOAD 'input' as (val:chararray);
>   
>   data_out = FOREACH data_in GENERATE
>     DefaultH(val),   GoodH(val),       BetterH(val),
>     MurmurH32(val),  MurmurH32A(val),  MurmurH32B(val),
>     MurmurH128(val), MurmurH128A(val), MurmurH128B(val),
>     SHA1H(val),       SHA256H(val),    SHA512H(val),
>     MD5H(val)
>     ;
>   
>   STORE data_out INTO 'output';
>   -- Consistent shuffle of large dataset
>   data_in = LOAD 'input' USING PigStorage('\n', '-tagPath') AS 
> (path:chararray, val:chararray);
>   data_numbered = RANK data_in;
>   data_ided = FOREACH numbered GENERATE
>     MurmurH32(CONCAT((chararray)path, '#', (chararray)rank_data_in)) AS 
> shuffle_key,
>     val AS val;
>   data_shuffled = FOREACH (ORDER data_ided BY shuffle_key) GENERATE val;
>   STORE data_shuffled INTO 'data_shuffled';
> {code}
> Important notes about this patch:
> * It should be applied _after_ the patch for DATAFU-46
> * It expands the dependence on Guava. Does [pull req 
> 75|https://github.com/linkedin/datafu/pull/75] mean there's momentum to 
> de-Guava datafu?
> * The patch has (commented out) code that shows what life would be like if 
> the sip24, crc32 and adler32 hashes were available. On your advice, I will 
> either (a) put in a patch removing the spurious comments or (b) file a 
> separate bug to update guava, push in a patch for that, and put in a patch 
> restoring to glory the extra hashes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to