Matthew Hayes created DATAFU-46:
-----------------------------------

             Summary: Hash UDFs should return zero-padded strings of uniform 
length even when leading bits are zero
                 Key: DATAFU-46
                 URL: https://issues.apache.org/jira/browse/DATAFU-46
             Project: DataFu
          Issue Type: Bug
            Reporter: Matthew Hayes
             Fix For: 1.3.0


Reported by Philip Kromer here:

https://github.com/linkedin/datafu/issues/93

Details reported there by Philip:

---------------------

The Hash UDFs in 'hex' mode currently do not return always the same-length 
string, because BigInteger.toString() omits leading zeros. So amidst a stream 
of 94% strings the same length, 1/16th are shorter by one or more characters, 
1/256th by two or more, and in the unlikely case that an MD5 hash's value was 
124 bits of zeros and 4 bits of ones it would return the one-character-long 
string 'f'.

This is surprising behavior, and a trap for those practicing the frequent trick 
of generating a hash and chopping off just the number of bits you need:

{code}
-- returns one-fifteenth, not one-sixteenth, of the input.
sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY 
(STARTSWITH(digest, 'f'));
{code}




--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to