Matthew Hayes created DATAFU-46: ----------------------------------- Summary: Hash UDFs should return zero-padded strings of uniform length even when leading bits are zero Key: DATAFU-46 URL: https://issues.apache.org/jira/browse/DATAFU-46 Project: DataFu Issue Type: Bug Reporter: Matthew Hayes Fix For: 1.3.0
Reported by Philip Kromer here: https://github.com/linkedin/datafu/issues/93 Details reported there by Philip: --------------------- The Hash UDFs in 'hex' mode currently do not return always the same-length string, because BigInteger.toString() omits leading zeros. So amidst a stream of 94% strings the same length, 1/16th are shorter by one or more characters, 1/256th by two or more, and in the unlikely case that an MD5 hash's value was 124 bits of zeros and 4 bits of ones it would return the one-character-long string 'f'. This is surprising behavior, and a trap for those practicing the frequent trick of generating a hash and chopping off just the number of bits you need: {code} -- returns one-fifteenth, not one-sixteenth, of the input. sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY (STARTSWITH(digest, 'f')); {code} -- This message was sent by Atlassian JIRA (v6.2#6252)