[ https://issues.apache.org/jira/browse/DATAFU-46?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matthew Hayes updated DATAFU-46: -------------------------------- Assignee: Philip (flip) Kromer > Hash UDFs should return zero-padded strings of uniform length even when > leading bits are zero > --------------------------------------------------------------------------------------------- > > Key: DATAFU-46 > URL: https://issues.apache.org/jira/browse/DATAFU-46 > Project: DataFu > Issue Type: Bug > Reporter: Matthew Hayes > Assignee: Philip (flip) Kromer > Fix For: 1.3.0 > > Attachments: > 0001-Hash-UDFs-return-zero-padded-strings-of-uniform-leng.patch > > > Reported by Philip Kromer here: > https://github.com/linkedin/datafu/issues/93 > Details reported there by Philip: > --------------------- > The Hash UDFs in 'hex' mode currently do not return always the same-length > string, because BigInteger.toString() omits leading zeros. So amidst a stream > of 94% strings the same length, 1/16th are shorter by one or more characters, > 1/256th by two or more, and in the unlikely case that an MD5 hash's value was > 124 bits of zeros and 4 bits of ones it would return the one-character-long > string 'f'. > This is surprising behavior, and a trap for those practicing the frequent > trick of generating a hash and chopping off just the number of bits you need: > {code} > -- returns one-fifteenth, not one-sixteenth, of the input. > sampled_lines = FILTER(FOREACH lines GENERATE MD5(val) AS digest, val) BY > (STARTSWITH(digest, 'f')); > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)