HashPartitioner is not a safe partitioner for non-prime number of reducers, 
particularly bad for 2^n, which seems to be a common use
------------------------------------------------------------------------------------------------------------------------------------

                 Key: PIG-2116
                 URL: https://issues.apache.org/jira/browse/PIG-2116
             Project: Pig
          Issue Type: Bug
    Affects Versions: 0.8.1, 0.9.0
            Reporter: Woody Anderson


the implementation of hashCode should not be assumed to be good.
in particular, the hashCode of String and List (used by Tuple) are very bad for 
modulus 2^n.

we propose to add an additional perturbation of the int before doing the "% 
reducers" bucketing.
HashMap.java uses this to prevent the String.hashCode from causing massive 
bucket collisions etc. but that perturbation is targeted explicitly for a 2^n 
number of buckets, which Pig is not doing in general.

we propose possibly using the final mixing step from murmur3.

here is some discussion of this issue for context:

This has some amusing implications: this hash is terrible for
2,4,8,16,31, and 32 reducers, so even in normal situations that's pretty
bad, especially if pig happens to pick 31 reducers because it has
104-106 mappers * 0.3.

31 is congruent to -1 mod 2^k for all 2 <= k <= 5, so in that case the hash is
effectively:

t[0]*(-1)^(n-1) + t[1]*(-1)^(n-2) + ... + t[n-2]*(-1) + t[n-1]

= (for odd n) t[0] - t[1] + t[2] - t[3] + t[4] + ...

So for example the string "mississippim" hashes to 0 (mod 2^32), as
every even input character is cancelled out by an equal odd input
elsewhere.

H = 0
for c in "mississippim":
  H = H*31 + ord(c)
  print "%c: H=%d (mod 32)" % (c, H%32)

m: H=13 (mod 32)
i: H=28 (mod 32)
s: H=23 (mod 32)
s: H=28 (mod 32)
i: H=13 (mod 32)
s: H=6 (mod 32)
s: H=13 (mod 32)
i: H=28 (mod 32)
p: H=20 (mod 32)
p: H=28 (mod 32)
i: H=13 (mod 32)
m: H=0 (mod 32)

Similarly with exactly 31 reducers, the hash function cancels out
entirely (31 is 0 mod 31, so everything but the last item is multiplied
by 0^i) and the result is simply the value of the last item.

A simple fix is to add a post-hash mixing step that nontrivially affects
the bits in the state over all other bits in the hash output, ideally
with probability 1/2 for all bits.  That way the modulo doesn't
distribute across the whole function back to the input, and the internal
state of the hash above whatever modulus has some effect.

H = 0
for c in "mississippim":
  H = H*31 + ord(c)
  # these &0xffffffff ops are to simulate unsigned 32-bit math in python
  H = H&0xffffffff
  Hout = (H + (H<<3))&0xffffffff
  Hout = Hout ^ (Hout>>11)
  Hout = (Hout + (Hout<<15))&0xffffffff
  print "%c: H=%08x === %d (mod 32)" % (c, Hout, Hout%32)

m: H=01ea83d5 === 21 (mod 32)
i: H=3d39fa73 === 19 (mod 32)
s: H=6c78d8d4 === 20 (mod 32)
s: H=3c76f555 === 21 (mod 32)
i: H=0abb25ff === 31 (mod 32)
s: H=40df81c9 === 9 (mod 32)
s: H=cfc8a427 === 7 (mod 32)
i: H=cea62c2b === 11 (mod 32)
p: H=4594d493 === 19 (mod 32)
p: H=f14b432a === 10 (mod 32)
i: H=169be0b0 === 16 (mod 32)
m: H=7d57b59c === 28 (mod 32)

The mixing step only needs to be done once at the end.  The one I
inserted was stolen from Bob Jenkins' hash site, which is required
reading for anyone who decides to implement their own hashing.

Or you could use a real (good, fast, tested) hash function like murmur3.

-Andy


On Thu, Jun 02, 2011 at 03:37:56PM -0700, Woody Anderson wrote:
> This caught me off guard the other day, so i figured i'd pass it along:
> 
> the hashCode implementation of Tuple and String have very specific expansions 
> which do not provide a lot of hashCode variance mod 2^k when the elements are 
> all equal.
> 
> string:
>  t[0]*31^(n-1) + t[1]*31^(n-2) + ... + t[n-1]
> tuple:
> ..(((31 + t[0])*31 + t[1])*31 + t[2])*31 + t[4]..
> 
> this expansion modulo powers of 2 is degenerate if t[i] are all equal.
> eg. you group by (n0, n1) to do some work, and there are an unusually high 
> number of tuples where n0 == n1, the value of n0/n1 makes no difference. this 
> will equal 1 mod 16.
> the same goes if you're grouping by strings, and have a lot of "a", "aa", 
> "aaaa", "b", "bb", "bbb", etc. type data
> this results in all the data ending up in a single reducer/part file. which 
> is either a waste or going to kill your job.
> so, if you use 2^k reducers then that's a terrible group-by. and it's not 
> going to be good (in general) for any non-prime.
> 
> under 'normal' circumstances you probably won't notice this being a factor. I 
> didn't notice until i used string.hashCode as part of a group-by to both 
> group by my string an produce a semi-randomized output ordering (sherpa 
> requirement); this completely blew up when simply grouping by the string 
> hadn't.
> 
> so, if you have highly varied data elements, this this is less of an issue, 
> though a prime will usually generalize better, and you won't suddenly wonder 
> about the bad dispersal you're getting.
> -w

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to