Re: Hi, Hive People urgent question about [Distribute By] function

Gopal Vijayaraghavan Tue, 27 Oct 2015 10:50:35 -0700

> I want to override partitionByHash function on Flink like the same way
>of DBY on Hive.
> I am working on implementing some benchmark system for these two system,
>which could be contritbutino to Hive as well.


I would be very disappointed if Flink fails to outperform Hive with a
Distribute BY, because the hive version is about 5-10x slower than it can
be with Tez.

Mapreduce forces a full sort of the output data, so the Hive version will
be potentially O(N*LOG(N)) by default while Flink should be able to do
O(N).


Assuming you don't turn on any of the compatibility modes, the hashCode
generated would be a murmur hash after encoding data into a byte[] using
BinarySortableSerDe & the data is then sorted using
key=(murmur_hash(byte[]) % n-reducers).


The reducers then pull the data, merge-sort using the disk which is
entirely wasted CPU.

If you or anyone's interested in fixing this for Tez, I have a JIRA open
to the fix the hash-only shuffle -
https://issues.apache.org/jira/browse/HIVE-11858

Cheers,
Gopal

Re: Hi, Hive People urgent question about [Distribute By] function

Reply via email to