> I want to override partitionByHash function on Flink like the same way >of DBY on Hive. > I am working on implementing some benchmark system for these two system, >which could be contritbutino to Hive as well.
I would be very disappointed if Flink fails to outperform Hive with a Distribute BY, because the hive version is about 5-10x slower than it can be with Tez. Mapreduce forces a full sort of the output data, so the Hive version will be potentially O(N*LOG(N)) by default while Flink should be able to do O(N). Assuming you don't turn on any of the compatibility modes, the hashCode generated would be a murmur hash after encoding data into a byte[] using BinarySortableSerDe & the data is then sorted using key=(murmur_hash(byte[]) % n-reducers). The reducers then pull the data, merge-sort using the disk which is entirely wasted CPU. If you or anyone's interested in fixing this for Tez, I have a JIRA open to the fix the hash-only shuffle - https://issues.apache.org/jira/browse/HIVE-11858 Cheers, Gopal