this should get you on the right path: https://issues.apache.org/jira/browse/HIVE-7121
From: Connell Donaghy [mailto:cdona...@pinterest.com] Sent: Monday, July 13, 2015 2:50 PM To: user@hive.apache.org Subject: DISTRIBUTE BY question Hey! I'm trying to write a tool which uses a storagehandler to store HFiles, using a specific partition function. So in order to do this, I have been trying to use DISTRIBUTE BY and a UDF using the key column and number of reducers (which becomes number of partitions, as each reducer creates its own hfile.) However, I have noticed that sometimes two UDF values (say 0 and 11) will both go to reducer 0, while reducer 11 does not get any inputs. Could you guys point me to the place in your source code where you implement the partitioning for the map/reduce job and DISTRIBUTE BY, so that I could try and reverse-engineer it to ensure the keys go to the right partition? If my question doesn't make sense, just pointing me to where DISTRIBUTE BY is implemented would be very helpful, and thank you so so much for your time! ====================================================================== THIS ELECTRONIC MESSAGE, INCLUDING ANY ACCOMPANYING DOCUMENTS, IS CONFIDENTIAL and may contain information that is privileged and exempt from disclosure under applicable law. If you are neither the intended recipient nor responsible for delivering the message to the intended recipient, please note that any dissemination, distribution, copying or the taking of any action in reliance upon the message is strictly prohibited. If you have received this communication in error, please notify the sender immediately. Thank you.