In RDBMS (say Oracle or Sybase) a hash join comes to play when you tend to join two tables with a large set of information (large being a relative term). The smaller of two tables is hashed into memory.
For Hive, I have two tables one called t with 1.7 million rows and another called smallt with 100 rows. Using three ways of joining the two tables on object_id column 1) WHERE EXISTS 0: jdbc:hive2://rhes564:10010/default> select count(1) from t WHERE EXISTS (select 1 from smallt where t.object_id = smallt.object_id); +------+--+ | _c0 | +------+--+ | 100 | +------+--+ 1 row selected (66.369 seconds) 2) IN 0: jdbc:hive2://rhes564:10010/default> select count(1) from t where t.object_id IN (select smallt.object_id from smallt); +------+--+ | _c0 | +------+--+ | 100 | +------+--+ 1 row selected (66.158 seconds) 3) Classic Join 0: jdbc:hive2://rhes564:10010/default> select count(1) from t, smallt where t.object_id = smallt.object_id; +------+--+ | _c0 | +------+--+ | 100 | +------+--+ 1 row selected (68.978 seconds) You can see the results and judge for yourself HTH Mich Talebzadeh Sybase ASE 15 Gold Medal Award 2008 A Winning Strategy: Running the most Critical Financial Data on ASE 15 <http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf> http://login.sybase.com/files/Product_Overviews/ASE-Winning-Strategy-091908.pdf Author of the books "A Practitioner’s Guide to Upgrading to Sybase ASE 15", ISBN 978-0-9563693-0-7. co-author "Sybase Transact SQL Guidelines Best Practices", ISBN 978-0-9759693-0-4 Publications due shortly: Complex Event Processing in Heterogeneous Environments, ISBN: 978-0-9563693-3-8 Oracle and Sybase, Concepts and Contrasts, ISBN: 978-0-9563693-1-4, volume one out shortly <http://talebzadehmich.wordpress.com/> http://talebzadehmich.wordpress.com NOTE: The information in this email is proprietary and confidential. This message is for the designated recipient only, if you are not the intended recipient, you should destroy it immediately. Any information in this message shall not be understood as given or endorsed by Peridale Technology Ltd, its subsidiaries or their employees, unless expressly so stated. It is the responsibility of the recipient to ensure that this email is virus free, therefore neither Peridale Ltd, its subsidiaries nor their employees accept any responsibility. From: Raviv Murciano-Goroff [mailto:ravi...@gmail.com] Sent: 29 August 2015 00:50 To: user@hive.apache.org Subject: Join vs. Where...In Hi, I often have the following situation: I have a small table with a list of unique IDs and a very large table of events associated with the IDs. I want to perform some aggregation including only events associated with IDs from the small table. Is there a rule of thumb for whether performing a JOIN on the unique ID is faster or slower than using WHERE id IN (SELECT id FROM small_table...)? Thank you for your advice, Raviv