Gopal V created HIVE-20624:
------------------------------
Summary: Vectorization: Fast Hash table should not double after
certain size, instead should grow
Key: HIVE-20624
URL: https://issues.apache.org/jira/browse/HIVE-20624
Project: Hive
Issue Type: Bug
Components: Vectorization
Reporter: Gopal V
The reason to use Power of two is to simplify the inner loop for the hash
function, but this is a significant memory issue when dealing with somewhat
larger hashtables like the customer or customer_address hashtables in TPC-DS.
This doubling is pretty bad when the hashtable load-factor is 0.75 and the
expected key count is 65M (for customer/customer_address)
{code}
long worstCaseNeededSlots = 1L << DoubleMath.log2(numRows /
hashTableLoadFactor, RoundingMode.UP);
{code}
That estimate is actually matching the actual impl, but after acquiring 65M
items in a single array, the rehashing will require a temporary growth to
65+128 while the rehash is in progress, all to fit exactly 65 back into it.
Fixing the estimate to match the implementation produced a number of
regressions in query runtimes, though the part that needs fixing is the
doubling implementation.
The obvious solution is to add 4M more everytime and use a modulo function or
the Lemire's multiply + shift operation[1], but more on that in comments.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)