Re: [PR] Improve BytesRefHash.add performance by optimize rehash operation [lucene]

via GitHub Wed, 04 Mar 2026 01:40:24 -0800


tyronecai commented on PR #15779:
URL: https://github.com/apache/lucene/pull/15779#issuecomment-3996369175


   > But, I still think a zero-hash impl should work too! (Using the opto that 
stuffs some of the hash bits into `ids` I linked above).
   
   
   ```
   Oooh I see, I think that should work? You discard the old hash map (ids[]) 
immediately (replace with 2X larger one), so surge is just 2X not 3X the prior 
array.
   ```
   
   ——— yes, just build new ids from all terms pointed to by bytesStart[]
   
   ```
   You should also be able to iterate via the byte[] blocks? They are just 
concatenated byte[] with prefix 1 or 2 byte vInt. Then the access would be 
entirely sequential --> CPU, caches, RAM happier? Maybe even single pass is OK 
even mixed with random-writes into new hash?
   ```
   
   —— Yes, but this is actually same as the access below.
   ```
   for (int id = 0; id < count; id++) {
           hashcode = code = pool.hash(bytesStart[id]);
   ```
   
   
   ```
   But, I still think a zero-hash impl should work too! (Using the opto that 
stuffs some of the hash bits into ids I linked above).
   
   We know the lower bits of the hash (position in the current ids array), we 
know more bits from [the recent 
opto](https://github.com/apache/lucene/commit/2f66c8f6668622c0c82b720d47ccd43b57e32edb),
 don't we have enough bits? We need just one more bit for the initial hash slot 
... on linear probe it just increments...
   ```
   
   ——   No, we don't have enough bits in current code. because 
   
   `position in the current ids array`  !=  `the lower bits of the hashcode`
   
   According to 
(https://github.com/apache/lucene/commit/2f66c8f6668622c0c82b720d47ccd43b57e32edb)
   ```
       hashSize = capacity;
       hashHalfSize = hashSize >> 1;
       hashMask = hashSize - 1;
       highMask = ~hashMask;
   ```
   When `capacity` is 16, `hashSize` is 16, `hashHalfSize` is 8,
   `hashMask` is 15 (0xf),
   `highMask` is -16 (0xfffffff0)
   
   ```ids[hashPos] = e | (hashcode & highMask);```
   The id contains the original id and the high 28 bits of the hashcode. 
   
   ```    
   int hashPos = code & hashMask;
   while (e != -1 ....) {
     code++;
     hashPos = code & hashMask;
     e = ids[hashPos];
   }
   ```
   when we look up the ids, we use the lower 4 bits of the hashcode. and linear 
probe for an unused location e.
   
   
   We can completely change the logic of `ids` and store the lower N bits of 
the hashcode (currently the higher N bits) in `id`. This way, we don't need to 
recalculate the hashcode during rehashing.
   
   I analyzed the benefits and potential problems using CodeX.
   
   The current design uses the low k bits for bucket location and stores the 
high bits for fingerprint, with the two pieces of information being largely 
independent.
   
   If we change it to "store the low bits for fingerprint," the first k bits 
overlap with the bucket location, essentially wasting k bits of information.
   
   This degradation occurs rapidly as hashSize increases. It will cause 
findHash to fall into pool.equals(...) more frequently.
   
   • The current scheme has approximately 32-k effective fingerprint bits.
   
   • The low-bit scheme effectively adds approximately 32-2k new information in 
collisions within the same bucket.
   
   When k>=16 (hashSize>=65536), there are almost no effective fingerprints, 
and findHash will more frequently fall into pool.equals(...).
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Improve BytesRefHash.add performance by optimize rehash operation [lucene]

Reply via email to