tyronecai commented on PR #15779:
URL: https://github.com/apache/lucene/pull/15779#issuecomment-3996369175
> But, I still think a zero-hash impl should work too! (Using the opto that
stuffs some of the hash bits into `ids` I linked above).
```
Oooh I see, I think that should work? You discard the old hash map (ids[])
immediately (replace with 2X larger one), so surge is just 2X not 3X the prior
array.
```
——— yes, just build new ids from all terms pointed to by bytesStart[]
```
You should also be able to iterate via the byte[] blocks? They are just
concatenated byte[] with prefix 1 or 2 byte vInt. Then the access would be
entirely sequential --> CPU, caches, RAM happier? Maybe even single pass is OK
even mixed with random-writes into new hash?
```
—— Yes, but this is actually same as the access below.
```
for (int id = 0; id < count; id++) {
hashcode = code = pool.hash(bytesStart[id]);
```
```
But, I still think a zero-hash impl should work too! (Using the opto that
stuffs some of the hash bits into ids I linked above).
We know the lower bits of the hash (position in the current ids array), we
know more bits from [the recent
opto](https://github.com/apache/lucene/commit/2f66c8f6668622c0c82b720d47ccd43b57e32edb),
don't we have enough bits? We need just one more bit for the initial hash slot
... on linear probe it just increments...
```
—— No, we don't have enough bits in current code. because
`position in the current ids array` != `the lower bits of the hashcode`
According to
(https://github.com/apache/lucene/commit/2f66c8f6668622c0c82b720d47ccd43b57e32edb)
```
hashSize = capacity;
hashHalfSize = hashSize >> 1;
hashMask = hashSize - 1;
highMask = ~hashMask;
```
When `capacity` is 16, `hashSize` is 16, `hashHalfSize` is 8,
`hashMask` is 15 (0xf),
`highMask` is -16 (0xfffffff0)
```ids[hashPos] = e | (hashcode & highMask);```
The id contains the original id and the high 28 bits of the hashcode.
```
int hashPos = code & hashMask;
while (e != -1 ....) {
code++;
hashPos = code & hashMask;
e = ids[hashPos];
}
```
when we look up the ids, we use the lower 4 bits of the hashcode. and linear
probe for an unused location e.
We can completely change the logic of `ids` and store the lower N bits of
the hashcode (currently the higher N bits) in `id`. This way, we don't need to
recalculate the hashcode during rehashing.
I analyzed the benefits and potential problems using CodeX.
The current design uses the low k bits for bucket location and stores the
high bits for fingerprint, with the two pieces of information being largely
independent.
If we change it to "store the low bits for fingerprint," the first k bits
overlap with the bucket location, essentially wasting k bits of information.
This degradation occurs rapidly as hashSize increases. It will cause
findHash to fall into pool.equals(...) more frequently.
• The current scheme has approximately 32-k effective fingerprint bits.
• The low-bit scheme effectively adds approximately 32-2k new information in
collisions within the same bucket.
When k>=16 (hashSize>=65536), there are almost no effective fingerprints,
and findHash will more frequently fall into pool.equals(...).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]