HUSTERGS opened a new pull request, #16201:
URL: https://github.com/apache/lucene/pull/16201
### Description
This change adds `ReadOnlyDenseIntObjectMap`, a compact read-only
representation for `IntObjectHashMap` instances whose keys are non-negative
dense integers. It stores values directly in an `Object[]` indexed by key,
avoiding the separate `int[]` keys table used by `IntObjectHashMap`.
This is useful for codec metadata maps keyed by `FieldInfo.number`. These
maps are built while reading segment metadata and then only queried afterwards.
The new representation is only selected through `maybeWrap(...)` when it
removes meaningful table slack. By default, wrapping requires at least 30%
fewer value slots. If keys are sparse, negative, or values are null, the
original `IntObjectHashMap` is kept.
### Motivation
Several codec readers keep per-field metadata maps keyed by
`FieldInfo.number`. After previous changes from field-name keyed maps to
`IntObjectHashMap`, these maps no longer retain field name strings as keys, but
they still keep an open-addressed hash table with both an `int[]` keys table
and an `Object[]` values table. I've seen other PRs try to reduce heap usage of
these maps, like #13961 #13327 #13368
This patch is motivated by a huge cluster in production. On one node we can
have around 20k open segments, and each segment has 400+ fields. Most of these
fields are keyword-like fields, so they are both indexed and have doc values.
For 400+ fields, `IntObjectHashMap` typically allocates a 1024-slot table,
plus the extra slot used for key `0`, so both arrays have 1025 entries. If
field numbers are dense enough that `maxFieldNumber + 1` is around half of the
hash table size, the dense read-only representation replaces:
* `int[1025]`
* `Object[1025]`
with approximately:
* `Object[512]`
Assuming compressed object pointers, this is roughly:
* `int[1025]`: ~4.0 KB
* `Object[1025]`: ~4.0 KB
* `Object[512]`: ~2.0 KB
So the saving is about 6 KB per converted map, excluding the referenced
values themselves.
In this workload, the main maps affected per segment are:
* `Lucene103BlockTreeTermsReader.fieldMap` for indexed fields
* `PerFieldDocValuesFormat.FieldsReader.fields` for doc-values fields
* the populated `Lucene90DocValuesProducer` metadata map for the dominant
doc-values type
(for example sorted/sorted-set keyword fields)
This gives a rough estimate of about:
`3 maps * 6 KB * 20,000 segments ~= 360,000 KB`, or around `350 MB` of heap
reduction on such a
node.
The exact saving depends on field-number density and on how fields are
distributed across doc-values types. If field numbers are denser, the saving
can be slightly higher; if fields are split across multiple smaller maps or
field numbers are sparse, `maybeWrap` keeps the original `IntObjectHashMap`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]