[PR] Replace IntObjectHashMap with dense array for field related map to reduce heap usage [lucene]

via GitHub Thu, 04 Jun 2026 09:45:53 -0700


HUSTERGS opened a new pull request, #16201:
URL: https://github.com/apache/lucene/pull/16201


   ### Description
   
   This change adds `ReadOnlyDenseIntObjectMap`, a compact read-only 
representation for `IntObjectHashMap` instances whose keys are non-negative 
dense integers. It stores values directly in an `Object[]` indexed by key, 
avoiding the separate `int[]` keys table used by `IntObjectHashMap`.
   
   This is useful for codec metadata maps keyed by `FieldInfo.number`. These 
maps are built while reading segment metadata and then only queried afterwards.
   
   The new representation is only selected through `maybeWrap(...)` when it 
removes meaningful table slack. By default, wrapping requires at least 30% 
fewer value slots. If keys are sparse, negative, or values are null, the 
original `IntObjectHashMap` is kept.
   
   ### Motivation
   
   Several codec readers keep per-field metadata maps keyed by 
`FieldInfo.number`. After previous changes from field-name keyed maps to 
`IntObjectHashMap`, these maps no longer retain field name strings as keys, but 
they still keep an open-addressed hash table with both an `int[]` keys table 
and an `Object[]` values table. I've seen other PRs try to reduce heap usage of 
these maps, like #13961 #13327 #13368
   
   This patch is motivated by a huge cluster in production. On one node we can 
have around 20k open segments, and each segment has 400+ fields. Most of these 
fields are keyword-like fields, so they are both indexed and have doc values. 
   
   For 400+ fields, `IntObjectHashMap` typically allocates a 1024-slot table, 
plus the extra slot used for key `0`, so both arrays have 1025 entries. If 
field numbers are dense enough that `maxFieldNumber + 1` is around half of the 
hash table size, the dense read-only representation replaces:
   
   * `int[1025]`
   * `Object[1025]`
   
   with approximately:
   
   * `Object[512]`
   
   Assuming compressed object pointers, this is roughly:
   
   * `int[1025]`: ~4.0 KB
   * `Object[1025]`: ~4.0 KB
   * `Object[512]`: ~2.0 KB
   
   So the saving is about 6 KB per converted map, excluding the referenced 
values themselves.
   
   In this workload, the main maps affected per segment are:
   
   * `Lucene103BlockTreeTermsReader.fieldMap` for indexed fields
   * `PerFieldDocValuesFormat.FieldsReader.fields` for doc-values fields
   * the populated `Lucene90DocValuesProducer` metadata map for the dominant 
doc-values type
     (for example sorted/sorted-set keyword fields)
   
   This gives a rough estimate of about:
   
   `3 maps * 6 KB * 20,000 segments ~= 360,000 KB`, or around `350 MB` of heap 
reduction on such a
   node.
   
   The exact saving depends on field-number density and on how fields are 
distributed across doc-values types. If field numbers are denser, the saving 
can be slightly higher; if fields are split across multiple smaller maps or 
field numbers are sparse, `maybeWrap` keeps the original `IntObjectHashMap`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Replace IntObjectHashMap with dense array for field related map to reduce heap usage [lucene]

Reply via email to