[jira] [Updated] (HDDS-15335) Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups

Siyao Meng (Jira) Wed, 20 May 2026 18:09:08 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-15335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Siyao Meng updated HDDS-15335:
------------------------------
    Description: 
HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo 
lookups.

NSSummaryTask.process() processes every batch of OM update events Recon
ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout)
it has two avoidable costs: every event triggers a fresh
getBucketTable().getSkipCache(...) RocksDB point read even though
bucket layout and objectID never change; and the three sub-tasks
(FSO / Legacy / OBS) iterate the event list sequentially even though
they operate on disjoint slices and write to disjoint NSSummary
entries.

This patch makes three changes:

  1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a
     field-level Map. After the first lookup for a bucket, subsequent
     lookups become HashMap.get() calls.

  2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread
     pool and joins on all three. The threads see the same event list;
     each only processes events whose (table, bucket layout) matches
     its target. Target NSSummary entries are disjoint across
     sub-tasks so no cross-thread synchronization is needed, and the
     TaskResult contract is unchanged.

  3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo)
     call: the parent of an OBS key is its bucket, and an UPDATE event
     cannot move a key between buckets.

Throughput on Intel Xeon Silver 4416+ (40 cores / 80 threads), OpenJDK
17, at 500k events plus 500k preloaded keys, RATIS replication, mixed
60/30/10 create/update/delete:

  | Code                       | events/sec | vs vanilla |
  | Vanilla                    |     78,098 |      1.00x |
  | + change 1 (cache)         |    672,172 |      8.61x |
  | + changes 1 and 2          |    918,550 |     11.76x |

Change 1 is the dominant lever: it removes about 1.5M
getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task
scans of 500k events, each scan doing one or more bucket lookups
before bailing or processing). Change 2 gives a further ~1.37x via JIT
specialization and instruction-cache locality on per-thread hot loops.
Change 3 is below measurement noise.

Heap pressure is reduced because change 1 stops allocating a transient
OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an
8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms)
and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across
the bench lifetime.

On a 100% FSO workload (fileTable / dirTable / deletedDirTable),
change 1 is a no-op because the FSO sub-task reads
keyInfo.getParentObjectID() directly without a bucket lookup. Change 2
still saves the bail-loop cost of Legacy and OBS scanning the event
list to skip at the table-name check, but that cost is small relative
to FSO's own processing, so the wall-clock speedup on FSO-heavy
workloads is correspondingly smaller. The patch is non-regressive in
any case.

The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is
provided as a companion patch on this JIRA.

All existing TestNSSummaryTask* unit tests pass. Two regression tests
are added to TestNSSummaryTask: one exercises the OBS sub-task path
end-to-end (previously only FSO + Legacy events were sent through
process()), and one asserts the returned TaskResult reports success
and contains a seek position for each of FSO, LEGACY, and OBS.

  was:
NSSummaryTask.process() processes every batch of OM update events Recon
ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout)
it has two avoidable costs: every event triggers a fresh
getBucketTable().getSkipCache(...) RocksDB point read even though
bucket layout and objectID never change; and the three sub-tasks
(FSO / Legacy / OBS) iterate the event list sequentially even though
they operate on disjoint slices and write to disjoint NSSummary
entries.

This patch makes three changes:

  1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a
     field-level Map. After the first lookup for a bucket, subsequent
     lookups become HashMap.get() calls.

  2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread
     pool and joins on all three. The threads see the same event list;
     each only processes events whose (table, bucket layout) matches
     its target. Target NSSummary entries are disjoint across
     sub-tasks so no cross-thread synchronization is needed, and the
     TaskResult contract is unchanged.

  3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo)
     call: the parent of an OBS key is its bucket, and an UPDATE event
     cannot move a key between buckets.

Throughput on Intel Xeon Silver 4416+, 80 CPUs, OpenJDK 17, at 500k
events plus 500k preloaded keys, RATIS replication, mixed 60/30/10
create/update/delete:

  | Code                       | events/sec | vs vanilla |
  | Vanilla                    |     78,098 |      1.00x |
  | + change 1 (cache)         |    672,172 |      8.61x |
  | + changes 1 and 2          |    918,550 |     11.76x |

Change 1 is the dominant lever: it removes about 1.5M
getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task
scans of 500k events, each scan doing one or more bucket lookups
before bailing or processing). Change 2 gives a further ~1.37x via JIT
specialization and instruction-cache locality on per-thread hot loops.
Change 3 is below measurement noise.

Heap pressure is reduced because change 1 stops allocating a transient
OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an
8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms)
and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across
the bench lifetime.

On a 100% FSO workload (fileTable / dirTable / deletedDirTable),
change 1 is a no-op because the FSO sub-task reads
keyInfo.getParentObjectID() directly without a bucket lookup. Change 2
still saves the bail-loop cost of Legacy and OBS scanning the event
list to skip at the table-name check, but that cost is small relative
to FSO's own processing, so the wall-clock speedup on FSO-heavy
workloads is correspondingly smaller. The patch is non-regressive in
any case.

The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is
provided as a companion patch on this JIRA.

All 81 existing TestNSSummaryTask* unit tests pass.


> Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups
> -------------------------------------------------------------------------
>
>                 Key: HDDS-15335
>                 URL: https://issues.apache.org/jira/browse/HDDS-15335
>             Project: Apache Ozone
>          Issue Type: Improvement
>          Components: Ozone Recon
>            Reporter: Siyao Meng
>            Assignee: Siyao Meng
>            Priority: Major
>         Attachments: 
> 0002-HDDS-15335.-Recon-add-JMH-and-timing-benchmark-harne.patch
>
>
> HDDS-15335. Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo 
> lookups.
> NSSummaryTask.process() processes every batch of OM update events Recon
> ingests. On keyTable workloads (LEGACY or OBJECT_STORE bucket layout)
> it has two avoidable costs: every event triggers a fresh
> getBucketTable().getSkipCache(...) RocksDB point read even though
> bucket layout and objectID never change; and the three sub-tasks
> (FSO / Legacy / OBS) iterate the event list sequentially even though
> they operate on disjoint slices and write to disjoint NSSummary
> entries.
> This patch makes three changes:
>   1. NSSummaryTaskDbEventHandler caches OmBucketInfo lookups in a
>      field-level Map. After the first lookup for a bucket, subsequent
>      lookups become HashMap.get() calls.
>   2. NSSummaryTask.process() submits the three sub-tasks to a 3-thread
>      pool and joins on all three. The threads see the same event list;
>      each only processes events whose (table, bucket layout) matches
>      its target. Target NSSummary entries are disjoint across
>      sub-tasks so no cross-thread synchronization is needed, and the
>      TaskResult contract is unchanged.
>   3. The OBS UPDATE path drops a redundant getKeyParentID(oldKeyInfo)
>      call: the parent of an OBS key is its bucket, and an UPDATE event
>      cannot move a key between buckets.
> Throughput on Intel Xeon Silver 4416+ (40 cores / 80 threads), OpenJDK
> 17, at 500k events plus 500k preloaded keys, RATIS replication, mixed
> 60/30/10 create/update/delete:
>   | Code                       | events/sec | vs vanilla |
>   | Vanilla                    |     78,098 |      1.00x |
>   | + change 1 (cache)         |    672,172 |      8.61x |
>   | + changes 1 and 2          |    918,550 |     11.76x |
> Change 1 is the dominant lever: it removes about 1.5M
> getSkipCache(bucketDBKey) RocksDB Gets per process() call (3 sub-task
> scans of 500k events, each scan doing one or more bucket lookups
> before bailing or processing). Change 2 gives a further ~1.37x via JIT
> specialization and instruction-cache locality on per-thread hot loops.
> Change 3 is below measurement noise.
> Heap pressure is reduced because change 1 stops allocating a transient
> OmBucketInfo per RocksDB Get. At 1M events / 1M preloaded keys with an
> 8 GB heap, total stop-the-world pause dropped 25% (1137 ms to 850 ms)
> and cumulative bytes reclaimed dropped 52% (522 GB to 249 GB) across
> the bench lifetime.
> On a 100% FSO workload (fileTable / dirTable / deletedDirTable),
> change 1 is a no-op because the FSO sub-task reads
> keyInfo.getParentObjectID() directly without a bucket lookup. Change 2
> still saves the bail-loop cost of Legacy and OBS scanning the event
> list to skip at the table-name check, but that cost is small relative
> to FSO's own processing, so the wall-clock speedup on FSO-heavy
> workloads is correspondingly smaller. The patch is non-regressive in
> any case.
> The reproduction harness (NSSummaryProcessTimingTest under -Pbench) is
> provided as a companion patch on this JIRA.
> All existing TestNSSummaryTask* unit tests pass. Two regression tests
> are added to TestNSSummaryTask: one exercises the OBS sub-task path
> end-to-end (previously only FSO + Legacy events were sent through
> process()), and one asserts the returned TaskResult reports success
> and contains a seek position for each of FSO, LEGACY, and OBS.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-15335) Recon: parallelize NSSummaryTask sub-tasks and cache OmBucketInfo lookups

Reply via email to