devmadhuu opened a new pull request, #9258:
URL: https://github.com/apache/ozone/pull/9258
## What changes were proposed in this pull request?
**ContainerHealthTaskV2:** SCM-Based Container Health Monitoring with Batch
Processing
**Overview**
Introduces ContainerHealthTaskV2, a new implementation that uses SCM's
ReplicationManager as the single source of truth for container health states,
replacing the legacy dual-tracking approach.
**Key Improvements Over Legacy**
1. **Single Source of Truth**
- V2: Queries SCM directly for authoritative health status
- Legacy: Maintains separate health calculations in Recon, leading to
inconsistencies
2. **Bidirectional Synchronization**
- Validates all Recon containers against SCM
- Discovers and tracks containers known to SCM but missing in Recon
- Ensures no unhealthy containers are missed
3. **Batch Processing**
- Processes containers in batches of 1000 for database operations
- Reduces database round-trips by ~99% for large container sets
- Batches both inserts and deletes in a single transaction
4. **REPLICA_MISMATCH Detection**
- Continues to track checksum mismatches locally (SCM doesn't track this)
- Separate batch operations for SCM-tracked vs Recon-tracked states
**Configuration**
Enable V2 implementation via feature flag:
<property>
<name>ozone.recon.container.health.use.scm.report</name>
<value>true</value>
</property>
Default: false (uses legacy implementation)
**Technical Details**
- New Table: UNHEALTHY_CONTAINERS_V2 (independent from legacy table)
- Database: Supports Derby and other JOOQ-compatible databases
- States Tracked: MISSING, UNDER_REPLICATED, OVER_REPLICATED,
MIS_REPLICATED, REPLICA_MISMATCH
- Batch Size: 1000 records per transaction (configurable via DB_BATCH_SIZE)
**Testing**
- 5 comprehensive unit tests covering all scenarios
- Fixed Derby schema configuration for test environment
**Migration Path**
Both implementations can run in parallel, allowing gradual rollout and
comparison before full migration.
## What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-13891
## How was this patch tested?
Added junit test cases and tested using local docker cluster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]