errose28 opened a new pull request, #10602:
URL: https://github.com/apache/ozone/pull/10602

   ## What changes were proposed in this pull request?
   
   Changes generated by Claude Code with a spec, reviews, and edits by me.
   
   ### Metrics
   
   - Added build info: revision (git hash/string), component (string), and 
version (string)
     - This information was already present in every build, but was not exposed 
as metrics.
     - To expose the strings, they are added as labels to a gauge with a 
constant value of 1. Thanks @octachoron for the tip.
   - Added S3 gateway client version
     - S3 Gateway does not use software or apparent version since it is 
stateless. It does contain an Ozone client so expose the client version metric 
instead.
   
   Software and apparent version metrics for all relevant components were 
already published.
   
   This PR removes one `@Metric` annotation as a workaround for #10523 which is 
still pending merge on master. We should be able to merge either PR 
independently and reconcile them when they both land.
   
   ### Dashboard
   
   A Grafana dashboard was added to assist admins as they are orchestrating the 
upgrade. Since this depends on the new metrics, it will only be usable when 
upgrading from the initial version that supports ZDU (just like the ZDU feature 
itself). However once the metrics are present it could be helpful for even a 
non-rolling upgrade.
   
   Since this dashboard was designed with admins in mind, it does not expose 
software version, apparent version, or client versions which are internal to 
the cluster. It only exposes admin facing properties like "finalized" as a 
boolean state and the build version string. Internal version info is still 
accessible with PromQL for more dev focused debugging as needed.
   
   All panels were designed to account for large clusters so the dashboard 
remains readable even when there are 1000+ nodes. The tables are paginated and 
all other values are aggregates. The selectors at the top of the dashboard 
support drilling down to specific components as needed, while the banner at the 
top alerts that a filtered view is active.
   
   <img width="1600" height="1016" alt="image" 
src="https://github.com/user-attachments/assets/5aadb684-c897-4e20-8930-21578de8350b";
 />
   
   <img width="1605" height="943" alt="Screenshot 2026-06-24 at 7 00 16 PM" 
src="https://github.com/user-attachments/assets/aef34d3a-3e03-4a9f-be46-650d5d675fdf";
 />
   
   <img width="1608" height="1008" alt="Screenshot 2026-06-24 at 7 01 16 PM" 
src="https://github.com/user-attachments/assets/45dc1d10-bd6d-473f-b13d-c64d8f9b9c2d";
 />
   
   ## What is the link to the Apache JIRA
   
   HDDS-14825
   
   ## How was this patch tested?
   
   Unit tests for the new metrics were added.
   
   The dashboard can be manually viewed from Grafana in a local docker 
environment:
   ```
   cd hadoop-ozone/dist/target/ozone-*/compose/ozone
   COMPOSE_FILE=docker-compose.yaml:monitoring.yaml docker compose up --scale 
datanode=3 -d
   # Go to http://localhost:3000/dashboards and select "Ozone - Rolling Upgrade"
   # To tear down:
   COMPOSE_FILE=docker-compose.yaml:monitoring.yaml docker compose down
   ```
   The dashboard will need a few seconds to populate the values. Also zoom in 
the time interval to the last few minutes since the default 30 minute window 
will be hard to read when the cluster has only been live for a few seconds.
   
   By default this will run with all nodes finalized and in the same version. 
To see the dashboard with a simulated in progress upgrade, build Ozone with the 
following patch applied:
   ```patch
   diff --git 
b/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
 
a/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
   index f8ce4cbdc0..68aace6544 100644
   --- 
b/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
   +++ 
a/hadoop-hdds/framework/src/main/java/org/apache/hadoop/hdds/server/http/BuildInfoMetrics.java
   @@ -74,9 +74,18 @@ public static synchronized BuildInfoMetrics create(String 
component) {
      public void getMetrics(MetricsCollector collector, boolean all) {
        MetricsRecordBuilder builder = collector.addRecord(RECORD_NAME)
            .add(new MetricsTag(
   -            Interns.info("component", "Ozone component name"), 
component)).add(new MetricsTag(Interns.info("revision", "Source control 
revision"), revision))
   -        .add(new MetricsTag(Interns.info("version", "Ozone build version"), 
version))
   +            Interns.info("component", "Ozone component name"), component))
   +        .add(new MetricsTag(
   +            Interns.info("revision", "Source control revision"), revision))
            .addGauge(Interns.info("BuildInfo", "Always 1; identifying info is 
in labels"), 1L);
   +
   +    if (component.equals("hddsDatanode")) {
   +      builder.add(new MetricsTag(Interns.info("version", "Ozone build 
version"), "2.1.0-TEST"));
   +    } else {
   +      builder.add(new MetricsTag(Interns.info("version", "Ozone build 
version"), version));
   +    }
   +
   +
        builder.endRecord();
      }
    }
   diff --git b/hadoop-ozone/dist/src/main/compose/ozone/docker-config 
a/hadoop-ozone/dist/src/main/compose/ozone/docker-config
   index ecca3a971c..0c16691f2d 100644
   --- b/hadoop-ozone/dist/src/main/compose/ozone/docker-config
   +++ a/hadoop-ozone/dist/src/main/compose/ozone/docker-config
   @@ -67,3 +67,8 @@ no_proxy=om,scm,s3g,recon,kdc,localhost,127.0.0.1
    
    # Explicitly enable filesystem snapshot feature for this Docker compose 
cluster
    OZONE-SITE.XML_ozone.filesystem.snapshot.enabled=true
   +
   +# Testing overrides for ZDU dashboard verification: start with apparent < 
software
   +# to demonstrate divergence rendering. Revert before running acceptance 
tests.
   +OZONE-SITE.XML_testing.ozone.om.init.apparent.version=7
   +OZONE-SITE.XML_testing.hdds.scm.init.apparent.version=8
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to