[jira] [Updated] (HDDS-14829) Make snap diff job status tracking more reliable

Wei-Chiu Chuang (Jira) Tue, 19 May 2026 10:32:14 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei-Chiu Chuang updated HDDS-14829:
-----------------------------------
    Release Note: 
  The snapshot diff command now separates job submission from report retrieval 
to prevent RPC timeouts during large diff operations.

   * Submit Job: ozone sh snapshot diff <path> <snap1> <snap2>
   * Get Report: ozone sh snapshot diff -r <path> <snap1> <snap2> (or 
--get-report)

  Clients will automatically fall back to the legacy synchronous behavior when 
connecting to older servers.


  was:refactors the snapshot diff workflow to decouple job submission from 
report retrieval by introducing a dedicated SubmitSnapshotDiff


> Make snap diff job status tracking more reliable
> ------------------------------------------------
>
>                 Key: HDDS-14829
>                 URL: https://issues.apache.org/jira/browse/HDDS-14829
>             Project: Apache Ozone
>          Issue Type: Improvement
>            Reporter: Saketa Chalamchala
>            Assignee: Saketa Chalamchala
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.2.0
>
>
> Current Snapshot diff config defaults: 
> {code:java}
> ozone.om.snapshot.diff.cleanup.service.run.interval = 1m  (Interval at which 
> snapshot diff clean up service will run.)
> ozone.om.snapshot.diff.job.default.wait.time = 1m  (Default wait time 
> returned to client to wait before retrying snap diff request.)
> {code}
> The following scenario can happen:
> {code:java}
> T0: Replication job submits a snapshot diff job and is asked to check back in 
> 1min for the report. 
> T0.2: The job fails immediately and the status of the job is updated as 
> FAILED in the DB
> T0.5: But immediately after that cleanup runs and removes the job from the DB
> T1: After a minute the user runs the snapshot diff again as instructed in 
> Step 1 but since the job disappears from the DB a new job is submitted.
> T1.2: The job fails immediately and the status of the job is updated as 
> FAILED in the DB
> T1.5: But immediately after that cleanup runs and removes the job from the DB
> T2: After a minute the user runs the snapshot diff again as instructed in 
> Step 1 but since the job disappears from the DB a new job is submitted.
> ... the above pattern keeps repeating
> {code}
> - We should make snap diff response more reliable. We should let the client 
> know if the job has failed and that they should retry. If we want to keep the 
> status of the job for > 1min then Separating the submit/get API might be a 
> good idea. 
> - Improve auditing of snapshot diff jobs
> Current response: 
> {code:java}
> Snapshot diff job is IN_PROGRESS. Please retry after 60000 ms.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14829) Make snap diff job status tracking more reliable

Reply via email to