[
https://issues.apache.org/jira/browse/HDDS-14829?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wei-Chiu Chuang updated HDDS-14829:
-----------------------------------
Release Note:
The snapshot diff command now separates job submission from report retrieval
to prevent RPC timeouts during large diff operations.
* Submit Job: ozone sh snapshot diff <path> <snap1> <snap2>
* Get Report: ozone sh snapshot diff -r <path> <snap1> <snap2> (or
--get-report)
Clients will automatically fall back to the legacy synchronous behavior when
connecting to older servers.
was:refactors the snapshot diff workflow to decouple job submission from
report retrieval by introducing a dedicated SubmitSnapshotDiff
> Make snap diff job status tracking more reliable
> ------------------------------------------------
>
> Key: HDDS-14829
> URL: https://issues.apache.org/jira/browse/HDDS-14829
> Project: Apache Ozone
> Issue Type: Improvement
> Reporter: Saketa Chalamchala
> Assignee: Saketa Chalamchala
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.2.0
>
>
> Current Snapshot diff config defaults:
> {code:java}
> ozone.om.snapshot.diff.cleanup.service.run.interval = 1m (Interval at which
> snapshot diff clean up service will run.)
> ozone.om.snapshot.diff.job.default.wait.time = 1m (Default wait time
> returned to client to wait before retrying snap diff request.)
> {code}
> The following scenario can happen:
> {code:java}
> T0: Replication job submits a snapshot diff job and is asked to check back in
> 1min for the report.
> T0.2: The job fails immediately and the status of the job is updated as
> FAILED in the DB
> T0.5: But immediately after that cleanup runs and removes the job from the DB
> T1: After a minute the user runs the snapshot diff again as instructed in
> Step 1 but since the job disappears from the DB a new job is submitted.
> T1.2: The job fails immediately and the status of the job is updated as
> FAILED in the DB
> T1.5: But immediately after that cleanup runs and removes the job from the DB
> T2: After a minute the user runs the snapshot diff again as instructed in
> Step 1 but since the job disappears from the DB a new job is submitted.
> ... the above pattern keeps repeating
> {code}
> - We should make snap diff response more reliable. We should let the client
> know if the job has failed and that they should retry. If we want to keep the
> status of the job for > 1min then Separating the submit/get API might be a
> good idea.
> - Improve auditing of snapshot diff jobs
> Current response:
> {code:java}
> Snapshot diff job is IN_PROGRESS. Please retry after 60000 ms.
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]