[GitHub] [cloudstack] GutoVeronezi commented on issue #6836: Volume Snapshots Failing After 3600 Seconds

GitBox Thu, 20 Oct 2022 13:23:54 -0700


GutoVeronezi commented on issue #6836:
URL: https://github.com/apache/cloudstack/issues/6836#issuecomment-1286100015


   Thanks for the feedback, @whitetiger264.
   
   Given your network setup, and considering some oscillation, it is expected 
transference of GBs to take a long time. However, as seen in @slavkap tests 
(https://github.com/apache/cloudstack/issues/6836#issuecomment-1285873500), 
`Files.copy` executed by ACS takes longer to transfer the data (in your case, 
around twice more). Therefore, we also have the `Files.copy` factor.
   
   As the snapshot is defined by the delta creation, the copy speed does not 
affect the snapshot per se. With that said, knowing the network setup and the 
process that ACS executes (`Files.copy`) takes more time than the `pv` (or 
`cp`) command, for now, we can adjust the configurations to appropriate values. 
   
   For 15GB, it took a bit more than 1 hour (according to 
https://github.com/apache/cloudstack/issues/6836#issuecomment-1283521520, it 
took 1 hour and 2 minutes - 3720 seconds); therefore, we can consider around 
this value for each 15GB. e.g.: if you will work with volumes with 50GB, 
consider using around 12700 seconds (3720 * 50 / 15) as the base time of the 
snapshot operations. 
   
   The error in the MS logs says `Job is cancelled as it has been blocking 
others for too long`:
   ```
   2022-10-20 21:29:38,753 ERROR [o.a.c.a.c.u.s.CreateSnapshotCmd] 
(API-Job-Executor-1:ctx-c1c52096 job-1021 ctx-113be2b1) (logid:3f94e4e5) Failed 
to create snapshot due to an internal error creating snapshot for volume 
635fd5b5-bf14-4cf4-979c-be8a649d5189
   com.cloud.utils.exception.CloudRuntimeException: Unable to serialize: Job is 
cancelled as it has been blocking others for too long
   ```
   
   This message is generated when a job is cancelled because it has been 
executing for long than the threshold:
   
   
https://github.com/apache/cloudstack/blob/17fe98432daa0f1c67414c6ea69efa56eb30e195/framework/jobs/src/main/java/org/apache/cloudstack/framework/jobs/impl/AsyncJobManagerImpl.java#L877-L895
   
   The threshold is defined in the global configuration 
`job.cancel.threshold.minutes`:
   
   
https://github.com/apache/cloudstack/blob/17fe98432daa0f1c67414c6ea69efa56eb30e195/framework/jobs/src/main/java/org/apache/cloudstack/framework/jobs/impl/AsyncJobManagerImpl.java#L104-L105
   
   Analyzing the logs, it seems that the value of this configuration is the 
default (60 minutes), as the message is being printed after one hour of 
processing. Can you confirm if it is set as `60` minutes? If it is set as `60` 
minutes, could you increase the value and test the snapshot again?
   
   ---
   
   @nvazquez, @slavkap, and @weizhouapache, as seen in the tests, `Files.copy` 
takes more time to transfer a file than a simple `pv`/`cp`; For this situation, 
we can find another way of copying the file instead of using `Files.copy`. 
   
   Regarding @nvazquez comment 
(https://github.com/apache/cloudstack/issues/6836#issuecomment-1285684241), I 
have two points:
   - it is expected production environments to have large bandwidths. 
Considering it, and the fix in the copy process, big volumes will take a little 
bit of time to transfer; however, they will not be that long.
   - the full snapshot is a longstanding process, which affects VM's memory and 
disks; While this process is running, the VM gets frozen due to memory 
snapshot; i.e users are unable to manipulate their VM for a long time (if the 
amount of memory is huge). Considering it, and that the copy to the secondary 
storage could also generate the same kind of problem (job timeout/cancel), I do 
not see a full snapshot as a feasible option; Instead, I think the support of 
incremental snapshots would be a better option, as proposed in future works of 
issue #5124, which we will probably be starting at the end of this year.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [cloudstack] GutoVeronezi commented on issue #6836: Volume Snapshots Failing After 3600 Seconds

Reply via email to