GutoVeronezi commented on issue #6836: URL: https://github.com/apache/cloudstack/issues/6836#issuecomment-1286100015
Thanks for the feedback, @whitetiger264. Given your network setup, and considering some oscillation, it is expected transference of GBs to take a long time. However, as seen in @slavkap tests (https://github.com/apache/cloudstack/issues/6836#issuecomment-1285873500), `Files.copy` executed by ACS takes longer to transfer the data (in your case, around twice more). Therefore, we also have the `Files.copy` factor. As the snapshot is defined by the delta creation, the copy speed does not affect the snapshot per se. With that said, knowing the network setup and the process that ACS executes (`Files.copy`) takes more time than the `pv` (or `cp`) command, for now, we can adjust the configurations to appropriate values. For 15GB, it took a bit more than 1 hour (according to https://github.com/apache/cloudstack/issues/6836#issuecomment-1283521520, it took 1 hour and 2 minutes - 3720 seconds); therefore, we can consider around this value for each 15GB. e.g.: if you will work with volumes with 50GB, consider using around 12700 seconds (3720 * 50 / 15) as the base time of the snapshot operations. The error in the MS logs says `Job is cancelled as it has been blocking others for too long`: ``` 2022-10-20 21:29:38,753 ERROR [o.a.c.a.c.u.s.CreateSnapshotCmd] (API-Job-Executor-1:ctx-c1c52096 job-1021 ctx-113be2b1) (logid:3f94e4e5) Failed to create snapshot due to an internal error creating snapshot for volume 635fd5b5-bf14-4cf4-979c-be8a649d5189 com.cloud.utils.exception.CloudRuntimeException: Unable to serialize: Job is cancelled as it has been blocking others for too long ``` This message is generated when a job is cancelled because it has been executing for long than the threshold: https://github.com/apache/cloudstack/blob/17fe98432daa0f1c67414c6ea69efa56eb30e195/framework/jobs/src/main/java/org/apache/cloudstack/framework/jobs/impl/AsyncJobManagerImpl.java#L877-L895 The threshold is defined in the global configuration `job.cancel.threshold.minutes`: https://github.com/apache/cloudstack/blob/17fe98432daa0f1c67414c6ea69efa56eb30e195/framework/jobs/src/main/java/org/apache/cloudstack/framework/jobs/impl/AsyncJobManagerImpl.java#L104-L105 Analyzing the logs, it seems that the value of this configuration is the default (60 minutes), as the message is being printed after one hour of processing. Can you confirm if it is set as `60` minutes? If it is set as `60` minutes, could you increase the value and test the snapshot again? --- @nvazquez, @slavkap, and @weizhouapache, as seen in the tests, `Files.copy` takes more time to transfer a file than a simple `pv`/`cp`; For this situation, we can find another way of copying the file instead of using `Files.copy`. Regarding @nvazquez comment (https://github.com/apache/cloudstack/issues/6836#issuecomment-1285684241), I have two points: - it is expected production environments to have large bandwidths. Considering it, and the fix in the copy process, big volumes will take a little bit of time to transfer; however, they will not be that long. - the full snapshot is a longstanding process, which affects VM's memory and disks; While this process is running, the VM gets frozen due to memory snapshot; i.e users are unable to manipulate their VM for a long time (if the amount of memory is huge). Considering it, and that the copy to the secondary storage could also generate the same kind of problem (job timeout/cancel), I do not see a full snapshot as a feasible option; Instead, I think the support of incremental snapshots would be a better option, as proposed in future works of issue #5124, which we will probably be starting at the end of this year. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
