Anthony, I implemented the threshold logic on Api Layer, in SyncQueueJob manager. In other words, before submitting the job for execution, we should know the host the job would go first to – that would be the object we are synchronizing on. For createSnapshot it's always the host where vm is 1) running on (for Running vm) 2) ran the last time on (for Stopped vm). Only when the command fails on the initial host, we retry on other hosts in cluster. So it would work like this:
1) api call is made 2) Before submitting the async job to the queue, we figure out the host id (getHostIdForSnapshotOperation method in SnapshotManagerImpl). Lets say, the id of the host is 1. 3) The job is submitted with object to sync on = "host id=1". 4) Once the job is ready to execute, it goes to snapshot manager which sends the command to the host id=1 first. If it fails by some reason, it gets resent to other host in the cluster (if exist). And in this failure scenario we don't do any synchronization. We've decided not to handle this error case because it won't happen in most of the cases. I've checked the code for other commands you've mentioned; the host is always picked up randomly from the list of hosts in cluster. So we can't apply the same logic unless we fix the code to pick up the same host on step 2) and step 4) without making callbacks from SnapshotManager to the SyncQueueManager. I would appreciate any suggestions on how to implement it. Thank you, Alena. From: Anthony Xu <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: RE: FS on cloudStack createSnapshot synchronization improvement There are several commands need this kind of threshold, e.g. Move volume, create template from snapshot, So this is common requirement , not only for createsnapshot. Can we add threshold mechanism in host command queue to resolve this issue? Anthony -----Original Message----- From: Edison Su [mailto:[email protected]] Sent: Thursday, October 11, 2012 4:42 PM To: [email protected]<mailto:[email protected]> Subject: RE: FS on cloudStack createSnapshot synchronization improvement I only have one comment: Can we put this snapshot improvement code out of snapshotmanager? -----Original Message----- From: Alena Prokharchyk [mailto:[email protected]] Sent: Tuesday, October 09, 2012 11:51 AM To: [email protected]<mailto:[email protected]> Subject: FS on cloudStack createSnapshot synchronization improvement Hi All, I'm planning to introduce some changes to create snapshot behavior for the future cloudStack release (the changes will go to asf/master branch). The fix is fixing the problem described below: "With the current code for snapshots, cloudStack always creates snapshot on the host where vm is Running (for vms in Running state) or on the host where vm used to run the last time (for vms in Stopped state). As the createSnapshot commands are not synchronized on the agent side, the case when multiple commands are send to the backend at the same time can lead to the performance issues on the hypervisor side. At the end there is a high possibility that createSnapshot command might time out on the Xen side. The solution is to synchronize number of concurrent snapshots per host basis. The threshold should be configurable as the customer usually knows how many snapshots at a time the backend can handle. While the concurrent snapshots are being processed by the backend, all subsequent snapshot commands scheduled for execution on the same host, should wait in the queue" Here is the feature FS available for the review: https://cwiki.apache.org/confluence/display/CLOUDSTACK/Snapshot+improv e ment s+FS If you have any comments/suggestions/questions on the implementation, please let me know. -Alena.
