On Tue, Feb 11, 2014 at 8:43 AM, Thomas Thrainer <[email protected]>wrote:
> > > > On Fri, Feb 7, 2014 at 2:55 PM, Hrvoje Ribicic <[email protected]> wrote: > >> On Thu, Feb 6, 2014 at 9:47 AM, Thomas Thrainer <[email protected]>wrote: >> >>> >>> >>> >>> On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote: >>> >>>> This patch adds a design document exploring zeroing and lock reduction >>>> as options for the improved performance and parallelism of >>>> cross-cluster instance moves. >>>> >>>> Signed-off-by: Hrvoje Ribicic <[email protected]> >>>> --- >>>> doc/design-move-instance-improvements.rst | 182 >>>> +++++++++++++++++++++++++++++ >>>> 1 file changed, 182 insertions(+) >>>> create mode 100644 doc/design-move-instance-improvements.rst >>>> >>>> diff --git a/doc/design-move-instance-improvements.rst >>>> b/doc/design-move-instance-improvements.rst >>>> new file mode 100644 >>>> index 0000000..22b4bf5 >>>> --- /dev/null >>>> +++ b/doc/design-move-instance-improvements.rst >>>> @@ -0,0 +1,182 @@ >>>> +======================================== >>>> +Cross-cluster instance move improvements >>>> +======================================== >>>> + >>>> +.. contents:: :depth: 3 >>>> + >>>> +To move instances across clusters, Ganeti provides the move-instance >>>> tool. It >>>> +uses the RAPI to create new instances in the destination cluster, >>>> ready to >>>> +import data from instances in the source cluster. >>>> + >>>> +The tool works correctly and reliably, but depending on bandwidth and >>>> priority, >>>> +an instance disk of considerable size requires a long time to >>>> transfer. This is >>>> +inconvenient at best, and can be remedied by either reducing the >>>> length of the >>>> +transfers, or allowing more operations to run in parallel with >>>> instance moves. >>>> + >>>> +The former can be achieved through the zeroing of empty space on >>>> instance disks >>>> +and compressing them prior to transfer, and the latter by reducing the >>>> amount of >>>> +locking happening during an instance move. As the approaches aim to >>>> tackle two >>>> +different aspects of the problem, they do not exclude each other and >>>> will be >>>> +presented independently. >>>> + >>>> +Zeroing instance disks >>>> +====================== >>>> + >>>> +Support for disk compression during instance moves was partially >>>> present before, >>>> +but cleaned up and explicitly added as the --compress option only as >>>> of Ganeti >>>> +2.10. While compression lowers the amount of data sent, further gains >>>> can be >>>> +achieved by taking advantage of the structure of the disk - namely, >>>> sending only >>>> +used disk sectors. >>>> + >>>> +There is no direct way to achieve this, as it would require that the >>>> +move-instance tool is aware of the structure of the file system. >>>> Mounting the >>>> +filesystem is not an option, primarily due to security issues. A disk >>>> primed to >>>> +take advantage of a disk driver exploit could cause an attacker to >>>> breach >>>> +instance isolation and gain control of a Ganeti node. >>>> + >>>> +An indirect way for this performance gain to be achieved is the >>>> zeroing of the >>>> +empty hard disk space. Sequences of zeroes can be compressed and thus >>>> +transferred very efficiently, all without the host knowing that these >>>> are empty >>>> +space. This approach can also be dangerous if a sparse disk is zeroed >>>> in this >>>> +way, causing ballooning. As Ganeti does not seem to make special >>>> concessions for >>>> +moving sparse disks, the only difference should be the disk space >>>> utilization >>>> +on the current node. >>>> + >>>> +Zeroing approaches >>>> +++++++++++++++++++ >>>> + >>>> +Zeroing is a feasible approach, but the node cannot perform it as it >>>> cannot >>>> +mount the disk. Only virtualization-based options remain, and of >>>> those, using >>>> +Ganeti's own virtualization capabilities makes the most sense. There >>>> are two >>>> +ways of doing this - creating a new helper instance, temporary or >>>> persistent, or >>>> +reusing the target instance. >>>> + >>>> +Both approaches have their disadvantages. Creating a new helper >>>> instance >>>> +requires managing its lifecycle, taking special care to make sure no >>>> helper >>>> +instance remains left over due to a failed operation. Even if this >>>> were to be >>>> +taken care of, disks are not yet separate entities in Ganeti, making >>>> the >>>> +temporary transfer of disks between instances hard to implement and >>>> even harder >>>> +to make robust. The reuse can be done by modifying the OS running on >>>> the >>>> +instance to perform the zeroing itself when notified via the new >>>> instance >>>> +communication mechanism, but this approach is neither generic, nor >>>> particularly >>>> +safe. There is no guarantee that the zeroing operation will not >>>> interfere with >>>> +the normal operation of the instance, nor that it will be completed if >>>> a >>>> +user-initiated shutdown occurs. >>>> + >>>> +A better solution can be found by combining the two approaches - >>>> re-using the >>>> +virtualized environment, but with a specifically crafted OS image. >>>> With the >>>> +instance shut down as it should be in preparation for the move, it can >>>> be >>>> +extended with an additional disk with the OS image on it. By >>>> prepending the >>>> +disk and changing some instance parameters, the instance can boot from >>>> it. The >>>> +OS can be configured to perform the zeroing on startup, attempting to >>>> mount any >>>> +partitions with a filesystem present, and creating and deleting a >>>> zero-filled >>>> +file on them. After the zeroing is complete, the OS should shut down, >>>> and the >>>> +master should note the shutdown and restore the instance to its >>>> previous state. >>>> + >>>> +Note that the requirements above are very similar to the notion of a >>>> helper VM >>>> +suggested in the OS install document. Some potentially unsafe actions >>>> are >>>> +performed within a virtualized environment, acting on disks that >>>> belong or will >>>> +belong to the instance. The mechanisms used will thus be developed >>>> with both >>>> +approaches in mind. >>>> + >>>> +Implementation >>>> +++++++++++++++ >>>> + >>>> +There are two components to this solution - the Ganeti changes needed >>>> to boot >>>> +the OS, and the OS image used for the zeroing. Due to the variety of >>>> filesystems >>>> +and architectures that instances can use, no single ready-to-run disk >>>> image can >>>> +satisfy the needs of all the Ganeti users. Instead, the >>>> instance-debootstrap >>>> +scripts can be used to generate a zeroing-capable OS image. This might >>>> not be >>>> +ideal, as there are lightweight distributions that take up less space >>>> and boot >>>> +up more quickly. Generating those with the right set of drivers for the >>>> +virtualization platform of choice is not easy. Thus we do not provide >>>> a script >>>> +for this purpose, but the user is free to provide any OS image which >>>> performs >>>> +the necessary steps: zero out all virtualization-provided devices on >>>> startup, >>>> +shutdown immediately. The cluster-wide parameter controlling the image >>>> to be >>>> +used would be called zeroing-image. >>>> + >>>> +The modifications to Ganeti code needed are minor. The zeroing >>>> functionality >>>> +should be implemented as an extension of the instance export, and >>>> exposed as the >>>> +--zero-free-space option. Prior to beginning the export, the instance >>>> +configuration is temporarily extended with a new read-only disk of >>>> sufficient >>>> +size to host the zeroing image, and the changes necessary for the >>>> image to be >>>> +used as the boot drive. The temporary nature of the configuration >>>> changes >>>> +requires that they are not propagated to other nodes. While this would >>>> normally >>>> +not be feasible with an instance using a disk template offering >>>> multi-node >>>> +redundancy, experiments with the code have shown that the restriction >>>> on >>>> +diverse disk templates can be bypassed to temporarily allow a plain >>>> +disk-template disk to host the zeroing image. The image is dumped to >>>> the disk, >>>> +and the instance is started up. >>>> + >>>> +Once the instance is started up, the zeroing will proceed until >>>> completion, when >>>> +a self-initiated shutdown will occur. The instance-shutdown detection >>>> +capabilities of 2.11 should prevent the watcher from restarting the >>>> instance >>>> +once this happens, allowing the host to take it as a sign the zeroing >>>> was >>>> +completed. Either way, the host waits until the instance is shut down, >>>> or a >>>> +user-defined timeout has been reached and the instance is forcibly >>>> shut down. >>>> >>> >>> This timeout should be dependent on the size of the disks of the >>> instance. Zeroing 300GB can take some time, and such instances could >>> happily exist next to 10GB ones... >>> >>> >> >> A valid point, but I am a bit suspicious whether the user can provide a >> good guess for the size factor, and shutting down too early has >> consequences, as discussed in the document. >> >> The point of the timeout would be to kill the VM after enough time has >> passed that the user is sure that something has gone wrong, and wishes to >> end the attempt. This is the only way to do it, as the current version of >> Ganeti cannot end running jobs. There are plans for this to change, but for >> the time being, some mechanism has to be provided. >> >> Additionally, a fixed timeout is necessary - the zeroing image can be >> user-provided, and there's no way of telling how long startup will take, as >> this may include setting up whatever mechanisms are needed for the instance >> communication. >> >> That said, I am not against having two timeout parameters - the fixed one >> and one size factor. I would just suggest that the default is zero for the >> size factor and a very conservative value for the fixed one. >> With instance communication in place, the size factor should be ignored >> in favor of the real-time reports. >> > > Ok, makes sense. Just keep in mind that there are instances with >8TB of > disk around, and choosing a conservative fixed timeout which also works for > those might be a bit difficult. > > Would it be possible to monitor the VM if it performs disk operations or not? > >> >>> + >>>> +Better progress monitoring can be implemented with the instance-host >>>> +communication channel proposed by the OS install design document. The >>>> first >>>> +version will most likely use only the shutdown detection, and will be >>>> improved >>>> +to account for the available communication channel at a later time. >>>> + >>>> +After the shutdown, the temporary disk is destroyed and the instance >>>> +configuration is reverted to its original state. The very same action >>>> is done if >>>> +any error is encountered during the zeroing process. In the case that >>>> the >>>> +zeroing is interrupted while the zero-filled file is being written, >>>> there is >>>> +little that can be done to recover. One precautionary measure is to >>>> place the >>>> +file in the /tmp directory on Unix systems, if one exists and can be >>>> identified >>>> +as such. Even if TmpFS is mounted there, it is the most likely >>>> location to be >>>> +cleaned up in case of failure. >>>> >>> >>> If TmpFS is mounted there, it would hide the zero-file from the user and >>> making it thus harder to recover manually from such a problem. Also, if the >>> filesystem is not the root filesystem of the guest but usually mounted >>> under e.g. /home, there wouldn't be a /tmp directory... Anyway, both >>> approaches have advantages and disadvantages, so I would personally go for >>> the easier one. >>> >>> >> Not to mention that this might be used to move another type of OS, and >> there putting things into /tmp might be considered an obfuscation :) >> No /tmp it is. >> >> >>> Another note: the OS image could/should also zero all swap partition >>> completely in order to save some more space. >>> >> >> Ack, will include it in the doc. >> >> >>> >>> Something I'm missing in this part of the design is a discussion of >>> compression-methods (maybe with a lot of zeros something really fast can be >>> used) and/or a (semi-) automated way of figuring out if zeroing+compression >>> is faster than just sending the whole data. I agree that this is a bit out >>> of scope for now, but the user should at least have the option to enable or >>> disable zeroing. For future work, move-instance could get a rough >>> measurement of the throughput between the clusters and could then decide >>> based on the size of the instance disks and some heuristics if zeroing >>> makes sense. >>> >> >> I am completely in favor of allowing the user to enable or disable >> zeroing through the --zero-free-space option, but I should probably make >> that more clear in the design document. >> >> I think that the heuristic would be troublesome because the transfer >> speed is dependent on: >> >> - compression / decompression speed - algorithm dependent >> - encryption / decryption speed - algorithm dependent >> - bandwidth >> >> and the overall duration is then altered by the size of the compressed >> file. We could choose the best value, but for that we would need to supply >> or perform measurements of the performance of compression and encryption >> algorithms, and the free space ratio for which these were recorded. >> >> I'd much rather leave the choice of compression algorithm to the user, >> and provide a decent default based on what we use. >> Anyone who performs enough instance moves to care about performance will >> probably be in a position to perform some benchmarks and set the best >> parameters on their own. >> >> Maybe a --perform-zeroing-under-free-space-ratio parameter, unset by >> default, would be a good compromise? Or some sort of hook present for the >> ExportInstance opcode/LU? >> > > I would list a couple of those ideas as further work and see what's > actually asked for. A complete automatic approach which measures the > link/parameters first and then performs some magic would be cool, but > really hard to implement in a very robust way... BTW, a > --perform-zeroing-under-free-space-ratio might be hard to implement, > because the only the zeroing VM knows about the free space on the disk, but > this VM could be user supplied. > > >> >> >>> Another thing missing is the discussion of encryption algorithms. The >>> method to encrypt the data sent from one cluster to the other can be >>> configured and plays quite a big role throughput-wise. We could give users >>> the choice to use another (possibly weaker) encryption if they want more >>> speed and/or review the choice we've made. >>> >> >> I focused on the cross-cluster case in this document, and there I'd be >> surprised if encryption trumped bandwidth as the limiting factor. For >> inter-cluster moves, certainly, and I'd guess some users would appreciate >> the "none" option as well. Will add this to the document. >> > > I guess we should just leave he choice to the user. If the data on the VM > is not sensitive, no encryption might be good enough if the data resides in > one data-center (but not in the same cluster). > > >> >>> >>>> + >>>> +Lock reduction >>>> +============== >>>> + >>>> +An instance move as executed by the move-instance tool consists of >>>> several >>>> +preparatory RAPI calls, leading up to two long-lasting opcodes: >>>> OpCreateInstance >>>> +and OpBackupExport. While OpBackupExport locks only the instance, the >>>> locks of >>>> +OpCreateInstance require more attention. >>>> + >>>> +When executed, this opcode attempts to lock all nodes on which the >>>> instance may >>>> +be created and obtain shared locks on the groups they belong to. In >>>> the case >>>> +that an IAllocator is used, this means all nodes must be locked. Any >>>> operation >>>> +that requires a node lock to be present can delay the move operation, >>>> and there >>>> +is no shortage of these. >>>> + >>>> +The concept of opportunistic locking has been introduced to remedy >>>> exactly this >>>> +situation, allowing the IAllocator to grab as many node locks as >>>> possible. >>>> +Depending on how many nodes were available, the operation either >>>> proceeds as >>>> +expected, or fails noting that it is temporarily infeasible. The >>>> failure case >>>> +is unacceptable for the move-instance tool, which is expected to fail >>>> only if >>>> +the move is impossible. To yield the benefits of opportunistic locking >>>> yet >>>> +satisfy this constraint, the move-instance tool can be extended with >>>> the >>>> +--opportunistic-tries and --opportunistic-try-delay options. A number >>>> of >>>> +opportunistic instance creations are attempted, with a delay between >>>> attempts. >>>> +Should they all fail, a normal and blocking instance creation is >>>> requested. >>>> + >>>> +While it may seem excessive to grab so many node locks, the early >>>> release >>>> +mechanism is used to make the situation less dire, releasing all nodes >>>> that were >>>> +not chosen as candidates for allocation. This is taken to the extreme >>>> as all the >>>> +locks acquired are released prior to the start of the transfer, >>>> barring the >>>> +newly-acquired lock over the new instance. This works because all >>>> operations >>>> +that alter the node in a way which could affect the transfer: >>>> + >>>> +* are prevented by the instance lock or instance presence, e.g. >>>> gnt-node remove, >>>> + gnt-node evacuate, >>>> + >>>> +* do not interrupt the transfer, e.g. a PV on the node can be set as >>>> + unallocatable, and the transfer still proceeds as expected, >>>> + >>>> +* do not care, e.g. a gnt-node powercycle explicitly ignores all locks. >>>> + >>>> +This is an invariant to be kept in mind for future development, but at >>>> the >>>> +current time, no additional locks are needed. >>>> + >>>> +Introduction of changes >>>> +======================= >>>> + >>>> +Both the instance zeroing and the lock reduction will be implemented >>>> as a part >>>> +of Ganeti 2.12, in the way described in the previous chapters. They >>>> will be >>>> +implemented as separate changes, first the lock reduction, and then >>>> the instance >>>> +zeroing due to the implementation overlapping and benefitting from the >>>> changes >>>> +needed for the OS installation improvements. >>>> -- >>>> 1.7.10.4 >>>> >>>> >>> Would it make sense to share this design doc as well with the SRE's? I >>> know that climent@ filed the bug about instance moves, but he's not >>> working on it any more. So ganeti-sre@ or ganeti-team@ might be >>> appropriate. >>> >>> Cheers, >>> Thomas >>> >>> >>> -- >>> Thomas Thrainer | Software Engineer | [email protected] | >>> >>> Google Germany GmbH >>> Dienerstr. 12 >>> 80331 München >>> >>> Registergericht und -nummer: Hamburg, HRB 86891 >>> Sitz der Gesellschaft: Hamburg >>> Geschäftsführer: Graham Law, Christine Elizabeth Flores >>> >> >> > > > -- > Thomas Thrainer | Software Engineer | [email protected] | > > Google Germany GmbH > Dienerstr. 12 > 80331 München > > Registergericht und -nummer: Hamburg, HRB 86891 > Sitz der Gesellschaft: Hamburg > Geschäftsführer: Graham Law, Christine Elizabeth Flores >
