Re: [RFC master] Add move instance improvements design document

Petr Pudlák Mon, 10 Feb 2014 23:49:55 -0800

On Tue, Feb 11, 2014 at 8:43 AM, Thomas Thrainer <[email protected]>wrote:


>
>
>
> On Fri, Feb 7, 2014 at 2:55 PM, Hrvoje Ribicic <[email protected]> wrote:
>
>> On Thu, Feb 6, 2014 at 9:47 AM, Thomas Thrainer <[email protected]>wrote:
>>
>>>
>>>
>>>
>>> On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote:
>>>
>>>> This patch adds a design document exploring zeroing and lock reduction
>>>> as options for the improved performance and parallelism of
>>>> cross-cluster instance moves.
>>>>
>>>> Signed-off-by: Hrvoje Ribicic <[email protected]>
>>>> ---
>>>>  doc/design-move-instance-improvements.rst |  182
>>>> +++++++++++++++++++++++++++++
>>>>  1 file changed, 182 insertions(+)
>>>>  create mode 100644 doc/design-move-instance-improvements.rst
>>>>
>>>> diff --git a/doc/design-move-instance-improvements.rst
>>>> b/doc/design-move-instance-improvements.rst
>>>> new file mode 100644
>>>> index 0000000..22b4bf5
>>>> --- /dev/null
>>>> +++ b/doc/design-move-instance-improvements.rst
>>>> @@ -0,0 +1,182 @@
>>>> +========================================
>>>> +Cross-cluster instance move improvements
>>>> +========================================
>>>> +
>>>> +.. contents:: :depth: 3
>>>> +
>>>> +To move instances across clusters, Ganeti provides the move-instance
>>>> tool. It
>>>> +uses the RAPI to create new instances in the destination cluster,
>>>> ready to
>>>> +import data from instances in the source cluster.
>>>> +
>>>> +The tool works correctly and reliably, but depending on bandwidth and
>>>> priority,
>>>> +an instance disk of considerable size requires a long time to
>>>> transfer. This is
>>>> +inconvenient at best, and can be remedied by either reducing the
>>>> length of the
>>>> +transfers, or allowing more operations to run in parallel with
>>>> instance moves.
>>>> +
>>>> +The former can be achieved through the zeroing of empty space on
>>>> instance disks
>>>> +and compressing them prior to transfer, and the latter by reducing the
>>>> amount of
>>>> +locking happening during an instance move. As the approaches aim to
>>>> tackle two
>>>> +different aspects of the problem, they do not exclude each other and
>>>> will be
>>>> +presented independently.
>>>> +
>>>> +Zeroing instance disks
>>>> +======================
>>>> +
>>>> +Support for disk compression during instance moves was partially
>>>> present before,
>>>> +but cleaned up and explicitly added as the --compress option only as
>>>> of Ganeti
>>>> +2.10. While compression lowers the amount of data sent, further gains
>>>> can be
>>>> +achieved by taking advantage of the structure of the disk - namely,
>>>> sending only
>>>> +used disk sectors.
>>>> +
>>>> +There is no direct way to achieve this, as it would require that the
>>>> +move-instance tool is aware of the structure of the file system.
>>>> Mounting the
>>>> +filesystem is not an option, primarily due to security issues. A disk
>>>> primed to
>>>> +take advantage of a disk driver exploit could cause an attacker to
>>>> breach
>>>> +instance isolation and gain control of a Ganeti node.
>>>> +
>>>> +An indirect way for this performance gain to be achieved is the
>>>> zeroing of the
>>>> +empty hard disk space. Sequences of zeroes can be compressed and thus
>>>> +transferred very efficiently, all without the host knowing that these
>>>> are empty
>>>> +space. This approach can also be dangerous if a sparse disk is zeroed
>>>> in this
>>>> +way, causing ballooning. As Ganeti does not seem to make special
>>>> concessions for
>>>> +moving sparse disks, the only difference should be the disk space
>>>> utilization
>>>> +on the current node.
>>>> +
>>>> +Zeroing approaches
>>>> +++++++++++++++++++
>>>> +
>>>> +Zeroing is a feasible approach, but the node cannot perform it as it
>>>> cannot
>>>> +mount the disk. Only virtualization-based options remain, and of
>>>> those, using
>>>> +Ganeti's own virtualization capabilities makes the most sense. There
>>>> are two
>>>> +ways of doing this - creating a new helper instance, temporary or
>>>> persistent, or
>>>> +reusing the target instance.
>>>> +
>>>> +Both approaches have their disadvantages. Creating a new helper
>>>> instance
>>>> +requires managing its lifecycle, taking special care to make sure no
>>>> helper
>>>> +instance remains left over due to a failed operation. Even if this
>>>> were to be
>>>> +taken care of, disks are not yet separate entities in Ganeti, making
>>>> the
>>>> +temporary transfer of disks between instances hard to implement and
>>>> even harder
>>>> +to make robust. The reuse can be done by modifying the OS running on
>>>> the
>>>> +instance to perform the zeroing itself when notified via the new
>>>> instance
>>>> +communication mechanism, but this approach is neither generic, nor
>>>> particularly
>>>> +safe. There is no guarantee that the zeroing operation will not
>>>> interfere with
>>>> +the normal operation of the instance, nor that it will be completed if
>>>> a
>>>> +user-initiated shutdown occurs.
>>>> +
>>>> +A better solution can be found by combining the two approaches -
>>>> re-using the
>>>> +virtualized environment, but with a specifically crafted OS image.
>>>> With the
>>>> +instance shut down as it should be in preparation for the move, it can
>>>> be
>>>> +extended with an additional disk with the OS image on it. By
>>>> prepending the
>>>> +disk and changing some instance parameters, the instance can boot from
>>>> it. The
>>>> +OS can be configured to perform the zeroing on startup, attempting to
>>>> mount any
>>>> +partitions with a filesystem present, and creating and deleting a
>>>> zero-filled
>>>> +file on them. After the zeroing is complete, the OS should shut down,
>>>> and the
>>>> +master should note the shutdown and restore the instance to its
>>>> previous state.
>>>> +
>>>> +Note that the requirements above are very similar to the notion of a
>>>> helper VM
>>>> +suggested in the OS install document. Some potentially unsafe actions
>>>> are
>>>> +performed within a virtualized environment, acting on disks that
>>>> belong or will
>>>> +belong to the instance. The mechanisms used will thus be developed
>>>> with both
>>>> +approaches in mind.
>>>> +
>>>> +Implementation
>>>> +++++++++++++++
>>>> +
>>>> +There are two components to this solution - the Ganeti changes needed
>>>> to boot
>>>> +the OS, and the OS image used for the zeroing. Due to the variety of
>>>> filesystems
>>>> +and architectures that instances can use, no single ready-to-run disk
>>>> image can
>>>> +satisfy the needs of all the Ganeti users. Instead, the
>>>> instance-debootstrap
>>>> +scripts can be used to generate a zeroing-capable OS image. This might
>>>> not be
>>>> +ideal, as there are lightweight distributions that take up less space
>>>> and boot
>>>> +up more quickly. Generating those with the right set of drivers for the
>>>> +virtualization platform of choice is not easy. Thus we do not provide
>>>> a script
>>>> +for this purpose, but the user is free to provide any OS image which
>>>> performs
>>>> +the necessary steps: zero out all virtualization-provided devices on
>>>> startup,
>>>> +shutdown immediately. The cluster-wide parameter controlling the image
>>>> to be
>>>> +used would be called zeroing-image.
>>>> +
>>>> +The modifications to Ganeti code needed are minor. The zeroing
>>>> functionality
>>>> +should be implemented as an extension of the instance export, and
>>>> exposed as the
>>>> +--zero-free-space option. Prior to beginning the export, the instance
>>>> +configuration is temporarily extended with a new read-only disk of
>>>> sufficient
>>>> +size to host the zeroing image, and the changes necessary for the
>>>> image to be
>>>> +used as the boot drive. The temporary nature of the configuration
>>>> changes
>>>> +requires that they are not propagated to other nodes. While this would
>>>> normally
>>>> +not be feasible with an instance using a disk template offering
>>>> multi-node
>>>> +redundancy, experiments with the code have shown that the restriction
>>>> on
>>>> +diverse disk templates can be bypassed to temporarily allow a plain
>>>> +disk-template disk to host the zeroing image. The image is dumped to
>>>> the disk,
>>>> +and the instance is started up.
>>>> +
>>>> +Once the instance is started up, the zeroing will proceed until
>>>> completion, when
>>>> +a self-initiated shutdown will occur. The instance-shutdown detection
>>>> +capabilities of 2.11 should prevent the watcher from restarting the
>>>> instance
>>>> +once this happens, allowing the host to take it as a sign the zeroing
>>>> was
>>>> +completed. Either way, the host waits until the instance is shut down,
>>>> or a
>>>> +user-defined timeout has been reached and the instance is forcibly
>>>> shut down.
>>>>
>>>
>>> This timeout should be dependent on the size of the disks of the
>>> instance. Zeroing 300GB can take some time, and such instances could
>>> happily exist next to 10GB ones...
>>>
>>>
>>
>> A valid point, but I am a bit suspicious whether the user can provide a
>> good guess for the size factor, and shutting down too early has
>> consequences, as discussed in the document.
>>
>> The point of the timeout would be to kill the VM after enough time has
>> passed that the user is sure that something has gone wrong, and wishes to
>> end the attempt. This is the only way to do it, as the current version of
>> Ganeti cannot end running jobs. There are plans for this to change, but for
>> the time being, some mechanism has to be provided.
>>
>> Additionally, a fixed timeout is necessary - the zeroing image can be
>> user-provided, and there's no way of telling how long startup will take, as
>> this may include setting up whatever mechanisms are needed for the instance
>> communication.
>>
>> That said, I am not against having two timeout parameters - the fixed one
>> and one size factor. I would just suggest that the default is zero for the
>> size factor and a very conservative value for the fixed one.
>> With instance communication in place, the size factor should be ignored
>> in favor of the real-time reports.
>>
>
> Ok, makes sense. Just keep in mind that there are instances with >8TB of
> disk around, and choosing a conservative fixed timeout which also works for
> those might be a bit difficult.
>
>

Would it be possible to monitor the VM if it performs disk operations or
not?


>
>>
>>>  +
>>>> +Better progress monitoring can be implemented with the instance-host
>>>> +communication channel proposed by the OS install design document. The
>>>> first
>>>> +version will most likely use only the shutdown detection, and will be
>>>> improved
>>>> +to account for the available communication channel at a later time.
>>>> +
>>>> +After the shutdown, the temporary disk is destroyed and the instance
>>>> +configuration is reverted to its original state. The very same action
>>>> is done if
>>>> +any error is encountered during the zeroing process. In the case that
>>>> the
>>>> +zeroing is interrupted while the zero-filled file is being written,
>>>> there is
>>>> +little that can be done to recover. One precautionary measure is to
>>>> place the
>>>> +file in the /tmp directory on Unix systems, if one exists and can be
>>>> identified
>>>> +as such. Even if TmpFS is mounted there, it is the most likely
>>>> location to be
>>>> +cleaned up in case of failure.
>>>>
>>>
>>> If TmpFS is mounted there, it would hide the zero-file from the user and
>>> making it thus harder to recover manually from such a problem. Also, if the
>>> filesystem is not the root filesystem of the guest but usually mounted
>>> under e.g. /home, there wouldn't be a /tmp directory... Anyway, both
>>> approaches have advantages and disadvantages, so I would personally go for
>>> the easier one.
>>>
>>>
>> Not to mention that this might be used to move another type of OS, and
>> there putting things into /tmp might be considered an obfuscation :)
>> No /tmp it is.
>>
>>
>>> Another note: the OS image could/should also zero all swap partition
>>> completely in order to save some more space.
>>>
>>
>>  Ack, will include it in the doc.
>>
>>
>>>
>>> Something I'm missing in this part of the design is a discussion of
>>> compression-methods (maybe with a lot of zeros something really fast can be
>>> used) and/or a (semi-) automated way of figuring out if zeroing+compression
>>> is faster than just sending the whole data. I agree that this is a bit out
>>> of scope for now, but the user should at least have the option to enable or
>>> disable zeroing. For future work, move-instance could get a rough
>>> measurement of the throughput between the clusters and could then decide
>>> based on the size of the instance disks and some heuristics if zeroing
>>> makes sense.
>>>
>>
>> I am completely in favor of allowing the user to enable or disable
>> zeroing through the --zero-free-space option, but I should probably make
>> that more clear in the design document.
>>
>> I think that the heuristic would be troublesome because the transfer
>> speed is dependent on:
>>
>> - compression / decompression speed - algorithm dependent
>> - encryption / decryption speed - algorithm dependent
>> - bandwidth
>>
>> and the overall duration is then altered by the size of the compressed
>> file. We could choose the best value, but for that we would need to supply
>> or perform measurements of the performance of compression and encryption
>> algorithms, and the free space ratio for which these were recorded.
>>
>> I'd much rather leave the choice of compression algorithm to the user,
>> and provide a decent default based on what we use.
>> Anyone who performs enough instance moves to care about performance will
>> probably be in a position to perform some benchmarks and set the best
>> parameters on their own.
>>
>> Maybe a --perform-zeroing-under-free-space-ratio parameter, unset by
>> default, would be a good compromise? Or some sort of hook present for the
>> ExportInstance opcode/LU?
>>
>
> I would list a couple of those ideas as further work and see what's
> actually asked for. A complete automatic approach which measures the
> link/parameters first and then performs some magic would be cool, but
> really hard to implement in a very robust way... BTW, a
> --perform-zeroing-under-free-space-ratio might be hard to implement,
> because the only the zeroing VM knows about the free space on the disk, but
> this VM could be user supplied.
>
>
>>
>>
>>> Another thing missing is the discussion of encryption algorithms. The
>>> method to encrypt the data sent from one cluster to the other can be
>>> configured and plays quite a big role throughput-wise. We could give users
>>> the choice to use another (possibly weaker) encryption if they want more
>>> speed and/or review the choice we've made.
>>>
>>
>> I focused on the cross-cluster case in this document, and there I'd be
>> surprised if encryption trumped bandwidth as the limiting factor. For
>> inter-cluster moves, certainly, and I'd guess some users would appreciate
>> the "none" option as well. Will add this to the document.
>>
>
> I guess we should just leave he choice to the user. If the data on the VM
> is not sensitive, no encryption might be good enough if the data resides in
> one data-center (but not in the same cluster).
>
>
>>
>>>
>>>> +
>>>> +Lock reduction
>>>> +==============
>>>> +
>>>> +An instance move as executed by the move-instance tool consists of
>>>> several
>>>> +preparatory RAPI calls, leading up to two long-lasting opcodes:
>>>> OpCreateInstance
>>>> +and OpBackupExport. While OpBackupExport locks only the instance, the
>>>> locks of
>>>> +OpCreateInstance require more attention.
>>>> +
>>>> +When executed, this opcode attempts to lock all nodes on which the
>>>> instance may
>>>> +be created and obtain shared locks on the groups they belong to. In
>>>> the case
>>>> +that an IAllocator is used, this means all nodes must be locked. Any
>>>> operation
>>>> +that requires a node lock to be present can delay the move operation,
>>>> and there
>>>> +is no shortage of these.
>>>> +
>>>> +The concept of opportunistic locking has been introduced to remedy
>>>> exactly this
>>>> +situation, allowing the IAllocator to grab as many node locks as
>>>> possible.
>>>> +Depending on how many nodes were available, the operation either
>>>> proceeds as
>>>> +expected, or fails noting that it is temporarily infeasible. The
>>>> failure case
>>>> +is unacceptable for the move-instance tool, which is expected to fail
>>>> only if
>>>> +the move is impossible. To yield the benefits of opportunistic locking
>>>> yet
>>>> +satisfy this constraint, the move-instance tool can be extended with
>>>> the
>>>> +--opportunistic-tries and --opportunistic-try-delay options. A number
>>>> of
>>>> +opportunistic instance creations are attempted, with a delay between
>>>> attempts.
>>>> +Should they all fail, a normal and blocking instance creation is
>>>> requested.
>>>> +
>>>> +While it may seem excessive to grab so many node locks, the early
>>>> release
>>>> +mechanism is used to make the situation less dire, releasing all nodes
>>>> that were
>>>> +not chosen as candidates for allocation. This is taken to the extreme
>>>> as all the
>>>> +locks acquired are released prior to the start of the transfer,
>>>> barring the
>>>> +newly-acquired lock over the new instance. This works because all
>>>> operations
>>>> +that alter the node in a way which could affect the transfer:
>>>> +
>>>> +* are prevented by the instance lock or instance presence, e.g.
>>>> gnt-node remove,
>>>> +  gnt-node evacuate,
>>>> +
>>>> +* do not interrupt the transfer, e.g. a PV on the node can be set as
>>>> +  unallocatable, and the transfer still proceeds as expected,
>>>> +
>>>> +* do not care, e.g. a gnt-node powercycle explicitly ignores all locks.
>>>> +
>>>> +This is an invariant to be kept in mind for future development, but at
>>>> the
>>>> +current time, no additional locks are needed.
>>>> +
>>>> +Introduction of changes
>>>> +=======================
>>>> +
>>>> +Both the instance zeroing and the lock reduction will be implemented
>>>> as a part
>>>> +of Ganeti 2.12, in the way described in the previous chapters. They
>>>> will be
>>>> +implemented as separate changes, first the lock reduction, and then
>>>> the instance
>>>> +zeroing due to the implementation overlapping and benefitting from the
>>>> changes
>>>> +needed for the OS installation improvements.
>>>> --
>>>> 1.7.10.4
>>>>
>>>>
>>> Would it make sense to share this design doc as well with the SRE's? I
>>> know that climent@ filed the bug about instance moves, but he's not
>>> working on it any more. So ganeti-sre@ or ganeti-team@ might be
>>> appropriate.
>>>
>>> Cheers,
>>> Thomas
>>>
>>>
>>> --
>>> Thomas Thrainer | Software Engineer | [email protected] |
>>>
>>> Google Germany GmbH
>>> Dienerstr. 12
>>> 80331 München
>>>
>>> Registergericht und -nummer: Hamburg, HRB 86891
>>> Sitz der Gesellschaft: Hamburg
>>> Geschäftsführer: Graham Law, Christine Elizabeth Flores
>>>
>>
>>
>
>
> --
> Thomas Thrainer | Software Engineer | [email protected] |
>
> Google Germany GmbH
> Dienerstr. 12
> 80331 München
>
> Registergericht und -nummer: Hamburg, HRB 86891
> Sitz der Gesellschaft: Hamburg
> Geschäftsführer: Graham Law, Christine Elizabeth Flores
>

Re: [RFC master] Add move instance improvements design document

Reply via email to