Re: [RFC master] Add move instance improvements design document

Hrvoje Ribicic Fri, 07 Feb 2014 06:26:41 -0800

On Fri, Feb 7, 2014 at 10:17 AM, Petr Pudlák <[email protected]> wrote:


>
>
>
> On Thu, Feb 6, 2014 at 9:47 AM, Thomas Thrainer <[email protected]>wrote:
>
>>
>>
>>
>> On Tue, Feb 4, 2014 at 6:05 PM, Hrvoje Ribicic <[email protected]> wrote:
>>
>>> This patch adds a design document exploring zeroing and lock reduction
>>> as options for the improved performance and parallelism of
>>> cross-cluster instance moves.
>>>
>>> Signed-off-by: Hrvoje Ribicic <[email protected]>
>>> ---
>>>  doc/design-move-instance-improvements.rst |  182
>>> +++++++++++++++++++++++++++++
>>>  1 file changed, 182 insertions(+)
>>>  create mode 100644 doc/design-move-instance-improvements.rst
>>>
>>> diff --git a/doc/design-move-instance-improvements.rst
>>> b/doc/design-move-instance-improvements.rst
>>> new file mode 100644
>>> index 0000000..22b4bf5
>>> --- /dev/null
>>> +++ b/doc/design-move-instance-improvements.rst
>>> @@ -0,0 +1,182 @@
>>> +========================================
>>> +Cross-cluster instance move improvements
>>> +========================================
>>> +
>>> +.. contents:: :depth: 3
>>> +
>>> +To move instances across clusters, Ganeti provides the move-instance
>>> tool. It
>>> +uses the RAPI to create new instances in the destination cluster, ready
>>> to
>>> +import data from instances in the source cluster.
>>> +
>>> +The tool works correctly and reliably, but depending on bandwidth and
>>> priority,
>>> +an instance disk of considerable size requires a long time to transfer.
>>> This is
>>> +inconvenient at best, and can be remedied by either reducing the length
>>> of the
>>> +transfers, or allowing more operations to run in parallel with instance
>>> moves.
>>> +
>>> +The former can be achieved through the zeroing of empty space on
>>> instance disks
>>> +and compressing them prior to transfer, and the latter by reducing the
>>> amount of
>>> +locking happening during an instance move. As the approaches aim to
>>> tackle two
>>> +different aspects of the problem, they do not exclude each other and
>>> will be
>>> +presented independently.
>>> +
>>> +Zeroing instance disks
>>> +======================
>>> +
>>> +Support for disk compression during instance moves was partially
>>> present before,
>>> +but cleaned up and explicitly added as the --compress option only as of
>>> Ganeti
>>> +2.10. While compression lowers the amount of data sent, further gains
>>> can be
>>> +achieved by taking advantage of the structure of the disk - namely,
>>> sending only
>>> +used disk sectors.
>>> +
>>> +There is no direct way to achieve this, as it would require that the
>>> +move-instance tool is aware of the structure of the file system.
>>> Mounting the
>>> +filesystem is not an option, primarily due to security issues. A disk
>>> primed to
>>> +take advantage of a disk driver exploit could cause an attacker to
>>> breach
>>> +instance isolation and gain control of a Ganeti node.
>>> +
>>> +An indirect way for this performance gain to be achieved is the zeroing
>>> of the
>>> +empty hard disk space. Sequences of zeroes can be compressed and thus
>>> +transferred very efficiently, all without the host knowing that these
>>> are empty
>>> +space. This approach can also be dangerous if a sparse disk is zeroed
>>> in this
>>> +way, causing ballooning. As Ganeti does not seem to make special
>>> concessions for
>>> +moving sparse disks, the only difference should be the disk space
>>> utilization
>>> +on the current node.
>>> +
>>> +Zeroing approaches
>>> +++++++++++++++++++
>>> +
>>> +Zeroing is a feasible approach, but the node cannot perform it as it
>>> cannot
>>> +mount the disk. Only virtualization-based options remain, and of those,
>>> using
>>> +Ganeti's own virtualization capabilities makes the most sense. There
>>> are two
>>> +ways of doing this - creating a new helper instance, temporary or
>>> persistent, or
>>> +reusing the target instance.
>>> +
>>> +Both approaches have their disadvantages. Creating a new helper instance
>>> +requires managing its lifecycle, taking special care to make sure no
>>> helper
>>> +instance remains left over due to a failed operation. Even if this were
>>> to be
>>> +taken care of, disks are not yet separate entities in Ganeti, making the
>>> +temporary transfer of disks between instances hard to implement and
>>> even harder
>>> +to make robust. The reuse can be done by modifying the OS running on the
>>> +instance to perform the zeroing itself when notified via the new
>>> instance
>>> +communication mechanism, but this approach is neither generic, nor
>>> particularly
>>> +safe. There is no guarantee that the zeroing operation will not
>>> interfere with
>>> +the normal operation of the instance, nor that it will be completed if a
>>> +user-initiated shutdown occurs.
>>> +
>>> +A better solution can be found by combining the two approaches -
>>> re-using the
>>> +virtualized environment, but with a specifically crafted OS image. With
>>> the
>>> +instance shut down as it should be in preparation for the move, it can
>>> be
>>> +extended with an additional disk with the OS image on it. By prepending
>>> the
>>> +disk and changing some instance parameters, the instance can boot from
>>> it. The
>>> +OS can be configured to perform the zeroing on startup, attempting to
>>> mount any
>>> +partitions with a filesystem present, and creating and deleting a
>>> zero-filled
>>> +file on them. After the zeroing is complete, the OS should shut down,
>>> and the
>>> +master should note the shutdown and restore the instance to its
>>> previous state.
>>> +
>>> +Note that the requirements above are very similar to the notion of a
>>> helper VM
>>> +suggested in the OS install document. Some potentially unsafe actions
>>> are
>>> +performed within a virtualized environment, acting on disks that belong
>>> or will
>>> +belong to the instance. The mechanisms used will thus be developed with
>>> both
>>> +approaches in mind.
>>> +
>>> +Implementation
>>> +++++++++++++++
>>> +
>>> +There are two components to this solution - the Ganeti changes needed
>>> to boot
>>> +the OS, and the OS image used for the zeroing. Due to the variety of
>>> filesystems
>>> +and architectures that instances can use, no single ready-to-run disk
>>> image can
>>> +satisfy the needs of all the Ganeti users. Instead, the
>>> instance-debootstrap
>>> +scripts can be used to generate a zeroing-capable OS image. This might
>>> not be
>>> +ideal, as there are lightweight distributions that take up less space
>>> and boot
>>> +up more quickly. Generating those with the right set of drivers for the
>>> +virtualization platform of choice is not easy. Thus we do not provide a
>>> script
>>> +for this purpose, but the user is free to provide any OS image which
>>> performs
>>> +the necessary steps: zero out all virtualization-provided devices on
>>> startup,
>>> +shutdown immediately. The cluster-wide parameter controlling the image
>>> to be
>>> +used would be called zeroing-image.
>>> +
>>> +The modifications to Ganeti code needed are minor. The zeroing
>>> functionality
>>> +should be implemented as an extension of the instance export, and
>>> exposed as the
>>> +--zero-free-space option. Prior to beginning the export, the instance
>>> +configuration is temporarily extended with a new read-only disk of
>>> sufficient
>>> +size to host the zeroing image, and the changes necessary for the image
>>> to be
>>> +used as the boot drive. The temporary nature of the configuration
>>> changes
>>> +requires that they are not propagated to other nodes. While this would
>>> normally
>>> +not be feasible with an instance using a disk template offering
>>> multi-node
>>> +redundancy, experiments with the code have shown that the restriction on
>>> +diverse disk templates can be bypassed to temporarily allow a plain
>>> +disk-template disk to host the zeroing image. The image is dumped to
>>> the disk,
>>> +and the instance is started up.
>>> +
>>> +Once the instance is started up, the zeroing will proceed until
>>> completion, when
>>> +a self-initiated shutdown will occur. The instance-shutdown detection
>>> +capabilities of 2.11 should prevent the watcher from restarting the
>>> instance
>>> +once this happens, allowing the host to take it as a sign the zeroing
>>> was
>>> +completed. Either way, the host waits until the instance is shut down,
>>> or a
>>> +user-defined timeout has been reached and the instance is forcibly shut
>>> down.
>>>
>>
>> This timeout should be dependent on the size of the disks of the
>> instance. Zeroing 300GB can take some time, and such instances could
>> happily exist next to 10GB ones...
>>
>>
>>> +
>>> +Better progress monitoring can be implemented with the instance-host
>>> +communication channel proposed by the OS install design document. The
>>> first
>>> +version will most likely use only the shutdown detection, and will be
>>> improved
>>> +to account for the available communication channel at a later time.
>>> +
>>> +After the shutdown, the temporary disk is destroyed and the instance
>>> +configuration is reverted to its original state. The very same action
>>> is done if
>>> +any error is encountered during the zeroing process. In the case that
>>> the
>>> +zeroing is interrupted while the zero-filled file is being written,
>>> there is
>>> +little that can be done to recover. One precautionary measure is to
>>> place the
>>> +file in the /tmp directory on Unix systems, if one exists and can be
>>> identified
>>> +as such. Even if TmpFS is mounted there, it is the most likely location
>>> to be
>>> +cleaned up in case of failure.
>>>
>>
>> If TmpFS is mounted there, it would hide the zero-file from the user and
>> making it thus harder to recover manually from such a problem. Also, if the
>> filesystem is not the root filesystem of the guest but usually mounted
>> under e.g. /home, there wouldn't be a /tmp directory... Anyway, both
>> approaches have advantages and disadvantages, so I would personally go for
>> the easier one.
>>
>
> I wouldn't be so sure that about /tmp being cleaned up, if it's a
> mount-point for TmpFS or another separate partition. I guess an OS first
> mounts partitions and only then cleans up /tmp.
>
>
Ack - that part will be removed.


>
>> Another note: the OS image could/should also zero all swap partition
>> completely in order to save some more space.
>>
>>
>> Something I'm missing in this part of the design is a discussion of
>> compression-methods (maybe with a lot of zeros something really fast can be
>> used)
>>
>
> I've had a good experience with lzop: http://en.wthe gains are dependent
> on the gains are dependent on 
> ikipedia.org/wiki/Lzop<http://en.wikipedia.org/wiki/Lzop>
> It's _very_ fast compared to other compression tools, so definitely it
> wouldn't be a bottleneck, and for blocks of zeroes it would work just as
> well as any other algorithm. I tried to compress 1GB of zeroes, it took
> 0.5s and got compressed into 4.5MB:
>
>  dd bs=1MB count=1024 if=/dev/zero | lzop | wc --bytes
> 1024+0 records in
> 1024+0 records out
> 1024000000 bytes (1.0 GB) copied, 2.50511 s, 409 MB/s
> 4668035
>
>

I actually do not know if 1GB of zeroes is a good benchmark for a
compression tool in this case. With an ext? filesystem, the empty space is
likely to be very fragmented, with pockets of zeroed free space scattered
amongst files. My hunch is also that speed rules as the ratio will be just
about the same for all compression tools, but I would like to do some
testing on a more realistic-looking drive first. The choice of compression
tool used would certainly be added as an option.


> and/or a (semi-) automated way of figuring out if zeroing+compression is
>> faster than just sending the whole data. I agree that this is a bit out of
>> scope for now, but the user should at least have the option to enable or
>> disable zeroing. For future work, move-instance could get a rough
>> measurement of the throughput between the clusters and could then decide
>> based on the size of the instance disks and some heuristics if zeroing
>> makes sense.
>>
>> Another thing missing is the discussion of encryption algorithms. The
>> method to encrypt the data sent from one cluster to the other can be
>> configured and plays quite a big role throughput-wise. We could give users
>> the choice to use another (possibly weaker) encryption if they want more
>> speed and/or review the choice we've made.
>>
>
> It'd be interesting to make some tests and measure the impact of various
> encryption algorithms. I remember using Blowfish with SSH to reduce CPU
> load and speed up transfers, but perhaps nowadays with faster CPUs and
> optimizations in encryption algorithms the difference isn't so large.
>

I guess that for cross-cluster transfers, the limiting factor is the
bandwidth and not the speed of encryption, but I might be wrong. Testing it
is :)


>
>
>>
>>
>>> +
>>> +Lock reduction
>>> +==============
>>> +
>>> +An instance move as executed by the move-instance tool consists of
>>> several
>>> +preparatory RAPI calls, leading up to two long-lasting opcodes:
>>> OpCreateInstance
>>> +and OpBackupExport. While OpBackupExport locks only the instance, the
>>> locks of
>>> +OpCreateInstance require more attention.
>>> +
>>> +When executed, this opcode attempts to lock all nodes on which the
>>> instance may
>>> +be created and obtain shared locks on the groups they belong to. In the
>>> case
>>> +that an IAllocator is used, this means all nodes must be locked. Any
>>> operation
>>> +that requires a node lock to be present can delay the move operation,
>>> and there
>>> +is no shortage of these.
>>> +
>>> +The concept of opportunistic locking has been introduced to remedy
>>> exactly this
>>> +situation, allowing the IAllocator to grab as many node locks as
>>> possible.
>>> +Depending on how many nodes were available, the operation either
>>> proceeds as
>>> +expected, or fails noting that it is temporarily infeasible. The
>>> failure case
>>> +is unacceptable for the move-instance tool, which is expected to fail
>>> only if
>>> +the move is impossible. To yield the benefits of opportunistic locking
>>> yet
>>> +satisfy this constraint, the move-instance tool can be extended with the
>>> +--opportunistic-tries and --opportunistic-try-delay options. A number of
>>> +opportunistic instance creations are attempted, with a delay between
>>> attempts.
>>>
>>
> Definitely the delays should be randomized to avoid inadvertently
> synchronized simultaneous attempts by multiple jobs.
>

Ack.


>
>
>>  +Should they all fail, a normal and blocking instance creation is
>>> requested.
>>>
>>
> I don't fully understand this. Does it mean that if opportunistic locking
> using an IAllocator fails, it'd fall back to just trying to pick up any
> node (or any two nodes) available?
>

No, it'd fall back to a non-opportunistic use of an IAllocator, blocking
the execution of the move until all the node locks on the target cluster
can be acquired. Will rewrite.

>
>
>>  +
>>> +While it may seem excessive to grab so many node locks, the early
>>> release
>>> +mechanism is used to make the situation less dire, releasing all nodes
>>> that were
>>> +not chosen as candidates for allocation. This is taken to the extreme
>>> as all the
>>> +locks acquired are released prior to the start of the transfer, barring
>>> the
>>> +newly-acquired lock over the new instance. This works because all
>>> operations
>>> +that alter the node in a way which could affect the transfer:
>>> +
>>> +* are prevented by the instance lock or instance presence, e.g.
>>> gnt-node remove,
>>> +  gnt-node evacuate,
>>> +
>>> +* do not interrupt the transfer, e.g. a PV on the node can be set as
>>> +  unallocatable, and the transfer still proceeds as expected,
>>> +
>>> +* do not care, e.g. a gnt-node powercycle explicitly ignores all locks.
>>> +
>>> +This is an invariant to be kept in mind for future development, but at
>>> the
>>> +current time, no additional locks are needed.
>>>
>>
> I'm a bit confused about what is the conclusion of this section. Does it
> propose any lock changes (reduction)? Or just proposes adding retries for
> instance creation if opportunistic locking fails?
>
> There is no general reduction in lock types acquired, nor can locks be
released earlier. Opportunistic locking may result in earlier execution of
operations, but it is just a matter of using it as the feature is already
present.
I will rewrite this to improve clarity.


> Perhaps we should rather aim for improving opportunistic locking in
> general, allowing these parameters for all LUs that use opportunistic
> locking. There are other LUs that use opportunistic locking as well.
>

That is a good point, but the scope of this change would be much greater
than the one proposed in this design document. When retrying, the
move-instance tool can simply issue another creation job, identical to the
previous one. Adding the option to the LU itself would mean introducing a
mechanism for the automatic retrying of LUs. While this can and probably
should be done, it is a much greater refactoring of the jobs in Ganeti and
should be undertaken separately.


>
>  +
>>> +Introduction of changes
>>> +=======================
>>> +
>>> +Both the instance zeroing and the lock reduction will be implemented as
>>> a part
>>> +of Ganeti 2.12, in the way described in the previous chapters. They
>>> will be
>>> +implemented as separate changes, first the lock reduction, and then the
>>> instance
>>> +zeroing due to the implementation overlapping and benefitting from the
>>> changes
>>> +needed for the OS installation improvements.
>>> --
>>> 1.7.10.4
>>>
>>>
>> Would it make sense to share this design doc as well with the SRE's? I
>> know that climent@ filed the bug about instance moves, but he's not
>> working on it any more. So ganeti-sre@ or ganeti-team@ might be
>> appropriate.
>>
>> Cheers,
>> Thomas
>>
>>
>> --
>> Thomas Thrainer | Software Engineer | [email protected] |
>>
>> Google Germany GmbH
>> Dienerstr. 12
>> 80331 München
>>
>> Registergericht und -nummer: Hamburg, HRB 86891
>> Sitz der Gesellschaft: Hamburg
>> Geschäftsführer: Graham Law, Christine Elizabeth Flores
>>
>
>

Re: [RFC master] Add move instance improvements design document

Reply via email to