> On 19 Jun 2019, at 20:12, Max Reitz <mre...@redhat.com> wrote:
> 
> On 05.06.19 14:17, Sam Eiderman wrote:
>> Until ESXi 6.5 VMware used the vmfsSparse format for snapshots (VMDK3 in
>> QEMU).
>> 
>> This format was lacking in the following:
>> 
>>    * Grain directory (L1) and grain table (L2) entries were 32-bit,
>>      allowing access to only 2TB (slightly less) of data.
>>    * The grain size (default) was 512 bytes - leading to data
>>      fragmentation and many grain tables.
>>    * For space reclamation purposes, it was necessary to find all the
>>      grains which are not pointed to by any grain table - so a reverse
>>      mapping of "offset of grain in vmdk" to "grain table" must be
>>      constructed - which takes large amounts of CPU/RAM.
>> 
>> The format specification can be found in VMware's documentation:
>> https://www.vmware.com/support/developer/vddk/vmdk_50_technote.pdf
>> 
>> In ESXi 6.5, to support snapshot files larger than 2TB, a new format was
>> introduced: SESparse (Space Efficient).
>> 
>> This format fixes the above issues:
>> 
>>    * All entries are now 64-bit.
>>    * The grain size (default) is 4KB.
>>    * Grain directory and grain tables are now located at the beginning
>>      of the file.
>>      + seSparse format reserves space for all grain tables.
>>      + Grain tables can be addressed using an index.
>>      + Grains are located in the end of the file and can also be
>>        addressed with an index.
>>      - seSparse vmdks of large disks (64TB) have huge preallocated
>>        headers - mainly due to L2 tables, even for empty snapshots.
>>    * The header contains a reverse mapping ("backmap") of "offset of
>>      grain in vmdk" to "grain table" and a bitmap ("free bitmap") which
>>      specifies for each grain - whether it is allocated or not.
>>      Using these data structures we can implement space reclamation
>>      efficiently.
>>    * Due to the fact that the header now maintains two mappings:
>>        * The regular one (grain directory & grain tables)
>>        * A reverse one (backmap and free bitmap)
>>      These data structures can lose consistency upon crash and result
>>      in a corrupted VMDK.
>>      Therefore, a journal is also added to the VMDK and is replayed
>>      when the VMware reopens the file after a crash.
>> 
>> Since ESXi 6.7 - SESparse is the only snapshot format available.
>> 
>> Unfortunately, VMware does not provide documentation regarding the new
>> seSparse format.
>> 
>> This commit is based on black-box research of the seSparse format.
>> Various in-guest block operations and their effect on the snapshot file
>> were tested.
>> 
>> The only VMware provided source of information (regarding the underlying
>> implementation) was a log file on the ESXi:
>> 
>>    /var/log/hostd.log
>> 
>> Whenever an seSparse snapshot is created - the log is being populated
>> with seSparse records.
>> 
>> Relevant log records are of the form:
>> 
>> [...] Const Header:
>> [...]  constMagic     = 0xcafebabe
>> [...]  version        = 2.1
>> [...]  capacity       = 204800
>> [...]  grainSize      = 8
>> [...]  grainTableSize = 64
>> [...]  flags          = 0
>> [...] Extents:
>> [...]  Header         : <1 : 1>
>> [...]  JournalHdr     : <2 : 2>
>> [...]  Journal        : <2048 : 2048>
>> [...]  GrainDirectory : <4096 : 2048>
>> [...]  GrainTables    : <6144 : 2048>
>> [...]  FreeBitmap     : <8192 : 2048>
>> [...]  BackMap        : <10240 : 2048>
>> [...]  Grain          : <12288 : 204800>
>> [...] Volatile Header:
>> [...] volatileMagic     = 0xcafecafe
>> [...] FreeGTNumber      = 0
>> [...] nextTxnSeqNumber  = 0
>> [...] replayJournal     = 0
>> 
>> The sizes that are seen in the log file are in sectors.
>> Extents are of the following format: <offset : size>
>> 
>> This commit is a strict implementation which enforces:
>>    * magics
>>    * version number 2.1
>>    * grain size of 8 sectors  (4KB)
>>    * grain table size of 64 sectors
>>    * zero flags
>>    * extent locations
>> 
>> Additionally, this commit proivdes only a subset of the functionality
>> offered by seSparse's format:
>>    * Read-only
>>    * No journal replay
>>    * No space reclamation
>>    * No unmap support
>> 
>> Hence, journal header, journal, free bitmap and backmap extents are
>> unused, only the "classic" (L1 -> L2 -> data) grain access is
>> implemented.
>> 
>> However there are several differences in the grain access itself.
>> Grain directory (L1):
>>    * Grain directory entries are indexes (not offsets) to grain
>>      tables.
>>    * Valid grain directory entries have their highest nibble set to
>>      0x1.
>>    * Since grain tables are always located in the beginning of the
>>      file - the index can fit into 32 bits - so we can use its low
>>      part if it's valid.
>> Grain table (L2):
>>    * Grain table entries are indexes (not offsets) to grains.
>>    * If the highest nibble of the entry is:
>>        0x0:
>>            The grain in not allocated.
>>            The rest of the bytes are 0.
>>        0x1:
>>            The grain is unmapped - guest sees a zero grain.
>>            The rest of the bits point to the previously mapped grain,
>>            see 0x3 case.
>>        0x2:
>>            The grain is zero.
>>        0x3:
>>            The grain is allocated - to get the index calculate:
>>            ((entry & 0x0fff000000000000) >> 48) |
>>            ((entry & 0x0000ffffffffffff) << 12)
>>    * The difference between 0x1 and 0x2 is that 0x1 is an unallocated
>>      grain which results from the guest using sg_unmap to unmap the
>>      grain - but the grain itself still exists in the grain extent - a
>>      space reclamation procedure should delete it.
>>      Unmapping a zero grain has no effect (0x2 will not change to 0x1)
>>      but unmapping an unallocated grain will (0x0 to 0x1) - naturally.
>> 
>> In order to implement seSparse some fields had to be changed to support
>> both 32-bit and 64-bit entry sizes.
>> 
>> Reviewed-by: Karl Heubaum <karl.heub...@oracle.com>
>> Reviewed-by: Eyal Moscovici <eyal.moscov...@oracle.com>
>> Reviewed-by: Arbel Moshe <arbel.mo...@oracle.com>
>> Signed-off-by: Sam Eiderman <shmuel.eider...@oracle.com>
>> ---
>> block/vmdk.c | 357 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
>> 1 file changed, 341 insertions(+), 16 deletions(-)
>> 
>> diff --git a/block/vmdk.c b/block/vmdk.c
>> index 931eb2759c..4377779635 100644
>> --- a/block/vmdk.c
>> +++ b/block/vmdk.c
> 
> [...]
> 
>> +static int vmdk_open_se_sparse(BlockDriverState *bs,
>> +                               BdrvChild *file,
>> +                               int flags, Error **errp)
>> +{
>> +    int ret;
>> +    VMDKSESparseConstHeader const_header;
>> +    VMDKSESparseVolatileHeader volatile_header;
>> +    VmdkExtent *extent;
>> +
>> +    if (flags & BDRV_O_RDWR) {
>> +        error_setg(errp, "No write support for seSparse images available");
>> +        return -ENOTSUP;
>> +    }
> Kind of works for me, but why not bdrv_apply_auto_read_only() like I had
> proposed?  The advantage is that this would make the node read-only if
> the user has specified auto-read-only=on instead of failing.
> 

Ah, I have not realized that bdrv_apply_auto_read_only() is preferred.
I’ll send a v3.

Sam

> Max

Reply via email to