[MariaDB discuss] Re: Rocksdb as a replacement for toku?

Marko Mäkelä via discuss Thu, 24 Aug 2023 23:40:30 -0700

On Thu, Aug 24, 2023 at 6:40 PM pslawek83 <[email protected]> wrote:
>
> Hi Marko, thanks for the response ... so for the DOCS part
>
> > https://mariadb.com/kb/en/innodb-page-compression/
> > When InnoDB page compression is used, InnoDB may still write the compressed 
> > page to the tablespace file with the original size of the uncompressed 
> > page, which would be equivalent to the value of the innodb_page_size system 
> > variable. This is done by design, because when InnoDB's I/O code needs to 
> > read the page from disk, it can only read the full page size. However, this 
> > is obviously not optimal.
> > This punch hole technique allows InnoDB to read the compressed page from 
> > disk as the full page size, even though the compressed page really takes up 
> > less space on the file system.
>
> So as far as i understand if compression is enabled the server will write eg. 
> 40% utilized page which will later be decompressed when read, but this 
> compression/decompression won't have any advangage if there's no punch hole 
> support. The only difference will be that the block will be filled with zeros 
> for the empty space at the end. Or is it possible for PAGE_COMPRESSION to 
> still be used without punch hole (just not that effective for example)?


The actual savings will depend on the implementation details and
configuration parameters of the underlying file system. The file
system that I have most everyday experience with is Linux ext4fs. If
you had suitable storage that can support atomic 16KiB writes (or
whatever your innodb_page_size is), you might want to do mke2fs -b
16384 and use innodb_flush_method=O_DIRECT, to safely set
innodb_doublewrite=OFF. But, with this setting, I am pretty sure that
you would not be able to get any savings from using PAGE_COMPRESSED,
because any FALLOC_FL_PUNCH_HOLE request that is smaller than the file
system block size would simply be ignored.

The ext4 file system block size must be a power of 2 between 1 and 64
kilobytes. 40% of a 16384-byte page would be 6554 bytes. If the file
system was created with the minimum 1024-byte block size, then the
page would occupy 7 KiB or 43.75% of the logical size. With any other
ext4fs block size, the allocation should be 50%.

I do not know how much space the fragmentation of files occupies in
the file system metadata, but it does slow down file access, as you
can see based on reports like
https://jira.mariadb.org/browse/MDEV-19554 (for ext4fs). I think that
there has been a similar report for NTFS as well.

I do not know if there are any Linux file systems that would
categorically reject FALLOC_FL_PUNCH_HOLE. My common sense would
suggest that file systems that support snapshots or are implemented as
copy-on-write will have some limitations regarding sparse files. In
fact, there is the open bug https://jira.mariadb.org/browse/MDEV-25734
where a difference between ext4fs and xfs was demonstrated.

> Also tested this row compression, and yes page compression seems much better.

Actually, the InnoDB ROW_FORMAT=COMPRESSED that I implemented between
2005 and 2007 does not compress individual rows or records, but pages
of them. The name is somewhat misleading. Because at that time the SQL
parser was outside our control, we had to reuse some existing syntax,
which I think had been used for read-only MyISAM files that had been
prepared with some tool. A design constraint of that format was that
the fill factor of the uncompressed B-tree pages (which are in a
format equivalent to ROW_FORMAT=DYNAMIC) will be limited in order to
avoid overflows of the corresponding compressed pages.

I wanted to remove the code to deal with ROW_FORMAT=COMPRESSED,
because it complicates the buffer pool data structures. In fact, if
both the adaptive hash index and ROW_FORMAT=COMPRESSED could be
removed, the size of buffer pool page descriptors could be shrunk to
100 or 104 bytes, from the current size of 160 bytes on 64-bit ISAs.
Also, recently I learned about cases where some original design
decisions turned out to be wrong, causing overflow of the compressed
pages. The way I resolved https://jira.mariadb.org/browse/MDEV-30882
is only a partial fix that is not guaranteed to work in all cases.
There are users who seem to be happy with this format. Maybe because
there are so many issues with PAGE_COMPRESSED at so many levels, and
because the compressed BLOBs
(https://jira.mariadb.org/browse/MDEV-11371) which the InnoDB
ROW_FORMAT=COMPRESSED would handle in a better way has not been stable
enough for production use.

>Is it possible to get rid of punch hole by just re-mapping the pages inside 
>these innodb files?

Back in 2004 or 2005, I was searching for information on how or
whether Stacker or equivalent tools had solved this problem. In
https://en.wikipedia.org/wiki/DriveSpace some corruption related to
fragmented files is mentioned. Maybe it was not designed for any
update-in-place usage. I concluded back then that also compression
algorithms would have a hard time to support any update-in-place
operations. Even if such algorithms existed, they would likely have to
maintain some ephemeral data structures, which can be much larger than
the uncompressed input or the compressed output. If I remember
correctly, zlib could require more than 256 kilobytes of memory for a
single compression context, even though its input buffer size is only
32 kilobytes. If there are many actively modified pages, you would
need much more RAM for the compression algorithm than the actual
uncompressed or compressed data. One thing has not changed over the
years: RAM is much more expensive than persistent storage. Some people
are using SSDs as a less expensive alternative to RAM.

To avoid the need to recompress pages on every small modification
(which would be extremely slow), I came up with two ideas. Some fields
in the uncompressed page (page header, parts of record headers,
transaction IDs, BLOB pointers, child page numbers) are stored
uncompressed also in the compressed page, so that they can simply be
updated in place. Modifications to the rest of the data may be covered
by a log that will be stored in an otherwise unused area of the
compressed page. Only if the modification log runs out of space, we
will recompress the entire page.

>I was thinking about this... if it's possible to change the data structure to 
>divide the file into 512b blocks

While 512-byte blocks are still rather common, the physical block size
may often be 4096 bytes. Users of MariaDB Server 10.11 should benefit
from MDEV-14425, which finally allows the redo log to be written
efficiently on such storage.

I think that it would be better to design a format that wastes less
space to begin with. At the same time, operations on the keys and
access to records should remain reasonably fast, because we would not
want to waste expensive RAM by introducing an additional layer of a
record cache.

Like Gordan has said in this thread, you might just let the file
system handle compression if you need it. But, there is no free lunch.
I suppose that ZFS would not support O_DIRECT. In any case, with file
system compression, page writes would become more than a simple matter
of sending the data to a DMA controller. You could also let the
storage layer handle compression. I was really impressed by the
performance of ScaleFlux when we tested it some years ago.

Marko
-- 
Marko Mäkelä, Lead Developer InnoDB
MariaDB plc
_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB discuss] Re: Rocksdb as a replacement for toku?

Reply via email to