[MariaDB discuss] Odp: Re: Re: Rocksdb as a replacement for toku?

pslawek83 via discuss Thu, 24 Aug 2023 08:40:18 -0700

Hi Marko, thanks for the response ... so for the DOCS part  &gt; h mariadb.com 
ttps://mariadb.com/kb/en/innodb-page-compression/  &gt;  When InnoDB page 
compression is used, InnoDB may still write the compressed page to the 
tablespace file with the original size of the uncompressed page, which would be 
equivalent to the value of the  mariadb.com innodb_page_size  system variable. 
This is done by design, because when InnoDB&#39;s I/O code needs to read the 
page from disk, it can only read the full page size. However, this is obviously 
not optimal.  &gt; This   punch hole   technique allows InnoDB to read the 
compressed page from disk as the full page size, even though the compressed 
page really takes up less space on the file system.   So as far as i understand 
if compression is enabled the server will write eg. 40% utilized page which 
will later be decompressed when read, but this compression/decompression 
won&#39;t have any advangage if there&#39;s no punch hole support. The only 
difference will be that the block will be filled with zeros for the empty space 
at the end. Or is it possible for PAGE_COMPRESSION to still be used without 
punch hole (just not that effective for example)?   Also tested this row 
compression, and yes page compression seems much better. Is it possible to get 
rid of punch hole by just re-mapping the pages inside these innodb files? I was 
thinking about this... if it&#39;s possible to change the data structure to 
divide the file into 512b blocks while the beginning of each block will mark 
page number (0xffffffff for blocks which are part of previous page). To 
increase page size we&#39;d need to add another page with same number at the 
end of the file, later continuous recycled blocks can be added to a list and 
used to store smaller pages. To make this atomic, first whole block would be 
allocated and written with 0xfffffffe as a dummy id and then this placeholder 
id could be replaced with real block number).  The problem with this could be 
that the file would get 100% fragmented over time (but probably will work ok 
with NVME). The mapping could be read back by reading whole datafile during 
server start and it&#39;d need 125MB of memory for 1TB of compressed data to 
store the mapping in memory for fast access.   What you thinik? Would it be 
easy to add it on top of current code (if there&#39;s any API maybe, so for 
example i can do this) and would it even make sense?   Best,  Slawomir.   Dnia 
21 sierpnia 2023 07:08 Marko Mäkelä &lt;  [email protected] &gt; 
napisał(a):  On Sun, Aug 20, 2023 at 6:12 PM pslawek83 via discuss  
&lt;[email protected]&gt; wrote:  Thanks everyone for the answers. 
Tested this PAGE_COMPRESSION a bit and make some reading. Not sure if I fully 
understand how that works... is that correct?  1. Impossible without punch hole 
as if we don&#39;t have it the server will just write partially used pages 
which will take full page size anyway. But in docs it seems it described its 
use with punch hole disabled. What&#39;s the reason to still have page 
compression without punch hole?  Can you please point to the misleading part of 
the documentation so  that it can be corrected?  2. After enabling compression 
each page needs to be trimmed using punch hole, so if we have 1 TB database and 
64k blocks, the database files will have over 15 million fragments 
(1000000000000/64000)  That is correct.  So if we get like 50% compression and 
then we&#39;ll be able to put another 1TB of data into these holes we&#39;ll be 
having close to 31 million file fragments. I&#39;m not sure if under such 
conditions the FS will be able to even allocate space efficiently anymore, copy 
these files, etc. Or im not getting something?  For what it is worth, around 
the same time when the page_compressed  was being developed for MariaDB Server 
10.1, something similar was  being worked on at Oracle, I think for MySQL 5.7. 
To my understanding,  the same hardware partner worked with both companies. The 
obvious  answer to  smalldatum.blogspot.com 
https://smalldatum.blogspot.com/2015/10/wanted-file-system-on-which-innodb.html 
 would have been  en.wikipedia.org 
https://en.wikipedia.org/wiki/Fusion-io_NVMFS  if it  had worked without kernel 
panics or existed in a mainline kernel. I  don’t even know if it ever was open 
sourced.  3. I was testing compression of ~1.5GB table with nice results on 
default 16kb page size. One thing i noticed is that l was always getting 
exactly the same compressed size using `du`. However the table read/copy times 
were different from 40s for uncompressed to 80-400 seconds depending on algo. 
Is that possible that is because page size is too low?  As far as I remember, 
most compression algorithms that are derivatives  of the LZ77 family have an 
input buffer size of 32 KiB. From that  point of view, it would seem to make 
sense to use innodb_page_size=32k  or innodb_page_size=64k instead of the 
default. Furthermore, I think  that it was a mistake to implement bzip2 support 
at all. That  algorithm would work with much larger buffer sizes.  4. Myrocks 
seems to be very slow compared to compressed inno. Was also reading docs 
related to it. For its usecase is it more suitable for storing large rows 
rather than many small rows?  To my understanding, the LSM trees that MyRocks 
is built on work best  in an insert-only workload. On top of LSM trees, MyRocks 
implements  some caches and Bloom filters. If you have few updates or deletes 
and  most reads are accessing recently inserted data, it could work.  5. I 
remember reading that some innodb compression is going to be obsolete. Is that 
row compression? Or there are no plans to obsolete any compression model?  I 
wanted to deprecate and remove the ROW_FORMAT=COMPRESSED that I  designed and 
implemented in the InnoDB Plugin for MySQL 5.1 based on  some high level ideas 
of Heikki Tuuri between 2005 and 2007.  jira.mariadb.org 
https://jira.mariadb.org/browse/MDEV-22367  was the first step towards  that. 
Based on community feedback, this won’t be happening. Even after  some data 
structure cleanup that took place in MariaDB Server 10.6,  supporting the 
format incurs an overhead of some 32 to 40 bytes per  buffer page descriptor, 
which is not insignificant. Removing both  ROW_FORMAT=COMPRESSED and the 
adaptive hash index would shrink the  block descriptor to 80 bytes from its 
current 10.6 size of 160 bytes.  The write performance with 
ROW_FORMAT=COMPRESSED should have been  degraded when I implemented the 
easier-to-parse log record format in  MDEV-12353 (MariaDB Server 10.5). There 
are open bugs about  ROW_FORMAT=COMPRESSED, such as  jira.mariadb.org 
https://jira.mariadb.org/browse/MDEV-31574.  I would say that the  InnoDB 
ROW_FORMAT=COMPRESSED seemed like a good idea back when it was  implemented, 
when HDD storage was the only practical option. Just like  the InnoDB change 
buffer, which I removed in MariaDB Server 11.0 after  having to deal with too 
many cases of corruption:  fosdem.org 
https://fosdem.org/2023/schedule/event/innodb_change_buffer/  With best 
regards,  Marko  --  Marko Mäkelä, Lead Developer InnoDB  MariaDB plc

_______________________________________________
discuss mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[MariaDB discuss] Odp: Re: Re: Rocksdb as a replacement for toku?

Reply via email to