Hello hackers!

This email is about the proposal to implement OLTP-oriented compression on 
PostgreSQL.

Although there are currently some alternative ways to achieve compression, such 
as using a file system that supports compression.
However, this depends on a specific file system and is not suitable for all 
deployment environments, but also increases the complexity of deployment and 
maintenance.

I hope this compression work can meet the following goals
1. In most scenarios, the compressed size of the table can be lower than 50% of 
the original table
2. Mainly oriented to the OLTP scenario, the performance impact on the load of 
frequent reads and writes is relatively small.
3. Does not rely on special external software or hardware that is difficult to 
obtain
4. Friendly to application developers and database managers
5. The transformation of PostgreSQL is small


I have noticed that there has been some discussion or work related to 
compression before, but they do not meet the above goals.
such as,
1. Use the sparse file[1]
   This is also the implemention method of MySQL 5.7's transparent page 
compression. However, sparse files may generate a lot of fragmentation inside 
the file system, and the "compressed" data file in sparse files will be 
restored to their original size after physical backup and restoration, unless 
our backup and recovery tools also support sparse files.
2. Use TAM (table access method interface) (pg_cryogen, zedstore) [2] [3]
   Neither storage engine is geared towards OLTP scenarios. It is best to make 
relatively small modifications to the existing code of the heap engine (by 
modify md.c and fd.c mainly).


The methods proposed by Postgres Pro Enterprise CFS [4] and Nikolay P [5] are 
close to my needs.
However, I would like to mention a slightly different implementation plan, 
which does not require space reclamation. Hope to get any suggestions.

# Premise assumption
1. Most of the pages in the compressed table can be compressed to less than 50% 
of the original size
   As long as you use an algorithm with a relatively high compression ratio 
(such as zlib, zstd), this first point should be easy to meet. Unless the table 
stores compressed data, such as pictures.
2. The compression ratio of most pages in the same table is relatively close


# Page storage

Configure 3 files for storing compressed data for each segment of each main 
fork. The main fork segment file(for example: 123456.2) still exists, but its 
size is 0.

-Compressed data file (for example: 123456.2.cd)
   Used to store the compressed page. The block size of this file is table 
level configurable. But it can only be 1/2, 1/4 or 1/8 of BLOCKSZ
-Compress overflow address file (for example: 123456.2.coa)
   When a page cannot be compressed to less than the size of the compressed 
block, this file is used to store the address of the overflow block.
-Compress overflow data file (for example: 123456.2.cod)
   When a page cannot be compressed to less than the size of the compressed 
block, this file is used to store the overflow block.

The following is an example when the compressed block size is 4K, which is 1/2 
of BLOCKSZ.

## Scenario 1: The compressed size of the original page (including the header 
of the compressed page) is less than or equal to the compressed block size (4KB)

Compressed data files(123456.2.cd):
 0       1       2
 +=======+=======+=======+
 | data0 | data1 | data2 |  
 +=======+=======+=======+
->|   4K  |<-


## Scenario 2: The compressed size of the original page (including the header 
of the compressed page) is larger than the compressed block size (4KB)

If the compressed size of the original page (page 3 below) is greater than 4KB, 
it will not be compressed. The first 4KB of the original page is stored in the 
compressed data file, and the last 4KB is stored in the compress overflow data 
file.

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 4K  |<-

Compress overflow address file(123456.2.coa):
The compress overflow address file stores the block number of the compress 
overflow block assigned to each block + 1
The size of the compressed block and the number of expanded blocks of the 
compress overflow data file are stored in the head of the compress overflow 
address file

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       |   1   |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file:          |
      _______________________________|
     |
 0   |    1       2       3
 +===|=====+=========+==========+=========+
 | data3_2 |         |          |         |
 +=========+=========+==========+=========+
->| 2nd 4K  |<-


If the compressed block size is 1/4 or 1/8 of BLOCKSZ, each block that fails to 
compress may require multiple compress overflow block storage.
The following is an example when the compressed block size is 2K, which is 1/4 
of BLOCKSZ.

## Scenario 3: The compressed size of the original page (including the header 
of the compressed page) is larger than 2KB(compressed page block size) but less 
than 6KB (BLOCKSZ - compressed page block size )

In this case, data files will store compressed data, and at least 2KB storage 
space can be saved.

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 2K  |<-

Compress overflow address file(123456.2.coa):

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       | 1,2   |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file :         |
      _______________________________|
     |
 0   |     1         2          3
 +===|=====+=========+==========+=========+
 | data3_2 | data3_3 |          |         |
 +=========+=========+==========+=========+
 | 2nd 2K  | 3rh 2K  |          |         |


## Scenario 4: The compressed size of the original page (including the header 
of the compressed page) is larger than 6KB (BLOCKSZ - compressed page block 
size )

In this case, data files will store original data(8KB). same as Scenario 2

Compressed data files(123456.2.cd):

 0       1       2       3
 +=======+=======+=======+=========+
 | data0 | data1 | data2 | data3_1 |
 +=======+=======+=======+=========+
                       ->| 1st 2K  |<-

Compress overflow address file(123456.2.coa):

         0       1       2       3
 +=======+=======+=======+=======+=======+
 | head  |       |       |       | 1,2,3 |
 +=======+=======+=======+=======+===|===+
                                     |
                                     |
Compress overflow data file :         |
      _______________________________|
     |
 0   |     1         2          3
 +===|=====+=========+==========+=========+
 | data3_2 | data3_3 | data3_4  |         |
 +=========+=========+==========+=========+
 | 2nd 2K  | 3rd 2K  | 4th 2K   |         |


# How to distinguish between compressed or uncompressed blocks in compressed 
data files?

The PostgreSQL heap file has a uniform header. At first, I considered adding 
compression-related flags to the header.
However, there will be a problem. When the size of the data in the page after 
compression changes, from compressed format to uncompressed format, or from 
uncompressed format to compressed format,
Need to modify the head of the original page, not only to recalculate the 
checksum, but also update the buffer.

However, I noticed that the first 4 bytes of each page are the high part of 
pg_lsn.
Therefore, use an oversized `lsn number` that cannot appear in the real 
environment as a sign of whether it is a magic of compressed pages.
The definition of the envisaged compressed header is as follows

typedef struct
{
   uint32    magic;       /* compress page magic number,must be 0xfffffffe */
   uint8    algorithm;     /* 1=pglz, 2=zstd ...*/
   uint8    flag;             /* reserved */
   uint16    size;           /* size after compressed */
} PageCompressHead;


# How to manage block space in compress overflow data files?

Once the overflow block x in the compress overflow data file is allocated to 
the block a, it will always belong to the block a, even if the size of the 
block a after compression becomes smaller and the overflow block x is no longer 
be used.

This approach simplifies the space management of compress overflow blocks, but 
fragmentation may occur, especially when the compressed block size is 1/4 or 
1/8 BLOCKSZ.
However, the fragment will only appear in the scene where the size of the same 
block is frequently changed greatly after compression.

Consider the following situation. If only one record is inserted into a page 
and written to the disk, the compression rate must be very high, and only one 
compressed block is required.
After writing new rows in the future, the required compressed blocks will 
become 2, 3, 4 ... These overflow blocks are not allocated at a time, so it is 
likely that they are not sequential in the compress overflow data file, 
resulting in more fragmentation.

We can avoid this problem by setting a table-level number of pre-allocated 
compressed blocks.
When the number of compressed blocks required after the original page is 
compressed is less than this value, space is allocated according to the number 
of pre-allocated compressed blocks.

And no matter how severe the fragmentation, the total space occupied by the 
compressed table cannot be larger than the original table before compression.

# Impact on other parts of PostgreSQL?
1. pg_basebackup / pg_checksum needs to handle checksum verification according 
to the new compression format
2. pg_rewind needs to handle the copying of data blocks according to the new 
compression format

# Problems
This solution simplifies copy storage space management, but also has the 
following defects
1. The space saved by compression is limited by the size of the compressed 
block.
  For example, when the compressed block size is set to 4KB, up to 50% of space 
can be saved.
  For inserting only unmodified tables, you can set the compressed block size 
to BLOCKSZ / 8 to alleviate this problem; but for scenarios with frequent 
writes, it is easy to generate fragments and increase the number of IOs.
2. When accessing a page that can be compressed to a compressed block, only one 
IO is required; but when accessing a page that cannot be compressed to a 
compressed block, multiple IO is required
  Generally it is 3 times, the address file is very small, it should be almost 
in the memory, not counting the address file is 2 times.


I think the above issues are a necessary trade-off. Any suggestions are welcome.

# references
[1] https://www.postgresql.org/message-id/flat/op.ux8if71gcigqcu%40soyouz
[2] 
https://www.postgresql.org/message-id/CAONYFtNDNghfatnrhOJMOT=BXNbEiobHFABt2sx_cn2=5t=1...@mail.gmail.com
[3] 
https://www.postgresql.org/message-id/CALfoeiuF-m5jg51mJUPm5GN8u396o5sA2AF5N97vTRAEDYac7w%40mail.gmail.com
[4] https://postgrespro.com/docs/enterprise/9.6/cfs
[5] 
https://www.postgresql.org/message-id/flat/11996861554042351%40iva4-dd95b404a60b.qloud-c.yandex.net


Best Regards
Chen Hujaun

Reply via email to