Re: libpq compression (part 2)

2022-03-03 Thread Daniil Zakhlystov
Ok, thanks


> On 3 Mar 2022, at 02:33, Justin Pryzby  wrote:
> 
> If there's no objection, I'd like to move this to the next CF for 
> consideration
> in PG16.
> 
> On Mon, Jan 17, 2022 at 10:39:19PM -0600, Justin Pryzby wrote:
>> On Tue, Jan 18, 2022 at 02:06:32AM +0500, Daniil Zakhlystov wrote:
>>>> => Since March, errmsg doesn't need extra parenthesis around it (e3a87b4).
>> 
>>> I’ve resolved the stuck tests and added zlib support for CI Windows builds 
>>> to patch 0003-*.  Thanks
>>> for the suggestion, all tests seem to be OK now, except the macOS one.  It 
>>> won't schedule in Cirrus
>>> CI for some reason, but I guess it happens because of my GitHub account 
>>> limitation.
>> 
>> I don't know about your github account, but it works for cfbot, which is now
>> green.
>> 
>> Thanks for implementing zlib for windows.  Did you try this with default
>> compressions set to lz4 and zstd ?
>> 
>> The thread from 2019 is very long, and starts off with the guidance that
>> compression had been implemented at the wrong layer.  It looks like this 
>> hasn't
>> changed since then.  secure_read/write are passed as function pointers to the
>> ZPQ interface, which then calls back to them to read and flush its 
>> compression
>> buffers.  As I understand, the suggestion was to leave the socket reads and
>> writes alone.  And then conditionally de/compress buffers after reading /
>> before writing from the socket if compression was negotiated.
>> 
>> It's currently done like this
>> pq_recvbuf() => secure_read() - when compression is disabled 
>> pq_recvbuf() => ZPQ => secure_read() - when compression is enabled 
>> 
>> Dmitri sent a partial, POC patch which changes the de/compression to happen 
>> in
>> secure_read/write, which is changed to call ZPQ:  
>> https://www.postgresql.org/message-id/ca+q6zcuprssnars+fyobsd-f2stk1roa-4sahfofajowlzi...@mail.gmail.com
>> pq_recvbuf() => secure_read() => ZPQ
>> 
>> The same thing is true of the frontend: function pointers to
>> pqsecure_read/write are being passed to zpq_create, and then the ZPQ 
>> interface
>> called instead of the original functions.  Those are the functions which read
>> using SSL, so they should also handle compression.
>> 
>> That's where SSL is handled, and it seems like the right place to handle
>> compression.  Have you evaluated that way to do things ?
>> 
>> Konstantin said he put ZPQ at that layer seems to 1) avoid code duplication
>> between client/server; and, 2) to allow compression to happen before SSL, to
>> allow both (if the admin decides it's okay).  But I don't see why compression
>> can't happen before sending to SSL, or after reading from it?





Re: Commitfest 2021-11 Patch Triage - Part 2

2021-11-16 Thread Daniil Zakhlystov
Hi! It’s been a while since the original patch release. Let me provide a brief 
overview of the current patch status.

The initial approach was to use the streaming compression to compress all 
outgoing and decompress all incoming bytes. However, after the long discussion 
in the thread, the initial approach has been changed.

The current implementation allows compressing only specific message types, 
use the different compression algorithms for different message types, configure 
the allowed compression methods and levels both for server- and client- sides 
via GUC setting / connection string respectively.

Also, current implementation combines (when possible) multiple protocol 
messages 
into the single CompressedData message for a better compression ratio.

> 
> On 16 Nov 2021, at 01:23, Robert Haas  wrote:
> 
> To me, this feels like an attempt to move the goalposts far enough to
> kill the project. Sure, in a perfect world, that would be nice. But,
> we don't do it anywhere else. If you try to store a JPEG into a bytea
> column, we'll try to compress it just like we would any other data,
> and it may not work out. If you then take a pg_basebackup of the
> database using -Z, there's no attempt made to avoid the overhead of
> CPU overhead of compressing those TOAST table pages that contain
> already-compressed data and not the others. And it's easy to
> understand why that's the case: when you insert data into the
> database, there's no way for the database to magically know whether
> that data has been previously compressed by some means, and if so, how
> effectively. And when you back up a database, the backup doesn't know
> which relfilenodes contain TOAST tables or which pages of those
> relfilenodes contain that is already pre-compressed. In both cases,
> your options are either (1) shut off compression yourself or (2) hope
> that the compressor doesn't waste too much effort on it.
> 
> I think the same approach ought to be completely acceptable here. I
> don't even really understand how we could do anything else. printtup()
> just gets datums, and it has no idea whether or how they are toasted.
> It calls the type output functions which don't know that data is being
> prepared for transmission to the client as opposed to some other
> hypothetical way you could call that function, nor do they know what
> compression method the client wants. It does not seem at all
> straightforward to teach them that ... and even if they did, what
> then? It's not like every column value is sent as a separate packet;
> the whole row is a single protocol message, and some columns may be
> compressed and others uncompressed. Trying to guess what to do about
> that seems to boil down to a sheer guess. Unless you try to compress
> that mixture of compressed and uncompressed values - and it's
> moderately uncommon for every column of a table to be even be
> toastable - you aren't going to know how well it will compress. You
> could easily waste more CPU cycles trying to guess than you would have
> spent just doing what the user asked for.
> 

Agree. From my POV, it is OK to use the protocol message type and length to 
decide 
should it be compressed or not. Also, this can be optimized later without the 
need to change 
the protocol.

Regarding the LZ4 support patch, it still has some minor polishing to do. 
Basically, it only adds the LZ4 
algorithm support and does not change anything fundamentally. So I would 
appreciate someone doing 
a review of the current patch version.

The original thread is quite huge so I guess that it makes it hard to catch up 
with the current patch status.
I can make a new one with a detailed summary if that would help.

Thanks,

Daniil Zakhlystov





Re: libpq compression

2021-10-07 Thread Daniil Zakhlystov
Hi, thanks for your fix! I am currently working on implementing the lz4 support for libpq compression. Looking forward to post an update soon.—Daniil ZakhlystovOn 2 Oct 2021, at 02:20, Daniel Gustafsson  wrote:On 2 Sep 2021, at 00:29, Daniel Gustafsson  wrote:On 29 Jul 2021, at 16:57, Daniil Zakhlystov  wrote:Forgot to attach the updated patch :) This fails to build on Windows due to the use of strcasecmp:+		if (strcasecmp(supported_algorithms[zpq->compressors[i].impl], "zstd") == Was that meant to be pg_strcasecmp?To keep this thread from stalling, attached is a rebased patchset with theabove mentioned fix to try and get this working on Windows.--Daniel Gustafsson		https://vmware.com/<0001-Add-zlib-and-zstd-streaming-compression.patch><0002-Implement-libpq-compression.patch>



Re: libpq compression

2021-07-29 Thread Daniil Zakhlystov
" that can be used by a 
> wide range of different modules around the project, such a relaxed way to 
> treat input arguments IMHO can be an evil. Or at least add Assert(ptr) 
> assertions so they can be catched up in debug mode.
> 
> * For zs_create() function which must be called first to create context for 
> further operations, a compression method is passed as integer. This method is 
> used inside zs_create() as index inside an array of compress implementation 
> descriptors. There are several problems:
> 1) Method ID value is not checked for validity. By passing an invalid method 
> ID we can easily get out of array bounds.
> 2) Content of array depends on on configuration options e.g. HAVE_LIBZSTD, 
> HAVE_LIBZ etc. So index inside this array is not persistent. For some config 
> options combination method ID with value 0 may refer to ZSTD and for others 
> to zlib.
> 3) There's no defines/enums/etc in public z_stream.h header that would define 
> which value should be passed to create a specific compressor. Users have to 
> guess it or check the code.

Fixed almost all of these issues, except the malloc/free-related stuff (will 
fix this later).

On Fri, Mar 19, 2021 at 11:02 AM Justin Pryzby  wrote:
> Also, I'm not sure, but I think you may find that the zstd configure.ac should
> use pkgconfig.  This allowed the CIs to compile these patches.  Without
> pkg-config, the macos CI couldn't find (at least) LZ4.k
> https://commitfest.postgresql.org/32/2813/
> https://commitfest.postgresql.org/32/3015/


Now --with-zstd uses pkg-config to link the ZSTD library and works correctly on 
macos.

I would appreciate hearing your thoughts on the new version of the patch,

Daniil Zakhlystov





Re: libpq compression

2021-03-18 Thread Daniil Zakhlystov
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:   tested, passed
Spec compliant:   tested, passed
Documentation:tested, passed

Hi,

I've compared the different libpq compression approaches in the streaming 
physical replication scenario.

Test setup
Three hosts: first is used for pg_restore run, second is master, third is the 
standby replica.
In each test run, I've run the pg_restore of the IMDB database 
(https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/2QYZBT)
 
and measured the received traffic on the standby replica.

Also, I've enlarged the ZPQ_BUFFER_SIZE buffer in all versions because too 
small buffer size (8192 bytes) lead to more 
system calls to socket read/write and poor compression in the chunked-reset 
scenario.

Scenarios:

chunked
use streaming compression, wrap compressed data into CompressedData messages 
and preserve the compression context between multiple CompressedData messages.
https://github.com/usernamedt/libpq_compression/tree/chunked-compression

chunked-reset
use streaming compression, wrap compressed data into CompressedData messages 
and reset the compression context on each CompressedData message.
https://github.com/usernamedt/libpq_compression/tree/chunked-reset

permanent
use streaming compression, send raw compressed stream without any wrapping
https://github.com/usernamedt/libpq_compression/tree/permanent-w-enlarged-buffer

Tested compression levels
ZSTD, level 1
ZSTD, level 5
ZSTD, level 9

ScenarioReplica rx, mean, MB
uncompressed6683.6


ZSTD, level 1
ScenarioReplica rx, mean, MB
chunked-reset   2726
chunked 2694
permanent   2694.3

ZSTD, level 5
ScenarioReplica rx, mean, MB
chunked-reset   2234.3
chunked 2123
permanent   2115.3

ZSTD, level 9
ScenarioReplica rx, mean, MB
chunked-reset   2153.6
chunked 1943
permanent   1941.6

Full report with additional data and resource usage graphs is available here
https://docs.google.com/document/d/1a5bj0jhtFMWRKQqwu9ag1PgDF5fLo7Ayrw3Uh53VEbs

Based on these results, I suggest sticking with chunked compression approach
which introduces more flexibility and contains almost no overhead compared to 
permanent compression.
Also, later we may introduce some setting to control should we reset the 
compression context in each message without
breaking the backward compatibility.

--
Daniil Zakhlystov

The new status of this patch is: Ready for Committer


Re: libpq compression

2021-02-25 Thread Daniil Zakhlystov
Hi, thanks for your review,

> On Feb 22, 2021, at 10:38 AM, Craig Ringer  
> wrote:
> 
> 
> On Thu, 11 Feb 2021, 21:09 Daniil Zakhlystov,  
> wrote::
> 
> 3. Chunked compression allows to compress only well compressible messages and 
> save the CPU cycles by not compressing the others
> 4. Chunked compression introduces some traffic overhead compared to the 
> permanent (1.2810G vs 1.2761G TX data on pg_restore of IMDB database dump, 
> according to results in my previous message)
> 5. From the protocol point of view, chunked compression seems a little bit 
> more flexible:
>  - we can inject some uncompressed messages at any time without the need to 
> decompress/compress the compressed data
>  - we can potentially switch the compression algorithm at any time (but I 
> think that this might be over-engineering)
> 
> Chunked compression also potentially makes it easier to handle non blocking 
> sockets, because you aren't worrying about yet another layer of buffering 
> within the compression stream. This is a real pain with SSL, for example. 
> 
> It simplifies protocol analysis.
> 
> It permits compression to be decided on the fly, heuristically, based on 
> message size and potential compressibility.
> 
> It could relatively easily be extended to compress a group of pending small 
> messages, e.g. by PQflush. That'd help mitigate the downsides with small 
> messages.
> 
> So while stream compression offers better compression ratios, I'm inclined to 
> suspect we'll want message level compression.


Actually, by chunked compression, I’ve meant another variant of streaming 
compression,
where all of the CompressedData messages share the same compression context.
Frankly, I don’t think that starting a new compression context on each 
compressed
message makes sense because it will significantly hurt the compression ratio.

Also, in the current state, chunked compression requires buffering.
I’ll look into it, but it seems like that avoiding buffering will result in the 
increase
of the socket read/write system calls.

> On Feb 23, 2021, at 12:48 AM, Robert Haas  wrote:
> 
> 1. As I mentioned above, we need to fall back to sending the
> uncompressed message if compression fails to reduce the size, or if it
> doesn't reduce the size by enough to compensate for the header we have
> to add to the packet (I assume this is 5 bytes, perhaps 6 if you allow
> a byte to mention the compression type).
> 
> 2. Refining this further, if we notice that we are failing to compress
> messages regularly, maybe we should adaptively give up. The simplest
> idea would be something like: keep track of what percentage of the
> time compression succeeds in reducing the message size. If in the last
> 100 attempts we got a benefit fewer than 75 times, then conclude the
> data isn't very compressible and switch to only attempting to compress
> every twentieth packet or so. If the data changes and becomes more
> compressible again the statistics will eventually tilt back in favor
> of compressing every packet again; if not, we'll only be paying 5% of
> the overhead.
> 
> 3. There should be some minimum size before we attempt compression.
> pglz gives up right away if the input is less than 32 bytes; I don't
> know if that's the right limit, but presumably it'd be very difficult
> to save 5 or 6 bytes out of a message smaller than that, and maybe
> it's not worth trying even for slightly larger messages.
> 
> 4. It might be important to compress multiple packets at a time. I can
> even imagine having two different compressed protocol messages, one
> saying 'here is a compressed messages' and the other saying 'here are
> a bunch of compressed messages rolled up into one packet’.

I’ll look into 1) and 2). As for 3) and 4), this is already implemented.


> On Feb 23, 2021, at 1:40 AM, Andres Freund  wrote:
> 
> On 2021-02-22 14:48:25 -0500, Robert Haas wrote:
>> So, if I read these results correctly, on the "pg_restore of IMDB
>> database" test, we get 88% of the RX bytes reduction and 99.8% of the
>> TX bytes reduction for 90% of the CPU cost. On the "pgbench" test,
>> which probably has much smaller packets, chunked compression gives us
>> no bandwidth reduction and in fact consumes slightly more network
>> bandwidth -- which seems like it has to be an implementation defect,
>> since we should always be able to fall back to sending the
>> uncompressed packet if the compressed one is larger, or will be after
>> adding the wrapper overhead. But with the current code, at least, we
>> pay about a 30% CPU tax, and there's no improvement. The permanent
>> compression imposes a whopping 90% CPU tax, but we save about 33% on
>> TX bytes and about 14% o

Re: libpq compression

2021-02-11 Thread Daniil Zakhlystov
Hi!

> On 09.02.2021 09:06, Konstantin Knizhnik wrote:
> 
> Sorry, but my interpretation of your results is completely different:
> permanent compression is faster than chunked compression (2m15 vs. 2m27) 
> and consumes less CPU (44 vs 48 sec).
> Size of RX data is slightly larger - 0.5Mb but TX size is smaller - 5Mb.
> So permanent compression is better from all points of view: it is 
> faster, consumes less CPU and reduces network traffic!
> 
> From my point of view your results just prove my original opinion that 
> possibility to control compression on the fly and use different 
> compression algorithm for TX/RX data
> just complicates implementation and given no significant advantages.

When I mentioned the lower CPU usage, I was referring to the pgbench test 
results in attached 
google doc, where chunked compression demonstrated lower CPU usage compared to 
the permanent compression.

I made another (a little bit larger) pgbench test to demonstrate this:

Pgbench test parameters:

Data load
pgbench -i -s 100

Run configuration
pgbench --builtin tpcb-like -t 1500 --jobs=64 --client==600"

Pgbench test results:

No compression
latency average = 247.793 ms
tps = 2421.380067 (including connections establishing)
tps = 2421.660942 (excluding connections establishing)

real6m11.818s
user1m0.620s
sys 2m41.087s
RX bytes diff, human: 703.9221M
TX bytes diff, human: 772.2580M

Chunked compression (compress only CopyData and DataRow messages)
latency average = 249.123 ms
tps = 2408.451724 (including connections establishing)
tps = 2408.719578 (excluding connections establishing)

real6m13.819s
user1m18.800s
sys 2m39.941s
RX bytes diff, human: 707.3872M
TX bytes diff, human: 772.1594M

Permanent compression
latency average = 250.312 ms
tps = 2397.005945 (including connections establishing)
tps = 2397.279338 (excluding connections establishing)

real6m15.657s
user1m54.281s
sys 2m37.187s
RX bytes diff, human: 610.6932M
TX bytes diff, human: 513.2225M


As you can see in the above results, user CPU time (1m18.800s vs 1m54.281s) is 
significantly smaller in
chunked compression because it doesn’t try to compress all of the packets. 

Here is the summary from my POV, according to these and previous tests results:

1. Permanent compression always brings the highest compression ratio
2. Permanent compression might be not worthwhile in load different from COPY 
data / Replication / BLOBs/JSON queries
3. Chunked compression allows to compress only well compressible messages and 
save the CPU cycles by not compressing the others
4. Chunked compression introduces some traffic overhead compared to the 
permanent (1.2810G vs 1.2761G TX data on pg_restore of IMDB database dump, 
according to results in my previous message)
5. From the protocol point of view, chunked compression seems a little bit more 
flexible:
 - we can inject some uncompressed messages at any time without the need to 
decompress/compress the compressed data
 - we can potentially switch the compression algorithm at any time (but I think 
that this might be over-engineering)

Given the summary above, I think it’s time to make a decision on which path we 
should take and make the final list of goals that need to be reached in this 
patch to make it committable. 

Thanks,

Daniil Zakhlystov




Re: libpq compression

2021-02-08 Thread Daniil Zakhlystov
n choose the other compression criteria (for example, compress only 
messages with length more than X bytes),
I am not sure if the current compression criteria provides the best results.

Thanks,

Daniil Zakhlystov





Re: zstd compression for pg_dump

2021-01-04 Thread Daniil Zakhlystov
Hi!

> On Jan 4, 2021, at 11:04 AM, Andrey Borodin  wrote:
> 
> Daniil, is levels definition compatible with libpq compression patch?
> +typedef struct Compress {
> + CompressionAlgorithmalg;
> + int level;
> + /* Is a nondefault level set ?  This is useful since different 
> compression
> +  * methods have different "default" levels.  For now we assume the 
> levels
> +  * are all integer, though.
> + */
> + boollevel_set;
> +} Compress;

Similarly to this patch, it is also possible to define the compression level at 
the initialization stage in libpq compression patch.

The difference is that in libpq compression patch the default compression level 
always equal to 1, independently of the chosen compression algorithm.

> On Jan 4, 2021, at 11:04 AM, Andrey Borodin  wrote:
> 
> Libpq compression encountered some problems with memory consumption which 
> required some extra config efforts.


> On Jan 4, 2021, at 12:06 PM, Justin Pryzby  wrote:
> 
> RAM use is not significantly different from zlib, except that zstd --long adds
> more memory.

Regarding ZSTD memory usage:

Recently I’ve made a couple of tests of libpq compression with different 
ZLIB/ZSTD compression levels which shown that compressing/decompressing ZSTD w/ 
high compression levels 
require to allocate more virtual (Commited_AS) memory, which may be exploited 
by malicious clients:

https://www.postgresql.org/message-id/62527092-16BD-479F-B503-FA527AF3B0C2%40yandex-team.ru

We can avoid high memory usage by limiting the max window size to 8MB. This 
should effectively disable the support of compression levels above 19:
https://www.postgresql.org/message-id/6A45DFAA-1682-4EF2-B835-C5F46615EC49%40yandex-team.ru

So maybe it is worthwhile to use similar restrictions in this patch.

—
Daniil Zakhlystov



Re: libpq compression

2020-12-23 Thread Daniil Zakhlystov
no compression  44,36 GiB   1,05 GiB
ZSTD:1  45,03 GiB   1,06 GiB
ZSTD:5  46,06 GiB   1,09 GiB
ZSTD:9  46,00 GiB   1,08 GiB
ZSTD:13 47,46 GiB   1,12 GiB
ZSTD:17 50,23 GiB   1,18 GiB
ZSTD:19 50,21 GiB   1,18 GiB

As for ZSTD levels higher than 19, decompressor returned the appropriate error 
(excerpt from PostgreSQL server log):
LOG:  failed to decompress data: Frame requires too much memory for decoding

Full benchmark report: 
https://docs.google.com/document/d/1LI8hPzMkzkdQLf7pTN-LXPjIJdjN33bEAqVJj0PLnHA
Pull request with max window size limit: 
https://github.com/postgrespro/libpq_compression/pull/5

This should fix the possible attack vectors related to high ZSTD compression 
levels.

—
Daniil Zakhlystov



Re: libpq compression

2020-12-22 Thread Daniil Zakhlystov
Hi!

I’ve fixed an issue with compression level parsing in this PR 
https://github.com/postgrespro/libpq_compression/pull/4

Also, did a couple of pgbenchmarks to measure database resource usage with 
different compression levels.

Firstly, I measured the bidirectional compression scenario, i.e. database had 
to do both compression and decompression:

Database setup:
pgbench "host=xxx dbname=xxx  port=5432 user=xxx” -i -s 500

Test run:
pgbench "host=xxx dbname=xxx  port=5432 user=xxx 
compression=zstd:(1/3/5/7/9/11/13/15/17/19/20)" --builtin tpcb-like -t 50 
--jobs=64 --client=700

When using bidirectional compression, Postgres resource usage correlates with 
the selected compression level. For example, here is the Postgresql application 
memory usage:

No compression - 1.2 GiB

ZSTD
zstd:1 - 1.4 GiB
zstd:7 - 4.0 GiB
zstd:13 - 17.7 GiB
zstd:19 - 56.3 GiB  

zstd:20 - 109.8 GiB - did not succeed
zstd:21, zstd:22  > 140 GiB
Postgres process crashes (out of memory)

ZLIB
zlib:1 - 1.35 GiB
zlib:5 - 1.35 GiB
zlib:9 - 1.35 GiB

Full report with CPU/Memory/Network consumption graph is available here:
https://docs.google.com/document/d/1qakHcsabZhV70GfSEOjFxmlUDBe21p7DRoPrDPAjKNg

Then, I’ve disabled the compression for the backend and decompression for the 
frontend 
and measured the resource usage for single-directional compression scenario 
(frontend compression only, backend decompression only):

ZSTD
For all ZSTD compression levels, database host resource usage was roughly the 
same, except the Committed Memory (Committed_AS):

no compression - 44.4 GiB
zstd:1 - 45.0 GiB
zstd:3 - 46.1 GiB
zstd:5 - 46.1 GiB
zstd:7 - 46.0 GiB
zstd:9 - 46.0 GiB
zstd:11 - 47.4 GiB
zstd:13 - 47.4 GiB
zstd:15 - 47.4 GiB
zstd:17 - 50.3 GiB
zstd:19 - 50.1 GiB  
zstd:20 - 66.8 GiB  
zstd:21 - 88.7 GiB  
zstd:22 - 123.9 GiB 

ZLIB
For all ZLIB compression level, database host resource usage was roughly the 
same.

Full report with CPU/Memory/Network consumption graph is available here:
https://docs.google.com/document/d/1gI7c3_YvcL5-PzeK65P0pIY-4BI9KBDwlfPpGhYxrNg

To sum up, there is actually almost no difference when decompressing the 
different compression levels, except the Committed_AS size. 

—
Daniil Zakhlystov





Re: libpq compression

2020-12-14 Thread Daniil Zakhlystov
> On Dec 10, 2020, at 1:39 AM, Robert Haas  wrote:
> 
> I still think this is excessively baroque and basically useless.
> Nobody wants to allow compression levels 1, 3, and 5 but disallow 2
> and 4. At the very most, somebody might want to start a maximum or
> minimum level. But even that I think is pretty pointless. Check out
> the "Decompression Time" and "Decompression Speed" sections from this
> link:
> 
> https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/
> 
> This shows that decompression time and speed is basically independent
> of compression method for all three of these compressors; to the
> extent that there is a difference, higher compression levels are
> generally slightly faster to decompress. I don't really see the
> argument for letting either side be proscriptive here. Deciding with
> algorithms you're willing to accept is totally reasonable since
> different things may be supported, security concerns, etc. but
> deciding you're only willing to accept certain levels seems unuseful.
> It's also unenforceable, I think, since the receiving side has no way
> of knowing what the sender actually did.

I agree that decompression time and speed are basically the same for different 
compression ratios for most algorithms.
But it seems like that this may not be true for memory usage.

Check out these links: http://mattmahoney.net/dc/text.html and 
https://community.centminmod.com/threads/round-4-compression-comparison-benchmarks-zstd-vs-brotli-vs-pigz-vs-bzip2-vs-xz-etc.18669/

According to these sources, zstd uses significantly more memory while 
decompressing the data which has been compressed with high compression ratios.

So I’ll test the different ZSTD compression ratios with the current version of 
the patch and post the results later this week.


> On Dec 10, 2020, at 1:39 AM, Robert Haas  wrote:
> 
> 
> Good points. I guess you need to arrange to "flush" at the compression
> layer as well as the libpq layer so that you don't end up with data
> stuck in the compression buffers.

I think that “flushing” the libpq and compression buffers before setting the 
new compression method will help to solve issues only at the compressing 
(sender) side
but won't help much on the decompressing (receiver) side.

In the current version of the patch, the decompressor acts as a proxy between 
secure_read and PqRecvBuffer /  conn->inBuffer. It is unaware of the Postgres 
protocol and 
will fail to do anything other than decompressing the bytes received from the 
secure_read function and appending them to the PqRecvBuffer.
So the problem is that we can’t decouple the compressed bytes from the 
uncompressed ones (actually ZSTD detects the compressed block end, but some 
other algorithms don’t).

We may introduce some hinges to control the decompressor behavior from the 
underlying levels after reading the SetCompressionMethod message
from PqRecvBuffer, but I don’t think that it is the correct approach.

> On Dec 10, 2020, at 1:39 AM, Robert Haas  wrote:
> 
> Another idea is that you could have a new message type that says "hey,
> the payload of this is 1 or more compressed messages." It uses the
> most-recently set compression method. This would make switching
> compression methods easier since the SetCompressionMethod message
> itself could always be sent uncompressed and/or not take effect until
> the next compressed message. It also allows for a prudential decision
> not to bother compressing messages that are short anyway, which might
> be useful. On the downside it adds a little bit of overhead. Andres
> was telling me on a call that he liked this approach; I'm not sure if
> it's actually best, but have you considered this sort of approach?

This may help to solve the above issue. For example, we may introduce the 
CompressedData message:

CompressedData (F & B) 

Byte1(‘m’) // I am not so sure about the ‘m’ identifier :)
Identifies the message as compressed data. 

Int32 
Length of message contents in bytes, including self. 

Byten
Data that forms part of a compressed data stream.

Basically, it wraps some chunk of compressed data (like the CopyData message).

On the sender side, the compressor will wrap all outgoing message chunks into 
the CopyData messages.

On the receiver side, some intermediate component between the secure_read and 
the decompressor will do the following:
1. Read the next 5 bytes (type and length) from the buffer 
2.1 If the message type is other than CompressedData, forward it straight to 
the PqRecvBuffer /  conn->inBuffer.
2.2 If the message type is CompressedData, forward its contents to the current 
decompressor.

What do you think of this approach?

—
Daniil Zakhlystov



Re: libpq compression

2020-12-08 Thread Daniil Zakhlystov
only controls behavior on the compression side. So, the client can
> only send zlib data if the server can decompress it, but the server
> need not advertise which levels it can decompress, because it's all or
> nothing.

Depending on the chosen compression algorithm, compression level may affect the 
decompression speed and memory usage.
That's why I think that it may be nice for the server to forbid some 
compression levels with high CPU / memory usage required for decompression.

> On Dec 3, 2020, at 2:23 AM, Robert Haas  wrote:
> 
> On the other hand, one could take a whole different
> approach and imagine the server being in charge of both directions,
> like having a GUC that is set on the server. Clients advertise what
> they can support, and the server tells them to do whatever the GUC
> says they must. That sounds awfully heavy-handed, but it has the
> advantage of letting the server administrator set site policy.

I personally think that this approach is the most practical one. For example:

In the server’s postgresql.conf:

compress_algorithms = ‘uncompressed' // means that the server forbids any 
server-to-client compression
decompress_algorithms = 'zstd:7,8;uncompressed' // means that the server can 
only decompress zstd with compression ratio 7 and 8 or communicate with 
uncompressed messages

In the client connection string:

“… compression=zlib:1,3,5;zstd:6,7,8;uncompressed …” // means that the client 
is able to compress/decompress zlib, zstd, or communicate with uncompressed 
messages

For the sake of simplicity, the client’s “compression” parameter in the 
connection string is basically an analog of the server’s compress_algorithms 
and decompress_algorithms.
So the negotiation process for the above example would look like this:

1. Client sends startup packet with 
“algorithms=zlib:1,3,5;zstd:6,7,8;uncompressed;”
Since there is no compression mode specified, assume that the client wants 
permanent compression.
In future versions, the client can turn request the switchable compression 
after the ‘;’ at the end of the message

2. Server replies with two messages:
- CompressionAck message containing “algorithms=zstd:7,8;uncompressed;”
Where the algorithms section basically matches the “decompress_algorithms” 
server GUC parameter.
In future versions, the server can specify the chosen compression mode after 
the ‘;’ at the end of the message

- Following SetCompressionMethod message containing “alg_idx=1;level_idx=1” 
which
essentially means that the server chose zstd with compression level 7 for 
server-to-client compression. Every next message from the server is now 
compressed with zstd

3. Client replies with SetCompressionMethod message containing “alg_idx=0” 
which means that the client chose the uncompressed
client-to-server messaging. Actually, the client had no other options, because 
the “uncompressed” was the only option left after the intersection of
compression algorithms from the connection string and algorithms received from 
the server in the CompressionAck message.
Every next message from the client is now being sent uncompressed.

—
Daniil Zakhlystov



Re: libpq compression

2020-11-26 Thread Daniil Zakhlystov
r receiving the SetCompressionMessage from the backend, the frontend should 
also reply with SetCompressionMessage with one of the following:
 - Index of the chosen frontend compression algorithm followed by the index of 
the chosen compression level. In this case, the backend now should use the 
chosen decompressor for incoming messages, the frontend should also use the 
chosen compressor for outgoing messages.
 - '-1', if the frontend doesn’t support the compression using any of the 
algorithms sent by the backend. In this case, the frontend should terminate the 
connection after sending this message.

After that sequence of messages, the frontend and backend may continue the 
usual conversation. In the case of permanent compression mode, further use of 
SetCompressionMessage is prohibited both on the frontend and backend sites. 
Supported compression and decompression methods are configured using GUC 
parameters:

compress_algorithms = ‘...’ // default value is ‘uncompressed’
decompress_algorithms = ‘...’ // default value is ‘uncompressed’

Please, let me know if I was unclear somewhere in the protocol description so I 
can clarify the things that I might have missed. I would appreciate hearing 
your opinion on the proposed protocol. 

Thanks,

Daniil Zakhlystov




Re: libpq compression

2020-11-24 Thread Daniil Zakhlystov
I completely agree that backward-compatibility is important here.

I think that it is a good idea to clarify how the compression establishment 
works in the current version of the patch:

1. Frontend send the startup packet which may look like this:

_pq_.compression = 'zlib,zstd' (I omitted the variations with compression 
levels for clarity)

Then, on the backend, there are two possible cases:
2.1 If the backend is too old and doesn't know anything about the compression 
or if the compression is disabled on the backend, it just ignores the 
compression parameter
2.2 In the other case, the backend intersects the client compression method 
with its own supported ones and responds with compressionAck message which 
contains the index of the chosen compression method (or '-1' if it doesn't 
support any of the methods provided).

If the frontend receives the compressionAck message, there is also two cases:
3.1 If compressionAck contains '-1', do not initiate compression
3.2 In the other case, initialize the chosen compression method immediately.

My idea is that we can add new compression approaches in the future and 
initialize them differently on step 3.2.

For example, in the case of switchable compression:

1. Client sends a startup packet with _pq_.compression = 'switchable,zlib,zstd' 
- it means that client wants switchable compression or permanent zlib/zstd 
compression.

Again, two main cases on the backend:
2.1 Backend doesn't know about any compression or compression turned off => 
ignore the _pq_.compression

2.2.1 If the backend doesn't have switchable compression implemented, it won't 
have 'switchable' in his supported methods. So it will simply discard this 
method in the process of the intersection of the client and frontend 
compression methods and respond with some compressionAck message - choose 
permanent zlib, zstd, or nothing (-1).

2.2.2 If the backend supports switchable on the fly compression, it will have 
'switchable' in his supported methods so it may choose 'switchable' in his 
compressionAck response.

After that, on the frontend side:
3.1 If compressionAck contains '-1', do not initiate compression

3.2.1 If compressionAck has 'zstd' or 'zlib' as the chosen compression method, 
init permanent streaming compression immediately.

3.2.2 If compressionAck has 'switchable' as the chosen compression method, init 
the switchable compression. Initialization may involve sending some additional 
messages to the backend to negotiate the details like the supported switchable 
on the fly compression methods or any other details.

The same applies to the compression with the different algorithms in each 
direction. We can call it, for example, 'directional-specific' and init 
differently on step 3.2. The key is that we don't even have to decide the exact 
initialization protocol for 'switchable' and 'direction-specific'. It may be 
added in the future.

Basically, this is what I’ve meant in my previous message about the future 
expansion of the current design, I hope that I managed to clarify it.

Thanks,

Daniil Zakhlystov

> On Nov 24, 2020, at 11:35 PM, Robert Haas  wrote:
> 
> On Tue, Nov 24, 2020 at 12:35 PM Daniil Zakhlystov
>  wrote:
>> To sum up, I think that the current implementation already introduces good 
>> benefits. As I proposed in the Usability review, we may introduce the new 
>> approaches later as separate compression 'algorithms'.
> 
> I don't think the current patch is so close to being committable that
> we shouldn't be considering what we really want to have here. It's one
> thing to say, well, this patch is basically done, let's not start
> redesigning it now. But that's not the case here. For example, I don't
> see any committer accepting the comments in zpq_stream.c as adequate,
> or the documentation, either. Some comments that have been made
> previously, like Andres's remark about the non-standard message
> construction in pq_configure(), have not been addressed, and I do not
> think any committer is going to agree with the idea that the novel
> method chosen by the patch is superior here, not least but not only
> because it seems like it's endian-dependent. That function also uses
> goto, which anybody thinking of committing this will surely try to get
> rid of, and I'm pretty sure the sscanf() isn't good enough to reject
> trailing garbage, and the error message that follows is improperly
> capitalized. I'm sure there's other stuff, too: this is just based on
> a quick look.
> 
> Before we start worrying about any of that stuff in too much detail, I
> think it makes a lot of sense to step back and consider the design.
> Honestly, the work of changing the design might be smaller than the
> amount of cleanup the patch needs. But even if it's larger, it's
> probably not vastly larger. And in any case, I quite disagree with the
> idea that we should commit to 

Re: libpq compression

2020-11-24 Thread Daniil Zakhlystov
enarios.


Coding review
--

In protocol.sgml:
>It can be just boolean values enabling or disabling compression
> ("true"/"false", "on"/"off", "yes"/"no", "1"/"0"), "auto" or explicit 
> list of compression algorithms
> separated by comma with optional specification of compression level: 
> "zlib,zstd:5".

But in fe-protocol3.c:
>   if (pg_strcasecmp(value, "true") == 0 ||
>   pg_strcasecmp(value, "yes") == 0 ||
>   pg_strcasecmp(value, "on") == 0 ||
>   pg_strcasecmp(value, "any") == 0 ||
>   pg_strcasecmp(value, "1") == 0)
>   {
I believe there is some mismatch - in docs, there is an “auto” parameter, but 
in code “auto” is missing, but “any” exists. Actually, I propose to remove both 
“auto” and “any” parameters because they work the same way as “true/on/yes/1” 
but appear like something else.


In fe-protocol3.c:
>#define pq_read_conn(conn) \
>   (conn->zstream  \
>? zpq_read(conn->zstream, conn->inBuffer + conn->inEnd,
> \
>   conn->inBufSize - conn->inEnd)  
> \
>: pqsecure_read(conn, conn->inBuffer + conn->inEnd,
> \
>conn->inBufSize - conn->inEnd))
I think there should be some comment regarding the read function choosing 
logic. Same for zpq_write calls. Also, pq_read_conn is defined as a macros, but 
there is no macros for pq_write_conn.

In configure.ac:
> if test "$with_zstd" = yes; then
>   AC_CHECK_LIB(zstd, ZSTD_decompressStream, [],
>[AC_MSG_ERROR([zstd library not found
> If you have zstd already installed, see config.log for details on the
> failure.  It is possible the compiler isn't looking in the proper directory.
> Use --without-zstd to disable zstd support.])])
> fi

> if test "$with_zstd" = yes; then
>   AC_CHECK_HEADER(zstd.h, [], [AC_MSG_ERROR([zstd header not found
> If you have zstd already installed, see config.log for details on the
> failure.  It is possible the compiler isn't looking in the proper directory.
> Use --without-zstd to disable zstd support.])])
> fi

Looks like the rows with --without-zstd are incorrect.

In fe-connect.c:
> if (index == (char)-1)
> {
>   appendPQExpBuffer(>errorMessage,
>   libpq_gettext(
>   "server is not supported requested compression algorithms %s\n"),
>   conn->compression);
>   goto error_return;
> }
Right now this error might be displayed in two cases:
Backend support compression, but it is somehow disabled/turned off
Backend support compression, but does not support requested algorithms

I think that it is a good idea to differentiate these two cases. Maybe define 
the following behavior somewhere in docs:

“When connecting to an older backend, which does not support compression, or in 
the case when the backend support compression but for some reason wants to 
disable it, the backend will just ignore the _pq_.compression parameter and 
won’t send the compressionAck message to the frontend.”


To sum up, I think that the current implementation already introduces good 
benefits. As I proposed in the Usability review, we may introduce the new 
approaches later as separate compression 'algorithms'.

Thanks,
Daniil Zakhlystov

Re: libpq compression

2020-11-23 Thread Daniil Zakhlystov
** this is a plaintext version of the previous HTML-formatted message **

Hi,

I’ve run a couple of pgbenchmarks using this patch with odyssey connection 
pooler, with client-to-pooler ZSTD compression turned on.

pgbench --builtin tpcb-like -t 75 --jobs=32 --client=1000

CPU utilization chart of the configuration above:
https://storage.yandexcloud.net/usernamedt/odyssey-compression.png

CPU overhead on average was about 10%.

pgbench -i -s 1500

CPU utilization chart of the configuration above:
https://storage.yandexcloud.net/usernamedt/odyssey-compression-i-s.png

As you can see, there was not any noticeable difference in CPU utilization with 
ZSTD compression enabled or disabled.

Regarding replication, I've made a couple of fixes for this patch, you can find 
them in this pull request 
https://github.com/postgrespro/libpq_compression/pull/3

With these fixes applied, I've run some tests using this patch with streaming 
physical replication on some large clusters. Here is the difference of network 
usage on the replica with ZSTD replication compression enabled compared to the 
replica without replication compression:

- on pgbench -i -s 1500 there was ~23x less network usage

- on pgbench -T 300 --jobs=32 --client=640 there was ~4.5x less network usage

- on pg_restore of the ~300 GB database there was ~5x less network usage

To sum up, I think that the current version of the patch (with per-connection 
compression) is OK from the protocol point of view except for the compression 
initialization part. As discussed, we can either do initialization before the 
startup packet or move the compression to _pq_ parameter to avoid issues on 
older backends.

Regarding switchable on the fly compression, although it introduces more 
flexibility, seems like that it will significantly increase the implementation 
complexity of both the frontend and backend. To support this approach in the 
future, maybe we should add something like the compression mode to protocol and 
name the current approach as “permanent” while reserving the “switchable” 
compression type for future implementation?

Thanks,

Daniil Zakhlystov

06.11.2020, 11:58, "Andrey Borodin" :
>>  6 нояб. 2020 г., в 00:22, Peter Eisentraut 
>> :
>>
>>  On 2020-11-02 20:50, Andres Freund wrote:
>>>  On 2020-10-31 22:25:36 +0500, Andrey Borodin wrote:
>>>>  But the price of compression is 1 cpu for 500MB/s (zstd). With a
>>>>  20Gbps network adapters cost of recompressing all traffic is at most
>>>>  ~4 cores.
>>>  It's not quite that simple, because presumably each connection is going
>>>  to be handled by one core at a time in the pooler. So it's easy to slow
>>>  down peak throughput if you also have to deal with TLS etc.
>>
>>  Also, current deployments of connection poolers use rather small machine 
>> sizes. Telling users you need 4 more cores per instance now to decompress 
>> and recompress all the traffic doesn't seem very attractive. Also, it's not 
>> unheard of to have more than one layer of connection pooling. With that, 
>> this whole design sounds a bit like a heat-generation system. ;-)
>
> User should ensure good bandwidth between pooler and DB. At least they must 
> be within one availability zone. This makes compression between pooler and DB 
> unnecessary.
> Cross-datacenter traffic is many times more expensive.
>
> I agree that switching between compression levels (including turning it off) 
> seems like nice feature. But
> 1. Scope of its usefulness is an order of magnitude smaller than compression 
> of the whole connection.
> 2. Protocol for this feature is significantly more complicated.
> 3. Restarted compression is much less efficient and effective.
>
> Can we design a protocol so that this feature may be implemented in future, 
> currently focusing on getting things compressed? Are there any drawbacks in 
> this approach?
>
> Best regards, Andrey Borodin.




Re: libpq compression

2020-11-19 Thread Daniil Zakhlystov
Hi, I’ve run a couple of pgbenchmarks using this patch with odyssey connection pooler, with client-to-pooler ZSTD compression turned on. pgbench --builtin tpcb-like -t 75 --jobs=32 --client=1000 CPU utilization chart of the configuration above:https://storage.yandexcloud.net/usernamedt/odyssey-compression.png CPU overhead on average was about 10%. pgbench -i -s 1500 CPU utilization chart of the configuration above:https://storage.yandexcloud.net/usernamedt/odyssey-compression-i-s.png As you can see, there was not any noticeable difference in CPU utilization with ZSTD compression enabled or disabled. Regarding replication, I've made a couple of fixes for this patch, you can find them in this pull request https://github.com/postgrespro/libpq_compression/pull/3 With these fixes applied, I've run some tests using this patch with streaming physical replication on some large clusters. Here is the difference of network usage on the replica with ZSTD replication compression enabled compared to the replica without replication compression: - on pgbench -i -s 1500 there was ~23x less network usage - on pgbench -T 300 --jobs=32 --client=640 there was ~4.5x less network usage - on pg_restore of the ~300 GB database there was ~5x less network usage To sum up, I think that the current version of the patch (with per-connection compression) is OK from the protocol point of view except for the compression initialization part. As discussed, we can either do initialization before the startup packet or move the compression to _pq_ parameter to avoid issues on older backends. Regarding switchable on the fly compression, although it introduces more flexibility, seems like that it will significantly increase the implementation complexity of both the frontend and backend. To support this approach in the future, maybe we should add something like the compression mode to protocol and name the current approach as “permanent” while reserving the “switchable” compression type for future implementation? Thanks, Daniil Zakhlystov 06.11.2020, 11:58, "Andrey Borodin" :  6 нояб. 2020 г., в 00:22, Peter Eisentraut <peter.eisentr...@enterprisedb.com> написал(а):  On 2020-11-02 20:50, Andres Freund wrote: On 2020-10-31 22:25:36 +0500, Andrey Borodin wrote: But the price of compression is 1 cpu for 500MB/s (zstd). With a 20Gbps network adapters cost of recompressing all traffic is at most ~4 cores. It's not quite that simple, because presumably each connection is going to be handled by one core at a time in the pooler. So it's easy to slow down peak throughput if you also have to deal with TLS etc.  Also, current deployments of connection poolers use rather small machine sizes. Telling users you need 4 more cores per instance now to decompress and recompress all the traffic doesn't seem very attractive. Also, it's not unheard of to have more than one layer of connection pooling. With that, this whole design sounds a bit like a heat-generation system. ;-)User should ensure good bandwidth between pooler and DB. At least they must be within one availability zone. This makes compression between pooler and DB unnecessary.Cross-datacenter traffic is many times more expensive.I agree that switching between compression levels (including turning it off) seems like nice feature. But1. Scope of its usefulness is an order of magnitude smaller than compression of the whole connection.2. Protocol for this feature is significantly more complicated.3. Restarted compression is much less efficient and effective.Can we design a protocol so that this feature may be implemented in future, currently focusing on getting things compressed? Are there any drawbacks in this approach?Best regards, Andrey Borodin.

Re: libpq compression

2020-11-03 Thread Daniil Zakhlystov
Hi,Looks like I found an issue: there can be a situation when libpq frontend hangs after zpq_read(). This may happen when the socket is not read ready and we can't perform another read because we wait on pqWait() but we actually have some buffered rx data.I think we should add a check if there is any buffered rx data before calling pqWait() to avoid such hangs.--Daniil Zakhlystov




Re: libpq compression

2020-11-02 Thread Daniil Zakhlystov
Hi, Currently, zpq_stream contains a check only for the tx buffered data - zpq_buffered(). I think that there should be the same functionality for rx buffered data. For example, the zpq_buffered_rx(). zpq_buffered_rx() returns a value greater than zero if there is any data that was fetched by rx_func() but haven't been decompressed yet, in any other case zpq_buffered_rx() returns zero. In this case, I think that we may also rename the existing zpq_buffered() to zpq_buffered_tx() for clarity.--Daniil Zakhlystov




Re: libpq compression

2020-11-01 Thread Daniil Zakhlystov
It looks there was a different version of pq_getbyte_if_available on my side, 
my bad.

I’ve reapplied the patch and compiler is happy now, thanks!

—
Daniil Zakhlystov

> On Nov 1, 2020, at 3:01 PM, Konstantin Knizhnik  
> wrote:
> 
> Hi
> 
> On 01.11.2020 12:37, Daniil Zakhlystov wrote:
>> Hi,
>> 
>> I have a couple of comments regarding the last patch, mostly these are minor 
>> issues.
>> 
>> In src/backend/libpq/pqcomm.c, starting from the line 1114:
>> 
>> int
>> pq_getbyte_if_available(unsigned char *c)
>> {
>>  int r;
>> 
>>  Assert(PqCommReadingMsg);
>> 
>>  if (PqRecvPointer < PqRecvLength || (0) > 0)   // not easy to 
>> understand optimization (maybe add a comment?)
>>  {
>>  *c = PqRecvBuffer[PqRecvPointer++];
>>  return 1;
>>  }
>>  return r; // returned value is not initialized
>> }
> 
> Sorry, I don't understand it.
> This is the code of pq_getbyte_if_available:
> 
> int
> pq_getbyte_if_available(unsigned char *c)
> {
> intr;
> 
> Assert(PqCommReadingMsg);
> 
> if (PqRecvPointer < PqRecvLength || (r = pq_recvbuf(true)) > 0)
> {
> *c = PqRecvBuffer[PqRecvPointer++];
> return 1;
> }
> return r;
> }
> 
> 
> So "return r" branch is executed when both conditions are false: 
> (PqRecvPointer < PqRecvLength)
> and ((r = pq_recvbuf(true)) > 0)
> Last condition cause assignment of "r" variable.
> 
> I wonder how did you get this "returned value is not initialized" warning?
> Is it produced by some static analyze tool or compiler? 
> 
> In any case, I will initialize "r" variable to make compiler happy.
> 
> 
>> In src/interfaces/libpq/fe-connect.c, starting from the line 3255:
>> 
>> pqGetc(, conn);
>> impl = zpq_get_algorithm_impl(algorithm);
>> {
>>  // I believe that if (impl < 0) condition is missing here, otherwise there 
>> is always an error
>>appendPQExpBuffer(>errorMessage,
>>  libpq_gettext(
>> "server is not supported 
>> requested compression algorithm %c\n"), algorithm);
>>goto error_return;
>> }
>> 
> 
> Sorry I have fixed this mistyping several days ago in GIT repository
>  g...@github.com:postgrespro/libpq_compression.git 
> <mailto:g...@github.com:postgrespro/libpq_compression.git>
> but did;t attach new version of the patch because I plan to make more changes 
> as a result of Andres review.
> 
> 
> 
>> In configure, starting from the line 1587:
>> 
>> --without-zlib  do not use Zlib
>> --with-zstd do not use zstd // is this correct?
>> 
> 
> Thank you for noting it: fixed.
> 
> 
> 



Re: libpq compression

2020-11-01 Thread Daniil Zakhlystov
Hi,

I have a couple of comments regarding the last patch, mostly these are minor 
issues.

In src/backend/libpq/pqcomm.c, starting from the line 1114:

int
pq_getbyte_if_available(unsigned char *c)
{
int r;

Assert(PqCommReadingMsg);

if (PqRecvPointer < PqRecvLength || (0) > 0)   // not easy to 
understand optimization (maybe add a comment?)
{
*c = PqRecvBuffer[PqRecvPointer++];
return 1;
}
return r; // returned value is not initialized
}

In src/interfaces/libpq/fe-connect.c, starting from the line 3255:

pqGetc(, conn);
impl = zpq_get_algorithm_impl(algorithm);
{   
 // I believe that if (impl < 0) condition is missing here, otherwise there is 
always an error
   appendPQExpBuffer(>errorMessage,
 libpq_gettext(
"server is not supported 
requested compression algorithm %c\n"), algorithm);
   goto error_return;
}

In configure, starting from the line 1587:

--without-zlib  do not use Zlib
--with-zstd do not use zstd // is this correct?

Thanks

--
Daniil Zakhlystov

> On Nov 1, 2020, at 12:08 AM, Konstantin Knizhnik  
> wrote:
> 
> Hi
> 
> On 31.10.2020 00:03, Andres Freund wrote:
>> Hi,
>> 
>> On 2020-10-29 16:45:58 +0300, Konstantin Knizhnik wrote:
>>>> - What does " and it is up to the client whether to continue work
>>>>without compression or report error" actually mean for a libpq 
>>>> parameter?
>>> It can not happen.
>>> The client request from server use of compressed protocol only if
>>> "compression=XXX" was specified in connection string.
>>> But XXX should be supported by client, otherwise this request will be
>>> rejected.
>>> So supported protocol string sent by client can never be empty.
>> I think it's pretty important for features like this to be able to fail
>> softly when the feature is not available on the other side. Otherwise a
>> lot of applications need to have unnecessary version dependencies coded
>> into them.
> Sorry, may be I do not completely understand your suggestion.
> Right now user jut specify that he wants to use compression.
> Libpq client sends to the server list of supported algorithms
> and server choose among them the best one is supported.
> It sends it chose to the client and them are both using this algorithm.
> 
> Sorry, that in previous mail I have used incorrect samples: client is not 
> explicitly specifying compression algorithm -
> it just request compression. And server choose the most efficient algorithm 
> which is supported both by client and server.
> So client should not know names of the particular algorithms (i.e. zlib, 
> zstd) and
> choice is based on the assumption that server (or better say programmer)
> knows better than user which algorithms is more efficient.
> Last assumption me be contested because user better know which content will 
> be send and which
> algorithm is more efficient for this content. But right know when the choice 
> is between zstd and zlib,
> the first one is always better: faster and provides better quality.
> 
>> 
>>>> - What is the point of the "streaming mode" reference?
>>> There are ways of performing compression:
>>> - block mode when each block is individually compressed (compressor stores
>>> dictionary in each compressed blocked)
>>> - stream mode
>>> Block mode allows to independently decompress each page. It is good for
>>> implementing page or field compression (as pglz is used to compress toast
>>> values).
>>> But it is not efficient for compressing client-server protocol commands.
>>> It seems to me to be important to explain that libpq is using stream mode
>>> and why there is no pglz compressor
>> To me that seems like unnecessary detail in the user oriented parts of
>> the docs at least.
>> 
> Ok, I will remove this phrase.
> 
>>>> Why are compression methods identified by one byte identifiers? That
>>>> seems unnecessarily small, given this is commonly a once-per-connection
>>>> action?
>>> It is mostly for simplicity of implementation: it is always simple to work
>>> with fixed size messages (or with array of chars rather than array of
>>> strings).
>>> And I do not think that it can somehow decrease flexibility: this one-letter
>>> algorihth codes are not visible for user. And I do not think that we
>>> sometime will support more than 127 (or ev

Re: libpq compression

2020-10-29 Thread Daniil Zakhlystov
Hi,

> On Oct 29, 2020, at 12:27 AM, Andres Freund  wrote:
> 
> The protocol sounds to me like there's no way to enable/disable
> compression in an existing connection. To me it seems better to have an
> explicit, client initiated, request to use a specific method of
> compression (including none). That allows to enable compression for bulk
> work, and disable it in other cases (e.g. for security sensitive
> content, or for unlikely to compress well content).
> 
> I think that would also make cross-version handling easier, because a
> newer client driver can send the compression request and handle the
> error, without needing to reconnect or such.
> 
> Most importantly, I think such a design is basically a necessity to make
> connection poolers to work in a sensible way.

Can you please clarify your opinion about connection poolers? Do you mean by 
sensible way that in some cases they can just forward the compressed stream 
without parsing?

>> +/*
>> + * Index of used compression algorithm in zpq_algorithms array.
>> + */
>> +static int zpq_algorithm_impl;
> 
> This is just odd API design imo. Why doesn't the dispatch work based on
> an argument for zpq_create() and the ZpqStream * for the rest?
> 
> What if there's two libpq connections in one process? To servers
> supporting different compression algorithms? This isn't going to fly.


I agree, I think that moving the zpq_algorithm_impl to the ZpqStream struct 
seems like an easy fix for this issue.

>> +/* initialize compression */
>> +if (zpq_set_algorithm(compression_algorithm))
>> +PqStream = zpq_create((zpq_tx_func)secure_write, 
>> (zpq_rx_func)secure_read, MyProcPort);
>> +}
>> +return 0;
>> +}
> 
> Why is zpq a wrapper around secure_write/read? I'm a bit worried this
> will reduce the other places we could use zpq.

Maybe we can just split the PqStream into PqCompressStream and 
PqDecompressStream? It looks like that they can work independently.

—
Daniil Zakhlystov




Re: libpq compression

2020-10-28 Thread Daniil Zakhlystov
The following review has been posted through the commitfest application:
make installcheck-world:  not tested
Implements feature:   not tested
Spec compliant:   not tested
Documentation:not tested

Hi, thanks for the patch!
 
I’ve made a quick review and found one issue.
 
If the backend sends a CompressionAck message followed by some already 
compressed message (for example, AuthenticationOk), then there is a chance that 
pqReadData() will read both messages into the read buffer at once. In this 
case, the CompressionAck message will be read normally, but the client will 
fail to recognize the next message (for example, AuthenticationOk) since it 
came in a compressed form but was incorrectly read as a regular message. So the 
client would not be able to recognize the second message and will crash.
 
Example of a successful launch (added some debug output):
usernamedt-osx: ~ usernamedt $ psql -d "host = x.x.x.x port = 6432 dbname = 
testdb user = testuser compression = 1"
NUM_READ: 6 (pqReadData read CompressionAck (6 bytes) and nothing more)
pqReadData RC: 1
NUM_READ: 346
pqReadData RC: 1
psql (14devel)
Type "help" for help.
 
testdb => // OK
 
Example of a failed launch:
usernamedt-osx: ~ usernamedt $ psql -d "host = x.x.x.x port = 6432 dbname = 
testdb user = testuser compression = 1"
NUM_READ: 24 (pqReadData read CompressionAck (6 bytes) and compressed 
AuthenticationOk (18 bytes) came after it)
pqReadData RC: 1
psql: error: could not connect to server: expected authentication request from 
server, but received x // FAIL

--
Daniil Zakhlystov