Re: libpq compression (part 3)

2024-05-21 Thread Jacob Burroughs
On Tue, May 21, 2024 at 1:24 PM Jacob Champion
 wrote:
>
> We absolutely have to document the risks and allow clients to be
> written safely. But I think server-side controls on risky behavior
> have proven to be generally more valuable, because the server
> administrator is often in a better spot to see the overall risks to
> the system. ("No, you will not use deprecated ciphersuites. No, you
> will not access this URL over plaintext. No, I will not compress this
> response containing customer credit card numbers, no matter how nicely
> you ask.") There are many more clients than servers, so it's less
> risky for the server to enforce safety than to hope that every client
> is safe.
>
> Does your database and access pattern regularly mingle secrets with
> public data? Would auditing correct client use of compression be a
> logistical nightmare? Do your app developers keep indicating in
> conversations that they don't understand the risks at all? Cool, just
> set `encrypted_compression = nope_nope_nope` on the server and sleep
> soundly at night. (Ideally we would default to that.)

Thinking about this more (and adding a encrypted_compression GUC or
whatever), I think my inclination would on the server-side default
enable compression for insecure connections but default disable for
encrypted connections, but both would be config parameters that can be
changed as desired.

> The concept of stream reset seems necessary but insufficient at the
> application level, which bleeds over into Jelte's compression_restart
> proposal. (At the protocol level, I think it may be sufficient?)
>
> If I write a query where one of the WHERE clauses is
> attacker-controlled and the other is a secret, I would really like to
> not compress that query on the client side. If I join a table of user
> IDs against a table of user-provided addresses and a table of
> application tokens for that user, compressing even a single row leaks
> information about those tokens -- at a _very_ granular level -- and I
> would really like the server not to do that.
>
> So if I'm building sand castles... I think maybe it'd be nice to mark
> tables (and/or individual columns?) as safe for compression under
> encryption, whether by row or in aggregate. And maybe libpq and psql
> should be able to turn outgoing compression on and off at will.
>
> And I understand those would balloon the scope of the feature. I'm
> worried I'm doing the security-person thing and sucking all the air
> out of the room. I know not everybody uses transport encryption; for
> those people, compress-it-all is probably a pretty winning strategy,
> and there's no need to reset the compression context ever. And the
> pg_dump-style, "give me everything" use case seems like it could maybe
> be okay, but I really don't know how to assess the risk there, at all.

I would imagine that a large volume of uses of postgres are in
contexts (e.g. internal networks) where either no encryption is used
or even when encryption is used the benefit of compression vs the risk
of someone being a position to perform a BREACH-style sidechannel
attack against DB traffic is sufficiently high that compress-it-all
would be be quite useful in many cases.  Would some sort of
per-table/column marking be useful for some cases?  Probably, but that
doesn't seem to me like it needs to be in v1 of this feature as long
as the protocol layer itself is designed such that parties can
arbitrarily alternate between transmitting compressed and uncompressed
data.  Then if we build such a feature down the road we just add logic
around *when* we compress but the protocol layer doesn't change.

> Well... working out the security minutiae _after_ changing the
> protocol is not historically a winning strategy, I think. Better to do
> it as a vertical stack.

Thinking about it more, I agree that we probably should work out the
protocol level mechanism for resetting compression
context/enabling/disabling/reconfiguring compression as part of this
work.  I don't think that we need to have all the ways that the
application layer might choose to use such things done here, but we
should have all the necessary primitives worked out.

-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-21 Thread Jacob Burroughs
On Tue, May 21, 2024 at 1:39 PM Jacob Champion
 wrote:
>
> If the server doesn't reject compressed packets pre-authentication,
> then case 2 isn't mitigated. (I haven't proven how risky that case is
> yet, to be clear.) In other words: if the threat model is that a
> client can attack us, we shouldn't assume that it will attack us
> politely.

I think I thought I was writing about something else when I wrote that
:sigh:.  I think what I really should have written was a version of
the part below, which is that we use streaming decompression, only
decompress 8kb at a time, and for pre-auth messages limit them to
`PG_MAX_AUTH_TOKEN_LENGTH` (65535 bytes), which isn't really enough
data to actually cause any real-world pain by needing to decompress vs
the equivalent pain of sending invalid uncompressed auth packets.

> > Because we are using streaming decompression, this is much less of an
> > issue than for things that decompress wholesale onto disk/into memory.
>
> (I agree in general, but since you're designing a protocol extension,
> IMO it's not enough that your implementation happens to mitigate
> risks. We more or less have to bake those mitigations into the
> specification of the extension, because things that aren't servers
> have to decompress now. Similar to RFC security considerations.)

We own both the canonical client and server, so those are both covered
here.  I would think it would be the responsibility of any other
system that maintains its own implementation of the postgres protocol
and chooses to support the compression protocol to perform its own
mitigations against potential compression security issues.  Should we
put the fixed message size limits (that have de facto been part of the
protocol since 2021, even if they weren't documented as such) into the
protocol documentation?  That would give implementers of the protocol
numbers that they could actually rely on when implementing the
appropriate safeguards because they would be able to actually have
explicit guarantees about the size of messages. I think it would make
more sense to put the limits on the underlying messages rather than
adding an additional limit that only applies to compressed messages.
( I don't really see how one could implement other tooling that used
pg compression without using streaming compression, as the protocol
never hands over a standalone blob of compressed data: all compressed
data is always part of a stream, but even with streaming decompression
you still need some kind of limits or you will just chew up memory.)

-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-21 Thread Jacob Burroughs
=message: Restart the stream for every message.
> Recommended if the amount of correlation between rows of the same
> query is a security concern.
> c. compression_restart=manual: Don't restart the stream automatically,
> but only when the client user calls a specific function. Recommended
> only if the user can make trade-offs, or if no encryption is used
> anyway.

I reasonably like this idea, though I think maybe we should also
(instead of query?) add per-transaction on the backend side.  I'm
curious what other people think of this.


-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-21 Thread Jacob Burroughs
On Mon, May 20, 2024 at 2:42 PM Jacob Champion
 wrote:
>
> I mean... you said it, not me. I'm trying not to rain on the parade
> too much, because compression is clearly very valuable. But it makes
> me really uncomfortable that we're reintroducing the compression
> oracle (especially over the authentication exchange, which is
> generally more secret than the rest of the traffic).

As currently implemented, the compression only applies to
CopyData/DataRow/Query messages, none of which should be involved in
authentication, unless I've really missed something in my
understanding.

> Right, I think it's reasonable to let a sufficiently
> determined/informed user lift the guardrails, but first we have to
> choose to put guardrails in place... and then we have to somehow
> sufficiently inform the users when it's okay to lift them.

My thought would be that compression should be opt-in on the client
side, with documentation around the potential security pitfalls. (I
could be convinced it should be opt-in on the server side, but overall
I think opt-in on the client side generally protects against footguns
without excessively getting in the way and if an attacker controls the
client, they can just get the information they want directly-they
don't need compression sidechannels to get that information.)

> But for SQL, where's the dividing line between attacker-chosen and
> attacker-sought? To me, it seems like only the user knows; the server
> has no clue. I think that puts us "lower" in Alyssa's model than HTTP
> is.
>
> As Andrey points out, there was prior work done that started to take
> this into account. I haven't reviewed it to see how good it is -- and
> I think there are probably many use cases in which queries and tables
> contain both private and attacker-controlled information -- but if we
> agree that they have to be separated, then the strategy can at least
> be improved upon.

Within SQL-level things, I don't think we can reasonably differentiate
between private and attacker-controlled information at the
libpq/server level.  We can reasonably differentiate between message
types that *definitely* are private and ones that could have
either/both data in them, but that's not nearly as useful.  I think
not compressing auth-related packets plus giving a mechanism to reset
the compression stream for clients (plus guidance on the tradeoffs
involved in turning on compression) is about as good as we can get.
That said, I *think* the feature is reasonable to be
reviewed/committed without the reset functionality as long as the
compressed data already has the mechanism built in (as it does) to
signal when a decompressor should restart its streaming.  The actual
signaling protocol mechanism/necessary libpq API can happen in
followon work.


-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-17 Thread Jacob Burroughs
On Fri, May 17, 2024 at 11:40 AM Robert Haas  wrote:
>
> To be clear, I am not arguing that it should be the receiver's choice.
> I'm arguing it should be the client's choice, which means the client
> decides what it sends and also tells the server what to send to it.
> I'm open to counter-arguments, but as I've thought about this more,
> I've come to the conclusion that letting the client control the
> behavior is the most likely to be useful and the most consistent with
> existing facilities. I think we're on the same page based on the rest
> of your email: I'm just clarifying.

This is what I am imagining too

> I have some quibbles with the syntax but I agree with the concept.
> What I'd probably do is separate the server side thing into two GUCs,
> each with a list of algorithms, comma-separated, like we do for other
> lists in postgresql.conf. Maybe make the default 'all' meaning
> "everything this build of the server supports". On the client side,
> I'd allow both things to be specified using a single option, because
> wanting to do the same thing in both directions will be common, and
> you actually have to type in connection strings sometimes, so
> verbosity matters more.
>
> As far as the format of the value for that keyword, what do you think
> about either compression=DO_THIS_BOTH_WAYS or
> compression=DO_THIS_WHEN_SENDING/DO_THIS_WHEN_RECEIVING, with each "do
> this" being a specification of the same form already accepted for
> server-side compression e.g. gzip or gzip:level=9? If you don't like
> that, why do you think the proposal you made above is better, and why
> is that one now punctuated with slashes instead of semicolons?

I like this more than what I proposed, and will update the patches to
reflect this proposal. (I've gotten them locally back into a state of
applying cleanly and dealing with the changes needed to support direct
SSL connections, so refactoring the protocol layer shouldn't be too
hard now.)

On Fri, May 17, 2024 at 11:10 AM Jacob Champion
 wrote:
> We're talking about a transport-level option, though -- I thought the
> proposal enabled compression before authentication completed? Or has
> that changed?
On Fri, May 17, 2024 at 1:03 PM Jelte Fennema-Nio  wrote:
> I think it would make sense to only compress messages after
> authentication has completed. The gain of compressing authentication
> related packets seems pretty limited.

At the protocol level, compressed data is a message type that can be
used to wrap arbitrary data as soon as the startup packet is
processed.  However, as an implementation detail that clients should
not rely on but that we can rely on in thinking about the
implications, the only message types that are compressed (except in
the 0005 CI patch for test running only) are PqMsg_CopyData,
PqMsg_DataRow, and PqMsg_Query, all of which aren't sent before
authentication.

-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-17 Thread Jacob Burroughs
On Thu, May 16, 2024 at 3:28 AM Robert Haas  wrote:
>
> Well, I mean, I don't really know what the right answer is here, but
> right now I can say pg_dump --compress=gzip to compress the dump with
> gzip, or pg_dump --compress=gzip:9 to compress with gzip level 9. Now,
> say that instead of compressing the output, I want to compress the
> data sent to me over the connection. So I figure I should be able to
> say pg_dump 'compress=gzip' or pg_dump 'compress=gzip:9'. I think you
> want to let me do the first of those but not the second. But, to turn
> your question on its head, what would be the reasoning behind such a
> restriction?

I think I was more thinking that trying to let both parties control
the parameter seemed like a recipe for confusion and sadness, and so
the choice that felt most natural to me was to let the sender control
it, but I'm definitely open to changing that the other way around.

> Note also the precedent of pg_basebackup. I can say pg_basebackup
> --compress=server-gzip:9 to ask the server to compress the backup with
> gzip at level 9. In that case, what I request from the server changes
> the actual output that I get, which is not the case here. Even so, I
> don't really understand what the justification would be for refusing
> to let the client ask for a specific compression level.
>
> And on the flip side, I also don't understand why the server would
> want to mandate a certain compression level. If compression is very
> expensive for a certain algorithm when the level is above some
> threshold X, we could have a GUC to limit the maximum level that the
> client can request. But, given that the gzip compression level
> defaults to 6 in every other context, why would the administrator of a
> particular server want to say, well, the default for my server is 3 or
> 9 or whatever?
>
> (This is of course all presuming you want to use gzip at all, which
> you probably don't, because gzip is crazy slow. Use lz4 or zstd! But
> it makes the point.)

New proposal, predicated on the assumption that if you enable
compression you are ok with the client picking whatever level they
want.  At least with the currently enabled algorithms I don't think
any of them are so insane that they would knock over a server or
anything, and in general postgres servers are usually connected to by
clients that the server admin has some channel to talk to (after all
they somehow had to get access to log in to the server in the first
place) if they are doing something wasteful, given that a client can
do a lot worse things than enable aggressive compression by writing
bad queries.

On the server side, we use slash separated sets of options
connection_compression=DEFAULT_VALUE_FOR_BOTH_DIRECTIONS/client_to_server=OVERRIDE_FOR_THIS_DIRECTION/server_to_client=OVERRIDE_FOR_THIS_DIRECTION
with the values being semicolon separated compression algorithms.
On the client side, you can specify
compression=,
but on the client side you can actually specify compression options,
which the server will use if provided, and otherwise it will fall back
to defaults.

If we think we need to, we could let the server specify defaults for
server-side compression.  My overall thought though is that having an
excessive number of knobs increases the surface area for testing and
bugs while also increasing potential user confusion and that allowing
configuration on *both* sides doesn't seem sufficiently useful to be
worth adding that complexity.

-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-05-17 Thread Jacob Burroughs
On Fri, May 17, 2024 at 3:15 AM Robert Haas  wrote:
>
> OK, so you made it so that compressed data is fully self-identifying.
> Hence, there's no need to worry if something gets changed: the
> receiver, if properly implemented, can't help but notice. The only
> downside that I can see to this design is that you only have one byte
> to identify the compression algorithm, but that doesn't actually seem
> like a real problem at all, because I expect the number of supported
> compression algorithms to grow very slowly. I think it would take
> centuries, possibly millenia, before we started to get short of
> identifiers. So, cool.
>
> But, in your system, how does the client ask the server to switch to a
> different compression algorithm, or to restart the compression stream?

I was leaving the details around triggering that for this conversation
and in that patch just designing the messages in a way that would
allow adding the reconfiguration/restarting to be easily added in a
backwards-compatible way in a future patch.  I would imagine that an
explicit `ParameterSet` call that sets `_pq_.connection_compression`
(or whatever the implementation details turn out to be) would also
trigger a compressor restart, and when restarted it would pick an
algorithm/configuration based on the new value of the parameter rather
than the one used at connection establishment.



-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-05-16 Thread Jacob Burroughs
On Thu, May 16, 2024 at 6:57 AM Robert Haas  wrote:
>
> Ugh, it's so hard to communicate clearly about this stuff. I didn't
> really have any thought that we'd ever try to handle something as
> complicated as compression using ParameterSet. I tend to agree that
> for compression I'd like to see the startup packet contain more than
> _pq_.compression=1, but I'm not sure what would happen after that
> exactly. If the client asks for _pq_.compression=lz4 and the server
> tells the client that it doesn't understand _pq_.compression at all,
> then everybody's on the same page: no compression. But, if the server
> understands the option but isn't OK with the proposed value, what
> happens then? Does it send a NegotiateCompressionType message after
> the NegotiateProtocolVersion, for example? That seems like it could
> lead to the client having to be prepared for a lot of NegotiateX
> messages somewhere down the road.
>
> I think at some point in the past we had discussed having the client
> list all the algorithms it supported in the argument to
> _pq_.compression, and then the server would respond with the algorithm
> it wanted use, or maybe a list of algorithms that it could allow, and
> then we'd go from there. But I'm not entirely sure if that's the right
> idea, either.

As currently implemented [1], the client sends the server the list of
all compression algorithms it is willing to accept, and the server can
use one of them.  If the server knows what `_pq_.compression` means
but doesn't actually support any compression, it will both send the
client its empty list of supported algorithms and just never send any
compressed messages, and everyone involved will be (relatively) happy.
There is a libpq function that a client can use to check what
compression is in use if a client *really* doesn't want to continue
with the conversation without compression, but 99% of the time I can't
see why a client wouldn't prefer to continue using a connection with
whatever compression the server supports (or even none) without more
explicit negotiation.  (Unlike TLS, where automagically picking
between using and not using TLS has strange security implications and
effects, compression is a convenience feature for everyone involved.)

> Changing compression algorithms in mid-stream is tricky, too. If I
> tell the server "hey, turn on server-to-client compression!" then I
> need to be able to identify where exactly that happens. Any messages
> already sent by the server and not yet processed by me, or any
> messages sent after that but before the server handles my request, are
> going to be uncompressed. Then, at some point, I'll start getting
> compressed data. If the compressed data is framed inside some message
> type created for that purpose, like I get a CompressedMessage message
> and then I decompress to get the actual message, this is simpler to
> manage. But even then, it's tricky if the protocol shifts. If I tell
> the server, you know what, gzip was a bad choice, I want lz4, I'll
> need to know where the switch happens to be able to decompress
> properly.
>
> I don't know if we want to support changing compression algorithms in
> mid-stream. I don't think there's any reason we can't, but it might be
> a bunch of work for something that nobody really cares about. Not
> sure.

As the protocol layer is currently designed [1], it explicitly makes
it very easy to change/restart compression streams, specifically for
this use case (and in particular for the general connection pooler use
case).  Compressed data is already framed in a `CompressedData`
message, and that message has a header byte that corresponds to an
enum value for which algorithm is currently in use.  Any time the
compression stream was restarted by the sender, the first
`CompressedData` message will set that byte, and then the client will
restart its decompression stream with the chosen algorithm from that
point.  For `CompressedData` messages that continue using the
already-established stream, the byte is simply set to 0.  (This is
also how the "each side sends a list" form of negotiation is able to
work without additional round trips, as the `CompressedData` framing
itself communicates which compression algorithm has been selected.)

[1] 
https://www.postgresql.org/message-id/CACzsqT5-7xfbz%2BSi35TBYHzerNX3XJVzAUH9AewQ%2BPp13fYBoQ%40mail.gmail.com

-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-15 Thread Jacob Burroughs
On Wed, May 15, 2024 at 11:31 AM Robert Haas  wrote:
> From my point of view, it's the client who knows what it wants to do
> with the connection. If the client plans to read a lot of data, it
> might want the server to compress that data, especially if it knows
> that it's on a slow link. If the client plans to send a lot of data --
> basically COPY, I'm not thinking this is going to matter much
> otherwise -- then it might want to compress that data before sending
> it, again, especially if it knows that it's on a slow link.
>
> But what does the server know, really? If some client connects and
> sends a SELECT query, the server can't guess whether that query is
> going to return 1 row or 100 million rows, so it has no idea of
> whether compression is likely to make sense or not. It is entitled to
> decide, as a matter of policy, that it's not willing to perform
> compression, either because of CPU consumption or security concerns or
> whatever, but it has no knowledge of what the purpose of this
> particular connection is.

I think I would agree with that.  That said, I don't think the client
should be in the business of specifying what configuration of the
compression algorithm the server should use.  The server administrator
(or really most of the time, the compression library developer's
defaults) gets to pick the compression/compute tradeoff for
compression that runs on the server (which I would imagine would be
the vast majority of it), and the client gets to pick those same
parameters for any compression that runs on the client machine
(probably indeed in practice only for large COPYs).  The *algorithm*
needs to actually be communicated/negotiated since different
client/server pairs may be built with support for different
compression libraries, but I think it is reasonable to say that the
side that actually has to do the computationally expensive part owns
the configuration of that part too.  Maybe I'm missing a good reason
that we want to allow clients to choose compression levels for the
server though?


-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-15 Thread Jacob Burroughs
On Wed, May 15, 2024 at 8:38 AM Robert Haas  wrote:
>
> I agree with that goal, but I'm somewhat confused by how your proposal
> achieves it. You had
> libpq_compression=lzma:client_to_server=off;gzip:server_to_client=off,
> so how do we parse that? Is that two completely separate
> specifications, one for lzma and one for gzip, and each of those has
> one option which is set to off? And then they are separated from each
> other by a semicolon? That actually does make sense, and I think it
> may do a better job allowing for compression options than my proposal,
> but it also seems a bit weird, because client_to_server and
> server_to_client are not really compression options at all. They're
> framing when this compression specification applies, rather than what
> it does when it applies. In a way it's a bit like the fact that you
> can prefix a pg_basebackup's --compress option with client- or server-
> to specify where the compression should happen. But we can't quite
> reuse that idea here, because in that case there's no question of
> doing it in both places, whereas here, you might want one thing for
> upstream and another thing for downstream.

Your interpretation is correct, but I don't disagree that it ends up
feeling confusing.

> I'm not a fan of three settings; I could go with two settings, one for
> each direction, and if you want both you have to set both. Or, another
> idea, what if we just separated the two directions with a slash,
> SEND/RECEIVE, and if there's no slash, then it applies to both
> directions. So you could say
> connection_compression='gzip:level=9/lzma' or whatever.
>
> But now I'm wondering whether these options should really be symmetric
> on the client and server sides? Isn't it for the server just to
> specify a list of acceptable algorithms, and the client to set the
> compression options? If both sides are trying to set the compression
> level, for example, who wins?

Compression options really only ever apply to the side doing the
compressing, and at least as I had imagined things each party
(client/server) only used its own level/other compression params.
That leaves me thinking, maybe we really want two independent GUCs,
one for "what algorithms are enabled/negotiable" and one for "how
should I configure my compressors" and then we reduce the dimensions
we are trying to shove into one GUC and each one ends up with a very
clear purpose:
connection_compression=(yes|no|alg1,alg2:server_to_client=alg1,alg2:client_to_server=alg3)
connection_compression_opts=gzip:level=2


-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Fwd: libpq compression (part 3)

2024-05-14 Thread Jacob Burroughs
On Tue, May 14, 2024 at 3:24 PM Robert Haas  wrote:
>
> IMHO, that's a HUGE improvement. But:
>
> * I would probably change is the name "libpq_compression", because
> even though we have src/backend/libpq, we typically use libpq to refer
> to the client library, not the server's implementation of the wire
> protocol. I think we could call it connection_encryption or
> wire_protocol_encryption or something like that, but I'm not a huge
> fan of libpq_compression.
>
I think connection_compression would seem like a good name to me.

> * I would use commas, not semicolons, to separate items in a list,
> i.e. lzma,gzip not lzma;gzip. I think that convention is nearly
> universal in PostgreSQL, but feel free to point out counterexamples if
> you were modelling this on something.
>
> * libpq_compression=lzma:client_to_server=off;gzip:server_to_client=off
> reads strangely to me. How about making it so that the syntax is like
> this:
>
> libpq_compression=DEFAULT_VALUE_FOR_BOTH_DIRECTIONS:client_to_server=OVERRIDE_FOR_THIS_DIRECTION:servert_to_client=OVERRIDE_FOR_THIS_DIRECTION
>
> With all components being optional. So this example could be written
> in any of these ways:
>
> libpq_compression=lzma;server_to_client=gzip
> libpq_compression=gzip;client_to_server=lzma
> libpq_compression=server_to_client=gzip;client_to_server=lzma
> libpq_compression=client_to_server=lzma;client_to_server=gzip
>
> And if I wrote libpq_compression=server_to_client=gzip that would mean
> send data to the client using gzip and in the other direction use
> whatever the default is.

The reason for both the semicolons and for not doing this is related
to using the same specification structure as here:
https://www.postgresql.org/docs/current/app-pgbasebackup.html
(specifically the --compress argument). Reusing that specification
requires that we use commas to separate the flags for a compression
method, and therefore left me with semicolons as the leftover
separator character. I think we could go with something like your
proposal, and in a lot of ways I like it, but there's still the
possibility of e.g.
`libpq_compression=client_to_server=zstd:level=10,long=true,gzip;client_to_server=gzip`
and I'm not quite sure how to make the separator characters work
coherently if we need to treat `zstd:level=10,long=true` as a unit.
Alternatively, we could have `connection_compression`,
`connection_compression_server_to_client`, and
`connection_compression_client_to_server` as three separate GUCs (and
on the client side `compression`, `compression_server_to_client`, and
`compression_client_to_server` as three separate connection
parameters), where we would treat `connection_compression` as a
default that could be overridden by an explicit
client_to_server/server_to_client.  That creates the slightly funky
case where if you specify all three then the base one ends up unused
because the two more specific ones are being used instead, but that
isn't necessarily terrible.  On the server side we *could* go with
just the server_to_client and client_to_server ones, but I think we
want it to be easy to use this feature in the simple case with a
single libpq parameter.

--
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-14 Thread Jacob Burroughs
On Tue, May 14, 2024 at 1:35 PM Robert Haas  wrote:
>
> Well, in my last response before the thread died, I complained that
> libpq_compression=gzip:compress=off was confusing, and I stand by
> that, because "compress" is used both in the name of the parameter and
> as an option within the value of that parameter. I think there's more
> than one acceptable way to resolve that problem, but I think leaving
> it like that is unacceptable.

What if we went with:
Server side:
* `libpq_compression=on` (I just want everything the server supports
available; probably the most common case)
* `libpq_compression=off` (I don't want any compression ever with this server)
* `libpq_compression=lzma;gzip` (I only want these algorithms for
whatever reason)
* `libpq_compression=lzma:client_to_server=off;gzip:server_to_client=off`
(I only want to send data with lzma and receive data with gzip)
Client side:
*`compression=on` (I just want compression; pick sane defaults
automatically for me; probably the most common case)
* `compression=off` (I don't want any compression)
* `compression=lzma;gzip` (I only want these algorithms for whatever reason)
* `compression=lzma:client_to_server=off;gzip:server_to_client=off` (I
only want to receive data with lzma and send data with gzip)

`client_to_server`/`server_to_client` is a bit verbose, but it's very
explicit in what it means, so you don't need to reason about who is
sending/receiving/etc in a given context, and a given config string
applied to the server or the client side has the same effect on the
connection.

> Even more broadly, I think there have been a couple of versions of
> this patch now where I read the documentation and couldn't understand
> how the feature was supposed to work, and I'm not really willing to
> spend time trying to review a complex patch for conformity with a
> design that I can't understand in the first place. I don't want to
> pretend like I'm the smartest person on this mailing list, and in fact
> I know that I'm definitely not, but I think I'm smart enough and
> experienced enough with PostgreSQL that if I look at the description
> of a parameter and say "I don't understand how the heck this is
> supposed to work", probably a lot of users are going to have the same
> reaction. That lack of understanding on my part my come either from
> the explanation of the parameter not being as good as it needs to be,
> or from the design itself not being as good as it needs to be, or from
> some combination of the two, but whichever is the case, IMHO you or
> somebody else has got to figure out how to fix it.

If the above proposal seems better to you I'll both rework the patch
and then also try to rewrite the relevant bits of documentation to
separate out "what knobs are there" and "how do I specify the flags to
turn the knobs", because I think those two being integrated is making
the parameter documentation less readable/followable.


-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-05-14 Thread Jacob Burroughs
On Tue, May 14, 2024 at 11:08 AM Robert Haas  wrote:
>
> According to https://commitfest.postgresql.org/48/4746/ this patch set
> needs review, but:
>
> 1. Considering that there have been no updates for 5 months, maybe
> it's actually dead?

I've withdrawn this patch from the commitfest.  I had been waiting for
some resolution on "Add new protocol message to change GUCs for usage
with future protocol-only GUCs" before I rebased/refactored this one,
because this would be introducing the first protocol extension so far,
and that discussion appeared to be working out some meaningful issues
on how GUCs and protocol parameters should interact.  If you think it
is worthwhile to proceed here though, I am very happy to do so. (I
would love to see this feature actually make it into postgres; it
would almost certainly be a big efficiency and cost savings win for
how my company deploys postgres internally :) )

> 2. I still think it needs to be more clear how the interface is
> supposed to work. I do not want to spend time reviewing a patch to see
> whether it works without understanding how it is intended to work --
> and I also think that reviewing the patch in detail before we've got
> the user interface right makes a whole lot of sense.

Regarding the interface, what I had originally gone for was the idea
that the naming of the options was from the perspective of the side
you were setting them on.  Therefore, when setting `libpq_compression`
as a server-level GUC, `compress` would control if the server would
compress (send compressed data) with the given algorithm, and
`decompress` would control if the the server would decompress (receive
compressed data) with the given algorithm.  And likewise on the client
side, when setting `compression` as a connection config option,
`compress` would control if the *client* would compress (send
compressed data) with the given algorithm, and `decompress` would
control if the the *client* would decompress (receive compressed data)
with the given algorithm.  So for a client to pick what to send, it
would choose from the intersection of its own `compress=true` and the
server's `decompress=true` algorithms sent in the `ParameterStatus`
message with `libpq_compression`.  And likewise on the server side, it
would choose from the intersection of the server's `compress=true`
algorithms and the client's `decompress=true` algorithms sent in the
`_pq_.libpq_compression` startup option.  If you have a better
suggestion I am very open to it though.



-- 
Jacob Burroughs | Staff Software Engineer
E: jburrou...@instructure.com




Re: libpq compression (part 3)

2024-01-12 Thread Jacob Burroughs
> I wonder if we could use "upstream" and "downstream" to be clearer? Or
> some other terminology?

What about `send` and `receive`?




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-01-02 Thread Jacob Burroughs
What if we created a new guc flag `GUC_PROTOCOL_EXTENSION` (or a
better name), used that for any of the GUCs options that should *only*
be set via a `ParameterSet` protocol message, and then prevent
changing those through SET/RESET/RESET ALL (but I don't see a reason
to prevent reading them through SHOW). (I would imagine most/all
`GUC_PROTOCOL_EXTENSION` GUCs would also set `GUC_NOT_IN_SAMPLE`,
`GUC_DISALLOW_IN_FILE`, and maybe `GUC_NO_SHOW_ALL`.). Looking back at
use cases in the original message of this thread, I would imagine at
least the "configurable GUC report" and "protocol compression" would
likely want to be flagged with `GUC_PROTOCOL_EXTENSION` since both
would seem like things that would require low-level client
involvement/support when they change.

I was already drafting this email before the last one in the thread
came through, but I think this proposal addresses a few things from
it:
> - Since we currently don't have any protocol parameters. How do I test
> it? Should I add a debug protocol parameter specifically for this
> purpose? Or should my tests  just always error at the moment?
`protcocol_managed_params` would serve as the example/test parameter
in this patch series
> - If I create a debug protocol parameter, what designs should it
> inherit from GUCs? e.g. parsing and input validation sounds useful.
> And where should it be stored e.g. protocol_params.c?
I'm thinking using flag(s) on GUCs would get useful mechanics without
needing to implement an entirely separate param system.  It appears
there are existing flags that cover almost everything we would want
except for actually marking a GUC as a protocol extension.
> - How do you get the value of a protocol parameter status? Do we
> expect the client to keep track of what it has sent? Do we always send
> a ParameterStatus message whenever it is changed? Or should we add a
> ParamaterGet protocol message too?
I would think it would be reasonable for a client to track its own
ParameterSets if it has a reason to care about them, since presumably
it needs to do something differently after setting them anyways?
Though I could see an argument for sending `ParameterStatus` if we
want that to serve as an "ack" so a client could wait until it had
active confirmation from a server that a new parameter value was
applied when necessary.

> To clarify, the main thing that I want to do is take the value from
> ParameterStatus and somehow, without having to escape it, set that
> value for that GUC for a different session. As explained above, I now
> think that this newly proposed protocol message is a bad fit for this.
> But I still think that that is not a weird thing to want.

Now, for the possibly nutty part of my proposal, what if we added a
GUC string list named `protcocol_managed_params` that had the
`GUC_PROTOCOL_EXTENSION` flag set, which would be a list of GUC names
to treat as if `GUC_PROTOCOL_EXTENSION` is set on them within the
context of the session.   If a client wanted to use `ParameterSet` to
manage a non-`GUC_PROTOCOL_EXTENSION` managed parameter, it would
first issue a `ParameterSet` to add the parameter to the
`protcocol_managed_params` list, and then issue a `ParameterSet` to
actually set the parameter.  If for some reason (e.g. some pgbouncer
use cases) the client then wanted the parameter to be settable
"normally" it would issue a third `ParameterSet` to remove the
parameter from `protcocol_managed_params`.  Regarding transactional
semantics, I *think* it would be reasonable to specify that changes
made through `ParameterSet` are not transactional and that
`protcocol_managed_params` itself cannot be changed while a
transaction is active.  That way within a given transaction a given
parameter has *only* transactional semantics or has *only*
nontransactional semantics, which I think avoids the potential
consistency problems?  I think this would both address the pgbouncer
use case where it wants to be able to reflect a ParameterStatus
message back to the server while being agnostic to its contents while
also addressing the "SET ROLE"/"SET SESSION AUTHORIZATION" issue: a
pooler would just add `session_authorization` to the
`protcocol_managed_params` list and then it would have full control
over the user by not passing along `ParameterSet` messages setting the
user from the client. (I think it would generally be reasonable to
still allow setting the role within the restricted
session_authorization role, but that would be a pooler decision not a
PG one.)




Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2023-12-30 Thread Jacob Burroughs
> How about instead of talking about protocol-only GUCs (which are
> indeed currently a theoretical concept), we start considering this
> patch as a way to modify parameters for protocol extensions. Protocol
> extension parameters (starting with _pq_.) can currently only be
> configured using the StartupMessage and afterwards there is no way to
> modify them.
>
> I believe [1], [2] and [3] from my initial email could all use such
> protocol extension parameters, if those parameters could be changed
> after the initial startup.

What if we allowed the client to send `ParameterStatus` messages to
the server to reconfigure protocol extensions that were requested at
startup?  Such processing would be independent of the transaction
lifecycle since protocol-level options aren't related to transactions.
Any errors in the set value would be handled with an `ErrorResponse`
(including if an extension was not reconfigurable after connection
startup), and success with a new `ReadyForQuery` message. The actual
effect of an extension change must be delayed until after the
ReadyForQuery has been transmitted, though I don't know if that is
feasible or if we would need to instead implicitly assume changes were
successful and just close the connection on error.  We wouldn't need
to bump the protocol version since a client wouldn't (shouldn't) send
these messages unless it successfully negotiated the relevant protocol
extension(s) at startup.




Re: libpq compression (part 3)

2023-12-20 Thread Jacob Burroughs
> I'm having difficulty understanding the details of this handshaking
> algorithm from this description. It seems good that the handshake
> proceeds in each direction somewhat separately from the other, but I
> don't quite understand how the whole thing fits together. If the
> client tells the server that 'none,gzip' is supported, and I elect to
> start using gzip, how does the client know that I picked gzip rather
> than none? Are the compressed packets self-identifying?

I agree I could have spelled this out more clearly.  I forgot to
mention that I added a byte to the CompressedMessage message type that
specifies the chosen algorithm.  So if the server receives
'none,gzip', it can either keep sending uncompressed regular messages,
or it can compress them in CompressedMessage packets which now look
like "z{len}{format}{data}" (format just being a member of the
pg_compress_algorithm enum, so `1` in the case of gzip).  Overall the
intention is that both the client and the server can just start
sending CompressedMessages once they receive the list of ones other
party supports without any negotiation or agreement needed and without
an extra message type to first specify the compression algorithm. (One
byte per message seemed to me like a reasonable overhead for the
simplicity, but it wouldn't be hard to bring back SetCompressionMethod
if we prefer.)

> It's also slightly odd to me that the same parameter seems to specify
> both what we want to send, and what we're able to receive. I'm not
> really sure we should have separate parameters for those things, but I
> don't quite understand how this works without it. The "none" thing
> seems like a bit of a hack. It lets you say "I'd like to receive
> compressed data but send uncompressed data" ... but what about the
> reverse? How do you say "don't bother compressing what you receive
> from the server, but please lz4 everything you send to the server"? Or
> how do you say "use zstd from server to client, but lz4 from client to
> server"? It seems like you can't really say that kind of thing.

When I came up with the protocol I was imagining that basically both
server admins and clients might want a decent bit more control over
the compression they do rather than the decompression they do, since
compression is generally much more computationally expensive than
decompression.  Now that you point it out though, I don't think that
actually makes that much sense.

> What if we had, on the server side, a GUC saying what compression to
> accept and a GUC saying what compression to be willing to do? And then
> let the client request whatever it wants for each direction.

Here's two proposals:
Option 1:
GUCs:
libpq_compression (default "off")
libpq_decompression (default "auto", which is defined to be equal to
libpq_compression)
Connection parameters:
compression (default "off")
decompression (default "auto", which is defined to be equal to compression)

I think we would only send the decompression fields over the wire to
the other side, to be used to filter for the first chosen compression
field.  We would send the `_pq_.libpq_decompression` protocol
extension even if only compression was enabled and not decompression
so that the server knows to enable compression processing for the
connection (I think this would be the only place we would still use
`none`, and not as part of a list in this case.)  I think we also
would want to add libpq functions to allow a client to check the
last-used compression algorithm in each direction for any
monitoring/inspection purposes (actually that's probably a good idea
regardless, so a client application that cares doesn't need to/try to
implement the intersection and assumption around choosing the first
algorithm in common).  Also I'm open to better names than "auto", I
just would like it to avoid unnecessary verbosity for the common case
of "I just want to enable bidirectional compression with whatever
algorithms are available with default parameters".

Option 2:
This one is even cleaner in the common case but a bit worse in the
uncommon case: just use one parameter and have
compression/decompression enabling be part of the compression detail
(e.g. "libpq_compression='gzip:no_decompress;lz4:level=2,no_compress;zstd'"
or something like that, in which case the "none,gzip" case would
become "'libpq_compression=gzip:no_compress'").  See
https://www.postgresql.org/docs/current/app-pgbasebackup.html ,
specifically the `--compress` flag, for how specifying compression
algorithms and details works.

I'm actually not sure which of the two I prefer; opinions are welcome :)

Thanks,
Jacob




Re: libpq compression (part 3)

2023-12-19 Thread Jacob Burroughs
> I believe this patch series is ready for detailed review/testing, with one 
> caveat: as can be seen here https://cirrus-ci.com/build/6732518292979712 , 
> the build is passing on all platforms and all tests except for the primary 
> SSL test on Windows.

One correction: I apparently missed a kerberos timeout failure on
freebsd with compression enabled (being color blind the checkmark and
still running colors are awfully similar, and I misread what I saw).
I haven't yet successfully reproduced that one, so I may or may not
need some pointers to sort it out, but I think whatever it is the fix
will be small enough that the patch overall is still reviewable.