from:"Jacob Champion"

Re: Should rolpassword be toastable?

2024-10-03 Thread Jacob Champion

On Thu, Oct 3, 2024 at 3:25 PM Tom Lane  wrote:
>
> Nathan Bossart  writes:
> > I don't mind proceeding with the patch if there is strong support for it.
> > I wavered only because it's hard to be confident that we are choosing the
> > right limit.
>
> I'm not that fussed about it; surely 256 is more than anyone is using?
> If not, we'll get push-back and then we can have a discussion about the
> correct limit that's informed by more than guesswork.

+1.

Next up is probably SCRAM-SHA-512, which should still have smaller
entries than that -- 222 bytes, I think, with 128-bit salts and a
5-digit iteration count?

--Jacob

Re: Retire support for OpenSSL 1.1.1 due to raised API requirements

2024-10-03 Thread Jacob Champion

On Thu, Oct 3, 2024 at 3:17 AM Daniel Gustafsson  wrote:
> In the TLSv1.3 cipher suite thread it was brought up that this bump in minimum
> version would bump the minimum version of libressl to 3.4, whcih corresponds 
> to
> the OpenBSD 3.4 release (from October 2021).

OpenBSD 7.0, that is.

> We don't explicitly mention which libressl version we support, if we raise it
> as proposed here then perhaps it's a good time to state that in the docs.

+1

--Jacob

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-10-02 Thread Jacob Champion

On Wed, Oct 2, 2024 at 11:33 AM Daniel Gustafsson  wrote:
> > If I migrate a server to a different machine that doesn't support my
> > groups, I don't know that this would give me enough information to fix
> > the configuration.
>
> Fair point, how about something along the lines of:
>
> +   errmsg("ECDH: failed to set curve names specified in ssl_groups: %s",
> +   SSLerrmessageExt(ERR_get_error(),
> +_("No valid groups found"))),

Yeah, I think that's enough of a pointer. And then maybe "Failed to
set group names specified in ssl_groups: %s" to get rid of the
lingering ECC references?

> > One nice side effect of the new ssl_groups implementation is that we
> > now support common group aliases. For example, "P-256", "prime256v1",
> > and "secp256r1" can all be specified now, whereas before ony
> > "prime256v1" worked because of how we looked up curves. Is that worth
> > a note in the docs?
>
> Maybe. We have this currently in the manual:
>
> "The full list of available curves can be shown with the command
> openssl ecparam -list_curves.  Not all of them are
> usable with TLS though."
>
> Perhaps we can extend that with a short not on aliases?  Got any suggested
> wordings for that if so?

Hm, well, I went down a rabbit hole this afternoon -- OpenSSL has an
open feature request [1] that might eventually document this the right
way. In the meantime, maybe something like...

An incomplete list of available groups can be shown with the
command openssl ecparam -list_curves. Not all of them are usable with
TLS though, and many supported group names and aliases are omitted.

In PostgreSQL versions before 18.0 this setting was named
ssl_ecdh_curve. It only accepted a single value and did not recognize
group aliases at all.

Thanks,
--Jacob

[1] https://github.com/openssl/openssl/issues/17953

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-10-02 Thread Jacob Champion

On Wed, Sep 25, 2024 at 6:39 AM Daniel Gustafsson  wrote:
> I can't recall specific bounds for supporting LibreSSL even being discussed,
> the support is also not documented as an official thing.  Requiring TLS 1.3
> APIs for supporting a library in 2025 (when 18 ships) doesn't seem entirely
> unreasonable so maybe 3.4 is a good cutoff.  The fact that LibreSSL trailed
> behind OpenSSL in adding these APIs shouldn't limit our functionality.

Okay. At minimum I think we'll lose conchuela, plover, and morepork
from the master builds until they are updated. schnauzer is new enough
to keep going.

> Thinking on it a bit I went (to some
> degree inspired by what we did in curl) with ssl_tls13_ciphers which makes the
> name very similar to the tls12 GUC but with the clear distinction of being
> protocol specific.  It also makes the GUC name more readable to place the
> protocol before "ciphers" I think.

Looks fine to me.

> I ended
> up adding a version of SSLerrmessage which takes a replacement string for 
> ecode
> 0 (which admittedly is hardcoded version knowledge as well..).  This can be
> used for scenarios when it's known that OpenSSL sometimes reports and error 
> and
> sometimes not (I'm sure there are quite a few more).

I like this new API! And yeah, I think it'll get more use elsewhere.

My only nitpick for this particular error message is that there's no
longer any breadcrumb back to the setting that's broken:

FATAL:  ECDH: failed to set curve names: No valid groups found
HINT:  Ensure that each group name is spelled correctly and
supported by the installed version of OpenSSL

If I migrate a server to a different machine that doesn't support my
groups, I don't know that this would give me enough information to fix
the configuration.

--

One nice side effect of the new ssl_groups implementation is that we
now support common group aliases. For example, "P-256", "prime256v1",
and "secp256r1" can all be specified now, whereas before ony
"prime256v1" worked because of how we looked up curves. Is that worth
a note in the docs? Even if not, it might be good to keep in mind for
support threads.

Thanks,
--Jacob

Re: pg_parse_json() should not leak token copies on failure

2024-10-01 Thread Jacob Champion

(Tom, thank you for the fixup in 75240f65!)

Off-list, Andrew suggested an alternative approach to the one I'd been
taking: let the client choose whether it wants to take token ownership
to begin with. v3 adds a new flag (and associated API) which will
allow libpq to refuse ownership of those tokens. The lexer is then
free to clean everything up during parse failures.

Usually I'm not a fan of "do the right thing" flags, but in this case,
libpq really is the outlier. And it's nice that existing clients
(potentially including extensions) don't have to worry about an API
change.

At the moment, we have a test matrix consisting of "standard frontend"
and "shlib frontend" tests for the incremental parser. I'm planning
for the v4 patch to extend that with a "owned/not owned" dimension;
any objections?

Thanks,
--Jacob


v3-0001-jsonapi-add-lexer-option-to-keep-token-ownership.patch
Description: Binary data

Re: Row pattern recognition

2024-10-01 Thread Jacob Champion

On Sun, Sep 29, 2024 at 5:08 PM Tatsuo Ishii  wrote:
> > I think implementing MEASURES is challenging. Especially we need to
> > find how our parser accepts "colname OVER
> > window_definition". Currently PostgreSQL's parser only accepts "func()
> > OVER window_definition" Even it is technically possible, I think the
> > v1 patch size will become much larger than now due to this.

[resending, to the whole list this time]

Yeah. In any case, I'm not the right person to bolt MEASURES onto the
existing grammar... my misadventures in the PATTERN parser have
highlighted how little I know about Bison. :D

> Please disregard my proposal. Even if we make such a function, it
> would always return NULL for unmatched rows or skipped rows, and I
> think the function does not solve your problem.
>
> However, I wonder if supporting MEASURES solves the problem either
> because any columns defined by MEASURES will return NULL except the
> first row in a reduced frame. Can you please show an example how to
> identify runs of matched rows using MEASURES?

I think you're probably right; my suggestion can't distinguish between
skipped (but previously matched) rows and entirely-unmatched rows. The
test case I'd been working with returned an empty match as a fallback,
so it wouldn't have had that problem in practice. I was hoping that
one of the existing whole-partition window functions would allow me to
cobble something together based on the COUNT(*) measure, but after
searching for a while I haven't been able to come up with a solution.

Maybe it's just too niche for the window-function version of this --
after all, it only makes sense when using both INITIAL and AFTER MATCH
SKIP PAST LAST ROW. A more general solution could identify the
row_number of the first and last rows of the window frame, perhaps?
But a frame isn't guaranteed to be contiguous, so maybe that doesn't
make sense either. Ugh.

> I wonder how Oracle solves the problem (an infinite set of possible
> matches) without using "--max-rows" or something like that because in
> my understanding Oracle supports the regular expressions and PERMUTE.

I chose a confusing way to describe it, sorry. The parenthesized
language for a pattern can be an infinite set, because A+ could match
"( A )" or "( A A )" or "( A A A )" and so on forever. But that
doesn't apply to our regex engine in practice; our tables have a
finite number of rows, and I *think* the PL for a finite number of
rows is also finite, due to the complicated rules on where empty
matches are allowed to appear in the language. (In any case, my tool
doesn't guard against infinite recursion...)

> >> My implementation is really messy -- it leaks memory like a sieve, and
> >> I cannibalized the parser from ECPG, which just ended up as an
> >> exercise in teaching myself flex/bison. But if there's interest in
> >> having this kind of tool in the tree, I can work on making it
> >> reviewable. Either way, I should be able to use it to double-check
> >> more complicated test cases.
>
> I definitely am interested in the tool!

Okay, good to know! I will need to clean it up considerably, and
figure out whether I've duplicated more code than I should have.

> >> A while back [2], you were wondering whether our Bison implementation
> >> would be able to parse the PATTERN grammar directly. I think this tool
> >> proves that the answer is "yes", but PERMUTE in particular causes a
> >> shift/reduce conflict. To fix it, I applied the same precedence
> >> workaround that we use for CUBE and ROLLUP.
>
> That's a good news!

+1

Thanks,
--Jacob

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-09-30 Thread Jacob Champion

On Mon, Sep 30, 2024 at 6:38 AM Antonin Houska  wrote:
>
> Jacob Champion  wrote:
>
> > On Fri, Sep 27, 2024 at 10:58 AM Antonin Houska  wrote:
> > That's why people are so pedantic about saying that OAuth is an
> > authorization framework and not an authentication framework.
>
> This statement alone sounds as if you missed *authentication*, but you seem to
> admit above that the /userinfo endpoint provides it ("tells you who the end
> user is"). I agree that it does. My understanding is that this endpoint, as
> well as the concept of "claims" and "scopes", is introduced by OpenID, which
> is an *authentication* framework, although it's built on top of OAuth.

OpenID is an authentication framework, but it's generally focused on a
type of client known as a Relying Party. In the architecture of this
patchset, the Relying Party would be libpq, which has the option of
retrieving authentication claims from the provider. Unfortunately for
us, libpq has no use for those claims. It's not trying to authenticate
the user for its own purposes.

The Postgres server, on the other hand, is not a Relying Party. (It's
an OAuth resource server, in this architecture.) It's not performing
any of the OIDC flows, it's not talking to the end user and the
provider at the same time, and it is very restricted in its ability to
influence the client exchange via the SASL mechanism.

> Regarding *authorization*, I agree that the bearer token may not contain
> enough information to determine whether the owner of the token is allowed to
> access the database. However, I consider database a special kind of
> "application", which can handle authorization on its own. In this case, the
> authorization can be controlled by (not) assigning the user the LOGIN
> attribute, as well as by (not) granting it privileges on particular database
> objects. In short, I think that *authentication* is all we need.

Authorizing the *end user's* access to the database using scopes is
optional. Authorizing the *bearer's* ability to connect on behalf of
the end user, however, is mandatory. Hopefully the below clarifies.

(I agree that most people probably want to use authentication, so that
the database can then make decisions based on HBA settings. OIDC is a
fine way to do that.)

> Are you sure you can legitimately acquire the bearer token containing my email
> address?

Yes. In general that's how OpenID-based "Sign in with "
works. All those third-party services are running around with tokens
that identify you, but unless they've asked for more abilities and
you've granted them the associated scopes, identifying you is all they
can do.

> I think the email address returned by the /userinfo endpoint is one
> of the standard claims [1]. Thus by returning the particular value of "email"
> from the endpoint the identity provider asserts that the token owner does have
> this address.

We agree that /userinfo gives authentication claims for the end user.
It's just insufficient for our use case.

For example, there are enterprise applications out there that will ask
for read access to your Google Calendar. If you're willing to grant
that, then you probably won't mind if those applications also know
your email address, but you probably do mind if they're suddenly able
to access your production databases just because you gave them your
email.

Put another way: if you log into Postgres using OAuth, and your
provider doesn't show you a big message saying "this application is
about to access *your* prod database using *your* identity; do you
want to allow that?", then your DBA has deployed a really dangerous
configuration. That's a critical protection feature you get from your
OAuth provider. Otherwise, what's stopping somebody else from setting
up their own malicious service to farm access tokens? All they'd have
to do is ask for your email.

> Another question, assuming the token verification is resolved somehow:
> wouldn't it be sufficient for the initial implementation if the client could
> pass the bearer token to libpq in the connection string?

It was discussed wayyy upthread:

https://postgr.es/m/CAAWbhmhmBe9v3aCffz5j8Sg4HMWWkB5FvTDCSZ_Vh8E1fX91Gw%40mail.gmail.com

Basically, at that point the entire implementation becomes an exercise
for the reader. I want to avoid that if possible. I'm not adamantly
opposed to it, but I think the client-side hook implementation is
going to be better for the use cases that have been discussed so far.

> Also, if libpq accepted the bearer token via the connection string, it would
> be possible to implement the authorization as a separate front-end application
> (e.g. pg_oauth_login) rather than adding more complexity to libpq itself.

The appl

Re: Row pattern recognition

2024-09-27 Thread Jacob Champion

On Wed, Sep 18, 2024 at 10:00 PM Tatsuo Ishii  wrote:
>
> Attached are the v22 patches. Just rebased.

Thanks!

With some bigger partitions, I hit an `ERROR:  wrong pos: 1024`. A
test that reproduces it is attached.

While playing with the feature, I've been trying to identify runs of
matched rows by eye. But it's pretty difficult -- the best I can do is
manually count rows using a `COUNT(*) OVER ...`. So I'd like to
suggest that MEASURES be part of the eventual v1 feature, if there's
no other way to determine whether a row was skipped by a previous
match. (That was less obvious to me before the fix in v17.)

--

I've been working on an implementation [1] of SQL/RPR's "parenthesized
language" and preferment order. (These are defined in SQL/Foundation
2023, section 9.41.) The tool gives you a way to figure out, for a
given pattern, what matches are supposed to be attempted and in what
order:

$ ./src/test/modules/rpr/rpr_prefer "a b? a"
( ( a ( b ) ) a )
( ( a ( ) ) a )

Many simple patterns result in an infinite set of possible matches. So
if you use an unbounded quantifiers, you have to also use --max-rows
to limit the size of the hypothetical window frame:

$ ./src/test/modules/rpr/rpr_prefer --max-rows 2 "^ PERMUTE(a*, b+)? $"
( ( ^ ( ( ( ( ( ( a ) ( b ) ) ) - ) ) ) ) $ )
( ( ^ ( ( ( ( ( ( ) ( b b ) ) ) - ) ) ) ) $ )
( ( ^ ( ( ( ( ( ( ) ( b ) ) ) - ) ) ) ) $ )
( ( ^ ( ( ( - ( ( ( b b ) ( ) ) ) ) ) ) ) $ )
( ( ^ ( ( ( - ( ( ( b ) ( a ) ) ) ) ) ) ) $ )
( ( ^ ( ( ( - ( ( ( b ) ( ) ) ) ) ) ) ) $ )
( ( ^ ( ) ) $ )

I've found this useful to check my personal understanding of the spec
and the match behavior, but it could also potentially be used to
generate test cases, or to help users debug their own patterns. For
example, a pattern that has a bunch of duplicate sequences in its PL
is probably not very well optimized:

$ ./src/test/modules/rpr/rpr_prefer --max-rows 4 "a+ a+"
( ( a a a ) ( a ) )
( ( a a ) ( a a ) )
( ( a a ) ( a ) )
( ( a ) ( a a a ) )
( ( a ) ( a a ) )
( ( a ) ( a ) )

And patterns with catastrophic backtracking behavior tend to show a
"sawtooth" pattern in the output, with a huge number of potential
matches being generated relative to the number of rows in the frame.

My implementation is really messy -- it leaks memory like a sieve, and
I cannibalized the parser from ECPG, which just ended up as an
exercise in teaching myself flex/bison. But if there's interest in
having this kind of tool in the tree, I can work on making it
reviewable. Either way, I should be able to use it to double-check
more complicated test cases.

A while back [2], you were wondering whether our Bison implementation
would be able to parse the PATTERN grammar directly. I think this tool
proves that the answer is "yes", but PERMUTE in particular causes a
shift/reduce conflict. To fix it, I applied the same precedence
workaround that we use for CUBE and ROLLUP.

Thanks again!
--Jacob

[1] https://github.com/jchampio/postgres/tree/dev/rpr
[2] 
https://www.postgresql.org/message-id/20230721.151648.412762379013769790.t-ishii%40sranhm.sra.co.jp
commit 819c1abba21c9b59cceedd55962c3fdcb982aad5
Author: Jacob Champion 
Date:   Thu Sep 26 14:12:38 2024 -0700

RPR: test larger window partitions

The second test case fails with

ERROR:  wrong pos: 1024

diff --git a/src/test/regress/expected/rpr.out 
b/src/test/regress/expected/rpr.out
index 0789e09435..dcfd67f143 100644
--- a/src/test/regress/expected/rpr.out
+++ b/src/test/regress/expected/rpr.out
@@ -834,3 +834,40 @@ ERROR:  SEEK is not supported
 LINE 8:  SEEK
  ^
 HINT:  Use INITIAL.
+-- Smoke test for larger partitions.
+WITH s AS (
+ SELECT v, count(*) OVER w AS c
+ FROM (SELECT generate_series(1, 5000) v)
+ WINDOW w AS (
+  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
+  AFTER MATCH SKIP PAST LAST ROW
+  INITIAL
+  PATTERN ( r+ )
+  DEFINE r AS TRUE
+ )
+)
+-- Should be exactly one long match across all rows.
+SELECT * FROM s WHERE c > 0;
+ v |  c   
+---+--
+ 1 | 5000
+(1 row)
+
+WITH s AS (
+ SELECT v, count(*) OVER w AS c
+ FROM (SELECT generate_series(1, 5000) v)
+ WINDOW w AS (
+  ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING
+  AFTER MATCH SKIP PAST LAST ROW
+  INITIAL
+  PATTERN ( r )
+  DEFINE r AS TRUE
+ )
+)
+-- Every row should be its own match.
+SELECT count(*) FROM s WHERE c > 0;
+ count 
+---
+  5000
+(1 row)
+
diff --git a/src/test/regress/sql/rpr.sql b/src/test/regress/sql/rpr.sql
index 302e2b86a5..6ff61e6d60 100644
--- a/src/test/regress/sql/rpr.sql
+++ b/src/test/regress/sql/rpr.sql
@@ -403,3 +403,32 @@ SELECT company, tdate, price, first_value(price) OVER w, 
last_value(price) OVER
   UP AS price > PREV(price),
   DOWN AS price < PREV(price)
 );
+
+-- Smoke test for larger partitions.
+WITH s AS (
+ SELECT v, cou

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-09-27 Thread Jacob Champion

On Fri, Sep 27, 2024 at 10:58 AM Antonin Houska  wrote:
> Have you considered sending the token for validation to the server, like this
>
> curl -X GET "https://www.googleapis.com/oauth2/v3/userinfo"; -H 
> "Authorization: Bearer $TOKEN"

In short, no, but I'm glad you asked. I think it's going to be a
common request, and I need to get better at explaining why it's not
safe, so we can document it clearly. Or else someone can point out
that I'm misunderstanding, which honestly would make all this much
easier and less complicated. I would love to be able to do it that
way.

We cannot, for the same reason libpq must send the server an access
token instead of an ID token. The /userinfo endpoint tells you who the
end user is, but it doesn't tell you whether the Bearer is actually
allowed to access the database. That difference is critical: it's
entirely possible for an end user to be authorized to access the
database, *and yet* the Bearer token may not actually carry that
authorization on their behalf. (In fact, the user may have actively
refused to give the Bearer that permission.) That's why people are so
pedantic about saying that OAuth is an authorization framework and not
an authentication framework.

To illustrate, think about all the third-party web services out there
that ask you to Sign In with Google. They ask Google for permission to
access your personal ID, and Google asks you if you're okay with that,
and you either allow or deny it. Now imagine that I ran one of those
services, and I decided to become evil. I could take my legitimately
acquired Bearer token -- which should only give me permission to query
your Google ID -- and send it to a Postgres database you're authorized
to access.

The server is supposed to introspect it, say, "hey, this token doesn't
give the bearer access to the database at all," and shut everything
down. For extra credit, the server could notice that the client ID
tied to the access token isn't even one that it recognizes! But if all
the server does is ask Google, "what's the email address associated
with this token's end user?", then it's about to make some very bad
decisions. The email address it gets back doesn't belong to Jacob the
Evil Bearer; it belongs to you.

Now, the token introspection endpoint I mentioned upthread should give
us the required information (scopes, etc.). But Google doesn't
implement that one. In fact they don't seem to have implemented custom
scopes at all in the years since I started work on this feature, which
makes me think that people are probably not going to be able to safely
log into Postgres using Google tokens. Hopefully there's some feature
buried somewhere that I haven't seen.

Let me know if that makes sense. (And again: I'd love to be proven
wrong. It would improve the reach of the feature considerably if I
am.)

Thanks,
--Jacob

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-09-18 Thread Jacob Champion

On Mon, Sep 9, 2024 at 5:00 AM Daniel Gustafsson  wrote:
> Good catch.  OpenSSL 3.2 changed the error message to be a lot more helpful,
> before that there is no error added to the queue at all for this processing
> (hence the "no SSL error reported").  The attached adds a hint as well as a
> proper error message for OpenSSL versions prior to 3.2.

Thanks!

> The attached version also has a new 0001 which bumps the minimum required
> OpenSSL version to 1.1.1 (from 1.1.0) since this patchset requires API's only
> present in 1.1.1 and onwards.  To keep it from being hidden here I will raise 
> a
> separate thread about it.

As implemented, my build matrix is no longer able to compile against
LibreSSL 3.3 and below (OpenBSD 6.x). Has the lower bound on LibreSSL
for PG18 been discussed yet?

> +#ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'  # allowed TLSv1.2 ciphers
> +#ssl_cipher_suites = ''# allowed TLSv1.3 cipher suites, blank for default

After marinating on this a bit... I think the naming may result in
some "who's on first" miscommunications in forums and on the list. "I
set the SSL ciphers to , but it says there are no valid
ciphers available!" Should we put TLS 1.3 into the new GUC name
somehow?

> +   {"ssl_groups", PGC_SIGHUP, CONN_AUTH_SSL,
> +   gettext_noop("Sets the curve(s) to use for ECDH."),

The ECDH reference should probably be updated/removed. Maybe something
like "Sets the group(s) to use for Diffie-Hellman key exchange." ?

> +#if (OPENSSL_VERSION_NUMBER <= 0x3020L)
> +   /*
> +* OpenSSL 3.3.0 introduced proper error messages for group
> +* parsing errors, earlier versions returns "no SSL error
> +* reported" which is far from helpful. For older versions, we
> +* manually set a better error message. Injecting the error
> +* into the OpenSSL error queue need APIs from OpenSSL 3.0.
> +*/
> +   errmsg("ECDH: failed to set curve names: No valid groups in 
> '%s'",
> +  SSLECDHCurve),

nit: can we do this only when ERR_get_error() returns zero? It looks
like LibreSSL is stuck at OPENSSL_VERSION_NUMBER == 0x2000, so if
they introduce a nice error message at some point it'll still get
ignored.

> +   &SSLCipherLists,

nit: a singular "SSLCipherList" would be clearer, IMO.

--

Looking at the commit messages:

>Support configuring multiple ECDH curves
>
>The ssl_ecdh_curve only GUC accepts a single value, but the TLS

"GUC" and "only" are transposed here.

>Support configuring TLSv1.3 cipher suites
>
>The ssl_ciphers GUC can only set cipher suites for TLSv1.2, and lower,
>connections. For TLSv1.3 connections a different OpenSSL must be used.

"a different OpenSSL API", maybe?

Thanks,
--Jacob

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-09-16 Thread Jacob Champion

On Wed, Sep 11, 2024 at 3:54 PM Jacob Champion
 wrote:
> Yeah, and I still owe you all an updated roadmap.

Okay, here goes. New reviewers: start here!

== What is This? ==

OAuth 2.0 is a way for a trusted third party (a "provider") to tell a
server whether a client on the other end of the line is allowed to do
something. This patchset adds OAuth support to libpq with libcurl,
provides a server-side API so that extension modules can add support
for specific OAuth providers, and extends our SASL support to carry
the OAuth access tokens over the OAUTHBEARER mechanism.

Most OAuth clients use a web browser to perform the third-party
handshake. (These are your "Okta logins", "sign in with XXX", etc.)
But there are plenty of people who use psql without a local browser,
and invoking a browser safely across all supported platforms is
actually surprisingly fraught. So this patchset implements something
called device authorization, where the client will display a link and
a code, and then you can log in on whatever device is convenient for
you. Once you've told your provider that you trust libpq to connect to
Postgres on your behalf, it'll give libpq an access token, and libpq
will forward that on to the server.

== How This Fits, or: The Sales Pitch ==

The most popular third-party auth methods we have today are probably
the Kerberos family (AD/GSS/SSPI) and LDAP. If you're not already in
an MS ecosystem, it's unlikely that you're using the former. And users
of the latter are, in my experience, more-or-less resigned to its use,
in spite of LDAP's architectural security problems and the fact that
you have to run weird synchronization scripts to tell Postgres what
certain users are allowed to do.

OAuth provides a decently mature and widely-deployed third option. You
don't have to be running the infrastructure yourself, as long as you
have a provider you trust. If you are running your own infrastructure
(or if your provider is configurable), the tokens being passed around
can carry org-specific user privileges, so that Postgres can figure
out who's allowed to do what without out-of-band synchronization
scripts. And those access tokens are a straight upgrade over
passwords: even if they're somehow stolen, they are time-limited, they
are optionally revocable, and they can be scoped to specific actions.

== Extension Points ==

This patchset provides several points of customization:

Server-side validation is farmed out entirely to an extension, which
we do not provide. (Each OAuth provider is free to come up with its
own proprietary method of verifying its access tokens, and so far the
big players have absolutely not standardized.) Depending on the
provider, the extension may need to contact an external server to see
what the token has been authorized to do, or it may be able to do that
offline using signing keys and an agreed-upon token format.

The client driver using libpq may replace the device authorization
prompt (which by default is done on standard error), for example to
move it into an existing GUI, display a scannable QR code instead of a
link, and so on.

The driver may also replace the entire OAuth flow. For example, a
client that already interacts with browsers may be able to use one of
the more standard web-based methods to get an access token. And
clients attached to a service rather than an end user could use a more
straightforward server-to-server flow, with pre-established
credentials.

== Architecture ==

The client needs to speak HTTP, which is implemented entirely with
libcurl. Originally, I used another OAuth library for rapid
prototyping, but the quality just wasn't there and I ported the
implementation. An internal abstraction layer remains in the libpq
code, so if a better client library comes along, switching to it
shouldn't be too painful.

The client-side hooks all go through a single extension point, so that
we don't continually add entry points in the API for each new piece of
authentication data that a driver may be able to provide. If we wanted
to, we could potentially move the existing SSL passphrase hook into
that, or even handle password retries within libpq itself, but I don't
see any burning reason to do that now.

I wanted to make sure that OAuth could be dropped into existing
deployments without driver changes. (Drivers will probably *want* to
look at the extension hooks for better UX, but they shouldn't
necessarily *have* to.) That has driven several parts of the design.

Drivers using the async APIs should continue to work without blocking,
even during the long HTTP handshakes. So the new client code is
structured as a typical event-driven state machine (similar to
PQconnectPoll). The protocol machine hands off control to the OAuth
machine during authentication, without really needing to know how it
works, because the OAuth machine replaces the PQsocket with a
general-purpos

Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

2024-09-11 Thread Jacob Champion

On Wed, Sep 11, 2024 at 12:08 PM Lars Kanis  wrote:
> How did you verify the issue on the server side - with YugabyteDB or
> with a modified Postgres server? I'd like to verify the GSSAPI part and
> I'm familiar with the Postgres server only.

Neither, unfortunately -- I have a protocol testbed that I use for
this kind of stuff. I'm happy to share once I get it cleaned up, but
it's unlikely to help you in this case since I haven't implemented
gssenc support. Patching the Postgres server itself seems like a good
way to go.

> > And are there any other sites that
> > need to make the same guarantee before returning?
>
> Which other sites do you mean?

I'm mostly worried that other parts of libpq might assume that a
single call to pqReadData will drain the buffers. If not, great! --
but I haven't had time to check all the call sites.

> > I need to switch away from this for a bit. Would you mind adding this
> > to the next Commitfest as a placeholder?
>
> No problem; registered: https://commitfest.postgresql.org/50/5251/

Thank you!

--Jacob

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-09-11 Thread Jacob Champion

(Thanks for the commit, Peter!)

On Wed, Sep 11, 2024 at 6:44 AM Daniel Gustafsson  wrote:
>
> > On 11 Sep 2024, at 09:37, Peter Eisentraut  wrote:
>
> > Is there any sense in dealing with the libpq and backend patches separately 
> > in sequence, or is this split just for ease of handling?
>
> I think it's just make reviewing a bit easier.  At this point I think they can
> be merged together, it's mostly out of historic reasons IIUC since the 
> patchset
> earlier on supported more than one library.

I can definitely do that (and yeah, it was to make the review slightly
less daunting). The server side could potentially be committed
independently, if you want to parallelize a bit, but it'd have to be
torn back out if the libpq stuff didn't land in time.

> > (I suppose the 0004 "review comments" patch should be folded into the 
> > respective other patches?)

Yes. I'm using that patch as a holding area while I write tests for
the hunks, and then moving them backwards.

> I added a warning to autconf in case --with-oauth is used without 
> --with-python
> since this combination will error out in running the tests.  Might be
> superfluous but I had an embarrassingly long headscratcher myself as to why 
> the
> tests kept failing =)

Whoops, sorry. I guess we should just skip them if Python isn't there?

> CURL_IGNORE_DEPRECATION(x;) broke pgindent, it needs to keep the semicolon on
> the outside like CURL_IGNORE_DEPRECATION(x);.  This doesn't really work well
> with how the macro is defined, not sure how we should handle that best (the
> attached makes the style as per how pgindent want's it with the semicolon
> returned).

Ugh... maybe a case for a pre_indent rule in pgindent?

> The oauth_validator test module need to load Makefile.global before exporting
> the symbols from there.

Hm. Why was that passing the CI, though...?

> There is a first stab at documenting the validator module API, more to come 
> (it
> doesn't compile right now).
>
> It contains a pgindent and pgperltidy run to keep things as close to in final
> sync as we can to catch things like the curl deprecation macro mentioned above
> early.

Thanks!

> > What could be the next steps to keep this moving along, other than stare at 
> > the remaining patches until we're content with them? ;-)
>
> I'm in the "stare at things" stage now to try and get this into the tree =)

Yeah, and I still owe you all an updated roadmap.

While I fix up the tests, I've also been picking away at the JSON
encoding problem that was mentioned in [1]; the recent SASLprep fix
was fallout from that, since I'm planning to pull in pieces of its
UTF-8 validation. I will eventually want to fuzz the heck out of this.

> To further pick away at this huge patch I propose to merge the SASL message
> length hunk which can be extracted separately.  The attached .txt (to keep the
> CFBot from poking at it) contains a diff which can be committed ahead of the
> rest of this patch to make it a tad smaller and to keep the history of that
> change a bit clearer.

LGTM!

--

Peter asked me if there were plans to provide a "standard" validator
module, say as part of contrib. The tricky thing is that Bearer
validation is issuer-specific, and many providers give you an opaque
token that you're not supposed to introspect at all.

We could use token introspection (RFC 7662) for online verification,
but last I looked at it, no one had actually implemented those
endpoints. For offline verification, I think the best we could do
would be to provide a generic JWT Profile (RFC 9068) validator, but
again I don't know if anyone is actually providing those token formats
in practice. I'm inclined to push that out into the future.

Thanks,
--Jacob

[1] https://www.postgresql.org/message-id/ZjxQnOD1OoCkEeMN%40paquier.xyz

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-11 Thread Jacob Champion

On Mon, Sep 9, 2024 at 10:30 PM Michael Paquier  wrote:
> No.  My question was about splitting pgstat_bestart() and
> pgstat_bestart_pre_auth() in a cleaner way, because authenticated
> connections finish by calling both, meaning that we do twice the same
> setup for backend entries depending on the authentication path taken.
> That seems like a waste.

I can try to separate them out. I'm a little wary of messing with the
CRITICAL_SECTION guarantees, though. I thought the idea was that you
filled in the entire struct to prevent tearing. (If I've misunderstood
that, please let me know :D)

> Perhaps just use a new
> "Authentication" class, as in "The server is waiting for an
> authentication operation to complete"?

Sounds good.

> Couldn't it be better to have a one-one mapping
> instead, adding twelve entries in wait_event_names.txt?

(I have no strong opinions on this myself, but while the debate is
ongoing, I'll work on a version of the patch with more detailed wait
events. It's easy to collapse them again if that gets the most votes.)

> I am not really on board with the test based on injection points
> proposed, though.  It checks that the "authenticating" flag is set in
> pg_stat_activity, but it does nothing else.  That seems limited.  Or
> are you planning for more?

I can test for specific contents of the entry, if you'd like. My
primary goal was to test that an entry shows up if that part of the
code hangs. I think a regression would otherwise go completely
unnoticed.

Thanks!
--Jacob

Re: libpq: Process buffered SSL read bytes to support records >8kB on async API

2024-09-10 Thread Jacob Champion

On Sun, Sep 8, 2024 at 1:08 PM Lars Kanis  wrote:
> I'm the maintainer of ruby-pg the ruby interface to the PostgreSQL
> database. This binding uses the asynchronous API of libpq by default to
> facilitate the ruby IO wait and scheduling mechanisms.
>
> This works well with the vanilla postgresql server, but it leads to
> starvation with other types of servers using the postgresql wire
> protocol 3. This is because the current functioning of the libpq async
> interface depends on a maximum size of SSL records of 8kB.

Thanks for the report! I wanted evidence that this wasn't a
ruby-pg-specific problem, so I set up a test case with
Python/psycopg2.

I was able to reproduce a hang when all of the following were true:
- psycopg2's async mode was enabled
- the client performs a PQconsumeInput/PQisBusy loop, waiting on
socket read events when the connection is busy (I used
psycopg2.extras.wait_select() for this)
- the server splits a large message over many large TLS records
- the server packs the final ReadyForQuery message into the same
record as the split message's final fragment

Gory details of the packet sizes, if it's helpful:
- max TLS record size is 12k, because it made the math easier
- server sends DataRow of 32006 bytes, followed by DataRow of 806
bytes, followed by CommandComplete/ReadyForQuery
- so there are three TLS records on the wire containing
  1) DataRow 1 fragment 1 (12k bytes)
  2) DataRow 1 fragment 2 (12k bytes)
  3) DataRow 1 fragment 3 (7430 bytes) + DataRow 2 (806 bytes)
   + CommandComplete + ReadyForQuery

> To fix this issue the attached patch calls pqReadData() repeatedly in
> PQconsumeInput() until there is no buffered SSL data left to be read.
> Another solution could be to process buffered SSL read bytes in
> PQisBusy() instead of PQconsumeInput() .

I agree that PQconsumeInput() needs to ensure that the transport
buffers are all drained. But I'm not sure this is a complete solution;
doesn't GSS have the same problem? And are there any other sites that
need to make the same guarantee before returning?

I need to switch away from this for a bit. Would you mind adding this
to the next Commitfest as a placeholder?

https://commitfest.postgresql.org/50/

Thanks,
--Jacob

Re: [PATCH] Fix small overread during SASLprep

2024-09-10 Thread Jacob Champion

On Tue, Sep 10, 2024 at 4:39 AM Daniel Gustafsson  wrote:
> Pushed, thanks!

Thank you!

--Jacob

Re: [PATCH] Fix small overread during SASLprep

2024-09-09 Thread Jacob Champion

On Mon, Sep 9, 2024 at 11:30 AM Daniel Gustafsson  wrote:
> Just to make sure I understand, this is for guarding against overreads in
> validation of strings containing torn MB characters?

Right. Our SASLprep code doesn't require/enforce UTF8-encoded inputs.

> Assuming I didn't
> misunderstand you this patch seems correct to me.

Thanks for the review!

--Jacob

[PATCH] Fix small overread during SASLprep

2024-09-09 Thread Jacob Champion

Hi all,

pg_utf8_string_len() doesn't check the remaining string length before
calling pg_utf8_is_legal(), so there's a possibility of jumping a
couple of bytes past the end of the string. (The overread stops there,
because the function won't validate a sequence containing a null
byte.)

Here's a quick patch to fix it. I didn't see any other uses of
pg_utf8_is_legal() with missing length checks.

Thanks,
--Jacob


pg_utf8_string_len-honor-null-terminators.patch
Description: Binary data

Re: Make query cancellation keys longer

2024-09-05 Thread Jacob Champion

On Thu, Sep 5, 2024 at 9:36 AM Jelte Fennema-Nio  wrote:
> Totally agreed that it would be good to update psql to use the new
> much more secure libpq function introduced in PG17[1]. This is not a
> trivial change though because it requires refactoring the way we
> handle signals (which is why I didn't do it as part of introducing
> these new APIs).

Yeah, I figured it wasn't a quick fix.

> I had hoped that the work in [2] would either do that
> or at least make it a lot easier, but that thread seems to have
> stalled. So +1 for doing this, but I think it's a totally separate
> change and so should be discussed on a separate thread.

As long as the new thread doesn't also stall out/get forgotten, I'm happy.

> It sounds to me like we should at least use OpenSSL's CRYPTO_memcmp if
> we linked against it and the OS doesn't provide a timingsafe_bcmp.
> Would that remove your concerns?

If we go that direction, I'd still like to know which platforms we
expect to have a suboptimal port, if for no other reason than
documenting that those users should try to get OpenSSL into their
builds. (I agree that they probably will already, if they care.)

And if I'm being really picky, I'm not sure we should call our port
"timingsafe_bcmp" (vs. pg_secure_bcmp or something) if we know it's
not actually timing-safe for some. But I won't die on that hill.

Thanks,
--Jacob

Re: Make query cancellation keys longer

2024-09-05 Thread Jacob Champion

On Thu, Sep 5, 2024 at 9:21 AM Tom Lane  wrote:
> Wasn't this already addressed in v17, by
>
> Author: Alvaro Herrera 
> 2024-03-12 [61461a300] libpq: Add encrypted and non-blocking query 
> cancellation
>
> ?  Perhaps we need to run around and make sure none of our standard
> clients use the old API anymore, but the libpq infrastructure is
> there already.

Right. From a quick grep, it looks like we have seven binaries using
the signal-based cancel handler.

(For programs that only send a cancel request right before they break
the connection, it's probably not worth a huge amount of effort to
change it right away, but for psql in particular I think the status
quo is a little weird.)

Thanks,
--Jacob

Re: Make query cancellation keys longer

2024-09-05 Thread Jacob Champion

On Thu, Aug 15, 2024 at 10:14 AM Heikki Linnakangas  wrote:
> I'm back to working on the main patch here, to make cancellation keys
> longer. New rebased version attached, with all the FIXMEs and TODOs from
> the earlier version fixed. There was a lot of bitrot, too.

I have a couple of questions/comments separate from the protocol changes:

Has there been any work/discussion around not sending the cancel key
in plaintext from psql? It's not a prerequisite or anything (the
longer length is a clear improvement either way), but it seems odd
that this longer "secret" is still just going to be exposed on the
wire when you press Ctrl+C.

> The first patch now introduces timingsafe_bcmp(), a function borrowed
> from OpenBSD to perform a constant-time comparison. There's a configure
> check to use the function from the OS if it's available, and includes a
> copy of OpenBSD's implementation otherwise. Similar functions exist with
> different names in OpenSSL (CRYPTO_memcmp) and NetBSD
> (consttime_memequal), but it's a pretty simple function so I don't think
> we need to work too hard to pick up those other native implementations.

One advantage to using other implementations is that _they're_ on the
hook for keeping constant-time guarantees, which is getting trickier
due to weird architectural optimizations [1]. CRYPTO_memcmp() has
almost the same implementation as 0001 here, except they made the
decision to mark the pointers volatile, and they also provide
hand-crafted assembly versions. This patch has OpenBSD's version, but
they've also turned on data-independent timing by default across their
ARM64 processors [2]. And Intel may require the same tweak, but it
doesn't look like userspace has access to that setting yet, and the
kernel thread [3] appears to have just withered...

For the cancel key implementation in particular, I agree with you that
it's probably not a serious problem. But if other security code starts
using timingsafe_bcmp() then it might be something to be concerned
about. Are there any platform/architecture combos that don't provide a
native timingsafe_bcmp() *and* need a DIT bit for safety?

Thanks,
--Jacob

[1] https://github.com/golang/go/issues/66450
[2] https://github.com/openbsd/src/commit/cf1440f11c
[3] https://lore.kernel.org/lkml/851920c5-31c9-ddd9-3e2d-57d379aa0...@intel.com/

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-09-03 Thread Jacob Champion

On Sun, Sep 1, 2024 at 5:10 PM Michael Paquier  wrote:
> On Fri, Aug 30, 2024 at 04:10:32PM -0400, Andrew Dunstan wrote:
> > Patch 0001 looks sane to me.
> So does 0002 to me.

Thanks both!

> I'm not much a fan of the addition of
> pgstat_bestart_pre_auth(), which is just a shortcut to set a different
> state in the backend entry to tell that it is authenticating.  Is
> authenticating the term for this state of the process startups,
> actually?  Could it be more transparent to use a "startup" or
> "starting"" state instead

Yeah, I think I should rename that. Especially if we adopt new wait
states to make it obvious where we're stuck.

"startup", "starting", "initializing", "connecting"...?

> that gets also used by pgstat_bestart() in
> the case of the patch where !pre_auth?

To clarify, do you want me to just add the new boolean directly to
pgstat_bestart()'s parameter list?

> The addition of the new wait event states in 0004 is a good idea,
> indeed,

Thanks! Any thoughts on the two open questions for it?:
1) Should we add a new wait event class rather than reusing IPC?
2) Is the level at which I've inserted calls to
pgstat_report_wait_start()/_end() sane and maintainable?

> and these can be seen in pg_stat_activity once we get out of
> PGSTAT_END_WRITE_ACTIVITY() (err.. Right?).

It doesn't look like pgstat_report_wait_start() uses that machinery.

--Jacob

Re: Proposal for implementing OCSP Stapling in PostgreSQL

2024-09-03 Thread Jacob Champion

On Mon, Sep 2, 2024 at 5:55 AM Daniel Gustafsson  wrote:
> I guess they prefer that orgs transition back to just using CRL's.

>From a practical perspective, I don't think anyone but browsers can do
that right now. Best I can tell, there's no CRLite client other than
Firefox, and Google's CRLSets look like a manual emergency system
rather than a general-purpose tool.

I don't think we could do it manually even if we wanted to (and we
don't want to, IMHO, for a whole host of reasons). As one specific
example, take the certificate for postgresql.org. There's no CRL
distribution point listed, and an LE blog post [1] from a couple years
back implies that they have no plans to make those available to us:

Although we will be producing CRLs which cover all certificates that we
issue, we will not be including those URLs in the CRL Distribution Point
extension of our certificates. [...] Our new CRL URLs will be disclosed
only in CCADB, so that the Apple and Mozilla root programs can consume
them without exposing them to potentially large download traffic from
the rest of the internet at large.

Frankly, it looks like they have no plan for non-browser clients. It's
feeling like one of those "Web" vs. "Internet" splits.

--Jacob

[1] https://letsencrypt.org/2022/09/07/new-life-for-crls.html

Re: PG_TEST_EXTRA and meson

2024-08-30 Thread Jacob Champion

On Wed, Aug 28, 2024 at 8:21 AM Nazir Bilal Yavuz  wrote:
> I do not exactly remember the reason but I think I copied the same
> behavior as before, PG_TEST_EXTRA variable was checked in the
> src/test/Makefile so I exported it there.

Okay, give v3 a try then. This exports directly from Makefile.global.
Since that gets pulled into a bunch of places, the scope is a lot
wider than it used to be; I've disabled it for PGXS so it doesn't end
up poisoning other extensions.

--Jacob

v3-make_test_extra.diff
Description: Binary data

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-08-29 Thread Jacob Champion

On Sun, Jun 30, 2024 at 10:48 AM Noah Misch  wrote:v
> > Would anyone like me to be more aggressive, and create a pgstat entry
> > as soon as we have the opening transaction? Or... as soon as a
> > connection is made?
>
> All else being equal, I'd like backends to have one before taking any lmgr
> lock or snapshot.

v3-0003 pushes the pgstat creation as far back as I felt comfortable,
right after the PGPROC registration by InitProcessPhase2(). That
function does lock the ProcArray, but if it gets held forever due to
some bug, you won't be able to use pg_stat_activity to debug it
anyway. And with this ordering, pg_stat_get_activity() will be able to
retrieve the proc entry by PID without a race.

This approach ends up registering an early entry for more cases than
the original patchset. For example, autovacuum and other background
workers will now briefly get their own "authenticating" state, which
seems like it could potentially confuse people. Should I rename the
state, or am I overthinking it?

> You could release the xmin before calling PAM or LDAP.  If you've copied all
> relevant catalog content to local memory, that's fine to do.

I played with the xmin problem a little bit, but I've shelved it for
now. There's probably a way to do that safely; I just don't understand
enough about the invariants to do it. For example, there's a comment
later on that says

 * We established a catalog snapshot while reading pg_authid and/or
 * pg_database;

and I'm a little nervous about invalidating the snapshot halfway
through that process. Even if PAM and LDAP don't rely on pg_authid or
other shared catalogs today, shouldn't they be allowed to in the
future, without being coupled to InitPostgres implementation order?
And I don't think we can move the pg_database checks before
authentication.

As for the other patches, I'll ping Andrew about 0001, and 0004
remains in its original WIP state. Anyone excited about that wait
event idea?

Thanks!
--Jacob

v3-0002-Test-Cluster-let-background_psql-work-asynchronou.patch
Description: Binary data

v3-0001-BackgroundPsql-handle-empty-query-results.patch
Description: Binary data

v3-0003-pgstat-report-in-earlier-with-STATE_AUTHENTICATIN.patch
Description: Binary data

v3-0004-WIP-report-external-auth-calls-as-wait-events.patch
Description: Binary data

Re: PG_TEST_EXTRA and meson

2024-08-28 Thread Jacob Champion

On Wed, Aug 28, 2024 at 6:41 AM Ashutosh Bapat
 wrote:
> Nazir, since you authored c3382a3c3cc, can you please provide input
> that Jacob needs?

Specifically, why the PG_TEST_EXTRA variable was being exported at the
src/test level only. If there's no longer a use case being targeted,
we can always change it in this patch, but I didn't want to do that
without understanding why it was like that to begin with.

--Jacob

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-08-26 Thread Jacob Champion

On Mon, Aug 26, 2024 at 1:18 AM Peter Eisentraut  wrote:
> Or, if people find that too scary, something like
>
> #ifdef JSONAPI_USE_PQEXPBUFFER
>
> #define jsonapi_StringInfo PQExpBuffer
> [...]
>
> That way, it's at least more easy to follow the source code because
> you see a mostly-familiar API.

I was having trouble reasoning about the palloc-that-isn't-palloc code
during the first few drafts, so I will try a round with the jsonapi_
prefix.

> Also, we should make this PQExpBuffer-using mode only used by libpq,
> not by frontend programs.  So libpq takes its own copy of jsonapi.c
> and compiles it using JSONAPI_USE_PQEXPBUFFER.  That will make the
> libpq build descriptions a bit more complicated, but everyone who is
> not libpq doesn't need to change.

Sounds reasonable. It complicates the test coverage situation a little
bit, but I think my current patch was maybe insufficient there anyway,
since the coverage for the backend flavor silently dropped...

> Or maybe there is a clever way to avoid even that: Create a
> fixed JsonLexContext like
>
>  static const JsonLexContext failed_oom;
>
> and on OOM you return that one from makeJsonLexContext*().  And then
> in pg_parse_json(), when you get handed that context, you return
> JSON_OUT_OF_MEMORY immediately.

I like this idea.

Thanks!
--Jacob

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-23 Thread Jacob Champion

On Fri, Aug 23, 2024 at 7:42 AM Jelte Fennema-Nio  wrote:
>
> On Fri, 23 Aug 2024 at 16:02, Robert Haas  wrote:
> > Personally, I'm 100% convinced at this point that we're arguing about
> > the wrong problem. Before, I didn't know for sure whether anyone would
> > be mad if we redefined PQprotocolVersion(), but now I know that there
> > is at least one person who will be, and that's Jacob.
>
> I could be interpreting Jacob his response incorrectly, but my
> understanding is that the type of protocol changes we would actually
> do in this version bump, would determine if he's either mad or happy
> that we redefined PQprotocolVersion.

Yes, but my conclusion is pretty much the same: let's talk about the
protocol changes. If we get to the end and revert the new API because
it's no longer adding anything -- e.g. because we've decided that
minor versions no longer have any compatibility guarantees at all --
so be it.

> > If there's one
> > among regular -hackers posters, there are probably more. Since Jelte
> > doesn't seem to want to produce the patch to add
> > PQminorProtocolVersion(), I suggest that somebody else does that --
> > Jacob, do you want to? -- and we commit that and move on.
>
> Let's call it PQfullProtocolVersion and make it return 30002. I'm fine
> with updating the patch. But I'll be unavailable for the next ~3
> weeks.

Agreed on the name. I've attached a reconfigured version of v15-0003,
with an extension that should hopefully not throw off the cfbot, and a
commit message that should hopefully not misrepresent the discussion
so far?

Thanks,
--Jacob
From d4e4ccd2d08e2a6159521fa31d929b796c56383d Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Fri, 23 Aug 2024 11:48:55 -0700
Subject: [PATCH] libpq: surface the full protocol version to clients

PQprotocolVersion() does not include the minor version of the protocol.
In preparation for proposals that bump that number for the first time,
add PQfullProtocolVersion() to provide it to clients that may care,
using the (major * 1 + minor) construction already in use by
PQserverVersion().

The core of this patch and its documentation is based on work by Jelte
Fennema-Nio, but that doesn't mean he endorses how I've used it here.
There is still significant disagreement on the semantics of major vs
minor versions when it comes to the wire protocol.
---
 doc/src/sgml/libpq.sgml   | 38 ---
 src/include/libpq/pqcomm.h|  1 +
 src/interfaces/libpq/fe-connect.c | 10 
 src/interfaces/libpq/libpq-fe.h   |  5 
 4 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/doc/src/sgml/libpq.sgml b/doc/src/sgml/libpq.sgml
index f916fce414..eb8c232869 100644
--- a/doc/src/sgml/libpq.sgml
+++ b/doc/src/sgml/libpq.sgml
@@ -2678,22 +2678,44 @@ const char *PQparameterStatus(const PGconn *conn, const 
char *paramName);
  
 
 
+
+ 
PQfullProtocolVersionPQfullProtocolVersion
+
+ 
+  
+   Interrogates the frontend/backend protocol being used.
+
+int PQfullProtocolVersion(const PGconn *conn);
+
+   Applications might wish to use this function to determine whether 
certain
+   features are supported. The result is formed by multiplying the server's
+   major version number by 1 and adding the minor version number. For
+   example, version 3.2 would be returned as 30002, and version 4.0 would
+   be returned as 4. Zero is returned if the connection is bad. The 3.0
+   protocol is supported by PostgreSQL server
+   versions 7.4 and above.
+  
+  
+   The protocol version will not change after connection startup is
+   complete, but it could theoretically change during a connection reset.
+  
+ 
+
+
 
  
PQprotocolVersionPQprotocolVersion
 
  
   
-   Interrogates the frontend/backend protocol being used.
+   Interrogates the frontend/backend protocol major version.
 
 int PQprotocolVersion(const PGconn *conn);
 
-   Applications might wish to use this function to determine whether 
certain
-   features are supported.  Currently, the possible values are 3
-   (3.0 protocol), or zero (connection bad).  The protocol version will
-   not change after connection startup is complete, but it could
-   theoretically change during a connection reset.  The 3.0 protocol is
-   supported by PostgreSQL server versions 7.4
-   and above.
+   Unlike , this returns only
+   the major protocol version in use, but it is supported by a wider range
+   of libpq releases back to version 7.4. Currently, the possible values 
are
+   3 (3.0 protocol), or zero (connection bad). Prior to release version
+   14.0, libpq could additionally return 2 (2.0 protocol).
   
  
 
diff --git a/

Re: PG_TEST_EXTRA and meson

2024-08-23 Thread Jacob Champion

On Fri, Aug 23, 2024 at 5:59 AM Ashutosh Bapat
 wrote:
> If I run
> export PG_TEST_EXTRA=xid_wraparound; ./configure --prefix=$BuildDir
> --enable-tap-tests && make -j4 && make -j4 install; unset
> PG_TEST_EXTRA
> followed by
> make -C $XID_MODULE_DIR check where
> XID_MODULE_DIR=src/test/modules/xid_wraparound - it skips the tests.

Right.

> I thought this was working before.

No, that goes back to my note in [1] -- as of c3382a3c3cc, the
variable was exported at only the src/test level, and I wanted to get
input on that so we could decide on the next approach if needed.

> Anyway, now I have written a script to test all the scenarios. You may
> want to test your patch using the script. It just needs PGDir to set
> to root directory of code.

Thanks! I see failures in 110, 120, and 130, as expected. Note that
constructions like

PG_TEST_EXTRA="" cd $XID_MODULE_DIR && make check && cd $PGDir

will not override the environment variable for the make invocation,
just for the `cd`. Also, rather than

export PG_TEST_EXTRA; ./configure ...; unset PG_TEST_EXTRA

it's probably easier to just pass PG_TEST_EXTRA= as a command
line option to configure.

> If there's some other way to setting PG_TEST_EXTRA when running
> configure, I think it needs to be documented.

./configure --help shows the new variable, with the same wording as
Meson. Or do you mean that it's significant enough to deserve a spot
in installation.sgml?

--Jacob

[1] 
https://www.postgresql.org/message-id/CAOYmi%2B%3D8HVgxANzFT_BZrAeDPxAgA5_kbHy-4VowdbGr0chHvQ%40mail.gmail.com

Re: Feature-test macros for new-in-v17 libpq features

2024-08-22 Thread Jacob Champion

On Thu, Aug 22, 2024 at 10:16 AM Tom Lane  wrote:
>
> In connection with that last point, I wonder if we should include
> commentary about when things came in.  I'd originally thought of
> just inserting the above names in alphabetical order, but now I
> wonder if the patch ought to look more like
>
>   */
> +/* Features added in PostgreSQL v14: */
>  /* Indicates presence of PQenterPipelineMode and friends */
>  #define LIBPQ_HAS_PIPELINING 1
>  /* Indicates presence of PQsetTraceFlags; also new PQtrace output format */
>  #define LIBPQ_HAS_TRACE_FLAGS 1
> +/* Features added in PostgreSQL v15: */
>  /* Indicates that PQsslAttribute(NULL, "library") is useful */
>  #define LIBPQ_HAS_SSL_LIBRARY_DETECTION 1
> +/* Features added in PostgreSQL v17: */
> + ... as above ...

+1, I like the new headers and keeping the version order.

Thanks!
--Jacob

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-21 Thread Jacob Champion

On Tue, Aug 20, 2024 at 3:17 PM Jelte Fennema-Nio  wrote:
>
> On Tue, 20 Aug 2024 at 23:48, Jacob Champion
>  wrote:
> > That guarantee (if adopted) would also make it possible for my use
> > case to proceed correctly, since a libpq client can still speak 3.0
> > packets on the socket safely.
>
> Not necessarily (at least not how I defined it). If a protocol
> parameter has been configured (possibly done by default by libpq),
> then that might not be the case anymore. So, you'd also need to
> compare the current values of the protocol parameters to their
> expected value.

With your definition, I agree. But I was about to sneakily suggest
(muahaha) that if you want to go that route, maybe protocol extensions
need to provide their own forward compatibility statements. Whether
via the same mechanism, or with something like criticality.

> > But in that case, PQprotocolVersion
> > should keep returning 3, because there's an explicit reason to care
> > about the major version by itself.
>
> I agree that there's a reason to care about the major version then,
> but it might still be better to fail hard for people that care about
> protocol details.

Maybe? In the span of a couple of days we've gone from "minor versions
are actually major versions and we will break all intermediaries all
the time" to "maybe not, actually". It's difficult for me to reason
through things that quickly. And on some level, that's fine and
expected, if we're still at the debate-and-design stage.

But personally I'd hoped that most of the conversation around
something this disruptive would be about what's going to break and
what's not, with the goal of making the broken set as small as
possible in exchange for specific benefits. Instead it seems like use
cases are having to justify themselves to avoid being broken, which is
not really the stance I want to see from a protocol maintainer.
Especially not if your stated goal is to bump versions whenever we
want (which, just for the record, I do not agree with).

Put another way: we've seen that our protocol-version joint has rusted
[1, 2]. I agree that needs to be fixed. But I also believe that we
shouldn't try to smash the joint open with a hammer, and that belief
seems philosophically at odds with the approach being taken upthread.
So if I'm the only one who feels this way, please someone let me know
so I can bow out instead of throwing up roadblocks... I don't want to
be a distraction from incremental protocol improvements.

--Jacob

[1] https://en.wikipedia.org/wiki/Protocol_ossification
[2] https://www.imperialviolet.org/2016/05/16/agility.html

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-20 Thread Jacob Champion

On Tue, Aug 20, 2024 at 12:55 PM Jelte Fennema-Nio  wrote:
> Another way of describing this guarantee: If a client connects using
> 3.8 and configures no protocol parameters, the client needs to handle
> anything 3.8 specific that the handshake requires (such as longer
> cancel token). But then after that handshake it can choose to send
> only 3.0 packets and expect to receive only 3.0 packets back.

That guarantee (if adopted) would also make it possible for my use
case to proceed correctly, since a libpq client can still speak 3.0
packets on the socket safely. But in that case, PQprotocolVersion
should keep returning 3, because there's an explicit reason to care
about the major version by itself.

--Jacob

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-20 Thread Jacob Champion

On Tue, Aug 20, 2024 at 10:42 AM David G. Johnston
 wrote:
> Semantic versioning guidelines are not something we are following, especially 
> here.

I understand; the protocol is ours, and we'll do whatever we do in the
end. I'm claiming that we can choose to provide semantics, and if we
do, those semantics will help people who are not here on the list to
defend their use cases.

--Jacob

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-20 Thread Jacob Champion

On Tue, Aug 20, 2024 at 8:24 AM Heikki Linnakangas  wrote:
> > Put another way: for a middlebox on the connection (which may be
> > passively observing, but also maybe actively adding new messages to
> > the stream), what is guaranteed to remain the same in the protocol
> > across a minor version bump? Hopefully the answer isn't "nothing"?
>
> I don't think we can give any future guarantees like that. If you have a
> middlebox on the connection, it needs to fully understand all the
> protocol versions it supports.

(GMail has catastrophically unthreaded this conversation for me, so
apologies if I start responding out of order)

Many protocols provide the list of assumptions that intermediates are
allowed to make within a single group of compatible versions, even as
the protocol gets extended. If we choose to provide those, then our
"major version" gains really useful semantics. See also the brief
"criticality" tangent upthread.

> That seems a bit tangential to the PQprotocolVersion() function though.
> A middlebox like that would probably not use libpq.

It's applicable to the use case I was talking about with Jelte. A
libpq client dropping down to the socket level is relying on
(implicit, currently undocumented/undecided, possibly incorrect!)
intermediary guarantees that the protocol provides for a major
version. I'm hoping we can provide some, since we haven't broken
anything yet. If we decide we can't, then so be it -- things will
break either way -- but it's still strange to me that we'd be okay
with literally zero forward compatibility and still call that a "minor
version".

--Jacob

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-20 Thread Jacob Champion

On Tue, Aug 20, 2024 at 7:26 AM Jelte Fennema-Nio  wrote:
> In practical terms I think that means for a minor version bump the
> format of the StartupMessage cannot be changed. Changing anything else
> is fair game for a minor protocol version bump.

I may be in a tiny minority here, but when I combine that statement
with your opinion from way upthread that

> IMHO, we
> should get to a state where protocol minor version bumps are so
> low-risk that we can do them whenever we add message types

then I don't see this effort ending up in a healthy place or with a
happy ecosystem. Pick any IETF-managed protocol, add on the statement
"we get to change anything we want in a minor version, and we reserve
the right to do it every single year", and imagine the chaos for
anyone who doesn't have power over both servers and clients.

To me it seems that what you're proposing is indistinguishable from
what most other protocols would consider a major version bump; it's
just that you (reasonably) want existing clients to be able to
negotiate multiple major versions in one round trip.

--Jacob

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-20 Thread Jacob Champion

On Mon, Aug 19, 2024 at 1:54 PM Jelte Fennema-Nio  wrote:
> My point is that the code that breaks, actually wants to be broken in this 
> case.

I'll turn this around then and assume for a moment that this is true:
no matter what the use cases are, they all want to be broken for
correctness. If this version change is allowed to break both the
endpoints and any intermediaries on the connection, why have we chosen
30001 as the new reported version as opposed to, say, 4?

Put another way: for a middlebox on the connection (which may be
passively observing, but also maybe actively adding new messages to
the stream), what is guaranteed to remain the same in the protocol
across a minor version bump? Hopefully the answer isn't "nothing"?

--Jacob

Re: Opinion poll: Sending an automated email to a thread when it gets added to the commitfest

2024-08-16 Thread Jacob Champion

On Fri, Aug 16, 2024 at 10:23 AM Maciek Sakrejda  wrote:
> > 1. A very short blurb like: "This thread was added to the commitfest
> > with ID 1234"
> > 2. A link to the commitfest entry
> > 3. A link to the cfbot CI runs
> > 4. A link to the diff on GitHub
> > 5. Any other suggestions?
>
> I would find (2) very useful.

Me too.

--Jacob

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-16 Thread Jacob Champion

On Fri, Aug 16, 2024 at 12:05 AM Jelte Fennema-Nio  wrote:
>
> On Fri, 16 Aug 2024 at 00:39, Jacob Champion
>  wrote:
> >
> > On Thu, Aug 15, 2024 at 3:04 PM Heikki Linnakangas  wrote:
> > > Perhaps we should even change it to return
> > > 30 for protocol version 3.0, and just leave a note in the docs like
> > > "in older versions of libpq, this returned 3 for protocol version 3.0".
> >
> > I think that would absolutely break current code. It's not uncommon
> > (IME) for hand-built clients wrapping libpq to make sure they're not
> > talking v2 before turning on some feature, and they're allowed to do
> > that with a PQprotocolVersion() == 3 check. A GitHub code search
> > brings up examples.
>
> Can you give a link for that code search and/or an example where
> someone used it like that in a real setting?

Keeping in mind that I'm responding to the change from 3 to 3:

https://github.com/search?q=PQprotocolVersion&type=code

https://github.com/psycopg/psycopg2/blob/658afe4cd90d3e167d7c98d22824a8d6ec895b1c/tests/test_async.py#L89

Bindings re-export this symbol in ways that basically just expand the
web of things to talk about. And there's hazards like

https://github.com/infusion/PHP/blob/7ebefb6426bb4b4820a30cca5c3a10bfd757b6ea/ext/pgsql/pgsql.c#L864

where the old client is fine, but new clients could be tricked into
writing similar checks as `>= 3` -- which is wrong because older
libpqs use 3, haha, surprise, have fun with that!

> The only example I could
> find where someone used it at all was psycopg having a unittest for
> their python wrapper around this API, and they indeed used == 3.

I've written code that uses exact equality as well, because I cared
about the wire protocol version. Even if I hadn't, isn't the first
public example enough, since a GitHub search is going to be an
undercount? What is your acceptable level of breakage?

People who are testing against this have some reason to care about the
underlying protocol compatibility. I don't need to understand (or even
agree with!) why they care; we've exported an API that allows them to
do something they find useful.

> The advantage is not introducing yet another API when we already have
> one with a great name

Sorry to move the goalposts, but when I say "value" I'm specifically
talking about value for clients/community, not value for patch writers
(or even maintainers). A change here doesn't add any value for
existing clients when compared to a new API, since they've already
written the version check code and are presumably happy with it. New
clients that have reason to care about the minor version, whatever
that happens to mean for a protocol, can use new code.

I'm not opposed to compatibility breaks when a client can see some
value in what we've done -- but the dev being broken should at least
be able to say something like "oh yeah, it was clearly broken before
and I'm glad it works now" or "wow, what a security hole, I'm glad
they patched it". That's not true here.

libpq is close to the base of a gigantic software stack and ecosystem.
We have an API, we have an SONAME, we have ways to introduce better
APIs while not breaking past clients. (And we can collect the list of
cruft to discard in a future SONAME bump, so we don't have to carry it
forever... but is the cost of this particularly large?)

> that no-one is currently using.

This is demonstrably false. You just cited the example you found in
psycopg, and all those bindings on GitHub have clients of their own,
not all of which are going to be easily searchable.

> The current API
> is in practice just a very convoluted way of writing 3.

There are versions of libpq still in support that can return 2, and
clients above us have to write code against the whole spectrum.

> Also doing an
> == 3 check is obviously problematic, if people use this function they
> should be using > 3 to be compatible with future versions.

Depends on why they're checking. I regularly write test clients that
drop down beneath the libpq layer. I don't want to be silently
upgraded.

I think I remember some production Go-based libpq clients several
years ago that did similar things, dropping down to the packet level
to do some sort of magic, but I can't remember exactly what now.
That's terrifying in the abstract, but it's not my decision or code to
maintain. The community is allowed to do things like that; we publish
a protocol definition in addition to an API that exposes the socket.

> So if we
> ever introduce protocol version 4, then these (afaict theoretical)
> users would break anyway.

Yes -- not theoretical, since I am one of them! -- that's the point.
Since we've already demonstrated that protocol details can leak up
above the API for the 2->3 change, a dev with reason to be paranoid
(such as myself) can write a canary for the 3->4 change. "Protocol 4.0
not yet supported" can be a feature.

--Jacob

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-08-15 Thread Jacob Champion

On Thu, Aug 15, 2024 at 3:04 PM Heikki Linnakangas  wrote:
> Perhaps we should even change it to return
> 30 for protocol version 3.0, and just leave a note in the docs like
> "in older versions of libpq, this returned 3 for protocol version 3.0".

I think that would absolutely break current code. It's not uncommon
(IME) for hand-built clients wrapping libpq to make sure they're not
talking v2 before turning on some feature, and they're allowed to do
that with a PQprotocolVersion() == 3 check. A GitHub code search
brings up examples.

As for 30001: I don't see the value in modifying an exported API in
this way. Especially since we can add a new entry point that will be
guaranteed not to break anyone, as Robert suggested. I think it's a
POLA violation at minimum; my understanding was that up until this
point, the value was incremented during major (incompatible) version
bumps. And I think other users will have had the same understanding.

--Jacob

Re: Proposal for implementing OCSP Stapling in PostgreSQL

2024-08-14 Thread Jacob Champion

On Wed, Aug 7, 2024 at 12:20 AM Daniel Gustafsson  wrote:
>
> While I have only skimmed the patch so far and need more review before I can
> comment on it, I do have a question on the expected use of OCSP support in
> postgres.  With OCSP becoming optional [0], and big providers like Let's
> Encrypt deprecating OCSP [1], is this mainly targeting organizations running
> their own CA with in-house OCSP?

That announcement took me by surprise (and, it looks like, a number of
other people [1, 2]). I get that OCSP is expensive and painful for
Let's Encrypt, based on previous outages and blog posts, but I also
figured that Must-Staple was basically the best you could do without
being a browser. It already seemed pretty clear that we shouldn't
build a client-side OCSP check. Throwing server-side stapling under
the bus with it was unexpected.

Some of the LE quotes on the matter are giving me cart-before-horse vibes:

> But it is clear to me OCSP is an ineffective technical dead-end, and we are 
> all better served by moving on to figure out what else we can do.
>
> We may keep OCSP running for some time for certificates that have the 
> must-staple extension, to help smooth the transition, but at this time we 
> don’t have a plan for how to actually deprecate OCSP: just an intent, 
> publicized to ensure we can all begin to plan for a future without it.

It's pretty frustrating to hear about a "transition" when there is
nothing to transition to.

I honestly wonder if they're going to end up walking some of this
back. The messaging reminds me of "that one project" that every
company seems to have, where it's expensive and buggy as heck, all the
maintainers want to see it deleted, and they unilaterally declare over
clients' objections that they will, only to find at the last second
that the cure is worse than the disease and then finally resign
themselves to supporting it. Tears are shed, bridges burned.

Anyways, I look forward to seeing how broken my crystal ball is this
time. The timing is awful for this patchset in particular.

--Jacob

[1] 
https://community.letsencrypt.org/t/sunsetting-of-ocsp-in-favor-of-older-technology/222589
[2] https://community.letsencrypt.org/t/what-will-happen-to-must-staple/222397

Re: PG_TEST_EXTRA and meson

2024-08-14 Thread Jacob Champion

On Tue, Aug 13, 2024 at 9:07 PM Ashutosh Bapat
 wrote:
> > I'm not entirely sure what you mean? src/test should work fine,
> > anything lower than that (say src/test/ssl) does not.
>
> I could run them from src/test/modules/xid_wraparound/. That's desirable.

On my machine, storing xid_wraparound into PG_TEST_EXTRA at configure
time and running `make check` from the modules/xid_wraparound
directory causes them to be skipped. Setting the environment variable
will continue to work at that directory level, though; my patch
shouldn't change that.

> What is working now should continue to work even after this change.
> PG_TEST_EXTRA="xyz" make check works right now.

Fair point, see attached.

Thanks,
--Jacob

v2-make_test_extra.diff
Description: Binary data

Re: PG_TEST_EXTRA and meson

2024-08-13 Thread Jacob Champion

On Fri, Aug 9, 2024 at 2:26 AM Ashutosh Bapat
 wrote:
> Here are my observations with the patch applied
> 1. If I run configure without setting PG_TEST_EXTRA, it won't run the
> tests that require PG_TEST_EXTRA to be set. This is expected.
> 2. But it wont' run tests even if PG_TEST_EXTRA is set while running
> make check.- that's unexpected

(see below)

> 3. If I run configure with PG_TEST_EXTRA set and run 'make check' in
> the test directory, it runs those tests. That's expected from the
> final patch but that doesn't seem to be what you described above.

I'm not entirely sure what you mean? src/test should work fine,
anything lower than that (say src/test/ssl) does not.

> 3. After 3, if I run `PG_TEST_EXTRA="something" make check`, it still
> runs those tests. So it looks like PG_TEST_EXTRA doesn't get
> overridden. If PG_TEST_EXTRA is set to something other than what was
> configured, it doesn't take effect when running the corresponding
> tests. E.g. PG_TEST_EXTRA is set to xid_wraparound at configure time,
> but `PG_TEST_EXTRA=wal_consistency_check make check ` is run, the
> tests won't use wal_consistency_check=all. - that's not expected.

I think you're running into the GNU Make override order [1]. For
instance when I want to override PG_TEST_EXTRA, I write

make check PG_TEST_EXTRA=whatever

If you want the environment variable to work by default instead, you can do

PG_TEST_EXTRA=whatever make check -e

If you don't want devs to have to worry about the difference, I think
we can change the assignment operator to `?=` in Makefile.global.in.

Thanks,
--Jacob

[1] https://www.gnu.org/software/make/manual/html_node/Environment.html

Re: BlastRADIUS mitigation

2024-08-13 Thread Jacob Champion

On Wed, Aug 7, 2024 at 5:55 AM Heikki Linnakangas  wrote:
> On 06/08/2024 03:58, Thomas Munro wrote:
> > On Tue, Aug 6, 2024 at 2:41 AM Heikki Linnakangas  wrote:
> >> What if the message contains multiple attribute of the same type? If
> >> there's a duplicate Message-Authenticator, we should surely reject the
> >> packet. I don't know if duplicate attributes are legal in general.
> >
> > Duplicate attributes are legal in general per RFC 2865, which has a
> > table of attributes and their possible quantity; unfortunately this
> > one is an extension from RFC 2869, and I didn't find where it pins
> > that down.

There's a similar table near the end of 2869:

https://datatracker.ietf.org/doc/html/rfc2869#section-5.19

> > I suppose if we wanted to be extra fastidious, we could also test with
> > a gallery of malformed packets crafted by a Perl script, but that
> > feels like overkill.  On the other hand it would be bad if you could
> > crash a server by lobbing UDP packets at it because of some dumb
> > thinko.
>
> This would also be a easy target for fuzz testing. I'm not too worried
> though, the packet format is pretty simple. Still, bugs happen. (Not a
> requirement for this patch in any case)



I've been working on fuzzing JSON, and I spent some time adapting that
to test this RADIUS code. No pressure to use any of it (the
refactoring to pull out the response validation is cowboy-quality at
best), but it might help anyone who wants to pursue it in the future?

This fuzzer hasn't been able to find anything in the response parser.
(But the modifications I made make that claim a little weaker, since
I'm not testing what's actually shipping.) The attached response
corpus could be used to seed a malformed-packet gallery like Thomas
mentioned; it's built with the assumption that the request packet was
all zeroes and the shared secret is `secret`.

I was impressed with how quickly LLVM was able to find the packet
shape, including valid signatures. The only thing I had to eventually
add manually was a valid Message-Authenticator case; I assume the
interdependency between the authenticator and the packet checksum was
a little too much for libfuzzer to figure out on its own.



--Jacob
From f04087abbe3381ee3a546750a217b980143441d9 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Thu, 8 Aug 2024 11:12:05 -0700
Subject: [PATCH] WIP: try to fuzz RADIUS implementation

- extract validate_radius_response()
- add LLVM fuzzer implementation
- suppress logging by default, let --log_min_messages override it
- support const response buffer (and improve fuzzability) by computing
  Message-Authenticator piecewise
---
 meson-clang.ini|   4 +
 src/backend/libpq/auth.c   | 336 +
 src/backend/meson.build|  18 ++
 src/common/meson.build |  16 ++
 src/include/libpq/auth.h   |  21 ++
 src/test/modules/fuzzers/fuzz_radius.c |  74 ++
 src/test/modules/fuzzers/meson.build   |  24 ++
 src/test/modules/meson.build   |   1 +
 8 files changed, 330 insertions(+), 164 deletions(-)
 create mode 100644 meson-clang.ini
 create mode 100644 src/test/modules/fuzzers/fuzz_radius.c
 create mode 100644 src/test/modules/fuzzers/meson.build

diff --git a/meson-clang.ini b/meson-clang.ini
new file mode 100644
index 00..560cffe8e7
--- /dev/null
+++ b/meson-clang.ini
@@ -0,0 +1,4 @@
+[binaries]
+c = 'clang-15'
+cpp = 'clang++-15'
+llvm-config = 'llvm-config-15'
diff --git a/src/backend/libpq/auth.c b/src/backend/libpq/auth.c
index 40f0cabf3a..d97fd6e3f2 100644
--- a/src/backend/libpq/auth.c
+++ b/src/backend/libpq/auth.c
@@ -2769,13 +2769,9 @@ CheckCertAuth(Port *port)
  * RADIUS authentication is described in RFC2865 (and several others).
  */
 
-#define RADIUS_VECTOR_LENGTH 16
 #define RADIUS_HEADER_LENGTH 20
 #define RADIUS_MAX_PASSWORD_LENGTH 128
 
-/* Maximum size of a RADIUS packet we will create or accept */
-#define RADIUS_BUFFER_SIZE 1024
-
 typedef struct
 {
uint8   attribute;
@@ -2783,16 +2779,6 @@ typedef struct
uint8   data[FLEXIBLE_ARRAY_MEMBER];
 } radius_attribute;
 
-typedef struct
-{
-   uint8   code;
-   uint8   id;
-   uint16  length;
-   uint8   vector[RADIUS_VECTOR_LENGTH];
-   /* this is a bit longer than strictly necessary: */
-   charpad[RADIUS_BUFFER_SIZE - RADIUS_VECTOR_LENGTH];
-} radius_packet;
-
 /* RADIUS packet types */
 #define RADIUS_ACCESS_REQUEST  1
 #define RADIUS_ACCESS_ACCEPT   2
@@ -2993,11 +2979,9 @@ PerformRadiusTransaction(const char *server, const char 
*secret, const char *por
uint8   message_authenticator_key[RADIUS_VECTOR_LENGTH];
uint8   message_authenticator[

Re: can we mark upper/lower/textlike functions leakproof?

2024-08-02 Thread Jacob Champion

On Fri, Aug 2, 2024 at 10:40 AM Robert Haas  wrote:
> My belief is that nearly everything in unsafe. We ship with very
> little marked leakproof right now, and that might be too conservative,
> but probably not by much.

Then to me, that seems like the best-case scenario for a "maybe"
classification. I'm still not sold on the idea of automatically
treating all "maybes" as leakproof (personally I'd prefer that users
surgically opt in), but if the pool is small...

Thanks,
--Jacob

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-08-02 Thread Jacob Champion

On Fri, Aug 2, 2024 at 10:13 AM Peter Eisentraut  wrote:
> You shouldn't use pfree() interchangeably with free(), even if that is
> not enforced because it's the same thing underneath.  First, it just
> makes sense to keep the alloc and free pairs matched up.  And second, on
> Windows there is some additional restriction (vague knowledge) that the
> allocate and free functions must be in the same library, so mixing them
> freely might not even work.

Ah, I forgot about the CRT problems on Windows. So my statement of
"the linker might not garbage collect" is pretty much irrelevant.

But it sounds like we agree that we shouldn't be using fe_memutils at
all in shlib builds. (If you can't use palloc -- it calls exit -- then
you can't use pfree either.) Is 0002 still worth pursuing, once I've
correctly wordsmithed the commit? Or did I misunderstand your point?

Thanks!
--Jacob

Re: can we mark upper/lower/textlike functions leakproof?

2024-08-02 Thread Jacob Champion

On Fri, Aug 2, 2024 at 9:22 AM Robert Haas  wrote:
> I'll be honest: I don't like it, either. I don't even like
> proleakproof=true/false/maybe; I asked about that to understand if
> that was what Jacob was proposing, not because I actually think we
> should do it. The problem is that there's likely to be a fairly wide
> range contained inside of "maybe", with cases like "upper" at the
> safer end of the spectrum. That's too fuzzy to use as a basis for any
> sort of real security, IMHO; we won't be able to find two hackers who
> agree on how anything should be marked.

I guess I wasn't trying to propose that the grey area be used as the
basis for security, but that we establish a lower bound for the grey.
Make things strictly better than today, and cut down on the fear that
someone's going to accidentally mark something that we all agree
shouldn't be. And then shrink the grey area over time as we debate.

(Now, if there aren't that many cases where we can all agree on
"unsafe", then the proposal loses pretty much all value, because we'll
never shrink the uncertainty.)

> I think part of our problem here is that we have very few examples of
> how to actually analyze a function for leakproof-ness, or how to
> exploit one that is erroneously so marked. The conversations then tend
> to degenerate into some people saying things are scary and some people
> saying the scariness is overrated and then the whole thing just
> becomes untethered from reality. Maybe we need to create some really
> robust documentation in this area so that we can move toward a common
> conceptual framework, instead of everybody just having a lot of
> opinions.

+1

--Jacob

Re: can we mark upper/lower/textlike functions leakproof?

2024-08-02 Thread Jacob Champion

On Fri, Aug 2, 2024 at 9:13 AM Joe Conway  wrote:
>
> On 8/2/24 11:07, Tom Lane wrote:
> > Joe Conway  writes:
> >> 
> >> Hmmm, and then have "leakproof_mode" = strict/lax/off where 'strict' is
> >> current behavior, 'lax' allows the 'maybe's to get pushed down, and
> >> 'off' ignores the leakproof attribute entirely and pushes down anything
> >> that merits being pushed?
> >> 
> >
> > So in other words, we might as well just remove RLS.
>
> Perhaps deciding where to draw the line for 'maybe' is too hard, but I
> don't understand how you can say that in a general sense.

I'm more sympathetic to the "maybe" case, but for the "off" proposal I
find myself agreeing with Tom. If you want "off", turn off RLS.

> And even 'off'
> has utility for cases where (1) no logins are allowed except for the app
> (which is quite common in production environments) and no database
> errors are propagated to the end client (again pretty standard best
> practice);

I'm extremely skeptical of this statement, but I've been out of the
full-stack world for a bit. How does a modern best-practice
application hide the fact that it ran into an error and couldn't
complete a query?

> or (2) non-production environments, e.g. for testing or
> debugging; or

Changing the execution behavior between dev and prod seems like an
anti-goal. When would turning this off help you debug something?

> (3) use cases that utilize RLS as a notationally
> convenient filtering mechanism and are not bothered by some leakage in
> the worst case.

Catering to this use case doesn't seem like it should be a priority.
If a security feature is useful for you in a non-security setting,
awesome, but we shouldn't poke holes in it for that situation, nor
should it be surprising if the security gets stronger and becomes
harder to use for non-security.

--Jacob

Re: can we mark upper/lower/textlike functions leakproof?

2024-08-02 Thread Jacob Champion

On Thu, Aug 1, 2024 at 6:03 PM Robert Haas  wrote:
>
> On Thu, Aug 1, 2024 at 4:45 PM Jacob Champion
>  wrote:
> > Would it provide enough value for effort to explicitly mark leaky
> > procedures as such? Maybe that could shrink the grey area enough to be
> > protective?
>
> You mean like proleakproof = true/false/maybe?

Yeah, exactly.

--Jacob

Re: can we mark upper/lower/textlike functions leakproof?

2024-08-01 Thread Jacob Champion

On Thu, Aug 1, 2024 at 7:26 AM Tom Lane  wrote:
> Are you proposing that we invent two levels of leakproofness
> with different guarantees?  That seems like a mess, not least
> because there are going to be varying opinions about where we
> should set the bar for the lower level.

It kind of reminds me of the kernel's "paranoia level" tunables, which
seem to proliferate in weird ways [1] and not make for a particularly
great UX.

--Jacob

[1] 
https://askubuntu.com/questions/1400874/what-does-perf-paranoia-level-four-do

Re: can we mark upper/lower/textlike functions leakproof?

2024-08-01 Thread Jacob Champion

On Wed, Jul 31, 2024 at 1:26 PM Robert Haas  wrote:
> However, the risk is that an end-user is going to be much less able to
> evaluate what is and isn't safe than we are. I think some people are
> going to be like -- well the core project doesn't mark enough stuff
> leakproof, so I'll just go add markings to a bunch of stuff myself.
> And they probably won't stop at stuff like UPPER which is almost
> leakproof. They might add it to stuff such as LIKE which results in
> immediately giving away the farm. By not giving people any guidance,
> we invite them to make up their own rules.

+1.

Would it provide enough value for effort to explicitly mark leaky
procedures as such? Maybe that could shrink the grey area enough to be
protective?

--Jacob

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Jacob Champion

On Mon, Jul 29, 2024 at 1:51 PM Daniel Gustafsson  wrote:
> Together with a colleage we found the Azure provider use "verification_url"
> rather than xxx_uri.

Yeah, I think that's originally a Google-ism. (As far as I can tell
they helped author the spec for this and then didn't follow it. :/ ) I
didn't recall Azure having used it back when I was testing against it,
though, so that's good to know.

> Another discrepancy is that it uses a string for the
> interval (ie: "interval":"5").

Oh, that's a new one. I don't remember needing to hack around that
either; maybe iddawc handled it silently?

> One can of course argue that Azure is wrong and
> should feel bad, but I fear that virtually all (major) providers will have
> differences like this, so we will have to deal with it in an extensible 
> fashion
> (compile time, not runtime configurable).

Such is life... verification_url we will just have to deal with by
default, I think, since Google does/did it too. Not sure about
interval -- but do we want to make our distribution maintainers deal
with a compile-time setting for libpq, just to support various OAuth
flavors? To me it seems like we should just hold our noses and support
known (large) departures in the core.

> I was toying with making the name json_field name member an array, to allow
> variations.  That won't help with the fieldtype differences though, so another
> train of thought was to have some form of REQUIRED_XOR where fields can tied
> together.  What do you think about something along these lines?

If I designed it right, just adding alternative spellings directly to
the fields list should work. (The "required" check is by struct
member, not name, so both spellings can point to the same
destination.) The alternative typing on the other hand might require
something like a new sentinel "type" that will accept both... I hadn't
expected that.

> Another thing, shouldn't we really parse and interpret *all* REQUIRED fields
> even if we don't use them to ensure that the JSON is wellformed?  If the JSON
> we get is malformed in any way it seems like the safe/conservative option to
> error out.

Good, I was hoping to have a conversation about that. I am fine with
either option in principle. In practice I expect to add code to use
`expires_in` (so that we can pass it to custom OAuth hook
implementations) and `scope` (to check if the server has changed it on
us).

That leaves the provider... Forcing the provider itself to implement
unused stuff in order to interoperate seems like it could backfire on
us, especially since IETF standardized an alternate .well-known URI
[1] that changes some of these REQUIRED things into OPTIONAL. (One way
for us to interpret this: those fields may be required for OpenID, but
your OAuth provider might not be an OpenID provider, and our code
doesn't require OpenID.) I think we should probably tread lightly in
that particular case. Thoughts on that?

Thanks!
--Jacob

[1] https://www.rfc-editor.org/rfc/rfc8414.html

Re: [PoC] Federated Authn/z with OAUTHBEARER

2024-07-29 Thread Jacob Champion

On Mon, Jul 29, 2024 at 5:02 AM Peter Eisentraut  wrote:
> We should take the check for exit() calls from libpq and expand it to
> cover the other libraries as well.  Maybe there are other problems like
> this?

Seems reasonable, yeah.

> But under what circumstances does "the linker doesn't strip out" happen?
>   If this happens accidentally, then we should have seen some buildfarm
> failures or something?

On my machine, for example, I see differences with optimization
levels. Say you inadvertently call pfree() in a _shlib build, as I did
multiple times upthread. By itself, that shouldn't actually be a
problem (it eventually redirects to free()), so it should be legal to
call pfree(), and with -O2 the build succeeds. But with -Og, the
exit() check trips, and when I disassemble I see that pg_malloc() et
all have infected the shared object. After all, we did tell the linker
to put that object file in, and we don't ask it to garbage-collect
sections.

> Also, one could look further and notice that restricted_token.c and
> sprompt.c both a) are not needed by libpq and b) can trigger exit()
> calls.  Then it's not clear why those are not affected.

I think it's easier for the linker to omit whole object files rather
than partial ones. If libpq doesn't use any of those APIs there's not
really a reason to trip over it.

(Maybe the _shlib variants should just contain the minimum objects
required to compile.)

> I'm reminded of thread [0].  I think there is quite a bit of confusion
> about the pqexpbuffer vs. stringinfo APIs, and they are probably used
> incorrectly quite a bit.  There are now also programs that use both of
> them!  This patch now introduces another layer on top of them.  I fear,
> at the end, nobody is going to understand any of this anymore.

"anymore"? :)

In all seriousness -- I agree that this isn't sustainable. At the
moment the worst pain (the new layer) is isolated to jsonapi.c, which
seems like an okay place to try something new, since there aren't that
many clients. But to be honest I'm not excited about deciding the Best
Way Forward based on a sample size of JSON.

> Also,
> changing all the programs to link in libpq for pqexpbuffer seems like
> the opposite direction from what was suggested in [0].

(I don't really want to keep that new libpq dependency. We'd just have
to decide where PQExpBuffer is going to go if we're not okay with it.)

> I think we need to do some deeper thinking here about how we want the
> memory management on the client side to work.  Maybe we could just use
> one API but have some flags or callbacks to control the out-of-memory
> behavior.

Any src/common code that needs to handle both in-band and out-of-band
failure modes will still have to decide whether it's going to 1)
duplicate code paths or 2) just act as if in-band failures can always
happen. I think that's probably essential complexity; an ideal API
might make it nicer to deal with but it can't abstract it away.

Thanks!
--Jacob

Re: Serverside SNI support in libpq

2024-07-25 Thread Jacob Champion

On Fri, May 24, 2024 at 12:55 PM Cary Huang  wrote:
> pg_hosts should also have sslpassword_command just like in the 
> postgresql.conf in
> case the sslkey for a particular host is encrypted with a different password.

Good point. There is also the HBA-related handling of client
certificate settings (such as pg_ident)...

I really dislike that these things are governed by various different
files, but I also feel like I'm opening up a huge can of worms by
requesting nestable configurations.

> +   if (ssl_snimode != SSL_SNIMODE_OFF)
> +   SSL_CTX_set_tlsext_servername_callback(context, 
> sni_servername_cb);
>
> If libpq client does not provide SNI, this callback will not be called, so 
> there
> is not a chance to check for a hostname match from pg_hosts, swap the TLS 
> CONTEXT,
> or possibly reject the connection even in strict mode.

I'm mistrustful of my own test setup (see previous email to the
thread), but I don't seem to be able to reproduce this. With sslsni=0
set, strict mode correctly shuts down the connection for me. Can you
share your setup?

(The behavior you describe might be a useful setting in practice, to
let DBAs roll out strict protection for new clients gracefully without
immediately blocking older ones.)

Thanks,
--Jacob

Re: Serverside SNI support in libpq

2024-07-25 Thread Jacob Champion

On Fri, May 10, 2024 at 7:23 AM Daniel Gustafsson  wrote:
> The way multiple certificates are handled is that libpq creates one SSL_CTX 
> for
> each at startup, and switch to the appropriate one when the connection is
> inspected.

I fell in a rabbit hole while testing this patch, so this review isn't
complete, but I don't want to delay it any more. I see a few
possibly-related problems with the handling of SSL_context.

The first is that reloading the server configuration doesn't reset the
contexts list, so the server starts behaving in really strange ways
the longer you test. That's an easy enough fix, but things got weirder
when I did. Part of that weirdness is that SSL_context gets set to the
last initialized context, so fallback doesn't always behave in a
deterministic fashion. But we do have to set it to something, to
create the SSL object itself...

I tried patching all that, but I continue to see nondeterministic
behavior, including the wrong certificate chain occasionally being
served, and the servername callback being called twice for each
connection (?!).

Since I can't reproduce the weirdest bits under a debugger yet, I
don't really know what's happening. Maybe my patches are buggy. Or
maybe we're running into some chicken-and-egg madness? The order of
operations looks like this:

1. Create a list of contexts, selecting one as an arbitrary default
2. Create an SSL object from our default context
3. During the servername_callback, reparent that SSL object (which has
an active connection underway) to the actual context we want to use
4. Complete the connection

It's step 3 that I'm squinting at. I wondered how, exactly, that
worked in practice, and based on this issue the answer might be "not
well":

https://github.com/openssl/openssl/issues/6109

Matt Caswell appears to be convinced that SSL_set_SSL_CTX() is
fundamentally broken. So it might just be FUD, but I'm wondering if we
should instead be using the SSL_ flavors of the API to reassign the
certificate chain on the SSL pointer directly, inside the callback,
instead of trying to set them indirectly via the SSL_CTX_ API.

Have you seen any weird behavior like this on your end? I'm starting
to doubt my test setup... On the plus side, I now have a handful of
debugging patches for a future commitfest.

Thanks,
--Jacob

Re: PG_TEST_EXTRA and meson

2024-07-23 Thread Jacob Champion

On Tue, Jul 23, 2024 at 3:32 AM Nazir Bilal Yavuz  wrote:
> Upthread Jacob said he could work on a patch about introducing the
> PG_TEST_EXTRA configure option to make builds. Would you still be
> interested in working on this? If not, I would gladly work on it.

Sure! Attached is a minimalist approach using AC_ARG_VAR.

It works for top-level `make check-world`, or `make check -C
src/test`. If you run `make check` directly from a test subdirectory,
the variable doesn't get picked up, because it's only exported from
the src/test level as of your patch c3382a3c3cc -- but if that turns
out to be a problem, we can plumb it all the way down or expand the
scope of the export.

Thanks,
--Jacob

make_test_extra.diff
Description: Binary data

Re: Document use of ldapurl with LDAP simple bind

2024-07-23 Thread Jacob Champion

On Tue, Jul 23, 2024 at 1:37 AM Peter Eisentraut  wrote:
> Committed.

Thanks!

> (I suppose this could be considered a bug fix, but I don't feel an
> urgency to go backpatching this.  Let me know if there are different
> opinions.)

Certainly no urgency. The docs part of the patch also could be
backported alone, but I don't feel strongly either way.

--Jacob

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-07-22 Thread Jacob Champion

On Wed, Jul 3, 2024 at 9:20 AM Daniel Gustafsson  wrote:
> It's essentially just polish and adding comments with the functional
> changes that a) it parses the entire list of curves so all errors can be
> reported instead of giving up at the first error; b) leaving the cipher suite
> GUC blank will set the suites to the OpenSSL default vale.

Is there an advantage to setting it to a compile-time default, as
opposed to just leaving it alone and not setting it at all? With the
current patch, if you dropped in a more advanced OpenSSL 3.x that
changed up the defaults, you wouldn't see any benefit.

Thanks,
--Jacob

Re: Add support to TLS 1.3 cipher suites and curves lists

2024-07-22 Thread Jacob Champion

On Fri, Jul 12, 2024 at 1:03 PM Daniel Gustafsson  wrote:
> The original author added the string parsing in order to provide a good error
> message in case of an error in the list, and since that seemed like a nice 
> idea
> I kept in my review revision.  With what you said above I agree it's not worth
> the extra complexity it brings so the attached revision removes it.

Misspelling a group now leads to the following error message for OpenSSL 3.0:

FATAL:  ECDH: failed to set curve names: no SSL error reported

Maybe a HINT would be nice here?:

HINT: Check that each group name is both spelled correctly and
supported by the installed version of OpenSSL.

or something.

> I don't have strong opinions on
> renaming ssl_ecdh_curve to reflect that it can take a list of multiple values,
> there is merit to having descriptive names but it would also be an invasive
> change for adding suffix 's'.

Can we just add an entry to map_old_guc_names to handle it? Something
like (untested)

 static const char *const map_old_guc_names[] = {
 "sort_mem", "work_mem",
 "vacuum_mem", "maintenance_work_mem",
+"ssl_ecdh_curve", "ssl_groups",
 NULL
 };

Re: Andres' concern about the ECDH part of the name, we could probably
keep the "dh" part, but I'd be wary of that changing underneath us
too. IANA changed the registry name to "TLS Supported Groups".

Thanks,
--Jacob

Re: PG_TEST_EXTRA and meson

2024-07-18 Thread Jacob Champion

On Wed, Jul 17, 2024 at 11:11 AM Tom Lane  wrote:
> Ah.  I have no particular objection to that, but I wonder whether
> we should make the autoconf/makefile infrastructure do it too.

I don't need it personally, having moved almost entirely to Meson. But
if the asymmetry is a sticking point, I can work on a patch.

Thanks!
--Jacob

Re: Proposal for implementing OCSP Stapling in PostgreSQL

2024-07-18 Thread Jacob Champion

On Wed, Jul 17, 2024 at 3:42 PM David Zhang  wrote:
> Totally agree. Either Implementing OCSP requests over HTTP, then parsing
> the response and then saving the results to a file, or using an OpenSSL
> client with a cron job to periodically update the file should work.
> Using a cron job would likely have less impact on PostgreSQL.

Yeah, my preference would be to farm this job out to OpenSSL entirely.
Implementing OCSP-over-HTTP ourselves seems unlikely to buy us much,
for all the work it would give us.

> Totally agree, then we should limit OCSP stapling check for the
> leaf/PostgreSQL server certificate only in v1.

Sounds good. Maybe a future version can implement a check of the full
chain, but I imagine we'll have to follow someone else's lead.

> > A question from ignorance: how does the client decide that the
> > signature on the OCSP response is actually valid for the specific
> > chain in use?
>
> If I understand correctly, the certificate used by the OCSP responder to
> sign the OCSP response must be valid for the specific chain in use, or
> the admins allow to load a new chain to validate the certificate used to
> sign the OCSP response. I think it would be better to make this part to
> be more open.

Okay. That part needs more design work IMO, and solid testing.

> > If it's okay with you, I'd like to volunteer to refactor out the
> > duplicated recipes in sslfiles.mk. I have some particular ideas in
> > mind and don't want to make you play fetch-a-rock. (No pressure to use
> > what I come up with, if you don't like it.)
> That would be great, thanks a lot in advance!

No problem! I've attached two patches that can be optionally applied
on top of yours:
- 0001 simplifies the response generation into a single recipe.
- 0002 is a bigger change that uses `openssl ca` to generate index
files, as opposed to constructing them manually ourselves.

(The makefile will get even smaller without multi-stapling support,
but I didn't want to combine that with the refactor.)

For 0002, I'm wiping the new CA index for each recipe and rebuilding
it from scratch, then copying it into place (this relies on the
.NOTPARALLEL setting up top for correctness). I think there's probably
an even simpler approach, which would be to generate a single index
that can produce all of our desired responses. I can give that a try
once multi-stapling support is pulled out.

> Thanks a lot for reviewing and providing all your feedback!

You're very welcome, thanks for working on this feature!

--Jacob
From 1f4f030233a895586aa7e114c92f4aa213250752 Mon Sep 17 00:00:00 2001
From: Jacob Champion 
Date: Mon, 15 Jul 2024 06:40:07 -0700
Subject: [PATCH 2/2] WIP: move to OpenSSL-constructed index files

---
 src/test/ssl/conf/ocsp.config | 19 +
 src/test/ssl/sslfiles.mk  | 75 ++-
 2 files changed, 48 insertions(+), 46 deletions(-)
 create mode 100644 src/test/ssl/conf/ocsp.config

diff --git a/src/test/ssl/conf/ocsp.config b/src/test/ssl/conf/ocsp.config
new file mode 100644
index 00..831c2f681c
--- /dev/null
+++ b/src/test/ssl/conf/ocsp.config
@@ -0,0 +1,19 @@
+[ ca ]
+default_ca = ocsp
+
+# A shell of a CA, mostly duplicating the server CA, which is used only during
+# the OCSP index generation recipes.
+[ ocsp ]
+dir = ./ssl/
+
+# The database (or "index") is the main thing we want.
+database = ./ssl/ocsp-certindex
+
+# Everything else should all be unused, so we specify whatever's most
+# convenient. In particular there's no need to have a unique cert/key pair for
+# this.
+certificate = ./ssl/server_ca.crt
+private_key = ./ssl/server_ca.key
+serial = ./ssl/ocsp_ca.srl
+default_md = sha256
+policy = policy_match
diff --git a/src/test/ssl/sslfiles.mk b/src/test/ssl/sslfiles.mk
index 9ba88f0be9..bfefa6a33e 100644
--- a/src/test/ssl/sslfiles.mk
+++ b/src/test/ssl/sslfiles.mk
@@ -196,6 +196,7 @@ CLIENT_CERTS := $(CLIENTS:%=ssl/%.crt)
 root_ca_state_files := ssl/root_ca-certindex ssl/root_ca-certindex.attr 
ssl/root_ca.srl
 server_ca_state_files := ssl/server_ca-certindex ssl/server_ca-certindex.attr 
ssl/server_ca.srl
 client_ca_state_files := ssl/client_ca-certindex ssl/client_ca-certindex.attr 
ssl/client_ca.srl
+ocsp_ca_state_files := ssl/ocsp-certindex ssl/ocsp-certindex.attr 
ssl/ocsp_ca.srl
 
 # These are the workhorse recipes. `openssl ca` can't be safely run from
 # parallel processes, so we must mark the entire Makefile .NOTPARALLEL.
@@ -224,6 +225,7 @@ ssl/%.csr: ssl/%.key conf/%.config
 #
 
 .INTERMEDIATE: $(root_ca_state_files) $(server_ca_state_files) 
$(client_ca_state_files)
+.INTERMEDIATE: $(ocsp_ca_state_files)
 
 # OpenSSL requires a directory to put all generated certificates in. We don't
 # use this for anything, but we need a location.
@@ -245,68 +247,49 @@ ssl/%.srl:
 # OCSP
 #
 .INTERMEDIATE: $(OC

Re: PG_TEST_EXTRA and meson

2024-07-17 Thread Jacob Champion

On Wed, Jul 17, 2024 at 8:01 AM Tom Lane  wrote:
> Jacob Champion  writes:
> > Personally I use the config-time PG_TEST_EXTRA extensively. I'd be sad
> > to see it go, especially if developers are no longer forced to use it.
>
> The existing and documented expectation is that PG_TEST_EXTRA is an
> environment variable, ie it's a runtime option not a configure option.
> Making it be the latter seems like a significant loss of flexibility
> to me.

I think/hope we're saying the same thing -- developers should not be
forced to lock PG_TEST_EXTRA into their configurations; that's
inflexible and unhelpful.

What I'm saying in addition to that is, I really like that I can
currently put a default PG_TEST_EXTRA into my meson config so that I
don't have to keep track of it, and I do that all the time. So I'm in
favor of the "option 3" approach.

Thanks,
--Jacob

Re: PG_TEST_EXTRA and meson

2024-07-17 Thread Jacob Champion

On Wed, Jul 17, 2024 at 3:34 AM Nazir Bilal Yavuz  wrote:
> Sorry, the previous reply was wrong; I misunderstood what you said.
> Yes, that is how the patch was coded and I agree that getting rid of
> config time PG_TEST_EXTRA could be a better alternative.

Personally I use the config-time PG_TEST_EXTRA extensively. I'd be sad
to see it go, especially if developers are no longer forced to use it.
(In practice, I don't change that setting much after initial
configure, because I use separate worktrees/build directories for
different patchsets. And the reconfiguration is fast when I do need to
modify it.)

Thanks,
--Jacob

Re: PG_TEST_EXTRA and meson

2024-07-16 Thread Jacob Champion

On Tue, Jul 16, 2024 at 2:12 PM Nazir Bilal Yavuz  wrote:
>
> 2- If PG_TEST_EXTRA is set from the setup command, use it from the
> setup command and discard the environment variable. If PG_TEST_EXTRA
> is not set from the setup command, then use it from the environment.

Is there a way for the environment to override the Meson setting
rather than vice-versa? My vote would be to have both available, but
with the implementation in patch 2 I'd still have to reconfigure if I
wanted to change my test setup.

Thanks!
--Jacob

Re: Proposal for implementing OCSP Stapling in PostgreSQL

2024-07-15 Thread Jacob Champion

On Tue, Mar 5, 2024 at 4:12 PM David Zhang  wrote:
> This is the third version patch for "Certificate status check using OCSP
> Stapling" with ssl regression test cases added.

Hi David,

Thanks again for working on this! So far I've taken a look at the
design and tests. I've only skimmed the callback implementations; I
plan to review those in more detail once the architecture is nailed
down.

= Design =

It looks like this design relies on the DBA to manually prefetch OCSP
responses for their cert chain, and cache them in the local
ssl_ocsp_file. This is similar to Nginx's ssl_stapling_file directive
[1]. I think this may make sense for a v1 (much less code!), but it's
going to take a lot of good documentation on what, exactly, an admin
has to do to make sure their response file is fresh. IIUC it will
depend on the CA's policy and how they operate their responder.

We should probably be prepared for complaints that we don't run the
client ourselves. A v2 could maybe include an extension for running
the OCSP client in a background worker? Or maybe that's overkill, and
an existing job scheduler extension could do it, but having a custom
worker that could periodically check the response file and provide
warnings when it gets close to expiring might be helpful for admins.

I am worried about the multi-stapling that is being done in the tests,
for example the server-ca-ocsp-good response file. I think those tests
are cheating a bit. Yes, for this particular case, we can issue a
single signed response for both the intermediate and the leaf -- but
that's only because we issued both of them. What if your intermediate
and your root use different responders [2]? What if your issuer's
responder doesn't even support multiple certs per request [3]?

(Note that OpenSSL still doesn't support multi-stapling [4], even in
TLS 1.3 where we were supposed to get it "for free".)

I think it would be good for the sslocspstapling directive to 1) maybe
have a shorter name (cue bikeshed!) and 2) support more than a binary
on/off. For example, the current implementation could use
"disable/require" options, and then a v2 could add "prefer" which
simply sends the status request and honors must-staple extensions on
the certificate. (That should be safe to default to, I think, and it
would let DBAs control the stapling rollout more easily.)

A question from ignorance: how does the client decide that the
signature on the OCSP response is actually valid for the specific
chain in use?

= Tests =

I think the tests should record the expected_stderr messages for
failures (see the Code section below for why).

If it turns out that multi-stapling is safe, then IMO the tests should
explicitly test both TLSv1.2 and v1.3, since those protocols differ in
how they handle per-certificate status.

If it's okay with you, I'd like to volunteer to refactor out the
duplicated recipes in sslfiles.mk. I have some particular ideas in
mind and don't want to make you play fetch-a-rock. (No pressure to use
what I come up with, if you don't like it.)

Because a CRL is a valid fallback for OCSP in practice, I think there
should be some tests for their interaction. (For v1, maybe that's as
simple as "you're not allowed to use both yet", but it should be
explicit.)

= Code =

(this is a very shallow review)

+#define OCSP_CERT_STATUS_OK1
+#define OCSP_CERT_STATUS_NOK   (-1)

Returning f -1 from the callback indicates an internal error, so we're
currently sending the wrong alerts for OCSP failures ("internal error"
rather than "bad certificate status response") and getting unhelpful
error messages from libpq. For cases where the handshake proceeds
correctly but we don't accept the OCSP response status, I think we
should be returning zero.

Also, we should not stomp on the OCSP_ namespace (I thought these
macros were part of the official  API at first).

+   /* set up OCSP stapling callback */
+   SSL_CTX_set_tlsext_status_cb(SSL_context, ocsp_stapling_cb);

It seems like this should only be done if ssl_ocsp_file is set?

Thanks again!
--Jacob

[1] https://nginx.org/en/docs/http/ngx_http_ssl_module.html#ssl_stapling_file
[2] as appears to be the case for postgresql.org
[3] https://community.letsencrypt.org/t/bulk-ocsp-requests/168156/2
[4] https://github.com/openssl/openssl/pull/20945

Re: Can't find bugs to work on

2024-07-12 Thread Jacob Champion

On Fri, Jul 12, 2024 at 7:59 AM Mohab Yaser
 wrote:
>
> So if anyone faced the same problem please let me know.

I agree that it's rough at the moment. I don't pretend to have any
solutions, but you might check the Bug Fixes section of the current
Commitfest, to help review:

https://commitfest.postgresql.org/48/

or our TODO list:

https://wiki.postgresql.org/wiki/Todo

or the Live Issues section of our v17 release items:

https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items#Live_issues

(but that last one is unlikely to be beginner-friendly). None of these
are ideal for your use case, to be honest, but maybe someone else here
has a better idea.

HTH,
--Jacob

Re: Document use of ldapurl with LDAP simple bind

2024-07-08 Thread Jacob Champion

On Fri, Jun 28, 2024 at 12:11 AM Peter Eisentraut  wrote:
> This appears to imply that specifying ldapurl is only applicable for
> search+bind.  Maybe that whole message should be simplified to something
> like
>
> "configuration mixes arguments for simple bind and search+bind"
>
> (The old wording also ignores that the error might arise via "ldapsuffix".)

I kept the imperative "cannot" and tried to match the terminology with
our documentation at [1]:

cannot mix options for simple bind and search+bind modes

WDYT?

--Jacob

[1] https://www.postgresql.org/docs/17/auth-ldap.html


v2-0001-docs-explain-how-to-use-ldapurl-with-simple-bind.patch
Description: Binary data


v2-0002-ldap-test-ldapurl-with-simple-bind.patch
Description: Binary data


v2-0003-hba-improve-error-when-mixing-LDAP-bind-modes.patch
Description: Binary data

Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible

2024-07-08 Thread Jacob Champion

On Sun, Jun 30, 2024 at 10:48 AM Noah Misch  wrote:
> That looks like a reasonable user experience.  Is any field newly-nullable?

Technically I think the answer is no, since backends such as walwriter
already have null database and user fields. It's new for a client
backend to have nulls there, though.

> That said, it
> may be more fruitful to arrange for authentication timeout to cut through PAM
> etc.

That seems mostly out of our hands -- the misbehaving modules are free
to ignore our signals (and do). Is there another way to force the
issue?

> Hanging connection slots hurt even if they lack an xmin.

Oh, would releasing the xmin not really move the needle, then?

> I assume it
> takes an immediate shutdown to fix them?

That's my understanding, yeah.

> > Would anyone like me to be more aggressive, and create a pgstat entry
> > as soon as we have the opening transaction? Or... as soon as a
> > connection is made?
>
> All else being equal, I'd like backends to have one before taking any lmgr
> lock or snapshot.

I can look at this for the next patchset version.

> > I haven't decided how to test these patches. Seems like a potential
> > use case for injection points, but I think I'd need to preload an
> > injection library rather than using the existing extension. Does that
> > seem like an okay way to go?
>
> Yes.

I misunderstood how injection points worked. No preload module needed,
so v2 adds a waitpoint and a test along with a couple of needed tweaks
to BackgroundPsql. I think 0001 should probably be applied
independently.

Thanks,
--Jacob

v2-0001-BackgroundPsql-handle-empty-query-results.patch
Description: Binary data

v2-0002-Test-Cluster-let-background_psql-work-asynchronou.patch
Description: Binary data

v2-0004-WIP-report-external-auth-calls-as-wait-events.patch
Description: Binary data

v2-0003-pgstat-report-in-earlier-with-STATE_AUTHENTICATIN.patch
Description: Binary data

Re: Direct SSL connection and ALPN loose ends

2024-06-25 Thread Jacob Champion

On Tue, Jun 25, 2024 at 7:20 AM Dave Cramer  wrote:
>
> On Tue, 25 Jun 2024 at 09:37, Vladimir Sitnikov  
> wrote:
>>
>> "SSL". Technically, the proper term is TLS, and even the document refers to 
>> "IANA TLS ALPN Protocol IDs" (TLS, not SSL).
>> I would not die on that hill, however, going for tlsnegotiation would look 
>> better than sslnegotiation.
>
> +1 again, unusual to use SSL when this really is TLS.

This was sort of litigated last ye-(checks notes) oh no, three years ago:

https://www.postgresql.org/message-id/flat/CE12DD5C-4BB3-4166-BC9A-39779568734C%40yesql.se

I'm your side when it comes to the use of the TLS acronym, personally,
but I think introducing a brand new option that interfaces with
sslmode and sslrootcert and etc. while not being named like them would
be outright unhelpful. And the idea of switching everything to use TLS
in docs seemed to be met with a solid "meh" on the other thread.

--Jacob

Re: Direct SSL connection and ALPN loose ends

2024-06-25 Thread Jacob Champion

On Thu, Jun 20, 2024 at 4:32 PM Jacob Champion
 wrote:
> Thanks,
> --Jacob

Hey Heikki,

[sending this to the list in case it's not just me]

I cannot for the life of me get GMail to deliver your latest message,
even though I see it on postgresql.org. It's not in spam; it's just
gone. I wonder if it's possibly the Perl server script causing
virus-scanner issues?

--Jacob

Re: Direct SSL connection and ALPN loose ends

2024-06-20 Thread Jacob Champion

On Thu, Jun 20, 2024 at 4:13 PM Heikki Linnakangas  wrote:
> > By "negotiation" I mean the server's response to the startup packet.
> > I.e. "supported"/"not supported"/"error".
>
> Ok, I'm still a little confused, probably a terminology issue. The
> server doesn't respond with "supported" or "not supported" to the
> startup packet, that happens earlier. I think you mean the SSLRequst /
> GSSRequest packet, which is sent *before* the startup packet?

Yes, sorry. (I'm used to referring to those as startup packets too, ha.)

> Hmm, right, GSS encryption was introduced in v12, and older versions
> respond with an error to a GSSRequest.
>
> We probably could make the same assumption for GSS as we did for TLS in
> a49fbaaf, i.e. that an error means that something's wrong with the
> server, rather than that it's just very old and doesn't support GSS. But
> the case for that is a lot weaker case than with TLS. There are still
> pre-v12 servers out there in the wild.

Right. Since we default to gssencmode=prefer, if you have Kerberos
creds in your environment, I think this could potentially break
existing software that connects to v11 servers once you upgrade libpq.

Thanks,
--Jacob

Re: Direct SSL connection and ALPN loose ends

2024-06-20 Thread Jacob Champion

On Mon, Jun 17, 2024 at 9:23 AM Jacob Champion
 wrote:
> > I think the behavior with v2 and v3 errors should be the same. And I
> > think an immediate failure is appropriate on any v2/v3 error during
> > negotiation, assuming we don't use those errors for things like "TLS not
> > supported", which would warrant a fallback.
>
> For GSS encryption, it was my vague understanding that older servers
> respond with an error rather than the "not supported" indication. For
> TLS, though, the decision in a49fbaaf (immediate failure) seemed
> reasonable.

Would an open item for this be appropriate?

Thanks,
--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-18 Thread Jacob Champion

On Fri, Jun 14, 2024 at 8:49 AM Tom Lane  wrote:
> I think that's an oversimplified analysis.  Sure, the languages are
> both Turing-complete, but for our purposes here they are both simply
> glue languages around some set of testing facilities.  Some of those
> facilities will be provided by the base languages (plus whatever
> extension modules we choose to require) and some by code we write.
> The overall experience of writing tests will be determined by the
> testing facilities far more than by the language used for glue.

+1. As an example, the more extensive (and high-quality) a language's
standard library, the more tests you'll be able to write. Convincing
committers to adopt a new third-party dependency is hard (for good
reason); the strength of the standard library should be considered as
a point of technical comparison.

> That being the case, I do agree with your point that Python
> equivalents to most of PostgreSQL::Test will need to be built up PDQ.
> Maybe they can be better than the originals, in features or ease of
> use, but "not there at all" is not better.

There's a wide gulf between "not there at all" and "reimplement it all
as a prerequisite for v1" as Robert proposed. I'm arguing for a middle
ground.

> But what I'd really like to see is some comparison of the
> language-provided testing facilities that we're proposing
> to depend on.  Why is pytest (or whatever) better than Test::More?

People are focusing a lot on failure reporting, and I agree with them,
but I did try to include more than just that in my OP.

I'll requote what I personally think is the #1 killer feature of
pytest, which prove and Test::More appear to be unable to accomplish
on their own: configurable isolation of tests from each other via
fixtures [1].

Problem 1 (rerun failing tests): One architectural roadblock to this
in our Test::More suite is that tests depend on setup that's done by
previous tests. pytest allows you to declare each test's setup
requirements via pytest fixtures, letting the test runner build up the
world exactly as it needs to be for a single isolated test. These
fixtures may be given a "scope" so that multiple tests may share the
same setup for performance or other reasons.

When I'm doing red-to-green feature development (e.g. OAuth) or
green-to-green refactoring (e.g. replacement of libiddawc with libcurl
in OAuth), quick cycle time and reduction of noise is extremely
important. I want to be able to rerun just the single red test I care
about before moving on.

(Tests may additionally be organized with custom attributes. My OAuth
suite contains tests that must run slowly due to mandatory timeouts;
I've marked them as slow, and they can be easily skipped from the test
runner.)

2. The ability to break into a test case with the built-in debugger
[2] is also fantastic for quick red-green work. Much better than
print() statements.

(Along similar lines, even the ability to use Python's built-in REPL
increases velocity. Python users understand that they can execute
`python3` and be dropped into a sandbox to try out syntax or some
unfamiliar library. Searching for how to do this in Perl results in a
handful of custom-built scripts; people here may know which to use as
a Perl monk, sure, but the point is to make it easy for newcomers to
write tests.)

> I also wonder about integration of python-based testing with what
> we already have.  A significant part of what you called the meson
> work had to do with persuading pg_regress, isolationtester, etc
> to output test results in the common format established by TAP.
> Am I right in guessing that pytest will have nothing to do with that?

Andres covered this pretty well. I will note that I had problems with
pytest-tap itself [3], and I'm unclear whether that represents a bug
in pytest-tap or a bug in pytest.

Thanks,
--Jacob

[1] https://docs.pytest.org/en/stable/how-to/fixtures.html
[2] https://docs.pytest.org/en/stable/how-to/failures.html
[3] https://github.com/python-tap/pytest-tap/issues/30

Re: RFC: adding pytest as a supported test framework

2024-06-18 Thread Jacob Champion

(slowly catching up from the weekend email backlog)

On Fri, Jun 14, 2024 at 5:10 AM Robert Haas  wrote:
> I mean, both Perl and Python are Turing-complete.

Tom responded to this better than I could have, but I don't think this
is a helpful statement. In fact I opened the unconference session with
it and immediately waved it away as not-the-point.

> I just don't believe in the idea that we're going to write one
> category of tests in one language and another category in another
> language.

You and I will probably not agree, then, because IMO we already do
that. SQL behavior is tested in SQL via pg_regress characterization
tests. End-to-end tests are written in Perl. Lower-level tests are
often written in C (and, unfortunately, then driven in Perl instead of
C; see above mail to Noah).

I'm fundamentally a polyglot tester by philosophy, so I don't see
careful use of multiple languages as an inherent problem to be solved.
They increase complexity (considerably so!) but generally provide
sufficient value to offset the cost.

> As soon as we open the door to Python tests, people are
> going to start writing the TAP tests that they would have written in
> Perl in Python instead.

There's a wide spectrum of opinions between yours (which I will
cheekily paraphrase as "people will love testing in Python so much
they'll be willing to reinvent all of the wheels" -- which in the
short term would increase maintenance cost but in the long term sounds
like a very good problem to have), and people who seem to think that
new test suite infrastructure would sit unused because no one wants to
write tests anyway (to pull a strawman out of some hallway
conversations at PGConf.dev). I think the truth is probably somewhere
in the middle?

My prior mail was an attempt to bridge the gap between today and the
medium term, by introducing a series of compromises and incremental
steps in response to specific fears. We can jump forward to the end
state and try to talk about it, but I don't control the end state and
I don't have a crystal ball.

> So as I see
> it, the only reasonable plan here if we want to introduce testing in
> Python (or C#, or Ruby, or Go, or JavaScript, or Lua, or LOLCODE) is
> to try to achieve a reasonable degree of parity between that language
> and Perl. Because then we can at least review the new infrastructure
> all at once, instead of incrementally spread across many patches
> written, reviewed, and committed by many different people.

I don't at all believe that a test API which is ported en masse as a
horizontal layer, without motivating vertical slices of test
functionality, is going to be fit for purpose.

And "written, reviewed, and committed by many different people" is a
feature for me, not a bug. One of the goals of the thread is to
encourage more community test writing than we currently have.
Otherwise, I could have kept silent (I am very happy with my personal
suite and have been comfortably maintaining it for a while). I am
trying to build community momentum around a pain point that is
currently rusted in place.

> Consider the meson build system project. To get that committed, Andres
> had to make it do pretty much everything MSVC could do and mostly
> everything that configure could do

I think some lessons can be pulled from that, but fundamentally that's
a port of the build infrastructure done by a person with a commit bit.
There are some pretty considerable differences. (And even then, it
wasn't "done" with the first volley of patches, right?)

Thanks,
--Jacob

Re: ecdh support causes unnecessary roundtrips

2024-06-17 Thread Jacob Champion

On Mon, Jun 17, 2024 at 10:01 AM Andres Freund  wrote:
> On 2024-06-17 12:00:30 +0200, Daniel Gustafsson wrote:
> > To set the specified curve in ssl_ecdh_curve we have to don't we?
>
> Sure, but it's not obvious to me why we actually want to override openssl's
> defaults here. There's not even a parameter to opt out of forcing a specific
> choice on the server side.

I had exactly the same question in the context of the other thread, and found


https://www.openssl.org/blog/blog/2022/10/21/tls-groups-configuration/index.html

My initial takeaway was that our default is more restrictive than it
should be, but the OpenSSL default is more permissive than what they
recommend in practice, due to denial of service concerns:

> A general recommendation is to limit the groups to those that meet the
> required security level and that all the potential TLS clients support.

--Jacob

Re: Direct SSL connection and ALPN loose ends

2024-06-17 Thread Jacob Champion

On Mon, Jun 17, 2024 at 8:24 AM Heikki Linnakangas  wrote:
> By "negotiation", which part of the protocol are we talking about
> exactly? In the middle of the TLS handshake? After sending the startup
> packet?

By "negotiation" I mean the server's response to the startup packet.
I.e. "supported"/"not supported"/"error".

> I think the behavior with v2 and v3 errors should be the same. And I
> think an immediate failure is appropriate on any v2/v3 error during
> negotiation, assuming we don't use those errors for things like "TLS not
> supported", which would warrant a fallback.

For GSS encryption, it was my vague understanding that older servers
respond with an error rather than the "not supported" indication. For
TLS, though, the decision in a49fbaaf (immediate failure) seemed
reasonable.

Thanks,
--Jacob

Re: Direct SSL connection and ALPN loose ends

2024-06-17 Thread Jacob Champion

On Mon, Apr 29, 2024 at 8:24 AM Heikki Linnakangas  wrote:
> I was mostly worried about the refactoring of the
> retry logic in libpq (and about the pre-existing logic too to be honest,
> it was complicated before these changes already).

Some changes in the v17 negotiation fallback order caught my eye:

1. For sslmode=prefer, a modern v3 error during negotiation now
results in a fallback to plaintext. For v16 this resulted in an
immediate failure. (v2 errors retain the v16 behavior.)
2. For gssencmode=prefer, a legacy v2 error during negotiation now
results in an immediate failure. In v16 it allowed fallback to SSL or
plaintext depending on sslmode.

Are both these changes intentional/desirable? Change #1 seems to
partially undo the decision made in a49fbaaf:

> Don't assume that "E" response to NEGOTIATE_SSL_CODE means pre-7.0 server.
>
> These days, such a response is far more likely to signify a server-side
> problem, such as fork failure. [...]
>
> Hence, it seems best to just eliminate the assumption that backing off
> to non-SSL/2.0 protocol is the way to recover from an "E" response, and
> instead treat the server error the same as we would in non-SSL cases.

Thanks,
--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-13 Thread Jacob Champion

On Thu, Jun 13, 2024 at 1:27 PM Robert Haas  wrote:
>
> On Thu, Jun 13, 2024 at 4:06 PM Jacob Champion
>  wrote:
> > There was a four-step plan sketch at the end of that email, titled "A
> > Plan". That was not intended to be "the final detailed plan", because
> > I was soliciting feedback on the exact pieces people wanted to try to
> > implement first, and I am not the God Emperor of Pytest. But it was
> > definitely A Plan.
>
> Well, OK, now I feel a bit dumb. I guess I missed that or forgot about it.

No worries. It's a really long thread. :D

But also: do you have opinions on what to fill in as steps 2
(something we have no ability to test today) and 3 (something we do
test today, but hate)?

My vote for step 2 is "client and server separation", perhaps by
testing libpq fallback against a server that claims support for
different build-time options. I don't want to have a vote in step 3,
because part of that step is proving that this framework can provide
value for a part of the project I don't really know much about.

> > I think this is way too much expectation for a v1 patch. If you were
> > committing this by yourself, would you agree to develop the entirety
> > of PostgreSQL::Test in a single commit, without the benefit of the
> > buildfarm checking you as you went, and other people trying to write
> > tests with it?
>
> Eh... I'm confused. PostgreSQL::Test::Cluster is more than half of the
> code in that directory, and without it you wouldn't be able to write
> most of the TAP tests that we have today.

Well, in my defense, you said "PostgreSQL::Test::whatever", which I
assumed meant all of it, including Kerberos.pm and SSL::Server and
AdjustUpgrade and... That seemed like way too much to me (and still
does!), but if that's not what you were arguing then never mind.

Yes, Cluster.pm seems like a pretty natural thing to ask for. I
imagine it's one of the first things we're going to need. And yet...

> You would really want to
> call this project done without having an equivalent?

...I have this really weird sneaking suspicion that, if a replacement
of the end-to-end Perl acceptance tests can be made an explicit
anti-goal in the short term, we might not necessarily need an
"equivalent" for v1. I realize that seems bizarre, because of course
we need a way to start the server if we want to test the server. But
frankly, starting a server is Pretty Easy (tm), and Cluster.pm has to
do a lot more than that because IMO it's designed for a variety of
acceptance-oriented tasks. 3000+ lines!

If there's widespread interest (as opposed to being just my own
personal fever dream) in testing Postgres components as individual
pieces rather than setting up the world, then I wonder if the
functionality from Cluster.pm couldn't be pared down a lot. Maybe you
don't need a centralized ->psql() or a ->command_ok() helper, because
you're usually not trying to test psql and other utilities during your
server-only tests.

Maybe you can just stand up a standby without a primary and drive it
via mock replication. Do you need quite as many "poll and wait for
some asynchronous result" type things when you're not waiting for a
result to cascade through a multinode system? Does something like (for
example) ->pg_recvlogical_upto() really have to be implemented in our
"core" fixtures or can it be done more easily by whoever needs that in
the future? Maybe You Ain't Gonna Need It.

If (he said, atop his naive idealistic soapbox) we can find a way to
put off writing utilities until we write the tests that need them,
without procrastinating, and without putting all of the negative
externalities of that approach on the committers with low-quality
copy-paste proliferation, and I'd like a pony while I'm at it, then I
think the result might end up being pretty coherent and maintainable.
Then not having "at least as much in-tree support for writing tests as
we have today" for the very first commit would be a feature and not a
bug.

Now, maybe if the collective ability to do that existed, we would have
done it already with Perl, but I do actually wonder whether that's
true or not.

Or, maybe, the very first suggestion for Step 3 will be something that
needs absolutely everything in Cluster.pm. So be it; I can live
without a pony.

> You would really want to
> call this project done without having an equivalent?

(A cop-out but not-really-cop-out alternative answer to this question
is that this project is not going to be "done" any more than Postgres
will ever be "done", and that's part of what I'm arguing should be
considered natural and okay. I understand that it is easier for me to
take that stance when I am not on the hook for

Re: RFC: adding pytest as a supported test framework

2024-06-13 Thread Jacob Champion

On Thu, Jun 13, 2024 at 1:04 PM Robert Haas  wrote:
> One caveat here,
> perhaps, is that the focus of the work you've done up until now and
> the things that I and other community members may want as a condition
> of merging stuff may be somewhat distinct. You will have naturally
> been focused on your goals rather than other people's goals, or so I
> assume.

Right. That's a risk I knew I was taking when I wrote it, so it's not
going to offend me when I need to rewrite things.

> I would be a bit wary of splitting it up over
> too many different threads. It may well make sense to split it up, but
> it will probably be easier to review if the core work to enable this
> is one patch set on one thread where someone can read just that one
> thread and understand the situation, rather than many threads where
> you have to read them all.

I'll try to avoid too many threads. But right now there is indeed just
one thread (OAUTHBEARER) and it's way too much:

- the introduction of pytest
- a Construct-based manipulation of the wire protocol, including
Wireshark-style network traces on failure
- pytest fixtures which spin up libpq and the server in isolation from
each other, relying on the Construct implementation to complete the
seam
- OAuth, which was one of the motivating use cases (but not the only
one) for all of the previous items

I really don't want to derail this thread with those. There are other
people here with their own hopes and dreams (see: unconference notes),
and I want to give them a platform too.

> > That doesn't mean I want to
> > do this without a plan; it just means that the plan can involve saying
> > "this is not working and we can undo it" which makes the uncertainty
> > easier to take.
>
> As a community, we're really bad at this. [...]

I will carry the response to this to the next email.

Thanks,
--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-13 Thread Jacob Champion

On Thu, Jun 13, 2024 at 11:11 AM Robert Haas  wrote:
> I feel like I already agreed to this in a previous email and you're
> continuing to argue with me as if I were disagreeing.

I also think that maybe arguments are starting to sail past each
other, and the temperature is starting to climb. (And Jelte may be
arguing to all readers of the thread, rather than just a single
individual. It's hard to tell with the list format.) And now I see
that there's another email that came in while I was writing this, but
I think I'm going to have to send this as-is because I can't write
emails that fast.

> I also agree with this. I'm just not super optimistic about how much
> of that will actually happen. And I'd like to hear you acknowledge
> that concern and think about whether it can be addressed in some way,
> instead of just repeating that we should do it anyway. Because I agree
> we probably should do it anyway, but that doesn't mean I wouldn't like
> to see the downsides mitigated as much as we can.

Okay, +1.

> In particular, if
> the proposal is exactly "let's add the smallest possible patch that
> enables people to write tests in Python and then add a few new tests
> in Python while leaving almost everything else in Perl, with no
> migration plan and no clear vision of how the Python support ever gets
> any better than the minimum stub that is proposed for initial commit,"
> then I don't know that I can vote for that plan.

(that's not the proposal and I know/think you know that but having my
original email twisted into that is making me feel a bit crispy)

I do not want to migrate, and I have stated so multiple times, which
is why I have not proposed a migration plan. Other committers have
already expressed resistance to the idea that we would rewrite the
Perl stuff. I think they're right. I think we should not. I think we
should accept the cost of both Perl and something else for the
near-to-medium future, as long as the "something else" gives us value
offsetting the additional cost.

> Honestly, that sounds
> like very little work for the person proposing that minimal patch and
> a whole lot of work for the rest of the community later on, and the
> evidence is not in favor of volunteers showing up to take care of that
> work.

Okay, cool. Here: as the person who is 100% signing himself up to do
that, time for me to jump in.

I have an entire 6000-something-line suite of protocol tests that has
been linked like four times above. It does something fundamentally
different from the end-to-end Perl suite; it is not a port. It is far
from perfect and I do not want to pressure people to adopt it as-is,
which is why I have not. In this thread, I am offering it solely as
evidence that I have follow-up intent.

But I might get hit by a bus. Or, as far as anyone except me knows, I
might lose interest after things get hard, which would be sad. Which
is why my very first proposal was to add an entry point that can be
reverted. The suite is not going to infect the codebase any more than
the Perl does. A revert will involve pulling the Meson test entry
code, and deleting all pytest subdirectories (of which there is only
one, IIRC, in my OAuth suite).

> The plan should be more front-loaded than that: enough initial
> development should get done by the people making the proposal that if
> the work stops after, we don't have another big mess on our hands.

For me personally, the problem is the opposite. I have done _so much_
initial development by myself that there's no way it could ever be
reviewed and accepted. But I had to do that to get meaningful
development done in my style of work, which is focused on security and
testability and verifiable implementation.

I am trying to carve off pieces of that and say "hey, does this look
nice to anyone else?" That will take time, probably over multiple
different threads. In the meantime, I don't want to be a serialization
point for other people who are excited about trying new testing
methods, because very few people are currently doing the exact kind of
work I am doing. They may want to do other things, as evidenced by the
thread contents. At least one committer would have to sign up to be a
serialization point, unfortunately, but I think that's going to be
true regardless of plan, if we want multiple non-committer members of
the community to be involved instead of just one torch-bearer.

Because of how many moving parts and competing interests and personal
disagreements there are, I am firmly in the camp of "try something
that many people think *might* work better, that can be undone if it
sucks, and iterate on it." I want to build community momentum, because
I think that's a pretty effective way to change the cultural norms
that you're saying you're frustrated with. That doesn't mean I want to
do this without a plan; it just means that the plan can involve saying
"this is not working and we can undo it" which makes the uncertainty
easier to take.

--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-13 Thread Jacob Champion

On Thu, Jun 13, 2024 at 12:20 PM Robert Haas  wrote:
> I interpreted Jacob's original email as articulating a goal ("For the
> v18 cycle, I would like to try to get pytest [1] in as a supported
> test driver, in addition to the current offerings") rather than a
> plan.

That's the first part of it.

> There's no patch set yet and, as I understand it, no detailed
> plan for a patch set: that email seemed to focus on the question of
> desirability, rather than on outlining a plan of work, which I assume
> is still to come.

There was a four-step plan sketch at the end of that email, titled "A
Plan". That was not intended to be "the final detailed plan", because
I was soliciting feedback on the exact pieces people wanted to try to
implement first, and I am not the God Emperor of Pytest. But it was
definitely A Plan.

> Some things I'd like to see when a patch set does
> show up are:
>
> - good documentation for people who have no previous experience with
> Python and/or pytest e.g. here's how to set up your environment on
> Linux, Windows, macOS, *BSD so you can run the tests, here's how to
> run the tests, here's how it's different from the Perl framework we
> have now

+1

> - no external dependencies on PostgreSQL connectors. psql or libpq
> foreign function interface. the latter would be a cool increment of
> progress over the status quo.

If this is a -1 for psycopg, then I will cast my vote for ctypes/CFFI
and against psql.

> - at least as much in-tree support for writing tests as we have today
> with PostgreSQL::Test::whatever, but not necessarily a 1:1 reinvention
> of the stuff we have now, and documentation of those facilities that
> is as good or, ideally, better than what we have today.

I think this is way too much expectation for a v1 patch. If you were
committing this by yourself, would you agree to develop the entirety
of PostgreSQL::Test in a single commit, without the benefit of the
buildfarm checking you as you went, and other people trying to write
tests with it?

> - high overall code quality and level of maturity, not just something
> someone threw together for parity with the Perl system.

+1

> - enough tests written for or converted to the new system to give
> reviewers confidence that it's truly usable and fit for purpose.

This is that "know everything up front" tax that I think is not
reasonable for a test framework. If the thing you're trying to avoid
is the foot-in-the-door phenomenon, I would agree with you for a
Postgres feature. But these are tests; we don't ship them, we have
different rules for backporting them, they are developed in a very
different way.

> The important thing to me here (as it so often is) is to think like a
> maintainer. Imagine that immediately after the patches for this
> feature are committed, the developers who did the work all disappear
> from the community and are never heard from again. How much pain does
> that end us causing? The answer doesn't need to be zero; that is
> unrealistic. But it also shouldn't be "well, if that happens we're
> going to have to rip the feature out"

Can you elaborate on why that's not an okay outcome?

> or "well, a bunch of committers
> who didn't want to write tests in Python in the first place are now
> going to have to do a lot of work in Python to stabilize the work
> already committed."

Is it that? If the problem is that, we should address that. Because if
that is truly the fear, I cannot assuage that fear without showing you
something, and I cannot show you something you do not want to see, if
you don't want to write tests in Python in the first place.

--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-12 Thread Jacob Champion

On Wed, Jun 12, 2024 at 10:30 AM Daniel Gustafsson  wrote:
> I might be missing something obvious, but if we use a third-party libpq driver
> in the testsuite doesn't that imply that a patch adding net new functionality
> to libpq also need to add it to the driver in order to write the tests?

I use the third-party driver to perform the "basics" at a high level
-- connections, queries during cluster setup, things that don't
involve ABI changes. For new ABI I use ctypes, or as other people have
mentioned CFFI would work.

--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-12 Thread Jacob Champion

On Wed, Jun 12, 2024 at 8:50 AM Andres Freund  wrote:
> I think I might have formulated my paragraph above badly - I didn't mean that
> we should move away from perl tests tomorrow, but that we need a path forward
> that allows folks to write tests without perl.

Okay, agreed.

> > - Tests aren't cheap, but in my experience, the maintenance-cost math
> > for tests is a lot different than the math for implementations.
>
> At the moment they tend to be *more* expensive often, due to spurious
> failures. That's mostly not perl's fault, don't get me wrong, but us not
> having better infrastructure for testing complicated behaviour and/or testing
> things on a more narrow basis.

Well, okay, but I'm not sure how to respond to this in the frame of
this discussion. Bad tests will continue to exist. I am trying to add
a tool that, in my view, has made it easier for me to test complicated
behavior than what we currently have. I can't prove that it will solve
other issues too.

> > Well, scopes are pretty front and center when you start building
> > pytest fixtures, and the complicated longer setups will hopefully
> > converge correctly early on and be reused everywhere else. I imagine
> > no one wants to build cluster setup from scratch.
>
> One (the?) prime source of state in our tap tests is the
> database. Realistically we can't just tear that one down and reset it between
> tests without causing the test times to explode. So we'll have to live with
> some persistent state.

Yes? If I've given the impression that I disagree, sorry; I agree.

> > On a slight tangent, is this not a problem today?
>
> It is, but that doesn't mean making it even bigger is unproblematic :)

Given all that's been said, I don't understand why you think the
problem would get bigger. We would cache expensive state that we need,
including the cluster, and pytest lets us do that, and my test suite
does that. I've never written a suite that spun up a separate cluster
for every single test and then immediately threw it away.

(If you want to _enable_ that behavior, to test in extreme isolation,
then pytest lets you do that too. But it's not something we should do
by default.)

> > We do a lot more acceptance testing than internal testing, which came
> > up as a major complaint from me and others during the unconference.
> > One of the reasons people avoid writing internal tests in Perl is
> > because it's very painful to find a rhythm with Test::More.
>
> What definition of internal tests are you using here?

There's a spectrum from unit-testing unexported functions all the way
to end-to-end acceptance, and personally I believe that anything
finer-grained than end-to-end acceptance is unnecessarily painful. My
OAuth suite sits somewhere in the middle, where it mocks the protocol
layer and can test the client and server as independent pieces. Super
useful for OAuth, which is asymmetrical.

I'd like to additionally see better support for unit tests of backend
internals, but I don't know those seams as well as all of you do and I
should not be driving that. I don't think Python will necessarily help
you with it. But it sure helped me break apart the client and the
server while enjoying the testing process, and other people want to do
that too, so that's what I'm pushing for.

> I think a lot of our tests are complicated, fragile and slow because we almost
> exclusively do end-to-end tests, because with a few exceptions we don't have a
> way to exercise code in a more granular way.

Yep.

> That's probably not going to fly. It introduces painful circular dependencies
> between building postgres (for libpq), building psycopg (requiring libpq) and
> testing postgres (requiring psycopg).

I am trying very hard not to drag that, which I understand is
controversial and is in no way a linchpin of my proposal, into the
discussion of whether or not we should try supporting pytest.

I get it; I understand that the circular dependency is weird; there
are alternatives if it's unacceptable; none of that has anything to do
with Python+pytest.

> One thing worth thinking about is that such dependencies have to work on a
> relatively large number of platforms / architectures. A lot of projects
> don't...

Agreed.

Thanks,
--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-12 Thread Jacob Champion

On Wed, Jun 12, 2024 at 4:48 AM Alexander Korotkov  wrote:
> Generally, testgres was initially designed as Python analogue of what
> we have in src/test/perl/PostgreSQL/Test.  In particular its
> testgres.PostgresNode is analogue of PostgreSQL::Test::Cluster.  It
> comes under PostgreSQL License.  So, I wonder if we could revise it
> and fetch most part of it into our source tree.

Okay. If there's wide interest in a port of PostgreSQL::Test::Cluster,
that might be something to take a look at. (Since I'm focused on
testing things that the current Perl suite can't do at all, I would
probably not be the first to volunteer.)

--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-12 Thread Jacob Champion

On Wed, Jun 12, 2024 at 4:40 AM Jelte Fennema-Nio  wrote:
> I think C#, Java, Go, Rust, Kotlin, and Swift would be acceptable
> choices for me (and possibly some more). They allow some type of
> introspection, they have a garbage collector, and their general
> tooling is quite good.
>
> But I think a dynamically typed scripting language is much more
> fitting for writing tests like this. I love static typing for
> production code, but imho it really doesn't have much benefit for
> tests.

+1. I write mostly protocol mocks and glue code in my authn testing,
to try to set up the system into some initial state and then break it.
Of the languages mentioned here, I've only used C#, Java, and Go. If I
had to reimplement my tests, I'd probably reach for Go out of all of
those, but the glue would still be more painful than it probably needs
to be.

> As scripting languages go, the ones that are still fairly heavily in
> use are Javascript, Python, Ruby, and PHP. I think all of those could
> probably work, but my personal order of preference would be Python,
> Ruby, Javascript, PHP.

- Python is the easiest language I've personally used to glue things
together, bar none.
- I like Ruby as a language but have no experience using it for
testing. (RSpec did come up during the unconference session and
subsequent hallway conversations.)
- Javascript is a completely different mental model from what we're
used to, IMO. I think we're likely to spend a lot of time fighting the
engine unless everyone is very clear on how it works.
- I don't see a use case for PHP here.

> TO CLARIFY: This thread is not a proposal to replace Perl with Python.
> It's a proposal to allow people to also write tests in Python.

+1. It doesn't need to replace anything. It just needs to help us do
more things than we're currently doing.

--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-12 Thread Jacob Champion

On Tue, Jun 11, 2024 at 4:48 PM Noah Misch  wrote:
> If we're going to test in a non-Perl language, I'd pick C over Python.

We already test in C, though? If the complaint is that those tests are
driven by Perl, I agree that something like libcheck or GTest or
whatever people are using nowadays would be nicer. But that can be
added at any point; the activation energy for a new C-based test
runner seems pretty small. IMO, there's no reason to pick it _over_
another language, when we already support C tests and agree that
developers need to be fluent in C.

> We'd need a library like today's Perl
> PostgreSQL::Test to make C-language tests nice, but the same would apply to
> any new language.

I think the effort required in rebuilding end-to-end tests in C is
going to be a lot different than in pretty much any other modern
high-level language, so I don't agree that "the same would apply".

For the five problem statements I put forward, I think C moves the
needle for zero of them. It neither solves the problems we have nor
gives us stronger tools to solve them ourselves. And for my personally
motivating use case of OAuth, where I need to manipulate HTTP and JSON
and TLS and so on and so forth, implementing all of that in C would be
an absolute nightmare. Given that choice, I would rather use Perl --
and that's saying something, because I like C a lot more than I like
Perl -- because it's the difference between being given a rusty but
still functional table saw, and being given a box of Legos to build a
"better" table saw, when all I want to do is cut a 2x4 in half and
move on with my work.

I will use the rusty saw if I have to. But I want to get a better saw
-- that somebody else with experience in making saws has constructed,
and other people are pretty happy with -- as opposed to building my
own.

> I also want the initial scope to be the new language coexisting with the
> existing Perl tests.

Strongly agreed.

Thanks,
--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-11 Thread Jacob Champion

On Mon, Jun 10, 2024 at 12:26 PM Alexander Korotkov
 wrote:
> Thank you for working on this.
> Do you think you could re-use something from testgres[1] package?

Possibly? I think we're all coming at this with our own bags of tricks
and will need to carve off pieces to port, contribute, or reimplement.
Does testgres have something in particular you'd like to see the
Postgres tests support?

Thanks,
--Jacob

Re: RFC: adding pytest as a supported test framework

2024-06-11 Thread Jacob Champion

On Mon, Jun 10, 2024 at 1:04 PM Andres Freund  wrote:
> Just for context for the rest the email: I think we desperately need to move
> off perl for tests. The infrastructure around our testing is basically
> unmaintained and just about nobody that started doing dev stuff in the last 10
> years learned perl.

Okay. Personally, I'm going to try to stay out of discussions around
subtracting Perl and focus on adding Python, for a bunch of different
reasons:

- Tests aren't cheap, but in my experience, the maintenance-cost math
for tests is a lot different than the math for implementations.
- I don't personally care for Perl, but having tests in any form is
usually better than not having them.
- Trying to convince people to get rid of X while adding Y is a good
way to make sure Y never happens.

> On 2024-06-10 11:46:00 -0700, Jacob Champion wrote:
> > 4. It'd be great to split apart client-side tests from server-side
> > tests. Driving Postgres via psql all the time is fine for acceptance
> > testing, but it becomes a big problem when you need to test how
> > clients talk to servers with incompatible feature sets, or how a peer
> > behaves when talking to something buggy.
>
> That seems orthogonal to using pytest vs something else?

Yes, I think that's fair. It's going to be hard not to talk about
"things that pytest+Python don't give us directly but are much easier
to build" in all of this (and I tried to call that out in the next
section, maybe belatedly). I think I'm going to have to convince both
a group of people who want to ask "why pytest in particular?" and a
group of people who ask "why isn't what we have good enough?"

> > == Why pytest? ==
> >
> > From the small and biased sample at the unconference session, it looks
> > like a number of people have independently settled on pytest in their
> > own projects. In my opinion, pytest occupies a nice space where it
> > solves some of the above problems for us, and it gives us plenty of
> > tools to solve the other problems without too much pain.
>
> We might be able to alleviate that by simply abstracting it away, but I found
> pytest's testrunner pretty painful. Oodles of options that are not very well
> documented and that often don't work because they are very specific to some
> situations, without that being explained.

Hm. There are a bunch of them, but I've never needed to go through the
oodles of options. Anything in particular that caused problems?

> > Problem 1 (rerun failing tests): One architectural roadblock to this
> > in our Test::More suite is that tests depend on setup that's done by
> > previous tests. pytest allows you to declare each test's setup
> > requirements via pytest fixtures, letting the test runner build up the
> > world exactly as it needs to be for a single isolated test. These
> > fixtures may be given a "scope" so that multiple tests may share the
> > same setup for performance or other reasons.
>
> OTOH, that's quite likely to increase overall test times very
> significantly. Yes, sometimes that can be avoided with careful use of various
> features, but often that's hard, and IME is rarely done rigiorously.

Well, scopes are pretty front and center when you start building
pytest fixtures, and the complicated longer setups will hopefully
converge correctly early on and be reused everywhere else. I imagine
no one wants to build cluster setup from scratch.

On a slight tangent, is this not a problem today? I mean... part of my
personal long-term goal is in increasing test hygiene, which is going
to take some shifts in practice. As long as review keeps the quality
of the tests fairly high, I see the inevitable "our tests take too
long" problem as a good one. That's true no matter what framework we
use, unless the framework is so bad that no one uses it and the
runtime is trivial. If we're worried that people will immediately
start exploding the runtime and no one will notice during review,
maybe we can have some infrastructure flag how much a patch increased
it?

> > Problem 2 (seeing what failed): pytest does this via assertion
> > introspection and very detailed failure reporting. If you haven't seen
> > this before, take a look at the pytest homepage [1]; there's an
> > example of a full log.
>
> That's not really different than what the perl tap test stuff allows. We
> indeed are bad at utilizing it, but I'm not sure that switching languages will
> change that.

Jelte already touched on this, but I wanted to hammer on the point: If
no one, not even the developers who chose and like Perl, is using
Test::More in a way that's maintainable, I would prefer to

RFC: adding pytest as a supported test framework

2024-06-10 Thread Jacob Champion

Hi all,

For the v18 cycle, I would like to try to get pytest [1] in as a
supported test driver, in addition to the current offerings.

(I'm tempted to end the email there.)

We had an unconference session at PGConf.dev [2] around this topic.
There seemed to be a number of nodding heads and some growing
momentum. So here's a thread to try to build wider consensus. If you
have a competing or complementary test proposal in mind, heads up!

== Problem Statement(s) ==

1. We'd like to rerun a failing test by itself.

2. It'd be helpful to see _what_ failed without poring over logs.

These two got the most nodding heads of the points I presented. (#1
received tongue-in-cheek applause.) I think most modern test
frameworks are going to give you these things, but unfortunately we
don't have them.

Additionally,

3. Many would like to use modern developer tooling during testing
(language servers! autocomplete! debuggers! type checking!) and we
can't right now.

4. It'd be great to split apart client-side tests from server-side
tests. Driving Postgres via psql all the time is fine for acceptance
testing, but it becomes a big problem when you need to test how
clients talk to servers with incompatible feature sets, or how a peer
behaves when talking to something buggy.

5. Personally, I want to implement security features test-first (for
high code coverage and other benefits), and our Perl-based tests are
usually too cumbersome for that.

== Why pytest? ==

>From the small and biased sample at the unconference session, it looks
like a number of people have independently settled on pytest in their
own projects. In my opinion, pytest occupies a nice space where it
solves some of the above problems for us, and it gives us plenty of
tools to solve the other problems without too much pain.

Problem 1 (rerun failing tests): One architectural roadblock to this
in our Test::More suite is that tests depend on setup that's done by
previous tests. pytest allows you to declare each test's setup
requirements via pytest fixtures, letting the test runner build up the
world exactly as it needs to be for a single isolated test. These
fixtures may be given a "scope" so that multiple tests may share the
same setup for performance or other reasons.

Problem 2 (seeing what failed): pytest does this via assertion
introspection and very detailed failure reporting. If you haven't seen
this before, take a look at the pytest homepage [1]; there's an
example of a full log.

Problem 3 (modern tooling): We get this from Python's very active
developer base.

Problems 4 (splitting client and server tests) and 5 (making it easier
to write tests first) aren't really Python- or pytest-specific, but I
have done both quite happily in my OAuth work [3], and I've since
adapted that suite multiple times to develop and test other proposals
on this list, like LDAP/SCRAM, client encryption, direct SSL, and
compression.

Python's standard library has lots of power by itself, with very good
documentation. And virtualenvs and better package tooling have made it
much easier, IMO, to avoid the XKCD dependency tangle [4] of the
2010s. When it comes to third-party packages, which I think we're
probably going to want in moderation, we would still need to discuss
supply chain safety. Python is not as mature here as, say, Go.

== A Plan ==

Even if everyone were on board immediately, there's a lot of work to
do. I'd like to add pytest in a more probationary status, so we can
iron out the inevitable wrinkles. My proposal would be:

1. Commit bare-bones support in our Meson setup for running pytest, so
everyone can kick the tires independently.
2. Add a test for something that we can't currently exercise.
3. Port a test from a place where the maintenance is terrible, to see
if we can improve it.

If we hate it by that point, no harm done; tear it back out. Otherwise
we keep rolling forward.

Thoughts? Suggestions?

Thanks,
--Jacob

[1] https://docs.pytest.org/
[2]
https://wiki.postgresql.org/wiki/PGConf.dev_2024_Developer_Unconference#New_testing_frameworks
[3] https://github.com/jchampio/pg-pytest-suite
[4] https://xkcd.com/1987/

Re: Re: Add support to TLS 1.3 cipher suites and curves lists

2024-06-07 Thread Jacob Champion

On Fri, Jun 7, 2024 at 3:02 AM Erica Zhang  wrote:
>
> For some security consideration, we prefer to use TLS1.3 cipher suites in our 
> product with some customization values instead of default value 
> "HIGH:MEDIUM:+3DES:!aNULL". Moreover we prefer to set a group of ecdh keys 
> instead of a single value.

+1 for the curve list feature, at least. No opinions on the 1.3
ciphersuites half, yet.

I've added this patch to my planned review for the v18 cycle. Some
initial notes:

- Could you separate the two features into two patches? That would
make it easier for reviewers. (They can still share the same thread
and CF entry.)
- The "curve" APIs have been renamed "group" in newer OpenSSLs for a
while now, and we should probably use those if possible.
- I think parsing apart the groups list to check NIDs manually could
lead to false negatives. From a docs skim, 3.0 allows providers to add
their own group names, and 3.3 now supports question marks in the
string to allow graceful fallbacks.
- I originally thought it'd be better to just stop calling
SSL_set_tmp_ecdh() entirely by default, so we could use OpenSSL's
builtin list of groups. But that may have denial-of-service concerns
[1]?
- We should maybe look into SSL_CTX_config(), if we haven't discussed
that already on the list, but that's probably a bigger tangent and
doesn't need to be part of this patch.

Thanks,
--Jacob

[1] 
https://www.openssl.org/blog/blog/2022/10/21/tls-groups-configuration/index.html

Re: pg_parse_json() should not leak token copies on failure

2024-06-04 Thread Jacob Champion

On Mon, Jun 3, 2024 at 9:32 PM Kyotaro Horiguchi
 wrote:
>
> Hi,

Thanks for the review!

> I understand that the part enclosed in parentheses refers to the "if
> (ptr) pfree(ptr)" structure. I believe we rely on the behavior of
> free(NULL), which safely does nothing. (I couldn't find the related
> discussion due to a timeout error on the ML search page.)

For the frontend, right. For the backend, pfree(NULL) causes a crash.
I think [1] is a related discussion on the list, maybe the one you
were looking for?

> Although I don't fully understand the entire parser code, it seems
> that the owner transfer only occurs in JSON_TOKEN_STRING cases. That
> is, the memory pointed by scalar_val would become dangling in
> JSON_TOKEN_NUBMER cases.

It should happen for both cases, I think. I see that the
JSON_SEM_SCALAR_CALL case passes scalar_val into the callback
regardless of whether we're parsing a string or a number, and
parse_scalar() does the same thing over in the recursive
implementation. Have I missed a code path?

> Even if this is not the case, the ownership
> transition apperas quite callenging to follow.

I agree!

> It might be safer or clearer to pstrdup the token in jsonb_in_scalar()
> and avoid NULLifying scalar_val after calling callbacks, or to let
> jsonb_in_sclar() NULLify the pointer.

Maybe. jsonb_in_scalar isn't the only scalar callback implementation,
though. It might be a fairly cruel change for dependent extensions,
since there will be no compiler complaint.

If a compatibility break is acceptable, maybe we should make the API
zero-copy for the recursive case, and just pass the length of the
parsed token? Either way, we'd have to change the object field
callbacks, too, but maybe an API change makes even more sense there,
given how convoluted it is right now.

Thanks,
--Jacob

[1] 
https://www.postgresql.org/message-id/flat/cf26e970-8e92-59f1-247a-aa265235075b%40enterprisedb.com

Document use of ldapurl with LDAP simple bind

2024-05-24 Thread Jacob Champion

Hi all,

Our documentation implies that the ldapurl setting in pg_hba is used
for search+bind mode only. It was pointed out to me recently that this
is not true, and if you're dealing with simple bind on a non-standard
scheme or port, then ldapurl makes the HBA easier to read:

... ldap ldapurl="ldaps://ldap.example.net:49151" ldapprefix="cn="
ldapsuffix=", dc=example, dc=net"

0001 tries to document this helpful behavior a little better, and 0002
pins it with a test. WDYT?

Thanks,
--Jacob


0001-docs-explain-how-to-use-ldapurl-with-simple-bind.patch
Description: Binary data


0002-ldap-test-ldapurl-with-simple-bind.patch
Description: Binary data

Re: pg_parse_json() should not leak token copies on failure

2024-05-24 Thread Jacob Champion

On Tue, Apr 30, 2024 at 2:29 PM Jacob Champion
 wrote:
> Attached is a draft patch to illustrate what I mean, but it's
> incomplete: it only solves the problem for scalar values.

(Attached is a v2 of that patch; in solving a frontend leak I should
probably not introduce a backend segfault.)

--Jacob


v2-0001-WIP-fix-leak-of-scalar-value-on-lex-failure.patch
Description: Binary data

Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs

2024-05-23 Thread Jacob Champion

On Thu, May 23, 2024 at 11:12 AM Tom Lane  wrote:
> If a reader doesn't recognize a particular
> chunk code, it can still tell whether the chunk is "critical" or not,
> and thereby decide if it must give up or can proceed while ignoring
> that chunk.)

Would it be good to expand on that idea of criticality? IIRC one of
Jelte's complaints earlier was that middleware has to know all the
extension types anyway, to be able to figure out whether it has to do
something about them or not. HTTP has the concept of hop-by-hop vs
end-to-end headers for related reasons.

--Jacob

Re: libpq compression (part 3)

2024-05-21 Thread Jacob Champion

On Tue, May 21, 2024 at 12:08 PM Jacob Burroughs
 wrote:
> I think I thought I was writing about something else when I wrote that
> :sigh:.  I think what I really should have written was a version of
> the part below, which is that we use streaming decompression, only
> decompress 8kb at a time, and for pre-auth messages limit them to
> `PG_MAX_AUTH_TOKEN_LENGTH` (65535 bytes), which isn't really enough
> data to actually cause any real-world pain by needing to decompress vs
> the equivalent pain of sending invalid uncompressed auth packets.

Okay. So it sounds like your position is similar to Robert's from
earlier: prefer allowing unauthenticated compressed packets for
simplicity, as long as we think it's safe for the server. (Personally
I still think a client that compresses its password packets is doing
it wrong, and we could help them out by refusing that.)

> We own both the canonical client and server, so those are both covered
> here.  I would think it would be the responsibility of any other
> system that maintains its own implementation of the postgres protocol
> and chooses to support the compression protocol to perform its own
> mitigations against potential compression security issues.

Sure, but if our official documentation is "here's an extremely
security-sensitive feature, figure it out!" then we've done a
disservice to the community.

> Should we
> put the fixed message size limits (that have de facto been part of the
> protocol since 2021, even if they weren't documented as such) into the
> protocol documentation?

Possibly? I don't know if the other PG-compatible implementations use
the same limits. It might be better to say "limits must exist".

> ( I don't really see how one could implement other tooling that used
> pg compression without using streaming compression, as the protocol
> never hands over a standalone blob of compressed data: all compressed
> data is always part of a stream, but even with streaming decompression
> you still need some kind of limits or you will just chew up memory.)

Well, that's a good point; I wasn't thinking about the streaming APIs
themselves. If the easiest way to implement decompression requires the
use of an API that shouts "hey, give me guardrails!", then that helps
quite a bit. I really need to look into the attack surface of the
three algorithms.

--Jacob

Re: libpq compression (part 3)

2024-05-21 Thread Jacob Champion

On Tue, May 21, 2024 at 9:14 AM Jacob Burroughs
 wrote:
> On Tue, May 21, 2024 at 10:43 AM Jelte Fennema-Nio  wrote:
> > To help get everyone on the same page I wanted to list all the
> > security concerns in one place:
> >
> > 1. Triggering excessive CPU usage before authentication, by asking for
> > very high compression levels
> > 2. Triggering excessive memory/CPU usage before authentication, by
> > sending a client sending a zipbomb
> > 3. Triggering excessive CPU after authentication, by asking for a very
> > high compression level
> > 4. Triggering excessive memory/CPU after authentication due to
> > zipbombs (i.e. small amount of data extracting to lots of data)
> > 5. CRIME style leakage of information about encrypted data
> >
> > 1 & 2 can easily be solved by not allowing any authentication packets
> > to be compressed. This also has benefits for 5.
>
> This is already addressed by only compressing certain message types.
> If we think it is important that the server reject compressed packets
> of other types I can add that, but it seemed reasonable to just make
> the client never send such packets compressed.

If the server doesn't reject compressed packets pre-authentication,
then case 2 isn't mitigated. (I haven't proven how risky that case is
yet, to be clear.) In other words: if the threat model is that a
client can attack us, we shouldn't assume that it will attack us
politely.

> > 4 would require some safety limits on the amount of data that a
> > (small) compressed message can be decompressed to, and stop
> > decompression of that message once that limit is hit. What that limit
> > should be seems hard to choose though. A few ideas:
> > a. The size of the message reported by the uncompressed header. This
> > would mean that at most the 4GB will be uncompressed, since maximum
> > message length is 4GB (limited by 32bit message length field)
> > b. Allow servers to specify maximum client decompressed message length
> > lower than this 4GB, e.g. messages of more than 100MB of uncompressed
> > size should not be allowed.
>
> Because we are using streaming decompression, this is much less of an
> issue than for things that decompress wholesale onto disk/into memory.

(I agree in general, but since you're designing a protocol extension,
IMO it's not enough that your implementation happens to mitigate
risks. We more or less have to bake those mitigations into the
specification of the extension, because things that aren't servers
have to decompress now. Similar to RFC security considerations.)

--Jacob

Re: libpq compression (part 3)

2024-05-21 Thread Jacob Champion

On Tue, May 21, 2024 at 8:23 AM Jacob Burroughs
 wrote:
> As currently implemented, the compression only applies to
> CopyData/DataRow/Query messages, none of which should be involved in
> authentication, unless I've really missed something in my
> understanding.

Right, but Robert has argued that we should compress it all, and I'm
responding to that proposal.

Sorry for introducing threads within threads. But I think it's
valuable to pin down both 1) the desired behavior, and 2) how the
current proposal behaves, as two separate things. I'll try to do a
better job of communicating which I'm talking about.

> > Right, I think it's reasonable to let a sufficiently
> > determined/informed user lift the guardrails, but first we have to
> > choose to put guardrails in place... and then we have to somehow
> > sufficiently inform the users when it's okay to lift them.
>
> My thought would be that compression should be opt-in on the client
> side, with documentation around the potential security pitfalls. (I
> could be convinced it should be opt-in on the server side, but overall
> I think opt-in on the client side generally protects against footguns
> without excessively getting in the way

We absolutely have to document the risks and allow clients to be
written safely. But I think server-side controls on risky behavior
have proven to be generally more valuable, because the server
administrator is often in a better spot to see the overall risks to
the system. ("No, you will not use deprecated ciphersuites. No, you
will not access this URL over plaintext. No, I will not compress this
response containing customer credit card numbers, no matter how nicely
you ask.") There are many more clients than servers, so it's less
risky for the server to enforce safety than to hope that every client
is safe.

Does your database and access pattern regularly mingle secrets with
public data? Would auditing correct client use of compression be a
logistical nightmare? Do your app developers keep indicating in
conversations that they don't understand the risks at all? Cool, just
set `encrypted_compression = nope_nope_nope` on the server and sleep
soundly at night. (Ideally we would default to that.)

> and if an attacker controls the
> client, they can just get the information they want directly-they
> don't need compression sidechannels to get that information.)

Sure, but I don't think that's relevant to the threats being discussed.

> Within SQL-level things, I don't think we can reasonably differentiate
> between private and attacker-controlled information at the
> libpq/server level.

And by the IETF line of argument -- or at least the argument I quoted
above -- that implies that we really have no business introducing
compression when confidentiality is requested. A stronger approach
would require us to prove, or the user to indicate, safety before
compressing.

Take a look at the security notes for QPACK [1] -- keeping in mind
that they know _more_ about what's going on at the protocol level than
we do, due to the header design. And they still say things like "an
encoder might choose not to index values with low entropy" and "these
criteria ... will evolve over time as new attacks are discovered." A
huge amount is left as an exercise for the reader. This stuff is
really hard.

> We can reasonably differentiate between message
> types that *definitely* are private and ones that could have
> either/both data in them, but that's not nearly as useful.  I think
> not compressing auth-related packets plus giving a mechanism to reset
> the compression stream for clients (plus guidance on the tradeoffs
> involved in turning on compression) is about as good as we can get.

The concept of stream reset seems necessary but insufficient at the
application level, which bleeds over into Jelte's compression_restart
proposal. (At the protocol level, I think it may be sufficient?)

If I write a query where one of the WHERE clauses is
attacker-controlled and the other is a secret, I would really like to
not compress that query on the client side. If I join a table of user
IDs against a table of user-provided addresses and a table of
application tokens for that user, compressing even a single row leaks
information about those tokens -- at a _very_ granular level -- and I
would really like the server not to do that.

So if I'm building sand castles... I think maybe it'd be nice to mark
tables (and/or individual columns?) as safe for compression under
encryption, whether by row or in aggregate. And maybe libpq and psql
should be able to turn outgoing compression on and off at will.

And I understand those would balloon the scope of the feature. I'm
worried I'm doing the security-person thing and sucking all the air
out of the room. I know not everybody uses transport encryption; for
those people, compress-it-all is probably a pretty winning strategy,
and there's no need to reset the compression context ever. And the
pg_dump-style, "give me everythi

Re: libpq compression (part 3)

2024-05-20 Thread Jacob Champion

On Mon, May 20, 2024 at 11:37 AM Robert Haas  wrote:
> But if that's a practical
> attack, preventing compression prior to the authentication exchange
> probably isn't good enough

I mean... you said it, not me. I'm trying not to rain on the parade
too much, because compression is clearly very valuable. But it makes
me really uncomfortable that we're reintroducing the compression
oracle (especially over the authentication exchange, which is
generally more secret than the rest of the traffic).

> But, does this mean that we should just refuse to offer compression as
> a feature? This kind of attack isn't a threat in every environment,
> and in some environments, compression could be pretty useful. For
> instance, you might need to pull down a lot of data from the database
> over a slow connection. Perhaps you're the only user of the database,
> and you wrote all of the queries yourself in a locked vault, accepting
> no untrusted inputs. In that case, these kinds of attacks aren't
> possible, or at least I don't see how, but you might want both
> compression and encryption.

Right, I think it's reasonable to let a sufficiently
determined/informed user lift the guardrails, but first we have to
choose to put guardrails in place... and then we have to somehow
sufficiently inform the users when it's okay to lift them.

> I guess I don't understand why TLS removed
> support for encryption entirely instead of disclaiming its use in some
> appropriate way.

One of the IETF conversations was at [1] (there were dissenters on the
list, as you might expect). My favorite summary is this one from
Alyssa Rowan:

> Compression is usually best performed as "high" as possible; transport layer 
> is blind to what's being compressed, which is (as we now know) was definitely 
> too low and was in retrospect a mistake.
>
> Any application layer protocol needs to know - if compression is supported - 
> to separate compression contexts for attacker-chosen plaintext and 
> attacker-sought unknown secrets. (As others have stated, HTTPbis covers this.)

But for SQL, where's the dividing line between attacker-chosen and
attacker-sought? To me, it seems like only the user knows; the server
has no clue. I think that puts us "lower" in Alyssa's model than HTTP
is.

As Andrey points out, there was prior work done that started to take
this into account. I haven't reviewed it to see how good it is -- and
I think there are probably many use cases in which queries and tables
contain both private and attacker-controlled information -- but if we
agree that they have to be separated, then the strategy can at least
be improved upon.

--Jacob

[1] https://mailarchive.ietf.org/arch/msg/tls/xhMLf8j4pq8W_ZGXUUU1G_m6r1c/

Re: libpq compression (part 3)

2024-05-20 Thread Jacob Champion

On Mon, May 20, 2024 at 11:05 AM Andrey M. Borodin  wrote:
> > So, the data would be compressed first, with framing around that, and
> > then transport encryption would happen afterwards. I don't see how
> > that would leak your password, but I have a feeling that might be a
> > sign that I'm about to learn some unpleasant truths.
>
> Compression defeats encryption. That's why it's not in TLS anymore.
> The thing is compression codecs use data self correlation. And if you mix 
> secret data with user's data, user might guess how correlated they are.

I'm slow on the draw, but I hacked up a sample client to generate
traffic against the compression-enabled server, to try to illustrate.

If my client sends an LDAP password of "hello", followed by the query
`SELECT 'world'`, as part of the same gzip stream, I get two encrypted
packets on the wire: lengths 42 and 49 bytes. If the client instead
sends the query `SELECT 'hello'`, I get lengths 42 and 46. We lost
three bytes, and there's only been one packet on the stream before the
query; if the observer controlled the query, it's pretty obvious that
the self-similarity has to have come from the PasswordMessage. Rinse
and repeat.

That doesn't cover the case where the password itself is low-entropy,
either. "hellohellohellohello" at least has length, but once you
compress it that collapses. So an attacker can passively monitor for
shorter password packets and know which user to target first.

--Jacob

Re: libpq compression (part 3)

2024-05-20 Thread Jacob Champion

On Mon, May 20, 2024 at 10:01 AM Robert Haas  wrote:
> I really hope that you can't poke big enough holes to kill the feature
> entirely, though. Because that sounds sad.

Even if there are holes, I don't think the situation's going to be bad
enough to tank everything; otherwise no one would be able to use
decompression on the Internet. :D And I expect the authors of the
newer compression methods to have thought about these things [1].

I hesitate to ask as part of the same email, but what were the plans
for compression in combination with transport encryption? (Especially
if you plan to compress the authentication exchange, since mixing your
LDAP password into the compression context seems like it might be a
bad idea if you don't want to leak it.)

--Jacob

[1] https://datatracker.ietf.org/doc/html/rfc8878#name-security-considerations

Re: libpq compression (part 3)

2024-05-20 Thread Jacob Champion

On Mon, May 20, 2024 at 8:29 AM Robert Haas  wrote:
> It does occur to me that if some compression algorithm has a buffer
> overrun bug, restricting its use until after authentication might
> reduce the score of the resulting CVE, because now you have to be able
> to authenticate to make an exploit work. Perhaps that's an argument
> for imposing a restriction here, but it doesn't seem to be the
> argument that you're making.

It wasn't my argument; Jacob B said above:

> in general postgres servers are usually connected to by
> clients that the server admin has some channel to talk to (after all
> they somehow had to get access to log in to the server in the first
> place) if they are doing something wasteful, given that a client can
> do a lot worse things than enable aggressive compression by writing
> bad queries.

...and my response was that, no, the proposal doesn't seem to be
requiring that authentication take place before compression is done.
(As evidenced by your email. :D) If the claim is that there are no
security problems with letting unauthenticated clients force
decompression, then I can try to poke holes in that; or if the claim
is that we don't need to worry about that at all because we'll wait
until after authentication, then I can poke holes in that too. My
request is just that we choose one.

> But if the client says in the startup message that it would like to
> send and receive compressed data and the server is happy with that
> request, I don't see why we need to postpone implementing that request
> until after the authentication exchange is completed. I think that
> will make the code more complicated and I don't see a security
> benefit.

I haven't implemented compression bombs before to know lots of
details, but I think the general idea is to take up resources that are
vastly disproportionate to the effort expended by the client. The
systemic risk is then more or less multiplied by the number of
intermediaries that need to do the decompression. Maybe all three of
our algorithms are hardened against malicious compression techniques;
that'd be great. But if we've never had a situation where a completely
untrusted peer can hand a blob to the server and say "here, decompress
this for me", maybe we'd better check?

--Jacob

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 932 matches

Mail list logo