Circling back here, anything else needed from my end on this patch? On Wed, Feb 21, 2024, 17:04 David Benjamin <david...@google.com> wrote:
> Thanks for the very thorough comments! I've attached a new version of the > patch. > > On Thu, Feb 15, 2024 at 4:17 PM Andres Freund <and...@anarazel.de> wrote: > >> Hi, >> >> On 2024-02-11 13:19:00 -0500, David Benjamin wrote: >> > I've attached a patch for the master branch to fix up the custom BIOs >> used >> > by PostgreSQL, in light of the issues with the OpenSSL update recently. >> > While c82207a548db47623a2bfa2447babdaa630302b9 (switching from >> BIO_get_data >> > to BIO_get_app_data) resolved the immediate conflict, I don't think it >> > addressed the root cause, which is that PostgreSQL was mixing up two BIO >> > implementations, each of which expected to be driving the BIO. >> >> Yea, that's certainly not nice - and I think we've been somewhat lucky it >> hasn't caused more issues. There's some nasty possibilities, e.g. >> sock_ctrl() >> partially enabling ktls without our initialization having called >> ktls_enable(). Right now that just means ktls is unusable, but it's not >> hard >> to imagine accidentally ending up sending unencrypted data. >> > > Yeah. Even if, say, the ktls bits work, given you all care enough about > I/O to have wanted to wrap the BIO, I assume you'd want to pick up those > features on your own terms, e.g. by implementing the BIO_CTRLs yourself. > > >> I've in the past looked into not using a custom BIO [1], but I have my >> doubts >> about that being a good idea. I think medium term we want to be able to do >> network IO asynchronously, which seems quite hard to do when using >> openssl's >> socket BIO. > > > > > Once we've done that, we're free to use BIO_set_data. While >> BIO_set_app_data >> > works fine, I've reverted back to BIO_set_data because it's more >> commonly used. >> > app_data depends on OpenSSL's "ex_data" mechanism, which is a tad >> heavier under >> > the hood. >> >> At first I was a bit wary of that, because it requires us to bring back >> the >> fallback implementation. But you're right, it's noticeably heavier than >> BIO_get_data(), and we do call BIO_get_app_data() fairly frequently. >> > > TBH, I doubt it makes any real difference perf-wise. But I think > BIO_get_data is a bit more common in this context. > > >> > That leaves ctrl. ctrl is a bunch of operations (it's ioctl). The only >> > operation that matters is BIO_CTRL_FLUSH, which is implemented as a >> no-op. All >> > other operations are unused. It's once again good that they're unused >> because >> > otherwise OpenSSL might mess with postgres's socket, break the >> assumptions >> > around interrupt handling, etc. >> >> How did you determine that only FLUSH is required? I didn't even really >> find >> documentation about what the intended semantics actually are. >> > > The unhelpful answer is that my day job is working on BoringSSL, so I've > spent a lot of time with this mess. :-) But, yeah, it's not well-documented > at all. OpenSSL ends up calling BIO_flush at the end of each batch of > writes in libssl. TBH, I suspect that was less intentional and more an > emergent property of them internally layering a buffer BIO at one point in > the process, but it's long been part of the (undocumented) API contract. > Conversely, I don't think OpenSSL can possibly make libssl *require* a > new BIO_CTRL because they'd break every custom BIO anyone has ever used > with the library. > > >> E.g. should we implement BIO_CTRL_EOF? Sure, it wasn't really supported so >> far, because we never set it, but is that right? > > > Ah hmm, BIO_CTRL_EOF is... a bit of a mess. OpenSSL kind of messed things > up. So, up until recently, I would have said that BIO_CTRL_EOF was not part > of the interface here. OpenSSL 1.0.x did not implement it for sockets, and > the BIO_read API *already* had a way to signal EOF: did you return zero > or -1? > > Then OpenSSL 1.1.x introduced size_t-based BIO_read_ex APIs. However, in > the process, they *forgot that EOF and error are different things* and > made it impossible to detect EOFs if you use BIO_read_ex! They never > noticed this, because they didn't actually convert their own code to their > new API. See this discussion, which alas ended with OpenSSL deciding to > ignore the problem and not even document their current interface. > https://github.com/openssl/openssl/issues/8208 > > Though they never responded, they seem to have tacitly settled using the > out-of-band BIO_eof function (which is BIO_CTRL_EOF) as the way to signal > EOF for BIO_read_ex. This is kind of fiddly, but is at least a well-defined > option. But the problem is no one's BIO_METHODs, including their own, are > read_ex-based. They all implement the old read callback. But someone might > call BIO_read_ex on a read-based BIO_METHOD. IMO, BIO_read_ex should be > responsible for translating between the two EOF conventions. For example, > it could automatically set a flag when the read callback returns 0 and then > make BIO_ctrl check the flag and automatically implement BIO_CTRL_EOF for > BIOs that don't do it themselves. Alas, OpenSSL did not do this and instead > they just made their built-in BIOs implement BIO_CTRL_EOF, even though this > new expectation is a compatibility break. That leaves BIO_read, the one > everyone uses, in a pretty ambiguous state that they seem uninterested in > addressing. > https://github.com/openssl/openssl/pull/10882 > https://github.com/openssl/openssl/pull/10907 > > Honestly, I suspect nothing in postgres actually cares about this, given > you all didn't notice things breaking around here in the early days of > 1.1.x. Still, that is a good point. I've updated the patch to implement > BIO_CTRL_EOF for completeness' sake. > > (For BoringSSL, we did not go down this route because, unlike OpenSSL > apparently, we do not think breaking backwards compatibility in the EOF > convention is OK. I do want to add the size_t APIs eventually, but I'm > thinking we'll do the automatic BIO_CTRL_EOF thing described above, to > avoid breaking compatibility with existing BIO_METHODs.) > > Beyond BIO_CTRL_EOF, I assume what you're really asking here is "how do we > know OpenSSL won't change the interface again?". And, honestly, we don't. > They promise API and ABI stability, which in theory means they shouldn't. > But their track record with custom BIOs is not great. That said, if they do > break it, I think it will unambiguously be an API break on their part and > hopefully it'll be possible to get them to fix the issue. The EOF issue > snuck in because it mostly only impacted people who tried to migrate to > their new APIs. It broke existing APIs too, but the one project that > noticed (CPython) misdiagnosed the issue and worked around it. > (Incorrectly, but no one noticed. See > https://github.com/python/cpython/pull/95495.) So, I raised the issue, > they'd long since shipped it and evidently felt no burning need to fix > their regression or unclear APIs. :-( > > But the alternative, picking up the real socket BIO's ctrl function, is > even more unsound, so I think this is the better place to be. > > >> What about >> BIO_CTRL_GET_CLOSE/BIO_CTRL_SET_CLOSE? >> > > The close flag is pretty silly. All BIOs do slightly different things with > it, which means that libssl cannot possibly call it generically. So if your > BIO doesn't have any need for it, you don't need to bother. It's usually > used to indicate whether the BIO owns the underlying fd/socket/BUF_MEM/etc, > > >> Another issue is that 0 doesn't actually seem like the universal error >> return >> - e.g. BIO_C_GET_FD seems to return -1, because 0 is a valid fd. >> > > Yeah, that is... also a mess. All of OpenSSL's ctrl functions return zero > for unrecognized commands. It seems they just don't have a convention for > distinguishing unimplemented commands from zero returns. As you note, this > is a problem for BIO_C_GET_FD. OpenSSL's other non-fd BIOs have the same > problem though: > https://github.com/openssl/openssl/blob/master/crypto/bio/bss_file.c#L340 > > Realistically, type-specific ioctls like that don't matter much because > you're really only going to call them if you already know you have an > applicable BIO. Hopefully, between that, and removing BIO_TYPE_DESCRIPTOR > (below), it's mostly okay? Also happy to add a `ret = -1` implementation > of BIO_C_GET_FD if you'd prefer. But, in the general case, I suspect we > just have to live with the zero thing, and hopefully OpenSSL will stop > introducing ctrls where zero is an unsafe return value. > > Perhaps BIO_ctrl in OpenSSL should have some logic like... > ``` > if (!(type & BIO_TYPE_DESCRIPTOR) && cmd == BIO_C_GET_FD) { > return -1; > } > ``` > Although I suspect there are people who implement BIO_C_GET_FD without > setting BIO_TYPE_DESCRIPTOR because none of this was ever documented. > > >> As of your patch the bio doesn't actually have an FD anymore, should it >> still >> set BIO_TYPE_DESCRIPTOR? >> > > Ah good call. Done. I don't think much code really cares, but it does mean > that, if you all call SSL_get_fd (why would you?), it won't get confused. > :-) > > > +static long >> > +my_sock_ctrl(BIO *h, int cmd, long num, void *ptr) >> > +{ >> > + long res = 0; >> > + >> > + switch (cmd) >> > + { >> > + case BIO_CTRL_FLUSH: >> > + /* libssl expects all BIOs to support BIO_flush. >> */ >> > + res = 1; >> > + break; >> > + } >> > + >> > + return res; >> > +} >> >> I'd move the res = 0 into a default: block. That way the compiler can >> warn if >> some case doesn't set it in all branches. >> > > Done. > > >> > static BIO_METHOD * >> > my_BIO_s_socket(void) >> > { >> >> Wonder if we should rename this. It's pretty odd that we still call it's >> not >> really related to s_socket anymore, and doesn't even really implement the >> same >> interface (e.g. get_fd doesn't work anymore). Similarly, my_SSL_set_fd() >> doesn't actually call set_fd() anymore, which sure seems odd. >> > > Done. I wasn't sure what to name them, or even what the naming convention > was, but I opted to name them after the Port and PGconn objects they wrap. > Happy to switch to another name if you have a better idea though. > > David >