Hi all, Was wondering if anyone has a reaction to my email below about statement preparation, was it too long? :)
(and sorry for top-posting) On Tue, Feb 6, 2018 at 9:27 PM, Shay Rojansky <r...@roji.org> wrote: > Hi all. > > Various versions of having PostgreSQL caching and/or autopreparing > statement plans have been discussed (https://www.postgresql.org/ > message-id/op.t9ggb3wacigqcu%40apollo13.peufeu.com, https:/ > /www.postgresql.org/message-id/8e76d8fc-8b8c-14bd-d4d1- > e9cf193a74f5%40postgrespro.ru), without clear conclusions or even an > agreement on what might be worthwhile to implement. I wanted to bring this > up again from a PostgreSQL driver maintainer's perspective (I'm the owner > of Npgsql, the open source .NET driver), apologies in advance if I'm > repeating things or I've missed crucial information. Below I'll describe > three relevant issues and what I've done to deal with them. > > When the same statement is rerun, preparing it has a very significant > performance boost. However, in short-lived connection scenarios it's > frequently not possible to benefit from this - think of a typical webapp > which allocates a connection from a pool, run a query and then return the > connection. To make sure prepared statements are used, Npgsql's connection > pool doesn't send DISCARD ALL when a connection is returned (to avoid > wiping out the connections), and maintains an internal table mapping SQL > (and parameter types) to a PostgreSQL statement name. The next time the > application attempts to prepare the same SQL, the prepared statement is > found in the table and no preparation needs to occur. This means that > prepared statements persist across pooled connection open/close, and are > never discarded unless the user uses a specific API. While this works, the > disadvantages are that: > 1. This kind of mechanism needs to be implemented again and again, in each > driver: > 2. It relies on Npgsql's internal pooling, which can track persistent > prepared statements on physical connections. If an external pool is used > (e.g. pgpool), this isn't really possible. > 1. It complicates resetting the session state (instead of DISCARD ALL, a > combination of all other reset commands except DEALLOCATE ALL needs be > sent). This is minor. > > The second issue is that many applications don't work directly against the > database API (ADO.NET in .NET, JDBC in Java). If any sort of O/RM or > additional layer is used, there's a good chance that that layer doesn't > prepare in any way, and indeed hide your access to the database API's > preparation method. Two examples from the .NET world is dapper (a very > popular micro-O/RM) and Entity Framework. In order to provide the best > possible performance in these scenarios, Npgsql has an opt-in feature > whereby it tracks how many times a given statement was executed, and once > it passes a certain threshold automatically prepares it. An LRU cache is > then used to determine which prepared statements to discard, to avoid > explosion. In effect, statement auto-preparation is implemented in the > driver. I know that the JDBC driver also implements such a mechanism (it > was actually the inspiration for the Npgsql feature). The issues with this > are: > > 1. As above, this has to be implemented by every driver (and is quite > complex to do well) > 2. There's a possible missed opportunity in having a single plan on the > server, as each connection has its own (the "global plan" option). Many > apps out there send the same statements across many connections so this > seems relevant - but I don't know if the gains outweigh the contention > impact in PostgreSQL. > > Finally, since quite a few (most?) other databases include autopreparation > (SQL Server, Oracle...), users porting their applications - which don't > explicitly prepare - experience a big performance drop. It can rightly be > said that porting an application across databases isn't a trivial task and > that adjustments need to be made, but from experience I can say that > PostgreSQL is losing quite a few users to this. > > The above issues could be helped by having PostgreSQL cache on its side > (especially the second issue, which is the most important). Ideally, any > adopted solution would be transparent and not require any modification to > applications. It would also not impact explicitly-prepared statements in > any way. > > Note that I'm not arguing for any specific implementation on the > PostgreSQL side (e.g. global or not), but just describing a need and hoping > to restart a conversation that will lead somewhere. > > (and thanks for reading this overly long message!) > > Shay >