Re: Outer cost higher than the inner cost
=?UTF-8?Q?Stanis=C5=82aw_Skonieczny?= writes: > What bothers me is that the inner plan cost (7430940) is higher than the > outer plan cost (1131699). I think it is estimating (based on knowledge of the ranges of join keys in the two relations) that that input subplan won't need to be run to completion. See initial_cost_mergejoin in costsize.c. regards, tom lane
Re: Query performance issue
Dheeraj Sonawane writes: > While executing the join query on the postgres database we have observed > sometimes randomly below query is being fired which is affecting our response > time. > Query randomly fired in the background:- > SELECT p.proname,p.oid FROM pg_catalog.pg_proc p, pg_catalog.pg_namespace n > WHERE p.pronamespace=n.oid AND n.nspname='pg_catalog' AND ( proname = > 'lo_open' or proname = 'lo_close' or proname = 'lo_creat' or proname = > 'lo_unlink' or proname = 'lo_lseek' or proname = 'lo_lseek64' or proname = > 'lo_tell' or proname = 'lo_tell64' or proname = 'loread' or proname = > 'lowrite' or proname = 'lo_truncate' or proname = 'lo_truncate64') That looks very similar to libpq's preparatory lookup before executing large object accesses (cf lo_initialize in fe-lobj.c). The details aren't identical so it's not from libpq, but I'd guess this is some other client library's version of the same thing. > Query intended to be executed:- > SELECT a.* FROM tablename1 a INNER JOIN users u ON u.id = a.user_id INNER > JOIN tablename2 c ON u.client_id = c.id WHERE u.external_id = ? AND c.name = > ? AND (c.namespace = ? OR (c.namespace IS NULL AND ? IS NULL)) It is *really* hard to believe that that lookup query would make any noticeable difference on response time for some other session, unless you are running the server on seriously underpowered hardware. It could be that you've misinterpreted your data, and what is actually happening is that that other session has completed its lookup query and is now doing fast-path large object reads and writes using the results. Fast-path requests might not show up as queries in your monitoring, but if the large object I/O is sufficiently fast and voluminous maybe that'd account for visible performance impact. > 2. Is there any way we can suppress this query? Stop using large objects? But the alternatives won't be better in terms of performance impact. Really, if this is a problem for you, you need a beefier server. Or split the work across more than one server. regards, tom lane
Re: performance of sql and plpgsql functions
Julius Tuskenis writes: > EXPLAIN (ANALYZE, BUFFERS, VERBOSE) > SELECT > COALESCE(sum(mok_nepadengta), 0) > FROM > public.b_pardavimai > JOIN public.b_mokejimai ON (mok_pardavimas = pard_id) > WHERE > (pard_tipas = ANY('{1, 2, 6, 7}')) > AND (mok_saskaita = 7141968) I believe that the SQL-language function executor always uses generic plans for parameterized queries (which is bad, but nobody's gotten round to improving it). So the above is a poor way of investigating what will happen, because it corresponds to a custom plan for the value 7141968. You should try something like PREPARE p(integer) AS SELECT COALESCE ... ... AND (mok_saskaita = $1); SET plan_cache_mode TO force_generic_plan; EXPLAIN ANALYZE EXECUTE p(7141968); What I suspect is that the statistics for mok_saskaita are highly skewed and so with a generic plan the planner will not risk using a plan that depends on the parameter value being infrequent, as the one you're showing does. regards, tom lane
Re: Extremely slow to establish connection when user has a high number of roles
Michal Charemza writes: > Tom Lane writes: >> I'm fairly sure that if that exists it's always noticed first, >> bypassing the need for any role membership tests. So please >> confirm whether your production database has revoked PUBLIC >> connect privilege. > I realised that in fact we hadn't revoked this. So it sounds like whatever > the issue, it's not about checking if the user has the CONNECT privilege? Yeah. I double-checked the code (see aclmask()), and it will detect holding a privilege via PUBLIC before it performs any role membership searches. So whatever is happening, it's not that lookup. >> It could be that the problem is not associated with the >> database's connect privilege, but with role membership lookups >> triggered by pg_hba.conf entries. Do you have any entries there >> that require testing membership (i.e. the role column is not >> "all")? > Running `select * from pg_hba_file_rules` it looks like the user column is > always {all} or {rdsadmin} You'll need to look closer and figure out which of the HBA rules is being used for the slow connection attempts. If it's {rdsadmin} then that would definitely involve a role membership search. If it's {all} then we're back to square one. A different line of thought could be that the slow connections are slow because they are waiting on a lock that some other process has got and is in no hurry to release. It would be worth trying to capture the contents of the pg_locks view (and I guess also pg_stat_activity) while one of these sessions is stuck, if you can reproduce it often enough to make that feasible. regards, tom lane
Re: Extremely slow to establish connection when user has a high number of roles
Michal Charemza writes: > Tom Lane writes: >> It's not very clear what you mean by "sometimes". Is the slowness > reproducible for a particular user and role configuration, or does > it seem to come and go by itself? > Ah it's more come and go by itself - as in one connection takes 30 seconds, > then the next say 0.06s. It's happened for every user we've tried. Even > more anecdotally, I would say it happens more when the database is busy in > terms of tables being dropped/created and permissions changing. OK, that pretty much eliminates the idea that it's a new manifestation of the catcache-inefficiency problem. Vijaykumar may well have the right idea, that it's a form of catalog bloat. Do you do bulk permissions updates that might affect thousands of role memberships at once? > Also: realise we did have one user that had directly was a member of > several thousand roles, but indirectly several million. It would sometimes > take 10 minutes for that user to connect. We've since changed that to one > role, and that user connects fine now. Interesting --- but even for that user, it was sometimes fast to connect? > I am wondering - what happens on connection? What catalogue tables does > PostgreSQL check and how? What's allowed to happen concurrently and what > isn't? If I knew, maybe I could come up with a reproduction script that > does reproduce the issue? Well, it's going to be looking to see that the user has CONNECT privileges on the target database. If that db doesn't have public connect privileges, but only grants CONNECT to certain roles, then we'll have to test whether the connecting user is a member of those roles --- which involves looking into pg_auth_members and possibly even doing recursive searches there. For the sort of setup you're describing with thousands of role grants (pg_auth_members entries) it's not hard to imagine that search being rather expensive. What remains to be explained is how come it's only expensive sometimes. The catalog-bloat idea comes from thinking about how Postgres handles row updates. There will be multiple physical copies (row versions) of any recently-updated row, and this is much more expensive to scan than a static situation with only one live row version. First just because we have to look at more than one copy, and second because testing whether that copy is the live version is noticeably more expensive if it's recent than once it's older than the xmin horizon, and third because if we are the first process to scan it since it became dead-to-everybody then it's our responsibility to mark it as dead-to-everybody, so that we have to incur additional I/O to do that. A plausible idea for particular connection attempts being slow is that they came in just as a whole lot of pg_auth_members entries became dead-to-everybody, and hence they were unlucky enough to get saddled with a whole lot of that hint-bit-updating work. (This also nicely explains why the next attempt isn't slow: the work's been done.) But this is only plausible if you regularly do actions that cause a lot of pg_auth_members entries to be updated at the same time. So we still don't have good insight into that, and your test script isn't shedding any light. A couple of other thoughts: * I don't think your test script would show a connection-slowness problem even if there was one to be shown, because you forgot to revoke the PUBLIC connect privilege on the postgres database. I'm fairly sure that if that exists it's always noticed first, bypassing the need for any role membership tests. So please confirm whether your production database has revoked PUBLIC connect privilege. * It could be that the problem is not associated with the database's connect privilege, but with role membership lookups triggered by pg_hba.conf entries. Do you have any entries there that require testing membership (i.e. the role column is not "all")? regards, tom lane
Re: Extremely slow to establish connection when user has a high number of roles
Michal Charemza writes: > The issue is that we're hitting a strange performance problem on > connection. Sometimes it can take ~25 to 40 seconds just to connect, > although it's often way quicker. There seems to be no middle ground - never > have I seen a connection take between 0.5 and 25 seconds for example. We > suspect it's related to the number of roles the connecting user has > (including via other roles), because if we remove all roles but one from > the connecting user (the one that grants connection permissions), > connecting is always virtually instantaneous. It's not very clear what you mean by "sometimes". Is the slowness reproducible for a particular user and role configuration, or does it seem to come and go by itself? As Tomas said, a self-contained reproduction script would be very helpful for looking into this. > The closest issue that I can find that's similar is > https://www.postgresql.org/message-id/flat/CAGvXd3OSMbJQwOSc-Tq-Ro1CAz%3DvggErdSG7pv2s6vmmTOLJSg%40mail.gmail.com, > which reports that GRANT role is slow with a high number of roles - but in > our case, it's connecting that's the problem, before (as far as we can > tell) even one query is run. That specific problem is (we think) new in v16, but the root cause is an inefficient lookup mechanism that has been there a long time. Maybe you have found a usage pattern that exposes its weakness in older branches. If so, we could consider back-patching 14e991db8 further than v16 ... but I don't plan to take any risk there without concrete evidence that it'd improve things. regards, tom lane
Re: Simple JOIN on heavy table not using expected index
kimaidou writes: > It seems PostgreSQL does not use the index parcelles_dep_idx on "dep" (text > field), even if the corresponding number of lines for this WHERE clause is > a smal subset of the entire data: > approx 6M against 80M in total 6M out of 80M rows is not a "small subset". Typically I'd expect the planner to use an index-based scan for up to 1 or 2 percent of the table. Beyond that, you're going to be touching most pages of the table anyway. You can try reducing random_page_cost to favor indexscans, but you might not find that the query gets any faster. regards, tom lane
Re: I don't understand that EXPLAIN PLAN timings
David Rowley writes: > I do wonder now if it was a bad idea to make Memoize build the hash > table on plan startup rather than delaying that until we fetch the > first tuple. I see Hash Join only builds its table during executor > run. Ouch! If it really does that, yes it's a bad idea. regards, tom lane
Re: Why is a sort required for this query? (IS NULL predicate on leading key column)
Jerry Brenner writes: > I'm wondering why a sort is required for this query, as the index should be > providing the required ordering to satisfy the ORDER BY clause. Does it > have to do with the IS NULL predicate on the leading key column in the > index? IS NULL is not seen as an equality condition, no. It's pretty much of a hack that makes it an indexable condition at all, and we don't really do any advanced optimization with it. regards, tom lane
Re: Selection not "pushed down into" CTE
Clemens Eisserer writes: > running postgresql 15.5 I was recently surpised postgresql didn't > perform an optimization which I thought would be easy to apply. It is not. > running the following query results in a full sort (caused by lead > over order by) as the ts > '2024-01-04' selection doesn't seem to be > applied to the CTE but only later: > with cte as (select ts, lead(ts, 1) over (order by ts) as ts2 from smartmeter) > select ts, ts2 from cte where ts > '2024-01-04' and extract(epoch > from ts2) - extract(epoch from ts) > 9; The ts restriction is not pushed down because of the rules in allpaths.c: * 4. If the subquery has any window functions, we must not push down quals * that reference any output columns that are not listed in all the subquery's * window PARTITION BY clauses. We can push down quals that use only * partitioning columns because they should succeed or fail identically for * every row of any one window partition, and totally excluding some * partitions will not change a window function's results for remaining * partitions. (Again, this also requires nonvolatile quals, but * subquery_is_pushdown_safe handles that.) To conclude that it'd be safe with this particular window function requires deep knowledge of that function's semantics, which the planner has not got. regards, tom lane
Re: Questions about "Output" in EXPLAIN ANALYZE VERBOSE
Jeff Janes writes: > On Tue, Jan 2, 2024 at 1:29 PM Jerry Brenner wrote: >> - It looks like "Output" includes more than just the columns with >> predicates and/or being accessed or returned in other nodes. > Not in my hands. For SELECTs it just lists the columns that are needed. It depends. The planner may choose to tell a non-top-level scan node to return all columns, in hopes of saving a tuple projection step at runtime. That's heuristic and depends on a number of factors, so you shouldn't count on it happening or not happening. regards, tom lane
Re: Question about semantics of $ variables in json explain plans in 13
Jerry Brenner writes: > Is there any documentation on the semantics of $ variables in json explain > plans for both InitPlans and SubPlans in 13? I don't think there's anything much in the user-facing docs, which is somewhat unfortunate because it's confusing: the notation is overloaded. $N could be a parameter supplied from outside the query (as in your $1, $2 and $3 in the source text), but it could also be a parameter supplied from an outer query level to a subplan, or it could be the result value of an InitPlan. The numbering of outside-the-query parameters is disjoint from that of the other kind. >- It looks like $0 represents the value from the outer query block when >the correlated subquery is evaluated >- It looks like $1 represents the result of the subquery evaluation Yeah, I think you're right here. $0 evidently corresponds to qRoot.contactID from the outer plan, and the plan label itself shows that $1 carries the sub-select's value back out. This $1 is unrelated to the $1 you wrote in the query text. (It looks like this is a custom plan in which "67" was explicitly substituted for your $3. Presumably $1 and $2 were replaced as well; we don't do half-custom plans.) regards, tom lane
Re: 2 json explain plans for the same query/plan - why does one have constants while the other has parameter markers?
Jerry Brenner writes: > We are currently on 13. We are capturing the explain plans for query > executions taking 1 second or longer and storing the json files. We are > most of the way through implementing a home grown solution to generate a > consistent hash value for a query plan, so we can find queries with > multiple plans. I've attached 2 query plans that we've captured that > differ in a seemingly strange way. (All executions are from the same exact > code path.) One of the plans has parameter markers in the predicates in > the values for "Recheck Cond" and "Index Cond", while the other does not. > Any insight into why we are seeing parameter markers in the body of the > query plan? The one with parameter markers is a "generic" plan for a parameterized query. When you get a plan without parameter markers for the same input query, that's a "custom" plan in which concrete values of the parameters have been substituted, possibly allowing const-simplification and more accurate rowcount estimates. The backend will generally try custom plans a few times and then try a generic plan to see if that's meaningfully slower -- if not, replanning each time is deemed to be wasteful. regards, tom lane
Re: Does Postgres have consistent identifiers (plan hash value) for explain plans?
Michael Paquier writes: > On Mon, Dec 04, 2023 at 09:57:24AM -0500, Tom Lane wrote: >> Jerry Brenner writes: >>> Both Oracle and SQL Server have >>> consistent hash values for query plans and that makes it easy to identify >>> when there are multiple plans for the same query. Does that concept exist >>> in later releases of Postgres (and is the value stored in the json explain >>> plan)? >> No, there's no support currently for obtaining a hash value that's >> associated with a plan rather than an input query tree. > PlannerGlobal includes a no_query_jumble that gets inherited by all > its lower-level nodes, so adding support for hashes compiled from > these node structures would not be that complicated. My point is that > the basic infrastructure is in place in the tree to be able to do > that, and it should not be a problem to even publish the compiled > hashes in EXPLAIN outputs, behind an option of course. Well, yeah, we could fairly easily activate that infrastructure for plans, but we haven't. More to the point, it's not clear to me that that would satisfy the OP's request for "consistent" hash values. The hashes would vary depending on object OID values, system version, possibly endianness, etc. I'm also wondering exactly what the OP thinks qualifies as different plans. Remembering the amount of fooling-around that's gone on with querytree hashes to satisfy various people's ill-defined desires for pg_stat_statements aggregation behavior, I'm not really eager to buy into the same definitional morass at the plan level. regards, tom lane
Re: Does Postgres have consistent identifiers (plan hash value) for explain plans?
Jerry Brenner writes: > Both Oracle and SQL Server have > consistent hash values for query plans and that makes it easy to identify > when there are multiple plans for the same query. Does that concept exist > in later releases of Postgres (and is the value stored in the json explain > plan)? No, there's no support currently for obtaining a hash value that's associated with a plan rather than an input query tree. regards, tom lane
Re: Performance degradation with CTEs, switching from PG 11 to PG 15
Jean-Christophe Boggio writes: > I did upgrade :-) But we have many users for which we don't decide on > when they do upgrade so we have to keep compatibility with most versions > of PG and in that particular case (non-existence of the materialized > keyword for PG 11 and before) it is a real problem. PG 11 is out of support as of earlier this month, so your users really need to be prioritizing getting onto more modern versions. regards, tom lane
Re: Postgres Locking
"Dirschel, Steve" writes: > Above I can see PID 3740 is blocking PID 3707. The PK on table > wln_mart.ee_fact is ee_fact_id. I assume PID 3740 has updated a row > (but not committed it yet) that PID 3707 is also trying to update. Hmm. We can see that 3707 is waiting for 3740 to commit, because it's trying to take ShareLock on 3740's transactionid: > transactionid | | | | || > 251189986 | | | | 54/196626 | 3707 | > ShareLock| f | f| 2023-10-31 14:40:21.837507-05 251189986 is indeed 3740's, because it has ExclusiveLock on that: > transactionid | | | | || > 251189986 | | | | 60/259887 | 3740 | > ExclusiveLock| t | f| There are many reasons why one xact might be waiting on another to commit, not only that they tried to update the same tuple. However, in this case I suspect that that is the problem, because we can also see that 3707 has an exclusive tuple-level lock: > tuple |91999 |93050 |0 | 1 || > | | | | 54/196626 | 3707 | ExclusiveLock > | t | f| That kind of lock would only be held while queueing to modify a tuple. (Basically, it establishes that 3707 is next in line, in case some other transaction comes along and also wants to modify the same tuple.) It should be released as soon as the tuple update is made, so 3707 is definitely stuck waiting to modify a tuple, and AFAICS it must be stuck because of 3740's uncommitted earlier update. > But I am being told those 2 sessions should not be trying to process the > same PK rows. Perhaps "should not" is wrong. Or it could be some indirect update (caused by a foreign key with CASCADE, or the like). You have here the relation OID (try "SELECT 93050::regclass" to decode it) and the tuple ID, so it should work to do SELECT * FROM that_table WHERE ctid = '(0,1)'; to see the previous state of the problematic tuple. Might help to decipher the problem. regards, tom lane
Re: Postgres 15 SELECT query doesn't use index under RLS
Alexander Okulovich writes: > I've attempted to reproduce this on my PC in Docker from the stage > database dump, but no luck. The first query execution on Postgres 15 > behaves like on the real stage, but subsequent ones use the index. Can you force it in either direction with "set enable_seqscan = off" (resp. "set enable_indexscan = off")? If so, how do the estimated costs compare for the two plan shapes? > Also, > they execute much faster. Looks like the hardware and(or) the data > structure on disk matters. Maybe your prod installation has a bloated index, and that's driving up the estimated cost enough to steer the planner away from it. regards, tom lane
Re: Postgres 15 SELECT query doesn't use index under RLS
Alexander Okulovich writes: > We used the product with the default settings. The requests are simple. > We didn't change the hardware (actually, we use even more performant > hardware because of that issue) and DDL. I've checked the request on old > and new databases. Requests that rely on this index execute more than 10 > times longer. Planner indeed used Index Scan before, but now it doesn't. > So, from my perspective, the only reason we experience that is database > logic change. [ shrug... ] Maybe, but it's still not clear if it's a bug, or an intentional change, or just a cost estimate that was on the hairy edge before and your luck ran out. If you could provide a self-contained test case that performs 10x worse under v15 than v12, we'd surely take a look at it. But with the information you've given so far, little is possible beyond speculation. regards, tom lane
Re: Postgres 15 SELECT query doesn't use index under RLS
Alexander Okulovich writes: > Recently, we upgraded the AWS RDS instance from Postgres 12.14 to 15.4 > and noticed extremely high disk consumption on the following query > execution: > select (exists (select 1 as "one" from "public"."indexed_commit" where > "public"."indexed_commit"."repo_id" in (964992,964994,964999, ...); > For some reason, the query planner starts using Seq Scan instead of the > index on the "repo_id" column when requesting under user limited with > RLS. On prod, it happens when there are more than 316 IDs in the IN part > of the query, on stage - 3. If we execute the request from Superuser, > the planner always uses the "repo_id" index. The superuser bypasses the RLS policy. When that's enforced, the query can no longer use an index-only scan (because it needs to fetch tenant_id too). Moreover, it may be that only a small fraction of the rows fetched via the index will satisfy the RLS condition. So the estimated cost of an indexscan query could be high enough to persuade the planner that a seqscan is a better idea. > Luckily, we can easily reproduce this on our stage database (which is > smaller). If we add a multicolumn "repo_id, tenant_id" index, the > planner uses it (Index Only Scan) with any IN params count under RLS. Yeah, that would be the obvious way to ameliorate both problems. If in fact you were getting decent performance from an indexscan plan before, the only explanation I can think of is that the repo_ids you are querying for are correlated with the tenant_id, so that the RLS filter doesn't eliminate very many rows from the index result. The planner wouldn't realize that by default, but if you create extended statistics on repo_id and tenant_id then it might do better. Still, you probably want the extra index. > Could you please clarify if this is a Postgres bug or not? You haven't shown any evidence suggesting that. > Should we > include the "tenant_id" column in all our indexes to make them work > under RLS? Adding tenant_id is going to bloat your indexes quite a bit, so I wouldn't do that except in cases where you've demonstrated it's important. regards, tom lane
Re: Unexpected termination looping over table.
Matt Gibbins writes: > Have encountered an intriguing issue processing a table with a large > number of rows. The process is to loop over the table processing each > row executing the Apache AGE cypher function over each row > individually.The looping is necessary because of encountering a limit on > the number of rows that the cypher function would process. > The process terminates unexpectedly with the following message. Notable > that it runs for quite some time before termination.: > SQL Error [42703]: ERROR: could not find rte for > a01a724103fbb3d059b8387bf043dbc8 > Where: PL/pgSQL function > analysis.create_trips(text,text,text,text,text,text,integer,text,integer) > line 5 at EXECUTE There is no occurrence of "could not find rte" anywhere in the core Postgres source code. I surmise that you're using some extension that isn't happy. regards, tom lane
Re: pgsql 10.23 , different systems, same table , same plan, different Buffers: shared hit
Achilleas Mantzios writes: > Thank you, I see that both systems use en_US.UTF-8 as lc_collate and > lc_ctype, Doesn't necessarily mean they interpret that the same way, though :-( > the below seems ok > FreeBSD : > postgres@[local]/dynacom=# select * from (values > ('a'),('Z'),('_'),('.'),('0')) as qry order by column1::text; > column1 > - > _ > . > 0 > a > Z > (5 rows) Sadly, this proves very little about Linux's behavior. glibc's idea of en_US involves some very complicated multi-pass sort rules. AFAICT from the FreeBSD sort(1) man page, FreeBSD defines en_US as "same as C except case-insensitive", whereas I'm pretty sure that underscores and other punctuation are nearly ignored in glibc's interpretation; they'll only be taken into account if the alphanumeric parts of the strings sort equal. regards, tom lane
Re: pgsql 10.23 , different systems, same table , same plan, different Buffers: shared hit
Achilleas Mantzios - cloud writes: > *FreeBSD* > >-> Index Only Scan using mail_vessel_addressbook_address_regex_idx > on mail_vessel_addressbook (cost=0.42..2912.06 rows=620 width=32) > (actual time=96.704..96.705 rows=1 loops=1) > Filter: ('f...@bar.com'::text ~* address_regex) > Rows Removed by Filter: 14738 > Heap Fetches: 0 > Buffers: shared hit=71 > > *Linux* > >-> Index Only Scan using mail_vessel_addressbook_address_regex_idx > on mail_vessel_addressbook (cost=0.42..2913.04 rows=620 width=32) > (actual time=1768.724..1768.725 rows=1 loops=1) > Filter: ('f...@bar.com'::text ~* address_regex) > Rows Removed by Filter: 97781 > Heap Fetches: 0 > Buffers: shared hit=530 > The file in FreeBSD came by pg_dump from the linux system, I am puzzled > why this huge difference in Buffers: shared hit. The "rows removed" value is also quite a bit different, so it's not just a matter of buffer touches --- there's evidently some real difference in how much of the index is being scanned. I speculate that you are using different collations on the two systems, and FreeBSD's collation happens to place the first matching row earlier in the index. regards, tom lane
Re: Planning time is time-consuming
David Rowley writes: > I'm not sure if you're asking for help here because you need planning > to be faster than it currently is, or if it's because you believe that > planning should always be faster than execution. If you think the > latter, then you're mistaken. Yeah. I don't see anything particularly troubling here. Taking circa three-quarters of a millisecond (on typical current hardware) to plan a four-way join on large tables is not unreasonable. In most cases one could expect the execution of such a query to take a good bit longer than that. I think the OP is putting too much emphasis on an edge case where execution finishes quickly because there are in fact zero rows matching the uuid restriction. BTW, in addition to the duplicative indexes, I wonder why the uuid columns being joined on aren't all of "uuid" type. While I doubt fixing that would move the needle greatly, it still seems sloppy. regards, tom lane
Re: Slow query, possibly not using index
Les writes: >>> If I try to select a single unused block this way: >>> explain analyze select id from media.block b where nrefs =0 limit 1 >>> then it runs for more than 10 minutes (I'm not sure how long, I cancelled >>> the query after 10 minutes). >> You might think that even so, it shouldn't take that long ... but >> indexes on UUID columns are a well known performance antipattern. > I'm aware of the problems with random UUID values. I was using this > function to create ulids from the beginning: Oh, well that would have been useful information to provide at the outset. Now that we know the index order is correlated with creation time, I wonder if it is also correlated with nrefs, in such a way that scanning in index order is disastrous because all the desired rows are at the end of the index. Also, you deny deleting any rows, but that wasn't the point. Do you ever update nrefs from zero to nonzero? That'd also leave dead entries behind in this index. If that is a routine event that is correlated with creation time, it gets easier to believe that your index could have lots of dead entries at the front. We'd still have to figure out why autovacuum is failing to clean out those entries in a timely fashion, but that seems like a plausible way for the performance problem to exist. > I can try to do VACUUM on this table, but I'm limited on resources. I think > it will take days to do this. Maybe I can try to dump the whole database > and restore it on another machine. Pretty hard to believe that dump-and-restore would be faster than VACUUM. > (Is there a way to check the number of dead rows?) I think contrib/pgstattuple might help. regards, tom lane
Re: Slow query, possibly not using index
Les writes: > If I try to select a single unused block this way: > explain analyze select id from media.block b where nrefs =0 limit 1 > then it runs for more than 10 minutes (I'm not sure how long, I cancelled > the query after 10 minutes). Are you sure it isn't blocked on a lock? Another theory is that the index contains many thousands of references to now-dead rows, and the query is vainly searching for a live entry. Given that EXPLAIN thinks there are only about 2300 live entries, and yet you say the index is 400MB, this seems pretty plausible. Have you disabled autovacuum, or something like that? (REINDEX could help here, at least till the index gets bloated again.) You might think that even so, it shouldn't take that long ... but indexes on UUID columns are a well known performance antipattern. The index entry order is likely to have precisely zip to do with the table's physical order, resulting in exceedingly-random access to the table, which'll be horribly expensive when the table is so much bigger than RAM. Can you replace the UUID column with a simple serial (identity) column? > I believe it is not actually using the index, because reading a single > (random?) entry from an index should not run for >10 minutes. You should believe what EXPLAIN tells you about the plan shape. (Its rowcount estimates are only estimates, though.) regards, tom lane
Re: slow delete
Les writes: > It seems that two foreign key constraints use 10.395 seconds out of the > total 11.24 seconds. But I don't see why it takes that much? Probably because you don't have an index on the referencing column. You can get away with that, if you don't care about the speed of deletes from the PK table ... regards, tom lane
Re: Plan weirdness. A sort produces more rows than the node beneath it
Dane Foster writes: >> If the sort is the inner input to a merge join, this could reflect >> mark-and-restore rescanning of the sort's output. Are there a >> whole lot of duplicate keys on the merge's other side? > Yes. The course_id column's values repeat a LOT on the merge side. Hmm. The planner should avoid using a merge join if it knows that to be true. Maybe analyze'ing that table would prompt it to use some other join method? regards, tom lane
Re: Plan weirdness. A sort produces more rows than the node beneath it
Dane Foster writes: > I'm trying to understand a bit of weirdness in a plan output. There is a > sort node above a sequential scan node where the scan node produces 26,026 > rows yet the sort node above it produces 42,995,408. How is it possible to > sort more data than you received? If the sort is the inner input to a merge join, this could reflect mark-and-restore rescanning of the sort's output. Are there a whole lot of duplicate keys on the merge's other side? regards, tom lane
Re: Postgresql equal join on function with columns not use index
"James Pang (chaolpan)" writes: > Looks like it's the function "regexp_replace" volatile and restrict=false > make the difference, we have our application role with default > search_path=oracle,$user,public,pg_catalog. > =#select > oid,proname,pronamespace::regnamespace,prosecdef,proisstrict,provolatile from > pg_proc where proname='regexp_replace' order by oid; > oid |proname | pronamespace | prosecdef | proisstrict | provolatile > ---++--+---+-+- > 2284 | regexp_replace | pg_catalog | f | t | i > 2285 | regexp_replace | pg_catalog | f | t | i > 17095 | regexp_replace | oracle | f | f | v > 17096 | regexp_replace | oracle | f | f | v > 17097 | regexp_replace | oracle | f | f | v > 17098 | regexp_replace | oracle | f | f | v Why in the world are the oracle ones marked volatile? That's what's preventing them from being used in index quals. regards, tom lane
Re: Postgresql equal join on function with columns not use index
"James Pang (chaolpan)" writes: >We migrate from Oracle to Postgresql14.8, one SQL has regression in > Postgres run in 5800 milliseconds in Postgresql v14.8, but the same SQL got > done in several hundred milliseconds in Oracle database. >With multiple table JOINs, if the join condition is > tablea.column1=tableb.column1, optimizer will use the index to filter data in > nest loops, but if tablea.column1=regexp_replace(tableb.column1), > Optimizer will not be able to use the index on tablea.column1, then it do a > table scan and nestloop to produce a lot rows then use > tablea.column1=regexp_replace(tableb.column1) as a filter. As a > workaround we create a view then use tablea.column1=view.column1 that works. > Is it expected ?details as below. It's impossible to comment on this usefully with such a fragmentary description of the problem. Please send a complete, self-contained test case if you want anybody to look at it carefully. https://wiki.postgresql.org/wiki/Slow_Query_Questions regards, tom lane
Re: Weird behavior of INSERT QUERY
Satalabaha Postgres writes: > The select query in the insert returns about 499 rows. However, this insert > statement when executed from application user i.e. schema1_u takes close to > 8 minutes. When the same insert statement gets executed as postgres user > it takes less than 280 ms. Both the executions use the same execution plan > with only difference that when schema1_u executes the SQL, we observe > "Trigger for constraint fk_con_tablea: time=426499.314 calls=499" taking > more time. So you need to find out what's happening in the trigger. Perhaps auto_explain with auto_explain.log_nested_statements enabled would give some insight. I suspect there might be a permissions problem causing schema1_u to not be allowed to "see" the statistics for table_b, resulting in a bad plan choice for the FK enforcement query; but that's just a guess. regards, tom lane
Re: PostgreSQL performance on ARM i.MX6
"Druckenmueller, Marc" writes: > I am investigating possible throughput with PostgreSQL 14.4 on an ARM i.MX6 > Quad CPU (NXP sabre board). > Testing with a simple python script (running on the same CPU), I get ~1000 > request/s. That does seem pretty awful for modern hardware, but it's hard to tease apart the various potential causes. How beefy is that CPU really? Maybe the overhead is all down to client/server network round trips? Maybe psycopg is doing something unnecessarily inefficient? For comparison, on my development workstation I get [ create the procedure manually in db test ] $ cat bench.sql call dummy_call(1,2,3,array[1,2,3]::float8[]); $ pgbench -f bench.sql -n -T 10 test pgbench (16beta1) transaction type: bench.sql scaling factor: 1 query mode: simple number of clients: 1 number of threads: 1 maximum number of tries: 1 duration: 10 s number of transactions actually processed: 353891 number of failed transactions: 0 (0.000%) latency average = 0.028 ms initial connection time = 7.686 ms tps = 35416.189844 (without initial connection time) and it'd be more if I weren't using an assertions-enabled debug build. It would be interesting to see what you get from exactly that test case on your ARM board. BTW, one thing I see that's definitely an avoidable inefficiency in your test is that you're forcing the array parameter to real[] (i.e. float4) when the procedure takes double precision[] (i.e. float8). That forces an extra run-time conversion. Swapping between float4 and float8 in my pgbench test doesn't move the needle a lot, but it's noticeable. Another thing to think about is that psycopg might be defaulting to a TCP rather than Unix-socket connection, and that might add overhead depending on what kernel you're using. Although, rather than try to micro-optimize that, you probably ought to be thinking of how to remove network round trips altogether. I can get upwards of 300K calls/second if I push the loop to the server side: test=# \timing Timing is on. test=# do $$ declare x int := 1; a float8[] := array[1,2,3]; begin for i in 1..100 loop call dummy_call (x,x,x,a); end loop; end $$; DO Time: 3256.023 ms (00:03.256) test=# select 100/3.256023; ?column? - 307123.137643683721 (1 row) Again, it would be interesting to compare exactly that test case on your ARM board. regards, tom lane
Re: Performance issues in query with multiple joins
=?UTF-8?B?zqDOsc+BzrHPg866zrXPhc63IM6gzrHPg8+DzrHPgc63?= writes: > We are facing a performance issue with the following query. Executing this > query takes about 20 seconds. Raising join_collapse_limit (to more than the number of joins in the query) might help. But I think really if performance is a problem you should think about ditching the star schema design. regards, tom lane
Re: High QPS, random index writes and vacuum
peter plachta writes: > The company I work for has a large (50+ instances, 2-4 TB each) Postgres > install. One of the key problems we are facing in vanilla Postgres is > vacuum behavior on high QPS (20K writes/s), random index access on UUIDs. Indexing on a UUID column is an antipattern, because you're pretty much guaranteed the worst-case random access patterns for both lookups and insert/delete/maintenance cases. Can you switch to timestamps or the like? There are proposals out there for more database-friendly ways of generating UUIDs than the traditional ones, but nobody's gotten around to implementing that in Postgres AFAIK. regards, tom lane
Re: Query unable to utilize index without typecast to fixed length character
ahi writes: > CREATE TABLE public.marketplace_sale ( > log_index integer NOT NULL, > created_at timestamp with time zone DEFAULT now() NOT NULL, > updated_at timestamp with time zone DEFAULT now() NOT NULL, > block_timestamp timestamp with time zone NOT NULL, > block bigint NOT NULL, > contract_address character(42) NOT NULL, > buyer_address character(42) NOT NULL, > seller_address character(42) NOT NULL, > transaction_hash character(66) NOT NULL, > quantity numeric NOT NULL, > token_id numeric NOT NULL, ... Type character(N) is a hangover from the days of punched cards. Don't use it. It has weird semantics concerning trailing spaces, which are almost never the behavior you actually want, and cause interoperability issues with type text. (Text is Postgres' native string type, meaning that unlabeled string constants will tend to get resolved to that.) regards, tom lane
Re: Why are commits consuming most of the database time?
Tim Slechta writes: > The test data below is from a non-virtualized (client system and database > server) Postgres 14 environment, with no replication, no high availability, > and with no load balancing. This environment has older and slower disk > drives, and the test is driven by a single client process. > In this case 24% of the round trips (client to database and back) are for > commit processing. However, commit processing is consuming 89% of the > total database time. (All times are measured from within the client.) You didn't say how big the transactions are, but if they're not writing a lot of data apiece, this result seems totally non-surprising. The commits have to push WAL log data down to disk before they can promise that the transaction's results are durable, while the statements within the transactions probably are not waiting for any disk writes at all. If you don't need strict ACID compliance, you could turn off synchronous_commit so that commits don't wait for WAL flush. (This doesn't risk the consistency of your database, but it does mean that a crash might lose the last few transactions that clients were told got committed.) If you do need strict ACID compliance, get a better disk subsystem. Or, perhaps, just a better OS ... Windows is generally not thought of as the best-performing platform for Postgres. regards, tom lane
Re: Connection forcibly closed remote server error.
aditya desai writes: > We are getting this error when transferring data using COPY command or > running workflow for huge data. We are using Password Authentication(LDAP) > "Connection forcibly closed remote server" Network timeout perhaps? If so, setting more-aggressive TCP keepalive parameters might help. regards, tom lane
Re: Domain check taking place unnecessarily?
Mark Hills writes: > On Wed, 8 Feb 2023, Laurenz Albe wrote: >> It takes 30 seconds to schan the table and determine that all existing >> rows satisky the constraint. > But there's no existing data (note this is adding column, not constraint) > Existing rows are guaranteed to satisfy the domain check, because the > domain check is guaranteed to be immutable (per [1] in my original mail) immutable != "will accept null". There could be some more optimizations here, perhaps, but there aren't. regards, tom lane
Re: ALTER STATEMENT getting blocked
aditya desai writes: > We have a Postgres 11.16 DB which is continuously connected to informatica > and data gets read from it continuously. > When we have to ALTER TABLE.. ADD COLUMN.. it gets blocked by the SELECTs > on the table mentioned by process above. > Is there any way to ALTER the table concurrently without getting blocked? > Any parameter or option? Can someone give a specific command? ALTER TABLE requires exclusive lock to do that, so it will queue up behind any existing table locks --- but then new lock requests will queue up behind its request. So this'd only happen if your existing reading transactions don't terminate. Very long-running transactions are unfriendly to other transactions for lots of reasons including this one; see if you can fix your application to avoid that. Or manually cancel the blocking transaction(s) after the ALTER begins waiting. regards, tom lane
Re: change the default value of enable_bitmapscan to off
"hehaoc...@hotmail.com" writes: > The default is enable_bitmapscan on. However, TPC-H.query17 get slower > running on my NVMe SSD (WD SN850) after switching on the parameter: latency > drop from 9secs to 16secs. During a B-tree Index Scan, bitmapscan > optimization converts random I/O into sequential. However, many users use > SSDs rather than HDDs. But they may not know the trick. Is there a > possibility that can change the default value to off? Use ALTER SYSTEM SET, or edit postgresql.conf: https://www.postgresql.org/docs/current/config-setting.html Note that changing planner parameters on the basis of a single query getting slower is a classic beginner error. You need to think about the totality of the installation's workload. regards, tom lane
Re: When you really want to force a certain join type?
Gunther Schadow writes: > Also, why even use the RECURSIVE keyword, DB2 didn't need it, and the > query analyzer should immediately see the recursion, so no need to have > that keyword. Our reading of the SQL spec is that it's required. The scope of visibility of CTE names is different depending on whether you write RECURSIVE or not, so it's not a question of "the query analyzer should see it": the analyzer is required NOT to see it. DB2 generally has a reputation for agreeing with the spec, so I'm surprised to hear that they're not doing this per spec. regards, tom lane
Re: Fwd: temp_file_limit?
Frits Jalvingh writes: > Just to be clear: my real question is: why is temp_file_limit not working > at the specified size? I've not looked at that code lately, but I strongly suspect that it's implemented in such a way that it's a per-process limit, not a per-session limit. So each parallel worker could use up that much space. It's also possible that you've found an actual bug, but without a reproducer case nobody's going to take that possibility too seriously. We're unlikely to accept "the limit should work across multiple processes" as a valid bug though. That would require a vastly more complicated implementation. regards, tom lane
Re: JSON down performacen when id:1
"Render Comunicacion S.L." writes: > The issue: > When we search our locator with section_id: 1 (or any number < 4), PostgreSQL > takes around 4, 5000, 8000ms or more. > When we search our locator with section_id: 4 (or any other bigger number), > PostgreSQL takes around 100 ms. ( ~ expected time) Your index is providing pretty awful performance: > -> Bitmap Heap Scan on matrix (cost=92.21..199.36 rows=27 > width=1144) (actual time=415.708..8325.296 rows=11 loops=1) > Recheck Cond: ((datos #> '{relations}'::text[]) @> > '[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb) > Rows Removed by Index Recheck: 272037 > Heap Blocks: exact=34164 lossy=33104 > -> Bitmap Index Scan on matrix_relations_idx > (cost=0.00..92.20 rows=27 width=0) (actual time=61.462..61.462 rows=155031 > loops=1) > Index Cond: ((datos #> '{relations}'::text[]) @> > '[{"section_id": "1", "section_tipo": "numisdata3"}]'::jsonb) I read that as 155K hits delivered by the index, of which only 11 were real matches. To make matters worse, with so many hits the bitmap was allowed to become "lossy" (ie track some hits at page-level not tuple-level) to conserve memory, so that the executor actually had to check even more than 155K rows. You need a better index. It might be that switching to a jsonb_path_ops index would be enough to fix it, or you might need to build an expression index matched specifically to this type of query. See https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING Also, if any of the terminology there doesn't make sense, read https://www.postgresql.org/docs/current/indexes.html regards, tom lane
Re: Increased iowait and blk_read_time with higher shared_buffers
Jordan Hurwich writes: > I'm familiar with the article you linked to, and part of my surprise is > that with these 32GB RAM machines we're seeing better performance at 12.5% > (4GB) than the commonly recommended 25% (8GB) of system memory for > shared_buffers. Your notes about disk read stats from Postgres potentially > actually representing blocks read from the OS cache make sense, I just > imagined that Postgres would be better at managing the memory when it was > dedicated to it via shared_buffers than the OS (obviously with some point > of diminishing returns); and I'm still hoping there's some Postgres > configuration change we can make that enables better performance through > improved utilization of shared_buffers at the commonly recommended 25% of > system memory. Keep in mind that 25% was never some kind of golden number. It is a rough rule of thumb that was invented for far smaller machines than what you're talking about here. regards, tom lane
Re: Odd Choice of seq scan
Ronuk Raval writes: > We've been working around the problem by rewriting queries to use UNION > instead. Yeah, that. The real issue here is that the seqscan and indexscan plans both suck, because they will both run that sub-select for every row in the table. The index-only plan might fetch fewer blocks along the way, because it only has to read the index not the table proper ... but that's only true if the table's pages are mostly marked all-visible. (My bet about the plan instability is that the planner might choose differently depending on how much of the table it believes is all-visible.) That only helps a bit, though. What you really want to have happen, assuming there are not too many interesting orderid values, is to do a point indexscan for each one of them. Currently the planner won't think of that by itself when faced with OR'd conditions in WHERE. You have to help it along with UNION or some similar locution. regards, tom lane
Re: Geometric types row estimation
Igor ALBUQUERQUE SILVA writes: > Thanks a lot for the explanation, I thought the built-in types were more > standard, so I didn't mention that I was having the same thing using > postgis. Hm --- you'd have to take that up with the PostGIS people. But they at least would be likely to have motivation to improve things. regards, tom lane
Re: Geometric types row estimation
Igor ALBUQUERQUE SILVA writes: > I'm having a problem regarding the point type/gist indexes. Here's a > minimal reproduction of it: > ... > What I was expecting is the first query to estimate 4 rows and the second > to estimate 1, like what I get If I try the same thing using integers. Unfortunately, the selectivity estimation functions for PG's geometric types are mostly just stubs. The estimation function for point <@ box in particular is contsel [1]: /* * contsel -- How likely is a box to contain (be contained by) a given box? * * This is a tighter constraint than "overlap", so produce a smaller * estimate than areasel does. */ Datum contsel(PG_FUNCTION_ARGS) { PG_RETURN_FLOAT8(0.001); } It's been like that (excepting notational changes) since Berkeley days, because nobody has bothered to make it better. In general, PG's built-in geometric types have never gotten much beyond their origins as an academic proof-of-concept. I think people who are doing serious work that requires such operations mostly use PostGIS, and I'd suggest looking into that. Or, if you feel like doing a lot of work to make these estimators better, have at it. regards, tom lane [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=src/backend/utils/adt/geo_selfuncs.c;hb=HEAD
Re: Catching up with performance & PostgreSQL 15
Andres Freund writes: > On November 30, 2022 3:47:32 AM PST, Andrew Dunstan > wrote: >> I think Alvaro's point is that it would have been better to work out >> these wrinkles before turning on JIT by default. Based on anecdotal >> reports from the field I'm inclined to agree. > The problem is that back when it was introduced these problems didn't exist > to a significant degree. JIT was developed when partitioning was very > minimal- and the problems we're seeing are almost exclusively with queries > with many partitions. The problems really only started much more recently. It > also wasn't enabled in the first release.. Well, wherever you want to pin the blame, it seems clear that we have a problem now. And I don't think flipping back to off-by-default is the answer -- surely there is some population of users who will not be happy with that. We really need to prioritize fixing the cost-estimation problems, and/or tweaking the default thresholds. regards, tom lane
Re: Catching up with performance & PostgreSQL 15
Alvaro Herrera writes: > IMO it was a mistake to turn JIT on in the default config, so that's one > thing you'll likely want to change. I wouldn't necessarily go quite that far, but I do think that the default cost thresholds for invoking it are enormously too low, or else there are serious bugs in the cost-estimation algorithms for deciding when to use it. A nearby example[1] of a sub-1-sec partitioned query that took 30sec after JIT was enabled makes me wonder if we're accounting correctly for per-partition JIT costs. regards, tom lane [1] https://www.postgresql.org/message-id/B6025887-D73F-4B5B-9925-4DA4B675F7E5%40elevated-dev.com
Re: why choosing an hash index instead of the btree version even if the cost is lower?
Tomas Vondra writes: > On 11/18/22 13:15, Luca Ferrari wrote: >> I've a table with a numeric column (integer), and I've created two >> indexes on such column, one btree and one hash. The hash results much >> larger as the btree, but what puzzles me is that executing an equality >> simple query, the system chooses the hash index (that has a final cost >> of 8984.08 while the btree index would have a final cost a little >> lower (8901.94). >> >> The only difference I can spot in the EXPLAIN plans is that the btree >> index has an initial cost, but I don't think this is the reason, since >> it should be the final cost what matters, right? > My guess is this is due to STD_FUZZ_FACTOR, see [1] and [2]. > That is, when comparing costs, we require the cost to be at least 1%, > because we have a cheapest path, and we're checking if it's worth > building another one (which is not free - we have to allocate stuff > etc.). And if the difference is tiny, it's not worth it. Even more to the point: if the total costs are fuzzily the same, then the next point of comparison will be the startup costs, which is where the hash index wins. I'm not sure if it's quite fair to give hash a zero startup cost; but it doesn't have to descend a search tree, so it is fair that its startup cost is less than btree's. regards, tom lane
Re: When can joins be avoided?
Stefan Fehrenbach writes: > I'm interested in the rules that Postgres uses to avoid joins. Are > these documented somewhere? If I had to look at the source code, where > would I start? src/backend/optimizer/plan/analyzejoins.c > They don't seem to match my intuition about which joins > could be avoided. I believe only left joins to single tables can be elided ATM. It's too hard to prove uniqueness of the join key in more- complicated cases. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > Oh, one comment there is actually obsolete now AFAIK. Unless there is > some reason to think personality(ADDR_NO_RANDOMIZE) might not work in > some case where sysctl -w kernel.randomize_va_space=0 will, I think we > can just remove that. AFAICS, f3e78069db7 silently does nothing on platforms lacking ADDR_NO_RANDOMIZE and PROC_ASLR_FORCE_DISABLE. Are you asserting there are no such platforms? (I'm happy to lose the comment if it's really useless now, but I think we have little evidence of that.) regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > On Tue, Aug 23, 2022 at 4:57 AM Tom Lane wrote: > +service the requests, with those clients receiving unhelpful > +connection failure errors such as Resource temporarily > +unavailable. > LGTM but I guess I would add "... or Connection refused"? Is that the spelling that appears on FreeBSD? Happy to add it. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Junwang Zhao writes: > Just curious, *backlog* defines the maximum pending connections, > why do we need to double the MaxConnections as the queue size? The postmaster allows up to twice MaxConnections child processes to exist, per the comment in canAcceptConnections: * We allow more connections here than we can have backends because some * might still be authenticating; they might fail auth, or some existing * backend might exit before the auth cycle is completed. The exact * MaxBackends limit is enforced when a new backend tries to join the * shared-inval backend array. You can argue that 2X might not be the right multiplier, and you can argue that the optimal listen queue length might be more or less than the limit on number of child processes, but that's how we've historically done it. I'm not especially interested in changing that without somebody making a well-reasoned case for some other number. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
OK, here's some proposed patches. 0001 adds a para about how to raise the listen queue length. 0002 isn't quite related, but while writing 0001 I noticed a nearby use of /proc/sys/... which I thought should be converted to sysctl. IMO /proc/sys pretty much sucks, at least for documentation purposes, for multiple reasons: * It's unlike the way you do things on other platforms. * "man sysctl" will lead you to useful documentation about how to use that command. There's no obvious way to find documentation about /proc/sys. * It's not at all sudo-friendly. Compare sudo sh -c 'echo 0 >/proc/sys/kernel/randomize_va_space' sudo sysctl -w kernel.randomize_va_space=0 The former is a lot longer and it's far from obvious why you have to do it that way. * You have to think in sysctl terms anyway if you want to make the setting persist across reboots, which you almost always do. * Everywhere else in runtime.sgml, we use sysctl not /proc/sys. 0003 removes PG_SOMAXCONN. While doing that I noticed that this computation hadn't been touched throughout all the various changes fooling with exactly what gets counted in MaxBackends. I think the most appropriate definition for the listen queue length is now MaxConnections * 2, not MaxBackends * 2, because the other processes counted in MaxBackends don't correspond to incoming connections. I propose 0003 for HEAD only, but the docs changes could be back-patched. regards, tom lane diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 963b18ed85..1192faa6ae 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -1298,6 +1298,22 @@ default:\ linkend="guc-max-files-per-process"/> configuration parameter to limit the consumption of open files. + + +Another kernel limit that may be of concern when supporting large +numbers of client connections is the maximum socket connection queue +length. If more than that many connection requests arrive within a +very short period, some may get rejected before the postmaster can +service the requests, with those clients receiving unhelpful +connection failure errors such as Resource temporarily +unavailable. The default queue length limit is 128 on many +platforms. To raise it, adjust the appropriate kernel parameter +via sysctl, then restart the postmaster. +The parameter is variously named net.core.somaxconn +on Linux, kern.ipc.soacceptqueue on newer FreeBSD, +and kern.ipc.somaxconn on macOS and other BSD +variants. + diff --git a/doc/src/sgml/runtime.sgml b/doc/src/sgml/runtime.sgml index 963b18ed85..1192faa6ae 100644 --- a/doc/src/sgml/runtime.sgml +++ b/doc/src/sgml/runtime.sgml @@ -1258,11 +1258,12 @@ default:\ - On Linux - /proc/sys/fs/file-max determines the - maximum number of open files that the kernel will support. It can - be changed by writing a different number into the file or by - adding an assignment in /etc/sysctl.conf. + On Linux the kernel parameter + fs.file-max determines the maximum number of open + files that the kernel will support. It can be changed with + sysctl -w fs.file-max=N. + To make the setting persist across reboots, add an assignment + in /etc/sysctl.conf. The maximum limit of files per process is fixed at the time the kernel is compiled; see /usr/src/linux/Documentation/proc.txt for diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 8a038d1b2a..1664fcee2a 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -4891,7 +4891,7 @@ SubPostmasterMain(int argc, char *argv[]) * If testing EXEC_BACKEND on Linux, you should run this as root before * starting the postmaster: * - * echo 0 >/proc/sys/kernel/randomize_va_space + * sysctl -w kernel.randomize_va_space=0 * * This prevents using randomized stack and code addresses that cause the * child process's memory map to be different from the parent's, making it diff --git a/src/backend/libpq/pqcomm.c b/src/backend/libpq/pqcomm.c index 8ff3be611d..7112e9751b 100644 --- a/src/backend/libpq/pqcomm.c +++ b/src/backend/libpq/pqcomm.c @@ -537,13 +537,11 @@ StreamServerPort(int family, const char *hostName, unsigned short portNumber, } /* - * Select appropriate accept-queue length limit. PG_SOMAXCONN is only - * intended to provide a clamp on the request on platforms where an - * overly large request provokes a kernel error (are there any?). + * Select appropriate accept-queue length limit. It seems reasonable + * to use a value similar to the maximum number of child processes + * that the postmaster will permit. */ - maxconn = MaxBackends * 2; - if (maxconn > PG_SOMAXCONN) - maxconn = PG_SOMAXCONN; + maxconn = MaxConnections * 2; err = lis
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > Cool. BTW small correction to something I said about FreeBSD: it'd be > better to document the new name kern.ipc.soacceptqueue (see listen(2) > HISTORY) even though the old name still works and matches OpenBSD and > macOS. Thanks. Sounds like we get to document at least three different sysctl names for this setting :-( regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > On Mon, Aug 22, 2022 at 12:20 PM Tom Lane wrote: >> Hmm. It'll be awhile till the 128 default disappears entirely >> though, especially if assorted BSDen use that too. Probably >> worth the trouble to document. > I could try to write a doc patch if you aren't already on it. I haven't done anything about it yet, but could do so tomorrow or so. (BTW, I just finished discovering that NetBSD has the same 128 limit. It looks like they intended to make that settable via sysctl, because it's a variable not a constant; but they haven't actually wired up the variable to sysctl yet.) > Oh, right. Looks like that was just paranoia in commit 153f4006763, > back when you got away from using the (very conservative) SOMAXCONN > macro. Looks like that was 5 on ancient systems going back to the > original sockets stuff, and later 128 was a popular number. Yeah I'd > say +1 for removing our cap. I'm pretty sure every system will > internally cap whatever value we pass in if it doesn't like it, as > POSIX explicitly says it can freely do with this "hint". Yeah. I hadn't thought to check the POSIX text, but their listen(2) page is pretty clear that implementations should *silently* reduce the value to what they can handle, not fail. Also, SUSv2 says the same thing in different words, so the requirement's been that way for a very long time. I think we could drop this ancient bit of paranoia. > ... Hmm, maybe we could add a hint to the error, > though? libpq doesn't really have a notion of hints --- perhaps we ought to fix that sometime. But this doesn't seem like a very exciting place to start, given the paucity of prior complaints. (And anyway people using other client libraries wouldn't be helped.) I think some documentation in the "Managing Kernel Resources" section should be plenty for this. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > Yeah retrying doesn't seem that nice. +1 for a bit of documentation, > which I guess belongs in the server tuning part where we talk about > sysctls, perhaps with a link somewhere near max_connections? More > recent Linux kernels bumped it to 4096 by default so I doubt it'll > come up much in the future, though. Hmm. It'll be awhile till the 128 default disappears entirely though, especially if assorted BSDen use that too. Probably worth the trouble to document. > Note that we also call listen() > with a backlog value capped to our own PG_SOMAXCONN which is 1000. I > doubt many people benchmark with higher numbers of connections but > it'd be nicer if it worked when you do... Actually it's 1. Still, I wonder if we couldn't just remove that limit now that we've desupported a bunch of stone-age kernels. It's hard to believe any modern kernel can't defend itself against silly listen-queue requests. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Thomas Munro writes: > If it's something like that, maybe increasing > /proc/sys/net/core/somaxconn would help? I think older kernels only > had 128 here. Bingo! I see $ cat /proc/sys/net/core/somaxconn 128 by default, which is right about where the problem starts. After $ sudo sh -c 'echo 1000 >/proc/sys/net/core/somaxconn' *and restarting the PG server*, I can do a lot more threads without a problem. Evidently, the server's socket's listen queue length is fixed at creation and adjusting the kernel limit won't immediately change it. So what we've got is that EAGAIN from connect() on a Unix socket can mean "listen queue overflow" and the kernel won't treat that as a nonblock-waitable condition. Still seems like a kernel bug perhaps, or at least a misfeature. Not sure what I think at this point about making libpq retry after EAGAIN. It would make sense for this particular undocumented use of EAGAIN, but I'm worried about others, especially the documented reason. On the whole I'm inclined to leave the code alone; but is there sufficient reason to add something about adjusting somaxconn to our documentation? regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Andrew Dunstan writes: > On 2022-08-21 Su 17:15, Tom Lane wrote: >> On the whole this is smelling more like a Linux kernel bug than >> anything else. > *nod* Conceivably we could work around this in libpq: on EAGAIN, just retry the failed connect(), or maybe better to close the socket and take it from the top with the same target server address. On the one hand, reporting EAGAIN certainly sounds like an invitation to do just that. On the other hand, if the failure is persistent then libpq is locked up in a tight loop --- and "Insufficient entries in the routing cache" doesn't seem like a condition that would clear immediately. It's also pretty unclear why the kernel would want to return EAGAIN instead of letting the nonblock connection path do the waiting, which is why I'm suspecting a bug rather than designed behavior. I think I'm disinclined to install such a workaround unless we get confirmation from some kernel hacker that it's operating as designed and application-level retry is intended. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Andrew Dunstan writes: > On 2022-08-20 Sa 23:20, Tom Lane wrote: >> Kevin McKibbin writes: >>> What's limiting my DB from allowing more connections? > The first question in my mind from the above is where this postgres > instance is actually listening. Is it really /var/run/postgresql? Its > postmaster.pid will tell you. I have often seen client programs pick up > a system libpq which is compiled with a different default socket directory. I wouldn't think that'd explain a symptom of some connections succeeding and others not within the same pgbench run. I tried to duplicate this behavior locally (on RHEL8) and got something interesting. After increasing the server's max_connections to 1000, I can do $ pgbench -S -c 200 -j 100 -t 100 bench and it goes through fine. But: $ pgbench -S -c 200 -j 200 -t 100 bench pgbench (16devel) starting vacuum...end. pgbench: error: connection to server on socket "/tmp/.s.PGSQL.5440" failed: Resource temporarily unavailable Is the server running locally and accepting connections on that socket? pgbench: error: could not create connection for client 154 So whatever is triggering this has nothing to do with the server, but with how many threads are created inside pgbench. I notice also that sometimes it works, making it seem like possibly a race condition. Either that or there's some limitation on how fast threads within a process can open sockets. Also, I determined that libpq's connect() call is failing synchronously (we get EAGAIN directly from the connect() call, not later). I wondered if libpq should accept EAGAIN as a synonym for EINPROGRESS, but no: that just makes it fail on the next touch of the socket. The only documented reason for connect(2) to fail with EAGAIN is EAGAIN Insufficient entries in the routing cache. which seems pretty unlikely to be the issue here, since all these connections are being made to the same local address. On the whole this is smelling more like a Linux kernel bug than anything else. regards, tom lane
Re: pgbench: could not connect to server: Resource temporarily unavailable
Kevin McKibbin writes: > What's limiting my DB from allowing more connections? > This is a sample of the output I'm getting, which repeats the error 52 > times (one for each failed connection) > -bash-4.2$ pgbench -c 200 -j 200 -t 100 benchy > ... > connection to database "benchy" failed: > could not connect to server: Resource temporarily unavailable > Is the server running locally and accepting > connections on Unix domain socket > "/var/run/postgresql/.s.PGSQL.5432"? This is apparently a client-side failure not a server-side failure (you could confirm that by seeing whether any corresponding failure shows up in the postmaster log). That means that the kernel wouldn't honor pgbench's attempt to open a connection, which implies you haven't provisioned enough networking resources to support the number of connections you want. Since you haven't mentioned what platform this is on, it's impossible to say more than that --- but it doesn't look like Postgres configuration settings are at issue at all. regards, tom lane
Re: PostgresSQL 9.5.21 very slow to connect and perform basic queries
bruno da silva writes: > *Question: *Besides the gc issue that you mentioned, having a large ( 700MB > or 1GB ) pgss_query_texts.stat could cause slowness in pg_stat_statement > processing > than leading to slower query responses with a 32bit PG? I'm thinking in > reducing pg_stat_statements.max from 10k to 3k Whether or not we've fully identified the problem, I think cutting pg_stat_statements.max is a good idea. Especially as long as you're stuck on an unsupported PG version. regards, tom lane
Re: PostgresSQL 9.5.21 very slow to connect and perform basic queries
I wrote: > bruno da silva writes: >> Do you have a lot of especially long statements being tracked >> in the pg_stat_statements view?* well, the view was showing the query >> column null.* >> * but looking on pgss_query_texts.stat there are very large sql >> statements, of around ~ 400kb, multiple thousands. * I see one possible piece of the puzzle here: since you're using a 32-bit build, overflowing size_t is a reachable hazard. Specifically, in this test to see if we need to garbage-collect the query text file: if (extent < pgss->mean_query_len * pgss_max * 2) return false; You said earlier that pg_stat_statements.max = 1, so a mean_query_len exceeding about 2^32 / 1 / 2 = 214748.3648 would be enough to overflow size_t and break this comparison. Now, a mean SQL query length in excess of 200kB sounds mighty improbable, but it's really the mean length of the query texts in the view. If your "normal" queries fall into just a few patterns they might be represented by a relatively small number of view entries. And if the "big" queries are sufficiently not alike, they might each get their own view entry, which could potentially drive the mean high enough to cause trouble. It'd be interesting to track what "SELECT avg(length(query)) FROM pg_stat_statements" gives. However, even if we grant that mean_query_len is that big, overflow here would make garbage collection of the query text file more likely not less so. What I'm speculating is that overflow is occurring and causing all processes to decide they need to run gc_qtexts() every time they insert a new query entry, even though the query texts file isn't actually bloated. That could possibly explain your performance issues: a garbage collection pass over a multi-gig file will take awhile, and what's worse is that it's done under an exclusive lock, meaning that all the backends stack up waiting their turn to perform a useless GC pass. What this doesn't explain is why the condition doesn't clear once you observe one of those "out of memory" complaints, because that should lead to truncating the texts file. Maybe it does get truncated, but then the cycle repeats after awhile? If you have a steady stream of incoming new 400kB queries, you could build back up to 2.2GB of text after five thousand or so of those. I'm also curious whether this installation is in the habit of doing pg_stat_statements_reset() a lot. It looks like that fails to reset mean_query_len, which might be intentional but perhaps it could play into getting a silly result here later on. regards, tom lane
Re: PostgresSQL 9.5.21 very slow to connect and perform basic queries
bruno da silva writes: > Do you have a lot of especially long statements being tracked > in the pg_stat_statements view?* well, the view was showing the query > column null.* > * but looking on pgss_query_texts.stat there are very large sql > statements, of around ~ 400kb, multiple thousands. * Hm. We try to recover from such failures by (a) resetting all the view's query text fields to null and (b) truncating the file --- well, unlinking it and creating it as empty. It seems like (a) happened and (b) didn't. It's pretty hard to explain that from the code though. Are you quite sure this is a 9.5.21 version of the pg_stat_statements extension? Is it possible that the pg_stat_tmp directory has been made non-writable? regards, tom lane > Are there any other signs of distress > in the postmaster log, like complaints about being unable to write > pgss_query_texts.stat? *no, just complaints for reading it. * > Thanks > On Tue, Aug 2, 2022 at 11:59 AM Tom Lane wrote: >> bruno da silva writes: > After more investigation, we found that pgss_query_texts.stat of a size >> of > 2.2GB. and this deployment has a 32bit pg. >> >> Hm ... we've heard one previous report of pg_stat_statements' query text >> file getting unreasonably large, but it's not clear how that can come >> to be. Do you have a lot of especially long statements being tracked >> in the pg_stat_statements view? Are there any other signs of distress >> in the postmaster log, like complaints about being unable to write >> pgss_query_texts.stat? >> >> regards, tom lane >> > -- > Bruno da Silva
Re: PostgresSQL 9.5.21 very slow to connect and perform basic queries
bruno da silva writes: > After more investigation, we found that pgss_query_texts.stat of a size of > 2.2GB. and this deployment has a 32bit pg. Hm ... we've heard one previous report of pg_stat_statements' query text file getting unreasonably large, but it's not clear how that can come to be. Do you have a lot of especially long statements being tracked in the pg_stat_statements view? Are there any other signs of distress in the postmaster log, like complaints about being unable to write pgss_query_texts.stat? regards, tom lane
Re: alter table xxx set unlogged take long time
Joe Conway writes: > Then (completely untested) I *think* you could create the "partition" > initially as a free standing unlogged table, load it, index it, switch > to logged, and then attach it to the partitioned table. I'm still of the opinion that this plan to load the data unlogged and switch to logged later is a loser. Sooner or later you have got to write the data to WAL, and this approach doesn't eliminate that cost. What it does do is create one whole extra cycle of writing the data to disk and reading it back. I don't think it's an oversight that no such thing is suggested in our standard tips for bulk-loading data: https://www.postgresql.org/docs/current/populate.html What perhaps *is* an oversight is that we don't suggest use of COPY FREEZE there. AFAIK that doesn't reduce the initial data loading cost directly, but it would save overhead later. regards, tom lane
Re: alter table xxx set unlogged take long time
"James Pang (chaolpan)" writes: > How to make it fast ? These are our steps about copy large data from Oracle > to Postgres > 1. Create table in Postgres 2. Extract data from Oracle to CSV 3. Alter > table set xxx unlogged, 4. Run copy command into Postgres db 5. Alter > table set xxx logged 6. Create index … The easy answer is to skip steps 3 and 5. regards, tom lane
Re: partition pruning only works for select but update
Justin Pryzby writes: > On Fri, Jul 01, 2022 at 08:30:40AM +, James Pang (chaolpan) wrote: >> We have other application depend on V13, possible to backport code changes >> to V13 as >> https://git.postgresql.org/gitweb/?p=postgresql.git;a=patch;h=86dc90056dfdbd9d1b891718d2e5614e3e432f35 > In any case, I'm sure the feature won't be backpatched to v13 - it's an > improvement but not a bugfix. Even more to the point, it was an extremely major change and would take a huge amount of QA effort to ensure that dropping it into v13 wouldn't cause fresh problems. The PG community has exactly no interest in making such effort. Besides which, what do you imagine "depends on v13" actually means? If you have an app that works on v13 but not v14, maybe it's because it depends on the old behavior in some way. Spend your effort on updating your app, instead. regards, tom lane
Re: partition pruning only works for select but update
"James Pang (chaolpan)" writes: > But when > Explain update table set .. where partitionkey between to_timestamp() and > to_timestamp(); > It still show all of partitions with update ... In releases before v14, partition pruning is far stupider for UPDATE (and DELETE) than it is for SELECT. regards, tom lane
Re: Adding non-selective key to jsonb query @> reduces performance?
Marcin Krupowicz writes: > However this one, is slow: > Q2 > select count(*) from tbl where row @> '{"SELECTIVE_COL": > "SearchValue", "DATE": "20220606", "NON_SELECTIVE_COL": "Abc"}'::jsonb > It takes 17ms > Note that the only difference is adding one more - not very unique - > key. If in Q2 I replaced NON_SELECTIVE_COL with another selective > column, it's becoming fast again. This doesn't surprise me a whole lot based on what I know of GIN. It's going to store sets of TIDs associated with each key or value mentioned in the data, and then a query will have to AND the sets of TIDs for keys/values mentioned in the query. That will take longer when some of those sets are big. It might be worth experimenting with an index built using the non-default jsonb_path_ops opclass [1]. I'm not sure if that'd be faster for this scenario, but it seems worth trying. regards, tom lane [1] https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING
Re: Query is taking too long i intermittent
Justin Pryzby writes: > On Mon, Jun 06, 2022 at 03:28:43PM +0530, Mayank Kandari wrote: >> SELECT event_id FROM event WHERE (event_sec > time.seconds) OR >> ((event_sec=time.seconds) AND (event_usec>=time.useconds) ORDER BY >> event_sec, event_usec LIMIT 1 > I think it'd be better if the column was a float storing the fractional number > of seconds. Currently, it may be hard for the planner to estimate rowcounts > if > the conditions are not independent. I don't know if it's related to this > problem, though. Also, even if you can't change the data representation, there's a more idiomatic way to do that in SQL: use a row comparison. SELECT ... WHERE row(event_sec, event_usec) >= row(time.seconds, time.useconds) ... I doubt this is notably more execution-efficient, but if you're getting a bad rowcount estimate it should help with that. It's easier to read too. regards, tom lane
Re: postgres backend process hang on " D " state
"James Pang (chaolpan)" writes: > 1. extensions > shared_preload_libraries = > 'orafce,pgaudit,pg_cron,pg_stat_statements,set_user' Can you still reproduce this if you remove all of those? regards, tom lane
Re: rows selectivity overestimate for @> operator for arrays
Alexey Ermakov writes: > so if we store mostly one element in array and they're almost all > distinct then in tables with more then stats_target/0.0063 (~1.58M for > maximum stats target 1) rows we'll get 0.005 constant for selectivity. Yeah. There's a comment in array_selfuncs.c about * TODO: this estimate probably could be improved by using the distinct * elements count histogram. For example, excepting the special case of * "column @> '{}'", we can multiply the calculated selectivity by the * fraction of nonempty arrays in the column. but I'm not sure whether that's relevant here. One thought is that if there is a pg_statistic row but it contains no MCE list, we could assume that the column elements are all distinct and see what sort of estimate that leads us to. regards, tom lane
Re: Why is there a Sort after an Index Only Scan?
=?UTF-8?Q?Andr=C3=A9_H=C3=A4nsel?= writes: > Limit (cost=0.43..296.63 rows=50 width=4) (actual time=1052.692..1052.737 > rows=50 loops=1) > -> Index Scan using orders_test_pkey on orders_test (cost=0.43..71149.43 > rows=12010 width=4) (actual time=1052.690..1052.728 rows=50 loops=1) > Filter: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= > '2022-05-01'::date)) > Rows Removed by Filter: 1998734 > Is it overestimating the cost of the sorting? No, but it's guessing it will hit 50 rows that satisfy the filter before it's gone very far in this index. If the shipping date and pkey are correlated in the wrong direction, that could be a very overoptimistic guess. I don't think we have adequate stats yet to detect this sort of problem. regards, tom lane
Re: LISTEN NOTIFY sometimes huge delay
"Peter Eser HEUFT [Germany]" writes: > I have a table with time series data and on this table a trigger for > notifies: > containers_notify AFTER INSERT ON containers FOR EACH ROW EXECUTE > PROCEDURE containers_notify('containers_notify_collector') > and the function does: > PERFORM pg_notify(CAST(TG_ARGV[0] AS text), row_to_json(NEW)::text); > This works as a charm but occasionally (I think with more load on the > system) the notifications are received much time (up to hours!) after > the INSERTs. > Nevertheless no notifications become lost, they are only very late! The > delay grows, seems as a queue grows, but the java process tries to fetch > the notifications fairly fast, Hm. We've not previously had reports of late notifications. One idea that comes to mind is that the server won't deliver notifications as long as the client has an open transaction, so is it possible your listening process sometimes forgets to close its transaction? > Versions: > PostgreSQL 10.12 on x86_64-pc-linux-gnu, compiled by > x86_64-pc-linux-gnu-gcc (Gentoo 6.4.0-r1 p1.3) 6.4.0, 64-bit > JDBC 42.2.23 That's pretty old. We've made a number of changes to the LISTEN/NOTIFY code since then; although in reading the commit log entries about them, nothing is said about long-delayed notifications. regards, tom lane
Re: Unworkable plan above certain row count
=?iso-8859-1?Q?Andr=E9_H=E4nsel?= writes: > Now if you change > INSERT INTO a(a_id) SELECT generate_series(1, 2); > to > INSERT INTO a(a_id) SELECT generate_series(1, 20); > i.e. add a zero, the plan becomes [ not a hashed subplan ] Yeah, it won't hash the subplan if the estimated size of the hash table exceeds work_mem. In this case, boosting work_mem would be a mighty good idea. regards, tom lane
Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.
Emil Iggland writes: > The query that is giving us issues is the following, channel 752433 has > NO values, 752431 has values. > (Channel 752433 only has valueseries 752434) > select * from datavalue > where dataview in ( select id from valueseries where channel = > %channel_idx%) > ORDER BY VALUETIMESTAMP DESC > FETCH FIRST ROW only; > Running explain analyze shows strange numbers, 52'000 rows are being > returned but there are no rows there. > For channel 752433 > -> Index Scan Backward using > datavalue_2022_03_valuetimestamp_dataview_idx on datavalue_2022_03 > datavalue_6 (cost=0.42..7166.19 rows=119673 width=226) (actual > time=0.008..32.831 rows=119601 loops=1) You've got the wrong column order (for this query anyway) in that index. It'd work a lot better if dataview were the first column; or at least, it wouldn't tempt the planner to try this unstably- performing plan. It's trying to use the index ordering to satisfy the ORDER BY, which works great as long as it finds a dataview match in some reasonably recent index entry. Otherwise, it's going to crawl the whole index to discover that there's no match. regards, tom lane
Re: How to find the final transformed query in postgresql
Goti writes: > I would like to know if postgres performs any transformations when it does > the parsing? This might be helpful reading: https://www.postgresql.org/docs/current/overview.html > If yes, is there a way we can get the final transformed query? See debug_print_parse and friends [1]. Depending on what you mean by "final transformed query", you might instead want debug_print_rewritten, or maybe you want the plan, in which case EXPLAIN is a much friendlier way to look at it than debug_print_plan. regards, tom lane [1] https://www.postgresql.org/docs/current/runtime-config-logging.html#RUNTIME-CONFIG-LOGGING-WHAT
Re: Query Planner not taking advantage of HASH PARTITION
Benjamin Tingle writes: > Interesting. Why is it impossible to prune hash partitions? Maybe prune > isn’t the best word, more so use to advantage. At the very least, it should > be possible to utilize a parallel insert against a table partitioned by > hash. (Partition query rows, then distribute these rows to parallel workers) Your plan-shape complaint had nothing to do with insertions; it had to do with joining the partitioned table to another table. That join can't be optimized. regards, tom lane
Re: Query Planner not taking advantage of HASH PARTITION
Benjamin Tingle writes: > I've recently started taking advantage of the PARTITION BY HASH feature for > my database system. It's a really great fit since my tables can get quite > large (900M+ rows for some) and splitting them up into manageable chunks > should let me upload to them without having to update an enormous index > every time. What's more, since each partition has a write lock independent > of the parent table, it should theoretically be possible to perform a > parallelized insert operation, provided the data to be added is partitioned > beforehand. > What has been disappointing is that the query planner doesn't seem to > recognize this potential. That's because there isn't any. The hash partitioning rule has basically nothing to do with any plausible WHERE condition. If you're hoping to see partition pruning happen, you need to be using list or range partitions, with operators compatible with your likely WHERE conditions. (I'm of the opinion that the hash partitioning option is more in the category of a dangerous nuisance than a useful feature. There are some around here who will argue otherwise, but they're wrong for exactly the reason that it's impossible to prune hash partitions.) regards, tom lane
Re: Slow Planning Times
"David G. Johnston" writes: > On Wed, Apr 6, 2022 at 5:27 PM Saurabh Sehgal wrote: >> I have the following query: >> I don't think it is super complex. But when I run explain analyze on this >> I get the following: >> Planning Time: 578.068 ms >> Execution Time: 0.113 ms > The fundamental issue here is that you have basically 12 conditions across > 5 tables that need to be evaluated to determine which one of the 1,680 > possible join orders is the most efficient. A 5-way join doesn't seem particularly outrageous. But I'm wondering if these are all plain tables or if some of them are actually complex views. Another possibility is that the statistics target has been cranked to the moon and the planner is spending all its time sifting through huge statistics arrays. It'd be interesting to see the actual schemas for the tables, as well as EXPLAIN's output for this query. I'm wondering exactly which PG version this is, too. regards, tom lane
Re: XA transactions much slower on 14.2 than on 13.5
Mladen Gogala writes: > My boss asked me to upgrade one of the development databases from 13.5 > --> 14.2. One thing that we've noticed right away is that XA > transactions (2-phase commit) are much slower on 14.2 than on 13.5. Were > there any significant changes to the XA protocol in the version 14? Did > anybody else encountered this problem? There were a bunch of changes around the 2PC code to support logical replication of 2PC transactions, but I don't think they should have made for any particular performance difference in non-replicated servers. Can you put together a self-contained test case that demonstrates what you're seeing? regards, tom lane
Re: Any way to speed up INSERT INTO
aditya desai writes: > One of the service layer app is inserting Millions of records in a table > but one row at a time. Although COPY is the fastest way to import a file in > a table. Application has a requirement of processing a row and inserting it > into a table. Is there any way this INSERT can be tuned by increasing > parameters? It is taking almost 10 hours for just 2.2 million rows in a > table. Table does not have any indexes or triggers. Using a prepared statement for the INSERT would help a little bit. What would help more, if you don't expect any insertion failures, is to group multiple inserts per transaction (ie put BEGIN ... COMMIT around each batch of 100 or 1000 or so insertions). There's not going to be any magic bullet that lets you get away without changing the app, though. It's quite possible that network round trip costs are a big chunk of your problem, in which case physically grouping multiple rows into each INSERT command (... or COPY ...) is the only way to fix it. But I'd start with trying to reduce the transaction commit overhead. regards, tom lane
Re: OOM killer while pg_restore
=?UTF-8?Q?Marc_Recht=c3=a9?= writes: > Le 03/03/2022 à 16:31, Tom Lane a écrit : >> Does memory consumption hold steady if you drop the FK constraints? > Actually the number of rows is 232735712. > Accordingly the RAM consumption would be x12 x3 = 7.8 GiB. > This is close to the 8,1g I reported earlier (actually it was closer to > 7.8 GB, due to GiB vs. GB confusion). > So there is no memory leak. > It took 16 hours on my box to reach that RAM consumption, and then the > COPY failed when checking the first FK (as the referenced table was empty). I'm guessing it was swapping like mad :-( We've long recommended dropping FK constraints during bulk data loads, and then re-establishing them later. That's a lot cheaper than retail validity checks, even without the memory-consumption angle. Ideally that sort of behavior would be automated, but nobody's gotten that done yet. (pg_restore does do it like that during a full restore, but not for a data-only restore, so I guess you were doing the latter.) regards, tom lane
Re: OOM killer while pg_restore
=?UTF-8?Q?Marc_Recht=c3=a9?= writes: > We have a pg_restore which fails due to RAM over-consumption of the > corresponding PG backend, which ends-up with OOM killer. > The table has one PK, one index, and 3 FK constraints, active while > restoring. > The dump contains over 200M rows for that table and is in custom format, > which corresponds to 37 GB of total relation size in the original DB. The FKs would result in queueing row trigger events, which would occupy some memory. But those should only need ~12 bytes per FK per row, which works out to less than 10GB for this number of rows, so it may be that you've hit something else that we would consider a leak. Does memory consumption hold steady if you drop the FK constraints? If not, as others have noted, we'd need more info to investigate this. The leak is probably independent of the specific data in the table, so maybe you could make a small self-contained example using a script to generate dummy data. regards, tom lane
Re: Advice needed: query performance deteriorates by 2000% within 1 minute
Peter Adlersburg writes: > Limit (cost=0.00..804.97 rows=10 width=22) (actual > time=23970.845..25588.432 rows=1 loops=1) >-> Seq Scan on "order" (cost=0.00..3863.86 rows=48 width=22) (actual > time=23970.843..25588.429 rows=1 loops=1) > Filter: (jsonb_to_tsvector('english'::regconfig, content, > '["all"]'::jsonb) @@ '''1.20709841'''::tsquery) > Rows Removed by Filter: 9652 > Planning Time: 0.430 ms > Execution Time: 25588.448 ms I think the expense here comes from re-executing jsonb_to_tsvector a lot of times. By default that's estimated as 100 times more expensive than a simple function (such as addition), but these results make it seem like that's an understatement. You might try something like alter function jsonb_to_tsvector(regconfig, jsonb, jsonb) cost 1000; to further discourage the planner from picking this plan shape. Possibly the cost estimate for ts_match_tq (the function underlying this variant of @@) needs to be bumped up as well. (Bear in mind that pg_dump will not propagate such hacks on system-defined objects, so you'll need a note to reapply any such changes after dump/reload or pg_upgrade.) regards, tom lane
Re: slow "select count(*) from information_schema.tables;" in some cases
Lars Aksel Opsahl writes: >> SELECT relname, pg_size_pretty(pg_relation_size(C.oid)) FROM pg_class C LEFT >> JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname = 'pg_catalog' >> ORDER BY 2 DESC LIMIT 20; can you show the output of this query "ORDER BY 2" is giving you a textual sort of the sizes, which is entirely unhelpful. Try SELECT relname, pg_size_pretty(pg_relation_size(C.oid)) FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace) WHERE nspname = 'pg_catalog' ORDER BY pg_relation_size(C.oid) DESC LIMIT 20; regards, tom lane
Re: 9.6 write time
Marc Millas writes: > checkpoint completion target is 0.9 checkpoint_timeout is the more interesting number here. regards, tom lane
Re: 9.6 write time
Marc Millas writes: > I am looking at a postgres 9.6 on rh7 > I see that checkpoint_write_time is huge which looks quite strange as the > average amount of data written is not that big. checkpoint_write_time is not the amount of time spent doing I/O; it's the elapsed wall-clock time in the write phase. If the I/O is being throttled because of an un-aggressive checkpoint completion target, it could be a lot more than the actual I/O time. What have you got your checkpoint parameters set to? regards, tom lane
Re: WAL files keep piling up
Zbigniew Kostrzewa writes: > Thanks for responding. On current master it looks like so: > ls pg_xlog/archive_status/ | grep ready | wc -l > 0 > ls pg_xlog/archive_status/ | grep done | wc -l > 501 Hmm, if you've got archiving turned off, I wonder why you have any .done files at all. Perhaps they are leftover from a time when you did have archiving on, and for some reason they are confusing the non-archive-mode cleanup logic. Anyway, you could certainly manually remove the .done files and the corresponding WAL segment files, and then see what happens. BTW, I'm sure you realize that 9.6.15 is not exactly current. regards, tom lane
Re: Query is slower with a large proportion of NULLs in several columns
Lars Bergeson writes: > I'm running PostgreSQL under AWS Aurora, and I didn't set it up or install > it, so I'm not sure about the OS version. Oh! Aurora is not Postgres. My admittedly-not-well-informed understanding is that they stuck a Postgres front end on their existing storage engine, so it's not surprising if storage-level behaviors are quite different from stock Postgres. regards, tom lane
Re: Query is slower with a large proportion of NULLs in several columns
"David G. Johnston" writes: > On Monday, December 20, 2021, Tom Lane wrote: >> It would help if somebody had labeled the units of I/O Time >> ... but I'm guessing those are microsec vs. the millisec >> of the other times, because otherwise it's completely wrong. > Related to my preceding observation, from the explain (buffers) docs: > “…and the time spent reading and writing data file blocks (in milliseconds) > if track_io_timing > <https://www.postgresql.org/docs/current/runtime-config-statistics.html#GUC-TRACK-IO-TIMING> > is enabled.“ Hmm ... the code sure looks like it's supposed to be millisec: appendStringInfoString(es->str, "I/O Timings:"); if (!INSTR_TIME_IS_ZERO(usage->blk_read_time)) appendStringInfo(es->str, " read=%0.3f", INSTR_TIME_GET_MILLISEC(usage->blk_read_time)); if (!INSTR_TIME_IS_ZERO(usage->blk_write_time)) appendStringInfo(es->str, " write=%0.3f", INSTR_TIME_GET_MILLISEC(usage->blk_write_time)); And when I try some cases here, I get I/O timing numbers that are consistent with the overall time reported by EXPLAIN, for example Seq Scan on foo (cost=0.00..843334.10 rows=1110 width=508) (actual time=0. 015..1897.492 rows=1100 loops=1) Buffers: shared hit=15874 read=717460 I/O Timings: read=1184.638 Planning: Buffers: shared hit=5 read=2 I/O Timings: read=0.025 Planning Time: 0.229 ms Execution Time: 2151.529 ms So now we have a real mystery about what is happening on Lars' system. Those numbers can't be right. regards, tom lane
Re: Query is slower with a large proportion of NULLs in several columns
Justin Pryzby writes: > On Mon, Dec 20, 2021 at 08:11:42PM -0800, Lars Bergeson wrote: >> Still taking 10X more I/O to read the smaller table. Very odd. > If I'm not wrong, it's even worse than that ? > It takes 20 or 30sec to run the query - but it says the associated I/O times > are ~500sec or ~6000sec ? It would help if somebody had labeled the units of I/O Time ... but I'm guessing those are microsec vs. the millisec of the other times, because otherwise it's completely wrong. regards, tom lane
Re: Query is slower with a large proportion of NULLs in several columns
Lars Bergeson writes: > What is it about null values in the table that slows down the full table > scan? If a row has any nulls, then it contains a "nulls bitmap" [1] that says which columns are null, and that bitmap has to be consulted while walking through the row contents. So the most obvious theory here is that that adds overhead that's significant in your case. But there are some holes in that theory, mainly that the I/O timings you are showing don't seem very consistent: no nulls: > I/O Timings: read=1884365.335 > Execution Time: 11135.368 ms with nulls: > I/O Timings: read=17141420.771 > Execution Time: 25407.318 ms Regardless of CPU time required, it should not take 10X less I/O time to read a physically larger table. So there's something fairly bogus going on there. One thing you might try is disabling parallelism (set max_parallel_workers_per_gather = 0) to see if that's confusing the numbers somehow. regards, tom lane [1] https://www.postgresql.org/docs/current/storage-page-layout.html#STORAGE-TUPLE-LAYOUT
Re: pg_trgm word_similarity query does not use index for input strings longer than 8 characters
Laurenz Albe writes: > On Tue, 2021-11-30 at 22:38 +0100, pgsql-performa...@jhacker.de wrote: >> INSERT INTO song (artist, title) >> SELECT 'artist','title' >> FROM generate_series(1,1); >> >> \set query '12345678' >> >> -- This query is slow >> EXPLAIN ANALYZE >> SELECT song.artist, song.title >> FROM song >> WHERE (song.artist %> :'query' OR song.title %> :'query') >> ; > The table is quite small; with a bigger table, the test would be more > meaningful. Yeah, this test case seems very unrealistic, both as to table size and as to the lack of variability of the table entries. I think the latter is causing the indexscans to take less time than they otherwise might, because none of the extracted trigrams find any matches. > Since you have SSDs, you should tune "random_page_cost = 1.1". Right. Poking at gincostestimate a bit, I see that for this operator the indexscan cost estimate is basically driven by the number of trigrams extracted from the query string (nine in this test case) and the index size; those lead to a predicted number of index page fetches that's then scaled by random_page_cost. That's coming out to make it look more expensive than the seqscan. It's actually not more expensive, but that's partially because page fetch costs are really zero in this test case (everything will stay in shared buffers the whole time), and partially because the unrealistic data pattern is leading to not having to look at as much of the index as gincostestimate expected. In general, it appears correct that longer query strings lead to a higher index cost estimate, because they produce more trigrams so there's more work for the index match to do. (At some level, a longer query means more work in the seqscan case too; but our cost models are inadequate to predict that.) regards, tom lane
Re: LwLockRelease performance
Ashkil Dighin writes: > The performance bottleneck in LWLockRelease()method goes through an array > one by one to see which lock was released with O(N). As soon as the lock is > found it performs an array to remove the lock. Typically, such locks are released in LIFO order. Do you have any actual evidence of a performance problem here? regards, tom lane
Re: Out of memory error
aditya desai writes: > In a trigger function I am creating a temp table . When an update on a > table is executed for say 10k rows. I get the below error. > ERROR: out of shared memory > HINT:You might need to increase max_locks_per_transaction > CONTEXT: SQL Statement "created temp table changedinfo(colName > varchar(100), oldValue varchar(4000), newValue varchar(4000) [ raised eyebrow ... ] If you are concerned about performance, I'd start by not creating a temp table per row of the outer update. That's costing probably 100x to 1000x as much as the row update itself. regards, tom lane
Re: Need help identifying a periodic performance issue.
Thomas Munro writes: > I'm wondering if the thing that changed between 9.6 and 13 might be > the heuristics for when auto vacuum's background ANALYZE is triggered, > creating the unlucky timing required to get your system to this state > occasionally. > For a while now I have been wondering how we could teach the > planner/stats system about "volatile" tables (as DB2 calls them), that > is, ones that are frequently empty, which often come up in job queue > workloads. I've seen problems like this with user queries (I used to > work on big job queue systems across different relational database > vendors, which is why I finished up writing the SKIP LOCKED patch for > 9.5), but this is the first time I've contemplated FK check queries > being negatively affected by this kind of stats problem. I don't have > a good concrete idea, though (various dumb ideas: don't let auto > analyze run on an empty table if it's marked VOLATILE, or ignore > apparently empty stats on tables marked VOLATILE (and use what?), > ...). Hmm. If this complaint were about v14 rather than v13, I'd be wondering whether 3d351d916 was what made things worse. But in v13, if the table does go to empty (zero length) and ANALYZE happens to see that state, we should end up back at the planner's "minimum ten pages" heuristic, which likely would be enough to prevent choice of a seqscan. OTOH, if the analyzed state is "empty but has a couple of pages", it looks like that could provoke a seqscan. This is all guesswork though, since we don't know quite what's happening on Robert's system. It might be worth setting "log_autovacuum_min_duration = 0" (either globally, or as a reloption on the relevant tables), and seeing if there seems to be any correlation between autovacuum/autoanalyze activity and the occurrences of poor plan choices. regards, tom lane