Re: When you really want to force a certain join type?
On 12/28/2022 10:48 AM, Justin Pryzby wrote: Maybe the new parameter in v15 would help. https://www.postgresql.org/docs/15/runtime-config-query.html#GUC-RECURSIVE-WORKTABLE-FACTOR recursive_worktable_factor (floating point) Sets the planner's estimate of the average size of the working table of a recursive query, as a multiple of the estimated size of the initial non-recursive term of the query. This helps the planner choose the most appropriate method for joining the working table to the query's other tables. The default value is 10.0. A smaller value such as 1.0 can be helpful when the recursion has low “fan-out” from one step to the next, as for example in shortest-path queries. Graph analytics queries may benefit from larger-than-default values. Thanks that's something I will try after I upgraded. But speaking of such other uses for recursive queries, I can say I have quite a bit of experience of turning graph related "traversal" and search and optimization and classification queries into SQL, in short, computing the transitive closure. And usually I have stayed away from the recursive WITH query and instead set up a start table and then perform the iterative step. And there are two ways to go about it. Say you have a graph, simple nodes and arcs. You want to find all paths through the graph. Now you can set up start nodes and then extend them at the end by joining the recursive table to the simple arc table and extend your path every time. This is what the WITH RECURSIVE supports. These queries linearly iterate as many times as the length of the longest path. with recursive arcs as ( select source, target, 1 as distance, ... from ... ), paths as ( select * from arcs union all select a.source, b.target, a.distance + 1 as distance, ... from paths a inner join_*arcs *_b on(b.source = a.target) ) select * from paths But another way is to join paths with paths. It would be this, which I think I have seen postgresql unable to deal with: with recursive arcs as ( select source, target, 1 as distance, ... from ... ), paths as ( select * from arcs union all select a.source, b.target, a.distance + 1 as distance, ... from paths a inner join_*paths *_b on(b.source = a.target) ) select * from paths So, instead of the recursive union to join back to the fixed table, it joins the recursive table to the recursive table, and the benefit of that is that these queries converge much quicker. Instead of going 10 iterations to find a path of length 10, you go 1 iteration to find all paths of 2 (if distance 1 is the base table of all arcs), then next you find paths of up to 4 then you find paths of up to 8, then 16, 32, ... This converges much faster. I usually do that as follows create table paths as select source, target, 1 as distance, ... from arcs; prepare rep as insert into paths(source, target, distance, ...) select a.source, b.target, a.distance + b.distance as distance, ... from paths a inner join paths b on(b.source = a.target) except select * from paths; execute rep; execute rep; ... or instead of the except, in order to minimize distances: where not exists (select 1 from paths x where x.source = a.source and x.target = a.target and x.distance < a.distance) I have even done a group by in the recursive step which replaces the paths relation at every iteration (e.g. with only minimal distance paths). Since this converges so rapidly I often prefer that approach over a recursive union query. I think in IBM DB2 allowed to join the recursive table with itself. Is this something you want to support at some time? Also, why even use the RECURSIVE keyword, DB2 didn't need it, and the query analyzer should immediately see the recursion, so no need to have that keyword. regards, -Gunther
When you really want to force a certain join type?
60) -> Materialize (cost=120554.35..120966.44 rows=82419 width=228) -> Sort (cost=120554.35..120760.39 rows=82419 width=228) Sort Key: f.token, p.pathid, p.end -> Nested Loop (cost=0.43..104808.55 rows=82419 width=228) -> WorkTable Scan on path p (cost=0.00..68.60 rows=3430 width=212) -> Index Scan using fsa_old_state_token_idx on fsa f (cost=0.43..30.30 rows=24 width=28) Once this merge join kicks in, the query essentially stalls (I mean, each of the limited components runs in seconds, and I can iteratively run them so that my initial set of tokens never grows past 100,000, and then I can complete everything in about linear time, each iteration takes about linear time proportional with the amount of tokens. But with the merge join it doesn't complete before several times that amount of time. I doubt that I can find any trick to give to the planner better data which it can then use to figure out that the merge join is a bad proposition. I wish I could just force it. I probably had this discussion here some years ago. I think that while the PostgreSQL optimizer is pretty good, there are situations such as this where its predictions do not work. Note, for my immediate relief I have forced it by simply set enable_mergejoin=off. This works fine, except, it converts both into a nested loop, but the upper merge join was not a problem, and sometimes (most often) nested loop is a bad choice for bulk data. It's only for this recursive query it sometimes makes sense. regards, -Gunther
Faster more low-level methods of having hot standby / secondary read-only servers?
ealistic numbers). And my 3 stand-bys rotate doing that, so 2 of the 3 are always up while 1 of them might be briefly down. This could be improved even with some OS and SAN support, all the way down to RAID mirrors. But practically, I could mount the same file system (e.g., BSD UFS) on the master with -o rw and on the stand-by with -o ro. Even without UFS having any "cluster" support, I can do that. I will have a few inconsistencies, but PostgreSQL is very tolerant of small inconsistencies and can fix them anyway. (I would not dare try anything like this with an Oracle system.) Another tool I could think of using is BSD UFS snapshot support. I could make a snapshot that is consistent from the file system perspective. PostgreSQL writers could interface with that, issue a sync, and trigger an UFS snapshot, then a file system sync. Now any standby who reads its data files from this snapshot would not only have file-system level consistency, but even database level consistency. So with this replication using a file system mounted from multiple PgSQL servers, replication should work well while consuming minimal amount of server resources and also not lead to too much actual disk IO contention (seeks). And even the disk contention could possibly be resolved by letting a RAID mirror sync to the same snapshot point and then split it off, or do the read activity of the stand-by server querying like crazy only from that mirror, while batching changes to the RAID master so that they can be applied with very low overhead. Anyone thinking about these things? regards, -Gunther
PosgtgreSQL hot standby reading WAL from muli-attached volume?
Hi, I wonder can we set up a hot standby in such a way that we don't need any log streaming nor shipping, where instead every hot standby just mounts the same disk in read-only mode which the master uses to write his WAL files? Even without a clustered file system, e.g., a UFS on FreeBSD, one can have the master mount in read-write mode while all the hot standbys would mount the volume read-only. Given that WAL logs are written out at a certain rate, one can at regular intervals issue mount -u /pg_wal and it should refresh the metadata, I assume. I am re-reading about hot-standby, and it strikes me that this method is essentially the "log shipping" method only that there is no actual "shipping" involved, the new log files simply appear all of a sudden on the disk. I suppose there is a question how we know when a new WAL file is finished appearing? And as I read the log-shipping method may not be suitable for hot standby use? Is this something that has been written about already? regards, -Gunther ___ freebsd-performa...@freebsd.org mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-performance To unsubscribe, send any mail to "freebsd-performance-unsubscr...@freebsd.org"
Re: Postgres performance comparing GCP and AWS
Hi Maurici, in my experience the key factor about speed in big queries is sequential scan. There is a huge variance in how the system is tuned. In some cases I cannot read more than 10 MB/s, in others I get to expect 20-40 MB/s. But then, when things are tuned well and the parallel workers set in, I see the throughput spike to 100-200 MB/s. You may have to enable the parallel workers in your postgresql.conf So, to me, this is what you want to check first. While the query runs, have both iostat and top running, with top -j or -c or -a or whatever it is on that particular OS to see the detail info about the process. Perhaps even -H to see threads. Then you should see good flow with high read speed and reasonable CPU load %. If you get low read speed and low CPU that is a sign of IO blockage somewhere. If you get high CPU and low IO, that's a planning mistake (the nested loop trap). You don't have that here apparently. But index scans I have seen with much worse IO throughput than seq table scans. Not sure. Also, on AWS you need to be sure you have enough IOPS provisioned on your EBS (I use gp3 now where you can have up to 10k IOPS) and also check bus throughput of the EC2 instance. Needless to say you don't want a t* instance where you have a limited burst CPU capacity only. regards, -Gunther On 2/23/2021 1:12 PM, Maurici Meneghetti wrote: Hi everyone, I have 2 postgres instances created from the same dump (backup), one on a GCP VM and the other on AWS RDS. The first instance takes 18 minutes and the second one takes less than 20s to run this simples query: SELECT "Id", "DateTime", "SignalRegisterId", "Raw" FROM "SignalRecordsBlobs" WHERE "SignalSettingId" = 103 AND "DateTime" BETWEEN '2019-11-28T14:00:12.54020' AND '2020-07-23T21:12:32.24900'; I’ve run this query a few times to make sure both should be reading data from cache. I expect my postgres on GPC to be at least similar to the one managed by AWS RDS so that I can work on improvements parallelly and compare. *DETAILS: Query explain for Postgres on GCP VM: *Bitmap Heap Scan on SignalRecordsBlobs SignalRecordsBlobs (cost=18.80..2480.65 rows=799 width=70) (actual time=216.766..776.032 rows=5122 loops=1) Filter: (("DateTime" >= \'2019-11-28 14:00:12.5402\'::timestamp without time zone) AND ("DateTime" <= \'2020-07-23 21:12:32.249\'::timestamp without time zone)) Heap Blocks: exact=5223 Buffers: shared hit=423 read=4821 -> Bitmap Index Scan on IDX_SignalRecordsBlobs_SignalSettingId (cost=0.00..18.61 rows=824 width=0) (actual time=109.000..109.001 rows=5228 loops=1) Index Cond: ("SignalSettingId" = 103) Buffers: shared hit=3 read=18 Planning time: 456.315 ms Execution time: 776.976 ms *Query explain for Postgres on AWS RDS: *Bitmap Heap Scan on SignalRecordsBlobs SignalRecordsBlobs (cost=190.02..13204.28 rows=6213 width=69) (actual time=2.215..14.505 rows=5122 loops=1) Filter: (("DateTime" >= \'2019-11-28 14:00:12.5402\'::timestamp without time zone) AND ("DateTime" <= \'2020-07-23 21:12:32.249\'::timestamp without time zone)) Heap Blocks: exact=5209 Buffers: shared hit=3290 read=1948 -> Bitmap Index Scan on IDX_SignalRecordsBlobs_SignalSettingId (cost=0.00..188.46 rows=6405 width=0) (actual time=1.159..1.159 rows=5228 loops=1) Index Cond: ("SignalSettingId" = 103) Buffers: shared hit=3 read=26 Planning time: 0.407 ms Execution time: 14.87 ms *PostgreSQL version number running: • VM on GCP*: PostgreSQL 11.10 (Debian 11.10-0+deb10u1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 8.3.0-6) 8.3.0, 64-bit *• Managed by RDS on AWS:* PostgreSQL 11.10 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-11), 64-bit *How PostgreSQL was installed: • VM on GCP*: Already installed when created VM running Debian on Google Console. *• Managed by RDS on AWS:* RDS managed the installation. *Changes made to the settings in the postgresql.conf file: *Here are some postgres parameters that might be useful: *Instance on VM on GCP (2 vCPUs, 2 GB memory, 800 GB disk):* • effective_cache_size: 1496MB • maintenance_work_mem: 255462kB (close to 249MB) • max_wal_size: 1GB • min_wal_size: 512MB • shared_buffers: 510920kB (close to 499MB) • max_locks_per_transaction 1000 • wal_buffers: 15320kB (close to 15MB) • work_mem: 2554kB • effective_io_concurrency: 200 • dynamic_shared_memory_type: posix On this instance we installed a postgres extension called timescaledb to gain performance on other tables. Some of these parameters were set using recommendations from that extension. *Instance managed by RDS (2 vCPUs, 2 GiB RAM, 250GB disk, 750 de IOPS):* • effective_cache_size: 1887792kB (close to 184
Conundrum with scaling out of bottleneck with hot standby, PgPool-II, etc.
aster and slave, or I can spend the same money to buy a single system with twice the CPU cores and a twice as wide IO path and disks. Why would I do anything other but to just increase that master db server? regards, -Gunther
Is there a known bug with SKIP LOCKED and "tuple to be locked was already moved to another partition due to concurrent update"?
Hi all, long time ago I devised with your help a task queuing system which uses SELECT ... FOR UPDATE SKIP LOCKED for many parallel workers to find tasks in the queue, and it used a partitioned table where the hot part of the queue is short and so the query for a job is quick and the skip locked locking makes sure that one job is only assigned to one worker. And this works pretty well for me, except that when we run many workers we find a lot of these failures occurring: "tuple to be locked was already moved to another partition due to concurrent update" This would not exactly look like a bug, because the message says "to be locked", so at least it's not allowing two workers to lock the same tuple. But it seems that the skip-locked mode should not make an error out of this, but treat it as the tuple was already locked. Why would it want to lock the tuple (representing the job) if another worker has already finished his UPDATE of the job to mark it as "done" (which is what makes the tuple move to the "completed" partition.) Either the SELECT for jobs to do returned a wrong tuple, which was already update, or there is some lapse in the locking. Either way it would seem to be a waste of time throwing all these errors when the tuple should not even have been selected for update and locking. I wonder if anybody knows anything about that issue? Of course you'll want to see the DDL and SQL queries, etc. but you can't really try it out unless you do some massively parallel magic. So I figured I just ask. regards, -Gunther
Re: FPGA optimization ...
Hi Thomas, you said: For the record, this is not exactly a new thing. Netezza (a PostgreSQL fork started in 1999 IBM) used FPGAs. Now there's swarm64 [1], another PostgreSQL fork, also using FPGAs with newer PostgreSQL releases. yes, I found the swarm thing on Google, and heard about Netezza years ago from the Indian consulting contractor that had worked on it (their price point was way out of the range that made sense for the academic place where I worked then). But there is good news, better than you thought when you wrote: Those are proprietary forks, though. The main reason why the community itself is not working on this directly (at least not on pgsql-hackers) is exactly that it requires specialized hardware, which the devs probably don't have, making development impossible, and the regular customers are not asking for it either (one of the reasons being limited availability of such hardware, especially for customers running in the cloud and not being even able to deploy custom appliances). I don't think this will change, unless the access to systems with FPGAs becomes much easier (e.g. if AWS introduces such instance type). It already has changed! Amazon F1 instances. And Xilinx has already packaged a demo https://aws.amazon.com/marketplace/pp/B07BVSZL51. This demo appears very limited though (only for TPC-H query 6 and 12 or so). Even the hardware to hold in your hand is now much cheaper. I know a guy who's marketing a board with 40 GB/s throughput. I don't have price but I can't imagine the board plus 1 TB disk to be much outside of US$ 2k. I could sponsor that if someone wants to have a serious shot at it. Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax exempt? I can donate and possibly find some people at Purdue University who might take this on. Interest? I don't think there's any such non-profit, managing/funding development. At least I'm not avare of it. There are various non-profits around the world, but those are organizing events and local communities. I'd say the best way to do something like this is to either talk to one of the companies participating in PostgreSQL devopment (pgsql-hackers is probably a good starting point), or - if you absolutely need to go through a non-profit - approach a university (which does not mean people from pgsql-hackers can't be involved, of course). I've been involved in a couple of such research projects in Europe, not sure what exactly is the situation/rules in US. Yes, might work with a University directly. Although I will contact the PostgreSQL foundation in the US also. regards, -Gunther
FPGA optimization ...
The time has come. FPGA optimization is in the palm of our hands (literally a 2 TB 40 GB/s IO PostgreSQL server fits into less than a shoe box), and on Amazon AWS F1 instances. Some demos are beginning to exist: https://github.com/Xilinx/data-analytics. <https://github.com/Xilinx/data-analytics> But a lot more could be done. How about linear sort performance at O(N)? https://hackaday.com/2016/01/20/a-linear-time-sorting-algorithm-for-fpgas/. And how about https://people.csail.mit.edu/wjun/papers/fccm2017.pdf, the following four sorting accelerators are used: * Tuple Sorter : Sorts an N-tuple using a sorting network. * Page Sorter : Sorts an 8KB (a flash page) chunk of sorted N-tuples in on-chip memory. * Super-Page Sorter : Sorts 16 8K-32MB sorted chunks in DRAM. * Storage-to-Storage Sorter: Sorts 16 512MB or larger sorted chunks in flash. Order of magnitude speed improvements? Better than Hadoop clusters on a single chip? 40 GB/s I/O throughput massive full table scan, blazing fast sort-merge joins? Here it is. Anybody working more on that? Should be an ideal project for a student or a group of students. Is there a PostgreSQL foundation I could donate to, 501(c)(3) tax exempt? I can donate and possibly find some people at Purdue University who might take this on. Interest? regards, -Gunther
Re: Out of Memory errors are frustrating as heck!
Thanks Tom, yes I'd say it's using a lot of memory, but wouldn't call it "leak" as it doesn't grow during the 30 min or so that this query runs. It explodes to 4GB and then stays flat until done. Yes, and this time the query is super complicated with many joins and tables involved. The query plan has 100 lines. Not easy to share for reproduce and I have my issue under control by adding some swap just in case. The swap space was never actually used. thanks, -Gunther On 8/23/2019 10:20, Tom Lane wrote: Gunther writes: Hi all, I am connecting to a discussion back from April this year. My data has grown and now I am running into new out of memory situations. It doesn't look like this has much of anything to do with the hash-table discussion. The big hog is an ExprContext: ExprContext: 1107296256 total in 142 blocks; 6328 free (101 chunks); 1107289928 used So there's something leaking in there, but this isn't enough info to guess what. regards, tom lane
Re: Out of Memory errors are frustrating as heck!
OK, I went back through that old thread, and I noticed an early opinion by a certain Peter who said that I should provision some swap space. Since I had plenty of disk and no other option I tried that. And it did some magic. Here this is a steady state now: top - 14:07:32 up 103 days, 9:57, 5 users, load average: 1.33, 1.05, 0.54 Tasks: 329 total, 2 running, 117 sleeping, 0 stopped, 0 zombie %Cpu(s): 31.0 us, 11.4 sy, 0.0 ni, 35.3 id, 22.3 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem : 7910376 total, 120524 free, 2174940 used, 5614912 buff/cache KiB Swap: 16777212 total, 16777212 free,0 used. 3239724 avail Mem PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 5068 postgres 20 0 4352496 4.0g 2.0g R 76.4 52.6 3:01.39 postgres: postgres integrator [local] INSERT 435 root 20 0 0 0 0 S 4.0 0.0 10:52.38 [kswapd0] and the nice thing is, the backend server process appears to be bounded at 4GB, so there isn't really a "memory leak". And also, the swap space isn't really being used. This may have to do with these vm. sysctl settings, overcommit, etc. * vm.overcommit_memory = 2 -- values are o 0 -- estimate free memory o 1 -- always assume there is enough memory o 2 -- no over-commit allocate only inside the following two parameters * vm.overcommit_kbytes = 0 -- how many kB above swap can be over-committed, EITHER this OR * vm.overcommit_ratio = 50 -- percent of main memory that can be committed over swap, o with 0 swap, that percent can be committed o i.e., this of 8 GB, 4 GB are reserved for buffer cache o not a good idea probably o at least we should allow 75% committed, i.e., 6 GB of 8 GB, leaving + 2 GB of buffer cache + 2 GB of shared buffers + 4 GB of all other memory I have vm.overcommit_memory = 2, _kbytes = 0, _ratio = 50. So this means with _ratio = 50 I can commit 50% of memory, 4GB and this is exactly what the server process wants. So with little impact on the available buffer cache I am in a fairly good position now. The swap (that in my case I set at 2 x main memory = 16G) serves as a buffer to smooth out this peak usage without ever actually paging. I suppose even without swap I could have set vm.overcommit_ratio = 75, and I notice now that I already commented this much (the above bullet points are my own notes.) Anyway, for now, I am good. Thank you very much. regards, -Gunther On 8/23/2019 9:17, Gunther wrote: Hi all, I am connecting to a discussion back from April this year. My data has grown and now I am running into new out of memory situations. Meanwhile the world turned from 11.2 to 11.5 which I just installed only to find the same out of memory error. Have any of the things discussed and proposed, especially this last one by Tomas Vondra, been applied to the 11 releases? Should I try these older patches from April? regards, -Gunther For what it is worth, this is what I am getting: TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8388608 total in 11 blocks; 3094872 free (4 chunks); 5293736 used JoinRelHashTable: 16384 total in 2 blocks; 5576 free (1 chunks); 10808 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 12720 free (8 chunks); 20048 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used: ExecutorState: 202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 10607168 used HashTableContext: 8192 total in 1 blocks; 7688 free (1 chunks); 504 used HashBatchContext: 13079304 total in 336 blocks; 7936 free (0 chunks); 13071368 used TupleSort main: 49208 total in 3 blocks; 8552 free (7 chunks); 40656 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free
Re: Out of Memory errors are frustrating as heck!
Hi all, I am connecting to a discussion back from April this year. My data has grown and now I am running into new out of memory situations. Meanwhile the world turned from 11.2 to 11.5 which I just installed only to find the same out of memory error. Have any of the things discussed and proposed, especially this last one by Tomas Vondra, been applied to the 11 releases? Should I try these older patches from April? regards, -Gunther For what it is worth, this is what I am getting: TopMemoryContext: 67424 total in 5 blocks; 7184 free (7 chunks); 60240 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 7720 free (1 chunks); 472 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 8388608 total in 11 blocks; 3094872 free (4 chunks); 5293736 used JoinRelHashTable: 16384 total in 2 blocks; 5576 free (1 chunks); 10808 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 12720 free (8 chunks); 20048 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalContext: 1024 total in 1 blocks; 624 free (0 chunks); 400 used: ExecutorState: 202528536 total in 19 blocks; 433464 free (12 chunks); 202095072 used HashTableContext: 8192 total in 1 blocks; 7656 free (0 chunks); 536 used HashBatchContext: 10615104 total in 261 blocks; 7936 free (0 chunks); 10607168 used HashTableContext: 8192 total in 1 blocks; 7688 free (1 chunks); 504 used HashBatchContext: 13079304 total in 336 blocks; 7936 free (0 chunks); 13071368 used TupleSort main: 49208 total in 3 blocks; 8552 free (7 chunks); 40656 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Subplan HashTable Temp Context: 1024 total in 1 blocks; 768 free (0 chunks); 256 used Subplan HashTable Context: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used
Re: Out of Memory errors are frustrating as heck!
..1310249.63 rows=13 width=233) (actual time=35925.031..40176.447 rows=236 loops=1) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual time=1.609..7687.986 rows=14676871 loops=1) -> Hash (cost=829388.19..829388.19 rows=1 width=136) (actual time=30106.123..30106.123 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1 width=136) (actual time=24786.510..30105.983 rows=236 loops=1) ... -> Hash (cost=2823846.37..2823846.37 rows=34619 width=930) (actual time=252946.367..252946.367 rows=113478127 loops=1) Buckets: 32768 (originally 32768) Batches: 65536 (originally 4) Memory Usage: 1204250kB -> Gather Merge (cost=2807073.90..2823846.37 rows=34619 width=930) (actual time=83891.069..153380.040 rows=113478127 loops=1) Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2806073.87..2818850.46 rows=14425 width=930) (actual time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid):: text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=882) (actual time=44814.114..45535.398 rows=231207 loops=3) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.docum... Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2806073.87..2818850.46 rows=14425 width=930) (actual time=83861.921..108022.671 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid):: text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1295969.26..1296005.32 rows=14425 width=882) (actual time=44814.114..45535.398rows=231207 loops=3) ... Planning Time: 2.953 ms Execution Time: 7004340.091 ms (70 rows) There isn't really any big news here. But what matters is that it works. thanks & regards, -Gunther Schadow
Re: Out of Memory errors are frustrating as heck!
uckets: 65536 Batches: 1 Memory Usage: 4680kB -> Parallel Seq Scan on entity_id agencyid (cost=0.00..1434.07 rows=20107 width=89) (actual time=0.376..46.875 rows=11393 loops=3) -> Hash (cost=1310249.63..1310249.63 rows=13 width=233) (actual time=39172.682..39172.682 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 70kB -> Hash Right Join (cost=829388.20..1310249.63 rows=13 width=233) (actual time=35084.850..39172.545 rows=236 loops=1) Hash Cond: ((an.actinternalid)::text = (q.actinternalid)::text) -> Seq Scan on act_id an (cost=0.00..425941.04 rows=14645404 width=134) (actual time=0.908..7583.123 rows=14676871 loops=1) -> Hash (cost=829388.19..829388.19 rows=1 width=136) (actual time=29347.614..29347.614 rows=236 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 63kB -> Gather (cost=381928.46..829388.19 rows=1 width=136) (actual time=23902.428..29347.481 rows=236 loops=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Hash Join (cost=380928.46..828388.09 rows=1 width=136) (actual time=23915.790..29336.452 rows=79 loops=3) Hash Cond: ((q.actinternalid)::text = (r.sourceinternalid)::text) -> Parallel Seq Scan on documentinformation q (cost=0.00..447271.93 rows=50050 width=99) (actual time=10055.238..15484.478 rows=87921 loops=3) Filter: (((classcode)::text = 'CNTRCT'::text) AND ((moodcode)::text = 'EVN'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 1540625 -> Parallel Hash (cost=380928.44..380928.44 rows=1 width=74) (actual time=13825.726..13825.726 rows=79 loops=3) Buckets: 1024 Batches: 1 Memory Usage: 112kB -> Parallel Seq Scan on actrelationship r (cost=0.00..380928.44 rows=1 width=74) (actual time=5289.948..13825.576 rows=79 loops=3) Filter: ((typecode)::text = 'SUBJ'::text) Rows Removed by Filter: 3433326 -> Hash (cost=2908913.87..2908913.87 rows=34619 width=930) (actual time=244145.322..244145.322 rows=113478127 loops=1) Buckets: 8192 (originally 8192) Batches: 65536 (originally 16) Memory Usage: 1204250kB -> Gather Merge (cost=2892141.40..2908913.87 rows=34619 width=930) (actual time=75215.333..145622.427 rows=113478127 loops=1) Workers Planned: 2 Workers Launched: 2 -> Merge Left Join (cost=2891141.37..2903917.96 rows=14425 width=930) (actual time=75132.988..99411.448 rows=37826042 loops=3) Merge Cond: (((documentinformationsubject.documentinternalid)::text = (documentinformationsubject_1.documentinternalid)::text) AND ((documentinformationsubject.documentid)::text = (documentinformationsubject_1.documentid)::text) AND ((documentinformationsubject.actinternalid)::text = (documentinformationsubject_1.actinternalid)::text)) -> Sort (cost=1301590.26..1301626.32 rows=14425 width=882) (actual time=39801.337..40975.780 rows=231207 loops=3) Sort Key: documentinformationsubject.documentinternalid, documentinformationsubject.documentid, documentinformationsubject.actinternalidct_1.documentid, documentinformationsubject_1.actinternalid Sort Method: external merge Disk: 169768kB Worker 0: Sort Method: external merge Disk: 169768kB Worker 1: Sort Method: external merge Disk: 169768kB -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=159) (actual time=23401.537..31758.042 rows=1031106 loops=3) Filter: ((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter: 7415579 Planning Time: 40.559 ms Execution Time: 6896581.566 ms (70 rows) For the first time this query has succeeded now. Memory was bounded. The time of nearly hours is crazy, but things sometimes take that long. The important thing was not to get an out of memory error. Thank you. Anything else you want to try, I can do it. regards, -Gunther
Re: Out of Memory errors are frustrating as heck!
( hjstate=) at nodeHashjoin.c:1127 1127{ ( I don't understand why all of them are at the same location. Doesn't make any sense to me. But I'll leave it at that right now. -Gunther
Re: Out of Memory errors are frustrating as heck!
After applying Tomas' patch 0002 as corrected, over 0001, same thing: integrator=# set enable_nestloop to off; SET integrator=# explain analyze select * from reports.v_BusinessOperation; WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32 to 64 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 64 to 128 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 128 to 256 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 256 to 512 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 512 to 1024 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 1024 to 2048 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 2048 to 4096 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 4096 to 8192 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 8192 to 16384 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16384 to 32768 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 32768 to 65536 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 65536 to 131072 WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 131072 to 262144 ERROR: out of memory DETAIL: Failed on request of size 8272 in memory context "HashBatchFiles". And from the log: 2019-04-21 23:29:33.497 UTC [8890] LOG: database system was shut down at 2019-04-21 23:29:33 UTC 2019-04-21 23:29:33.507 UTC [] LOG: database system is ready to accept connections Hashjoin 0x1732b88: initial nbatch = 16, nbuckets = 8192 2019-04-21 23:31:54.447 UTC [8896] WARNING: ExecHashIncreaseNumBatches: increasing number of batches from 16 to 32 2019-04-21 23:31:54.447 UTC [8896] LOG: ExecHashIncreaseNumBatches === context stats start === 2019-04-21 23:31:54.447 UTC [8896] STATEMENT: explain analyze select * from reports.v_BusinessOperation; TopMemoryContext: 120544 total in 7 blocks; 10016 free (6 chunks); 110528 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 6680 free (0 chunks); 1512 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 32768 total in 3 blocks; 13488 free (10 chunks); 19280 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1482752 total in 184 blocks; 11216 free (8 chunks); 1471536 used: ExecutorState: 647368 total in 10 blocks; 197536 free (13 chunks); 449832 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used TupleSort main: 4219912 total in 23 blocks; 246792 free (39 chunks); 3973120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashBatchFiles: 258032 total in 31 blocks; 6208 free (0 chunks); 251824 used HashTableContext: 8192 total in 1 blocks; 7624 free (5 chunks); 568 used HashBatchContext: 4279352 total in 130 blocks; 7936 free (0 chunks); 4271416 used TupleSort main: 41016 total in 3 blocks; 6504 free (6 chunks); 34512 used Caller tuples: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ... ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400
Re: Out of Memory errors are frustrating as heck!
s; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 288672 free (1 chunks); 812656 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6560 free (1 chunks); 1632 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used Grand total: 2424300520 bytes in 271910 blocks; 7332360 free (17596 chunks); 2416968160 used 2019-04-21 19:50:21.338 UTC [6974] ERROR: out of memory 2019-04-21 19:50:21.338 UTC [6974] DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". 2019-04-21 19:50:21.338 UTC [6974] STATEMENT: explain analyze select * from reports.v_BusinessOperation; Next I'll apply Tomas' corrected 0002 patch on top of this and see ... -Gunther
Re: Out of Memory errors are frustrating as heck!
On 4/20/2019 21:14, Tomas Vondra wrote: Maybe. But before wasting any more time on the memory leak investigation, I suggest you first try the patch moving the BufFile allocations to a separate context. That'll either confirm or disprove the theory. OK, fair enough. So, first patch 0001-* applied, recompiled and 2019-04-21 04:08:04.364 UTC [11304] LOG: server process (PID 11313) was terminated by signal 11: Segmentation fault 2019-04-21 04:08:04.364 UTC [11304] DETAIL: Failed process was running: explain analyze select * from reports.v_BusinessOperation; 2019-04-21 04:08:04.364 UTC [11304] LOG: terminating any other active server processes 2019-04-21 04:08:04.368 UTC [11319] FATAL: the database system is in recovery mode 2019-04-21 04:08:04.368 UTC [11315] WARNING: terminating connection because of crash of another server process SIGSEGV ... and with the core dump that I have I can tell you where: Core was generated by `postgres: postgres integrator'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x009f300c in palloc (size=8272) at mcxt.c:936 936 context->isReset = false; (gdb) bt #0 0x009f300c in palloc (size=8272) at mcxt.c:936 #1 0x0082b068 in makeBufFileCommon (nfiles=1) at buffile.c:116 #2 0x0082b0f8 in makeBufFile (firstfile=73) at buffile.c:138 #3 0x0082b283 in BufFileCreateTemp (interXact=false) at buffile.c:201 #4 0x006bdc15 in ExecHashJoinSaveTuple (tuple=0x1c5a468, hashvalue=3834163156, fileptr=0x18a3730) at nodeHashjoin.c:1227 #5 0x006b9568 in ExecHashTableInsert (hashtable=0x188fb88, slot=0x1877a18, hashvalue=3834163156) at nodeHash.c:1701 #6 0x006b6c39 in MultiExecPrivateHash (node=0x1862168) at nodeHash.c:186 #7 0x006b6aec in MultiExecHash (node=0x1862168) at nodeHash.c:114 #8 0x006a19cc in MultiExecProcNode (node=0x1862168) at execProcnode.c:501 #9 0x006bc5d2 in ExecHashJoinImpl (pstate=0x17b90e0, parallel=false) at nodeHashjoin.c:290 ... (gdb) info frame Stack level 0, frame at 0x7fffd5d4dc80: rip = 0x9f300c in palloc (mcxt.c:936); saved rip = 0x82b068 called by frame at 0x7fffd5d4dcb0 source language c. Arglist at 0x7fffd5d4dc70, args: size=8272 Locals at 0x7fffd5d4dc70, Previous frame's sp is 0x7fffd5d4dc80 Saved registers: rbx at 0x7fffd5d4dc60, rbp at 0x7fffd5d4dc70, r12 at 0x7fffd5d4dc68, rip at 0x7fffd5d4dc78 and I have confirmed that this is while working the main T_HashJoin with jointype JOIN_RIGHT. So now I am assuming that perhaps you want both of these patches applied. So applied it, and retried and boom, same thing same place. turns out the MemoryContext is NULL: (gdb) p context $1 = (MemoryContext) 0x0 all patches applied cleanly (with the -p1 option). I see no .rej file, but also no .orig file, not sure why that version of patch didn't create them. But I paid attention and know that there was no error. -Gunther
Re: Out of Memory errors are frustrating as heck!
1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx ... index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_attribute_relid_attnum_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_class_oid_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 6176 free (1 chunks); 2016 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used Grand total: 2538218304 bytes in 77339 blocks; 3075256 free (3372 chunks); 2535143048 used 2019-04-21 05:27:07.731 UTC [968] ERROR: out of memory 2019-04-21 05:27:07.731 UTC [968] DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". 2019-04-21 05:27:07.731 UTC [968] STATEMENT: explain analyze select * from reports.v_BusinessOperation; so we're ending up with the same problem. No cigar. But lots of admiration and gratitude for all your attempts to pinpoint this. Also, again, if anyone (of the trusted people) wants access to hack directly, I can provide. regards, -Gunther
Re: Out of Memory errors are frustrating as heck!
d. But to me, Valgrind would be a huge toolbox to just look after one thing. I wonder if we could not make a much simpler memory leak debugger tool. One that is fast, yet doesn't provide too much output to overwhelm the log destination file system (and waste too much time). There are already Debug macros there which, if enabled, just create an absolutely crazy amount of undecipherable log file content, because ALL backend processes would spit out this blabber. So I already stopped that by adding a variable that must be set to 1 (using the debugger on exactly one process for exactly the time desired): int _alloc_info = 0; #ifdef HAVE_ALLOCINFO #define AllocFreeInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocFree: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #define AllocAllocInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #else #define AllocFreeInfo(_cxt, _chunk) #define AllocAllocInfo(_cxt, _chunk) #endif But now I am thinking that should be the hook to use a limited cache where we can cancel out AllocSetAlloc with their AllocSetFree calls that follow relatively rapidly, which apparently is the majority of the log chatter created. The memory debugger would allocate a single fixed memory chunk like 8 or 16 kB as a cache per each memory context that is actually traced. In each we would record the memory allocation in the shortest possible way. With everything compressed. Instead of pointer, references to the memory we would store whatever memory chunk index, a very brief backtrace would be stored in a compressed form, by instruction pointer ($rip) of the caller and then variable length differences to the $rip of the caller next up. These could even be stored in an index of the 100 most common caller chains to compress this data some more, while minimizing the cost in searching. Now each allocated chunk would be kept in this table and when it fills up, the oldest allocated chunk removed and written to the log file. When freed before being evicted from the cache, the chunk gets silently removed. When a chunk is freed that is no longer in the cache, the free event is recorded in the log. That way only those chunks get written to the log files that have endured beyond the capacity of the cache. Hopefully that would be the chunks most likely involved in the memory leak. Once the log has been created, it can be loaded into PostgreSQL table itself, and analyzed to find the chunks that never get freed and from the compressed backtrace figure out where they have been allocated. BTW, my explain analyze is still running. That Sort - Unique step is taking forever on this data. OK, now I will try the various patches that people sent. -Gunther
Re: Out of Memory errors are frustrating as heck!
ND ((i % 15) < 15) AND ((i % 16) < 16) AND ((i % 17) < 17) AND ((i % 18) < 18) AND ((i % 19) < 19) AND ((i % 20) < 20) AND ((i % 21) < 21)) -> Function Scan on generate_series k (cost=0.00..10.00 rows=1000 width=4) (actual rows=99 loops=1) -> Hash (cost=0.02..0.02 rows=1 width=4) (actual rows=99 loops=1) Buckets: 4096 (originally 1024) Batches: 512 (originally 1) Memory Usage: 101kB -> CTE Scan on v (cost=0.00..0.02 rows=1 width=4) (actual rows=99 loops=1) Yes I thought that with CTEs and functions one might be able to generate a test case, but still not seing how you can trick the planner into this peculiar jointype JOIN_RIGHT and whether that is requisite for triggering the problem. Finally, I have tried to make a pstate pretty printer in explain.c: void DumpPlanState(PlanState *pstate) { ExplainState *es = NewExplainState(); ExplainNode(pstate, NIL, NULL, NULL, es); puts(es->str->data); pfree(es->str->data); } but that didn't work, because unfortunately that ExplainNode function is destructive. It would be so nice to refactor this explain code such that there would be a completely conservative function that simply dumps the present pstate with all the information about its estimate and actual situation, how many iterations it has already accomplished, how many it estimates to still have to do, whether its original estimate panned out or not, etc. This would be so tremendously useful for runtime debugging of queries. I think the utility of this can hardly be overstated. I mean even for end-user applications of some data warehouse, where you could probe a long running query every 5 seconds as to where the execution is. Man, I could not think of any more low hanging fruit useful feature. I am sure that if PostgreSQL was originally written in Java, this feature would naturally exist already. regards and Happy Easter, -Gunther
Re: Out of Memory errors are frustrating as heck!
pg_class reltuples and relpages: foo=# analyze tmp_q; ANALYZE foo=# analyze tmp_r; ANALYZE foo=# select relname, relpages, reltuples from pg_class where relname in ('tmp_q', 'tmp_r'); relname | relpages | reltuples -+--+- tmp_r | 5505039 | 1.13467e+08 tmp_q |7 | 236 (2 rows) foo=# update pg_class set (relpages, reltuples) = (5505039, 1.13467e+08) where relname = 'tmp_q'; UPDATE 1 foo=# update pg_class set (relpages, reltuples) = (7, 236) where relname = 'tmp_r'; UPDATE 1 but that didn't help. Somehow the planner outsmarts every such trick, so I can't get it to follow my right outer join plan where the big table is hashed. I am sure y'all know some way to force it. regards, -Gunther
Re: Out of Memory errors are frustrating as heck!
e)::text = 'DEF'::text) AND ((code_codesystem)::text = '2.16.840.1.113883.3.26.1.1'::text)) Rows Removed by Filter: 2584355 -> Materialize (cost=1548104.12..1553157.04 rows=1010585 width=111) (actual time=49869.984..54191.701 rows=38060250 loops=3) -> Sort (cost=1548104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3) Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid Sort Method: external merge Disk: 122192kB Worker 0: Sort Method: external merge Disk: 122192kB Worker 1: Sort Method: external merge Disk: 122192kB -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3) Filter: ((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter: 7415579 Planning Time: 2.523 ms Execution Time: 464825.391 ms (66 rows) *By the way, let me ask, do you have pretty-print functions I can call with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if there was, then we could determine where exactly in the current plan we are? And can I call the plan printer for the entire plan we are currently executing? Might it even give us preliminary counts of where in the process it is? (I ask the latter not only because it would be really useful for our present debugging, but also because it would be an awesome tool for monitoring of long running queries! Something I am sure tons of people would just love to have!)* I also read the other responses. I agree that having a swap space available just in case is better than these annoying out of memory errors. And yes, I can add that memory profiler thing, if you think it would actually work. I've done it with java heap dumps, even upgrading the VM to a 32 GB VM just to crunch the heap dump. But can you tell me just a little more as to how I need to configure this thing to get the data you want without blowing up the memory and disk during this huge query? regards, -Gunther
Re: Best Filesystem for PostgreSQL
On 4/17/2019 18:03, Imre Samu wrote: test: PostgreSQL v10.3 + Linux 5.0 File-System Benchmarks: Btrfs vs. EXT4 vs. F2FS vs. XFS https://www.phoronix.com/scan.php?page=article&item=linux-50-filesystems&num=3 So looks like XFS won. I like XFS for its ease of use especially when growing. Any ideas on how ZFS might do? ZFS is of course so much more flexible. -Gunther
Re: Pg10 : Client Configuration for Parallelism ?
On 4/17/2019 4:33, Thomas Kellerer wrote: A CTE would prevent parallelism. You mean like always? His SELECT count(1) FROM BigTable would be parallel if run alone but as WITH Data AS (SELECT count(1) FROM BigTable) SELECT * FROM Data nothing would be parallel any more? How about: SELECT * FROM (SELECT count(1) FROM BigTable) x Parallel or not? -Gunther
Re: Pg10 : Client Configuration for Parallelism ?
By the way On 4/17/2019 7:26, laurent.decha...@orange.com wrote: I can see whether there is parallelism with pg_top or barely top on the server. PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 38584 postgres 20 0 8863828 8.153g 8.151g R 100.0 3.2 1:23.01 postgres 10 root 20 0 0 0 0 S 0.3 0.0 88:07.26 rcu_sched PID USER PR NIVIRTRESSHR S %CPU %MEM TIME+ COMMAND 46687 postgres 20 0 8864620 0.978g 0.977g S 38.5 0.4 0:01.16 postgres 46689 postgres 20 0 8864348 996.4m 995.1m R 38.5 0.4 0:01.16 postgres 46690 postgres 20 0 8864348 987.2m 985.8m S 38.5 0.4 0:01.16 postgres 46691 postgres 20 0 8864348 998436 997084 R 38.5 0.4 0:01.16 postgres ... 46682 postgres 20 0 157996 2596 1548 R 0.7 0.0 0:00.05 top If you just use top with the -c option, you will see each postgres process identify itself as to its role, e.g. postgres: parallel worker for PID 46687 or postgres: SELECT ... or postgres: wal writer extremely useful this. -Gunther
Re: Out of Memory errors are frustrating as heck!
104.12..1550630.58 rows=1010585 width=111) (actual time=49869.980..50832.205 rows=1031106 loops=3) Sort Key: documentinformationsubject_1.documentinternalid, documentinformationsubject_1.documentid, documentinformationsubject_1.actinternalid Sort Method: external merge Disk: 122192kB Worker 0: Sort Method: external merge Disk: 122192kB Worker 1: Sort Method: external merge Disk: 122192kB -> Seq Scan on documentinformationsubject documentinformationsubject_1 (cost=0.00..1329868.64 rows=1010585 width=111) (actual time=20366.166..47751.267 rows=1031106 loops=3) Filter: ((participationtypecode)::text = 'PRD'::text) Rows Removed by Filter: 7415579 Planning Time: 2.523 ms Execution Time: 464825.391 ms (66 rows) By the way, let me ask, do you have pretty-print functions I can call with, e.g., node in ExecProcNode, or pstate in ExecHashJoin? Because if there was, then we could determine where exactly in the current plan we are? And can I call the plan printer for the entire plan we are currently executing? Might it even give us preliminary counts of where in the process it is? (I ask the latter not only because it would be really useful for our present debugging, but also because it would be an awesome tool for monitoring of long running queries! Something I am sure tons of people would just love to have! BTW, I also read the other responses. I agree that having a swap space available just in case is better than these annoying out of memory errors. And yes, I can add that memory profiler thing, if you think it would actually work. I've done it with java heap dumps, even upgrading the VM to a 32 GB VM just to crunch the heap dump. But can you tell me just a little more as to how I need to configure this thing to get the data you want without blowing up the memory and disk during this huge query? regards, -Gunther
Re: Out of Memory errors are frustrating as heck!
(8 chunks); 1471536 used: ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 1369088328 used exact same as before: ExecutorState: 1369337168 total in 163397 blocks; 248840 free (36 chunks); 1369088328 used but now we get an increase to: ExecutorState: 1369345496 total in 163398 blocks; 248840 free (36 chunks); 1369096656 used after I did this: (gdb) cont Continuing. Breakpoint 2, AllocSetAlloc (context=0x2a1d190, size=8) at aset.c:718 718 { (gdb) bt 4 #0 AllocSetAlloc (context=0x2a1d190, size=8) at aset.c:718 #1 0x0084e8ad in palloc0 (size=size@entry=8) at mcxt.c:969 #2 0x00702b63 in makeBufFileCommon (nfiles=nfiles@entry=1) at buffile.c:119 #3 0x00702e4c in makeBufFile (firstfile=163423) at buffile.c:138 (More stack frames follow...) (gdb) call MemoryContextStats(TopPortalContext) So now we have it confirmed don't we? No! No we have not! We stop at the /entrance /of the allocate method. So when I interrupted, there was no call yet. Then at the next stop the increase was from the previous. Continuing ... this now is from a stop at the makeBufFileCommon ExecutorState: 1369345496 total in 163398 blocks; 248816 free (36 chunks); 1369096680 used And again, now stopped before ExecutorState: 1369345496 total in 163398 blocks; 248792 free (36 chunks); 1369096704 used ExecutorState: 1369345496 total in 163398 blocks; 248792 free (36 chunks); 1369096704 used I don't see a growth between individual invocations. Anyway, these are the two ways to get there: (gdb) bt 4 #0 AllocSetAlloc (context=0x2a1d190, size=4) at aset.c:718 #1 0x0084e7dd in palloc (size=size@entry=4) at mcxt.c:938 #2 0x00702e59 in makeBufFile (firstfile=163423) at buffile.c:140 #3 BufFileCreateTemp (interXact=interXact@entry=false) at buffile.c:201 (More stack frames follow...) (gdb) cont Continuing. Breakpoint 3, AllocSetAlloc (context=0x2a1d190, size=394) at aset.c:718 718 { (gdb) bt 3 #0 AllocSetAlloc (context=0x2a1d190, size=394) at aset.c:718 #1 0x0084e7dd in palloc (size=394) at mcxt.c:938 #2 0x006101bc in ExecHashJoinGetSavedTuple (file=file@entry=0x4b4a198, hashvalue=hashvalue@entry=0x7ffcbf92fe5c, tupleSlot=0x2ae0ab8, hjstate=0x2a1d920) at nodeHashjoin.c:1277 (More stack frames follow...) But now it increased ExecutorState: 1369353824 total in 163399 blocks; 248792 free (36 chunks); 1369105032 used It increases every 3 times I stop at the breakpoint. -Gunther
Re: Out of Memory errors are frustrating as heck!
b in ExecHashJoinSaveTuple (tuple=0x2ba1018, hashvalue=, fileptr=0x6305b00) at nodeHashjoin.c:1220 #6 0x0060d766 in ExecHashTableInsert (hashtable=hashtable@entry=0x2b50ad8, slot=, hashvalue=) at nodeHash.c:1663 #7 0x00610c8f in ExecHashJoinNewBatch (hjstate=0x29a6be0) at nodeHashjoin.c:1051 (More stack frames follow...) and on (gdb) info frame Stack level 0, frame at 0x7ffcbf92fd90: rip = 0x849030 in AllocSetAlloc (aset.c:718); saved rip = 0x84e8ad called by frame at 0x7ffcbf92fdb0 source language c. Arglist at 0x7ffcbf92fd80, args: context=0x29a6450, size=8 Locals at 0x7ffcbf92fd80, Previous frame's sp is 0x7ffcbf92fd90 Saved registers: rip at 0x7ffcbf92fd88 (gdb) b AllocSetAlloc if (int)strcmp(context->name, "ExecutorState") == 0 && *(int *)$rsp != 0x84e7dd && 0x84e8ad != *(int *)$rsp Note: breakpoint 6 also set at pc 0x849030. Breakpoint 7 at 0x849030: file aset.c, line 718. (gdb) delete 6 Now if I continue I don't seem to be stopping any more. Does this help now? -Gunther
Re: Out of Memory errors are frustrating as heck!
On 4/15/2019 21:49, Gunther wrote: I'm going to try without that DISTINCT step, or perhaps by dismantling this query until it works without this excessive memory growth. It also failed. Out of memory. The resident memory size of the backend was 1.5 GB before it crashed. TopMemoryContext: 4335600 total in 8 blocks; 41208 free (16 chunks); 4294392 used HandleParallelMessages: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used TableSpace cache: 8192 total in 1 blocks; 2096 free (0 chunks); 6096 used Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks); 21728 used Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 416 free (0 chunks); 7776 used TopTransactionContext: 8192 total in 1 blocks; 5416 free (2 chunks); 2776 used RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks); 1296 used MessageContext: 524288 total in 7 blocks; 186848 free (7 chunks); 337440 used Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used smgr relation table: 32768 total in 3 blocks; 16832 free (8 chunks); 15936 used TransactionAbortContext: 32768 total in 1 blocks; 32512 free (0 chunks); 256 used Portal hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used TopPortalContext: 8192 total in 1 blocks; 7664 free (0 chunks); 528 used PortalHoldContext: 24632 total in 2 blocks; 7392 free (0 chunks); 17240 used PortalContext: 1105920 total in 138 blocks; 10368 free (8 chunks); 1095552 used: ExecutorState: 2238648944 total in 266772 blocks; 3726944 free (16276 chunks); 2234922000 used HashTableContext: 16384 total in 2 blocks; 4032 free (5 chunks); 12352 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7320 free (0 chunks); 872 used HashBatchContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8192 total in 1 blocks; 7752 free (0 chunks); 440 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashTableContext: 8192 total in 1 blocks; 7624 free (0 chunks); 568 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used TupleSort main: 286912 total in 8 blocks; 246792 free (39 chunks); 40120 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used HashTableContext: 8454256 total in 6 blocks; 64848 free (32 chunks); 8389408 used HashBatchContext: 66935744 total in 2037 blocks; 7936 free (0 chunks); 66927808 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used ExprContext: 8192 total in 1 blocks; 7936 free (0 chunks); 256 used Relcache by OID: 16384 total in 2 blocks; 3512 free (2 chunks); 12872 used CacheMemoryContext: 1101328 total in 14 blocks; 383480 free (0 chunks); 717848 used index info: 2048 total in 2 blocks; 680 free (1 chunks); 1368 used: pg_toast_2619_index index info: 2048 total in 2 blocks; 968 free (1 chunks); 1080 used: entity_id_fkidx index info: 2048 total in 2 blocks; 696 free (1 c
Re: Out of Memory errors are frustrating as heck!
-join specific code that is doing it. How about it's in the DISTINCT? I noticed while peeking up the call chain, that it was already in the UNIQUE sort thing also. I guess it's streaming the results from the hash join right into the unique sort step. What I've done before is compile with the comments removed from src/backend/utils/mmgr/aset.c:/* #define HAVE_ALLOCINFO */ I have just done that and it creates an insane amount of output from all the processes, I'm afraid there will be no way to keep that stuff separated. If there was a way of turning that one and off for one process only, then we could probably get more info... Everything is also extremely slow that way. Like in a half hour the memory didn't even reach 100 MB. and then look for allocations sizes which are getting allocated but not freed, and then you can go back to gdb to look for allocations of those specific sizes. I guess I should look for both, address and size to match it better. This generates a massive amount of output, and it bypasses the logging configuration and goes directly to stderr--so it might not end up where you expect. Yes, massive, like I said. Impossible to use. File system fills up rapidly. I made it so that it can be turned on and off, with the debugger. int _alloc_info = 0; #ifdef HAVE_ALLOCINFO #define AllocFreeInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocFree: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #define AllocAllocInfo(_cxt, _chunk) \ if(_alloc_info) \ fprintf(stderr, "AllocAlloc: %s: %p, %zu\n", \ (_cxt)->header.name, (_chunk), (_chunk)->size) #else #define AllocFreeInfo(_cxt, _chunk) #define AllocAllocInfo(_cxt, _chunk) #endif so with this I do (gdb) b AllocSetAlloc (gdb) cont (gdb) set _alloc_info=1 (gdb) disable (gdb) cont then I wait, ... until it crashes again ... no, it's too much. It fills up my filesystem in no time with the logs. It produced 3 GB in just a minute of run time. And also, I doubt we can find anything specifically by allocation size. It's just going to be 512 or whatever. Isn't there some other way? I'm going to try without that DISTINCT step, or perhaps by dismantling this query until it works without this excessive memory growth. -Gunther
Re: Out of Memory errors are frustrating as heck!
ProcNode (node=0x11687d0) at ../../../src/include/executor/executor.h:247 #11 ExecUnique (pstate=0x11685e0) at nodeUnique.c:73 #12 0x005fde68 in ExecProcNodeInstr (node=0x11685e0) at execProcnode.c:461 #13 0x005f75ba in ExecProcNode (node=0x11685e0) at ../../../src/include/executor/executor.h:247 #14 ExecutePlan (execute_once=, dest=0xcc60e0 , direction=, numberTuples=0, sendTuples=, operation=CMD_SELECT, use_parallel_mode=, planstate=0x11685e0, estate=0x1168340) at execMain.c:1723 #15 standard_ExecutorRun (queryDesc=0x119b6d8, direction=, count=0, execute_once=) at execMain.c:364 #16 0x0059c6f8 in ExplainOnePlan (plannedstmt=plannedstmt@entry=0x1199a68, into=into@entry=0x0, es=es@entry=0x1141d48, queryString=, params=0x0, queryEnv=queryEnv@entry=0x0, planduration=0x7fff2e4ca990) at explain.c:535 #17 0x0059c9ef in ExplainOneQuery (query=, cursorOptions=, into=0x0, es=0x1141d48, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=0x0, queryEnv=0x0) at explain.c:371 #18 0x0059ce37 in ExplainQuery (pstate=pstate@entry=0xf74608, stmt=stmt@entry=0x11ef240, queryString=queryString@entry=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., params=params@entry=0x0, queryEnv=queryEnv@entry=0x0, dest=dest@entry=0xf74578) at explain.c:254 #19 0x0072ca5d in standard_ProcessUtility (pstmt=0x11ef390, queryString=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"..., context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0xf74578, completionTag=0x7fff2e4cab20 "") at utility.c:675 #20 0x0072a052 in PortalRunUtility (portal=0xfb06b0, pstmt=0x11ef390, isTopLevel=, setHoldSnapshot=, dest=, completionTag=0x7fff2e4cab20 "") at pquery.c:1178 #21 0x0072add2 in FillPortalStore (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true) at pquery.c:1038 #22 0x0072b855 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0xf4c570, altdest=altdest@entry=0xf4c570, completionTag=0x7fff2e4cad30 "") at pquery.c:768 #23 0x007276e8 in exec_simple_query ( query_string=0xf4af30 "explain analyze\nSELECT DISTINCT\n documentInternalId, is_current,\ndocumentId,\ndocumentTypeCode,\nsubjectRoleInternalId,\nsubjectEntityInternalId,\nsubjectEntityId,\nsubjectEntityIdRoot,\nsubjectEntit"...) at postgres.c:1145 #24 0x00729534 in PostgresMain (argc=, argv=argv@entry=0xf76ce8, dbname=, username=) at postgres.c:4182 #25 0x006be215 in BackendRun (port=0xf6efe0) at postmaster.c:4361 #26 BackendStartup (port=0xf6efe0) at postmaster.c:4033 #27 ServerLoop () at postmaster.c:1706 #28 0x006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379 #29 0x004822dc in main (argc=3, argv=0xf45320) at main.c:228 Good, now I leave this all sitting like that for you to ask me what else you might want to see. We are now close to the edge of the cliff. -Gunther
Re: Out of Memory errors are frustrating as heck!
On 4/14/2019 23:24, Tom Lane wrote: ExecutorState: 2234123384 total in 266261 blocks; 3782328 free (17244 chunks); 2230341056 used Oooh, that looks like a memory leak right enough. The ExecutorState should not get that big for any reasonable query. 2.2 GB is massive yes. Your error and stack trace show a failure in HashBatchContext, which is probably the last of these four: HashBatchContext: 57432 total in 3 blocks; 16072 free (6 chunks); 41360 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchContext: 90288 total in 4 blocks; 16072 free (6 chunks); 74216 used HashBatchContext: 100711712 total in 3065 blocks; 7936 free (0 chunks); 100703776 used Perhaps that's more than it should be, but it's silly to obsess over 100M when there's a 2.2G problem elsewhere. Yes. I think it's likely that it was just coincidence that the failure happened right there. Unfortunately, that leaves us with no info about where the actual leak is coming from. Strange though, that the vmstat tracking never showed that the cache allocated memory goes much below 6 GB. Even if this 2.2 GB memory leak is there, and even if I had 2 GB of shared_buffers, I would still have enough for the OS to give me. Is there any doubt that this might be a problem with Linux? Because if you want, I can whip out a FreeBSD machine, compile pgsql, and attach the same disk, and try it there. I am longing to have a reason to move back to FreeBSD anyway. But I have tons of stuff to do, so if you do not have reason to suspect Linux to do wrong here, I prefer skipping that futile attempt The memory map shows that there were three sorts and four hashes going on, so I'm not sure I believe that this corresponds to the query plan you showed us before. Like I said, the first explain was not using the same constraints (no NL). Now what I sent last should all be consistent. Memory dump and explain plan and gdb backtrace. Any chance of extracting a self-contained test case that reproduces this? With 18 million rows involved in the base tables, hardly. But I am ready to try some other things with the debugger that you want me to try. If we have a memory leak issue, we might just as well try to plug it! I could even to give someone of you access to the system that runs this. thanks, -Gunther
Re: Out of Memory errors are frustrating as heck!
ter varying(555) | agencyname| text | productitemcode | character varying(512) | productinternalid | character varying(255) | Could you rerun the query with \set VERBOSITY verbose to show the file/line that's failing ? Here goes: integrator=# \set VERBOSITY verbose integrator=# SET ENABLE_NESTLOOP TO OFF; SET integrator=# INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation; ERROR: 53200: out of memory DETAIL: Failed on request of size 32800 in memory context "HashBatchContext". LOCATION: MemoryContextAlloc, mcxt.c:798 you notice that I set ENABLE_NESTLOOP to off, that is because the planner goes off thinking the NL plan is marginally more efficient, but in fact it will take 5 hours to get to the same out of memory crash, while the no NL plan gets there in half an hour. That verbose setting didn't help much I guess. If you wanted to show a stack trace, you could attach gdb to PID from SELECT pg_backend_pid(), "b"reak on errdetail, run the query, and then "bt" when it fails. gdb -p 27930 GNU gdb (GDB) Red Hat Enterprise Linux 8.0.1-30.amzn2.0.3 ... Attaching to process 27930 Reading symbols from /usr/local/pgsql/bin/postgres...done. ... (gdb) b errdetail Breakpoint 1 at 0x82b210: file elog.c, line 872. (gdb) cont Continuing. Breakpoint 1, errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context \"%s\".") at elog.c:872 872 { (gdb) bt #0 errdetail (fmt=fmt@entry=0x9d9958 "Failed on request of size %zu in memory context \"%s\".") at elog.c:872 #1 0x0084e320 in MemoryContextAlloc (context=0x600, size=size@entry=32800) at mcxt.c:794 #2 0x0060ce7a in dense_alloc (size=384, size@entry=381, hashtable=, hashtable=) at nodeHash.c:2696 #3 0x0060d788 in ExecHashTableInsert (hashtable=hashtable@entry=0x10ead08, slot=, hashvalue=194758122) at nodeHash.c:1614 #4 0x00610c6f in ExecHashJoinNewBatch (hjstate=0x10806b0) at nodeHashjoin.c:1051 #5 ExecHashJoinImpl (parallel=false, pstate=0x10806b0) at nodeHashjoin.c:539 #6 ExecHashJoin (pstate=0x10806b0) at nodeHashjoin.c:565 #7 0x0061ce4e in ExecProcNode (node=0x10806b0) at ../../../src/include/executor/executor.h:247 #8 ExecSort (pstate=0x1080490) at nodeSort.c:107 #9 0x0061d2c4 in ExecProcNode (node=0x1080490) at ../../../src/include/executor/executor.h:247 #10 ExecUnique (pstate=0x107ff60) at nodeUnique.c:73 #11 0x00619732 in ExecProcNode (node=0x107ff60) at ../../../src/include/executor/executor.h:247 #12 ExecModifyTable (pstate=0x107fd20) at nodeModifyTable.c:2025 #13 0x005f75ba in ExecProcNode (node=0x107fd20) at ../../../src/include/executor/executor.h:247 #14 ExecutePlan (execute_once=, dest=0x7f0442721998, direction=, numberTuples=0, sendTuples=, operation=CMD_INSERT, use_parallel_mode=, planstate=0x107fd20, estate=0x107f830) at execMain.c:1723 #15 standard_ExecutorRun (queryDesc=0x1086880, direction=, count=0, execute_once=) at execMain.c:364 #16 0x0072a972 in ProcessQuery (plan=, sourceText=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;", params=0x0, queryEnv=0x0, dest=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at pquery.c:161 #17 0x0072abb0 in PortalRunMulti (portal=portal@entry=0xfb06b0, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=dest@entry=0x7f0442721998, altdest=altdest@entry=0x7f0442721998, completionTag=completionTag@entry=0x7fff2e4cad30 "") at pquery.c:1286 #18 0x0072b661 in PortalRun (portal=portal@entry=0xfb06b0, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true, dest=dest@entry=0x7f0442721998, altdest=altdest@entry=0x7f0442721998, completionTag=0x7fff2e4cad30 "") at pquery.c:799 #19 0x007276e8 in exec_simple_query ( query_string=0xf4a710 "INSERT INTO reports.BusinessOperation SELECT * FROM reports.v_BusinessOperation;") at postgres.c:1145 #20 0x00729534 in PostgresMain (argc=, argv=argv@entry=0xf76ce8, dbname=, username=) at postgres.c:4182 #21 0x006be215 in BackendRun (port=0xf6dfe0) at postmaster.c:4361 #22 BackendStartup (port=0xf6dfe0) at postmaster.c:4033 #23 ServerLoop () at postmaster.c:1706 #24 0x006bf122 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0xf45320) at postmaster.c:1379 #25 0x004822dc in main (argc=3, argv=0xf45320) at main.c:228 That's it. Thank you all very much for your interest in this case. -Gunther
Re: Out of Memory errors are frustrating as heck!
ndex index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_database_oid_index index info: 2048 total in 2 blocks; 952 free (1 chunks); 1096 used: pg_authid_rolname_index WAL record construction: 49768 total in 2 blocks; 6368 free (0 chunks); 43400 used PrivateRefCount: 8192 total in 1 blocks; 2624 free (0 chunks); 5568 used MdSmgr: 8192 total in 1 blocks; 7256 free (1 chunks); 936 used LOCALLOCK hash: 16384 total in 2 blocks; 4600 free (2 chunks); 11784 used Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used ErrorContext: 8192 total in 1 blocks; 7936 free (4 chunks); 256 used Grand total: 1437014672 bytes in 168424 blocks; 11879744 free (3423 chunks); 1425134928 used 2019-04-14 16:38:26.355 UTC [11061] ERROR: out of memory 2019-04-14 16:38:26.355 UTC [11061] DETAIL: Failed on request of size 8272 in memory context "ExecutorState". I am delighted that Tom Lane is hinting that my query plan doesn't look so crazy, and it isn't. And delighted there may be a known bug involved. I wonder if this is one data issue. May be a few rows have excessively long text fields? But even checking for that is rather difficult because there are many tables and columns involved. Would really be nice if the error would say exactly what plan step that ExecutorState referred to, so one could narrow it down. regards, -Gunther On 4/14/2019 17:19, Tom Lane wrote: Gunther writes: For weeks now, I am banging my head at an "out of memory" situation. There is only one query I am running on an 8 GB system, whatever I try, I get knocked out on this out of memory. It is extremely impenetrable to understand and fix this error. I guess I could add a swap file, and then I would have to take the penalty of swapping. But how can I actually address an out of memory condition if the system doesn't tell me where it is happening? You might want to see the query, but it is a huge plan, and I can't really break this down. It shouldn't matter though. But just so you can get a glimpse here is the plan: Is that the whole plan? With just three sorts and two materializes, it really shouldn't use more than more-or-less 5X work_mem. What do you have work_mem set to, anyway? Is this a 64-bit build of PG? Also, are the estimated rowcounts shown here anywhere close to what you expect in reality? If there are any AFTER INSERT triggers on the insertion target table, you'd also be accumulating per-row trigger queue entries ... but if there's only circa 35K rows to be inserted, it's hard to credit that eating more than a couple hundred KB, either. Might this be a bug? It's conceivable that you've hit some memory-leakage bug, but if so you haven't provided nearly enough info for anyone else to reproduce it. You haven't even shown us the actual error message :-( https://wiki.postgresql.org/wiki/Guide_to_reporting_problems regards, tom lane
Out of Memory errors are frustrating as heck!
0 26873160 430997600 0 0 181 188 0 2 98 0 0 2019-04-14 16:38:48 0 00 27069200 430011600 0 105 137 263 0 0 100 0 0 2019-04-14 16:38:50 0 00 27066720 430023200 0 0 142 204 0 0 100 0 0 2019-04-14 16:38:52 0 00 28151160 419192800 0 0 116 242 0 0 100 0 0 2019-04-14 16:38:54 0 00 28153640 419200800 0 0 116 239 0 0 100 0 0 2019-04-14 16:38:56 0 00 28151160 419216400 0 0 159 236 0 0 100 0 0 2019-04-14 16:38:58 ending after the out of memory crash, that occurred exactly at the marked point 16:38:26.355 UTC. We can't really see anything too worrisome. There is always lots of memory used by cache, which could have been mobilized. The only possible explanation I can think of is that in that moment of the crash the memory utilization suddenly skyrocketed in less than a second, so that the 2 second vmstat interval wouldn't show it??? Nah. I have already much reduced work_mem, which has helped in some other cases before. Now I am going to reduce the shared_buffers now, but that seems counter-intuitive because we are sitting on all that cache memory unused! Might this be a bug? It feels like a bug. It feels like those out of memory issues should be handled more gracefully (garbage collection attempt?) and that somehow there should be more information so the person can do anything about it. Any ideas? -Gunther
Block / Page Size Optimization
Hi all, I am sure this should be a FAQ, but I can't see a definitive answer, only chatter on various lists and forums. Default page size of PostgreSQL is 8192 bytes. Default IO block size in Linux is 4096 bytes. I can set an XFS file system with 8192 bytes block size, but then it does not mount on Linux, because the VM page size is the limit, 4096 again. There seems to be no way to change that in (most, common) Linux variants. In FreeBSD there appears to be a way to change that. But then, there is a hardware limit also, as far as the VM memory page allocation is concerned. Apparently most i386 / amd64 architectures the VM page sizes are 4k, 2M, and 1G. The latter, I believe, are called "hugepages" and I only ever see that discussed in the PostgreSQL manuals for Linux, not for FreeBSD. People have asked: does it matter? And then there is all that chatter about "why don't you run a benchmark and report back to us" -- "OK, will do" -- and then it's crickets. But why is this such a secret? On Amazon AWS there is the following very simple situation: IO is capped on IO operations per second (IOPS). Let's say, on a smallish volume, I get 300 IOPS (once my burst balance is used up.) Now my simple theoretical reasoning is this: one IO call transfers 1 block of 4k size. That means, with a cap of 300 IOPS, I get to send 1.17 MB per second. That would be the absolute limit. BUT, if I could double the transfer size to 8k, I should be able to move 2.34 MB per second. Shouldn't I? That might well depend on whether AWS' virtual device paths would support these 8k block sizes. But something tells me that my reasoning here is totally off. Because I get better IO throughput that that. Even on 3000 IOPS I would only get 11 MB per second, and I am sure I am getting rather 50-100 MB/s, no? So my simplistic logic is false. What really is the theoretical issue with the file system block size? Where does -- in theory -- the benefit come from of using an XFS block size of 8 kB, or even increasing the PostgreSQL page size to 16 kB and then the XFS block size also to 16 kB? I remember having seen standard UFS block sizes of 16 kB. But then why is Linux so tough on refusing to mount an 8 kB XFS because it's VM page size is only 4 kB? Doesn't this all have one straight explanation? If you have a link that I can just read, I appreciate you sharing that. I think that should be on some Wiki or FAQ somewhere. If I get a quick and dirty explanation with some pointers, I can try to write it out into a more complete answer that might be added into some documentation or FAQ somewhere. thanks & regards, -Gunther
Re: EXPLAIN PLAN for DELETE CASCADE or DELETE not using pkey index despite EXPLAINing that it would?
On 3/21/2019 17:16, Tom Lane wrote: Gunther writes: I have 250 rows to delete, but they are a target to a bunch of child tables with foreign key on delete cascade. EXPLAIN DELETE FROM Foo WHERE id = (SELECT fooId FROM Garbage); shows me that it uses the nested loop by Foo_pkey index to find the 250 items from Garbage to be deleted. But once that starts, I see HUGE amount of read activity from the tablespace Foo_main that contains the Foo table, and only the Foo table, not the Foo_pkey, not any other index, not any other child table, not even the toast table for Foo is contained in that tablespace (I have the toast table diverted with symlinks to another volume). I'm betting you neglected to index the referencing column for one or more of the foreign keys involved. You can get away with that as long as you're not concerned with the speed of DELETE ... regards, tom lane I had the same suspicion. But firstly my schema is generated automatically and all foreign keys have the indexes. But what is even more stunning is that the table where this massive read activity happens is the Foo heap table. I verified that by using strace where all the massive amounts of reads are on those files for the main Foo table. And this doesn't make sense, since any foreign key targets its primary key. The foreign keys of the child tables are also indexed and there is no io on the volumes that hold these child tables, nor is the io on the volume that holds the Foo_pkey.
EXPLAIN PLAN for DELETE CASCADE or DELETE not using pkey index despite EXPLAINing that it would?
Hi, I have 250 rows to delete, but they are a target to a bunch of child tables with foreign key on delete cascade. EXPLAIN DELETE FROM Foo WHERE id = (SELECT fooId FROM Garbage); shows me that it uses the nested loop by Foo_pkey index to find the 250 items from Garbage to be deleted. But once that starts, I see HUGE amount of read activity from the tablespace Foo_main that contains the Foo table, and only the Foo table, not the Foo_pkey, not any other index, not any other child table, not even the toast table for Foo is contained in that tablespace (I have the toast table diverted with symlinks to another volume). I see the read activity with iostat, reading heavily at 130 MB/s for a long time until my burst balance is used up, then continuing to churn with 32 MB/s. I also see the read activity with iotop, that tells me that it is that postgres backend running the DELETE query that is doing this, not some autovacuum nor anything else. It looks to me that in actuality it is doing a sequential scan for each of the 250 rows, despite it EPLAINing to me that it was going to use that index. It would really be good to know what it is churning so heavily? I have seen some ways of using dtrace or things like that to do some measurement points. But I haven't seen how this is done to inspect the effective execution plan and where in that plan it is, i.e., which iteration. It would be nice if there was some way of doing a "deep explain plan" or even better, having an idea of the execution plan which the executor is actually following, and a way to report on the current status of work according to this plan. How else do I figure out what causes this heavy read activity on the main Foo table? This is something I might even want to contribute. For many years I am annoyed by this waiting for long running statement without any idea where it is and how much is there still to go. If I have a plan structure and an executor that follows the plan structure, there must be a way to dump it out. The pg_stat_activity table might contain a current statement id, and then a superuser might ask EXPLAIN STATEMENT :statementId. Or just a pg_plantrace command which would dump the current plan with an indication of completion % of each step. But also delete cascades and triggers should be viewable from this, they should be traced, I am sure that somewhere inside there is some data structure representing this activity and all it would take is to dump it? regards, -Gunther
Re: Poor man's partitioned index .... not being used?
Thanks David Rowley and Andrew Gierth. On 3/20/2019 23:46, Andrew Gierth wrote: If you did WHERE mod(id,2) = mod(8934,2) AND id = 8934 then the index would likely be used - because the prover can then treat mod(id,2) as an atom (call it X), constant-fold mod(8934,2) to 0 because mod() is immutable, and then observe that (X = 0) proves that (X = 0). foo=# EXPLAIN SELECT * FROM Test WHERE mod(id,2) = mod(8934,2) AND id = 8934; QUERY PLAN -- Index Only Scan using test_pk0 on test (cost=0.42..8.44 rows=1 width=4) Index Cond: (id = 8934) (2 rows) Yes indeed! It's being used that way! Interesting. Only that we can't use it if id was a variable? Hmm ... foo=# PREPARE testplan(int) AS foo-#SELECT * FROM Test WHERE mod(id,2) = mod($1,2) AND id = $1; PREPARE foo=# EXPLAIN EXECUTE testplan(8934); QUERY PLAN -- Index Only Scan using test_pk0 on test (cost=0.42..8.44 rows=1 width=4) Index Cond: (id = 8934) (2 rows) That's quite alright actually. Now the questions is, could we use this in a nested loop query plan? That's where I think it can't work: foo=# CREATE TABLE Test2 AS SELECT * FROM Test WHERE random() < 0.01 ORDER BY id DESC; SELECT 3730 integrator=# ANALYZE Test2; ANALYZE foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = mod(a.id,2) AND b.id = a.id) LIMIT 10; QUERY PLAN - Limit (cost=110.25..135.67 rows=10 width=8) -> Hash Right Join (cost=110.25..9591.02 rows=3730 width=8) Hash Cond: ((mod(b.id, 2) = mod(a.id, 2)) AND (b.id = a.id)) -> Seq Scan on test b (cost=0.00..5599.29 rows=388129 width=4) -> Hash (cost=54.30..54.30 rows=3730 width=4) -> Seq Scan on test2 a (cost=0.00..54.30 rows=3730 width=4) (6 rows) foo=# SET enable_hashjoin TO off; SET foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = mod(a.id,2) AND b.id = a.id) LIMIT 10; QUERY PLAN Limit (cost=47214.73..47227.86 rows=10 width=8) -> Merge Right Join (cost=47214.73..52113.16 rows=3730 width=8) Merge Cond: (((mod(b.id, 2)) = (mod(a.id, 2))) AND (b.id = a.id)) -> Sort (cost=46939.15..47909.47 rows=388129 width=4) Sort Key: (mod(b.id, 2)), b.id -> Seq Scan on test b (cost=0.00..5599.29 rows=388129 width=4) -> Sort (cost=275.58..284.91 rows=3730 width=4) Sort Key: (mod(a.id, 2)), a.id -> Seq Scan on test2 a (cost=0.00..54.30 rows=3730 width=4) (9 rows) foo=# SET enable_mergejoin TO off; SET foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = mod(a.id,2) AND b.id = a.id) LIMIT 10; QUERY PLAN Limit (cost=0.00..102516.78 rows=10 width=8) -> Nested Loop Left Join (cost=0.00..38238760.24 rows=3730 width=8) Join Filter: ((b.id = a.id) AND (mod(b.id, 2) = mod(a.id, 2))) -> Seq Scan on test2 a (cost=0.00..54.30 rows=3730 width=4) -> Materialize (cost=0.00..9056.93 rows=388129 width=4) -> Seq Scan on test b (cost=0.00..5599.29 rows=388129 width=4) (6 rows) It looks like it doesn't want to evaluate the mod(a.id, 2) before it moves to the index query for the nested loop. Notably the partitioned table approach should do that, but it has a different expression for the partition. No mod function but MODULUS and REMAINDER. I wonder if there was a way of marking such expressions as safe in the query, like suggesting a certain evaluation order, i.e., SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON(mod(b.id,2) = EVAL(mod(a.id,2)) AND b.id = a.id) LIMIT 10; It's OK though. It just goes to show that in a case like this, it is best to just go with the partitioned table anyway. regards, -Gunther
Poor man's partitioned index .... not being used?
Hi all, look at this short story please: foo=# CREATE TABLE Test(id int NOT NULL PRIMARY KEY); CREATE TABLE foo=# INSERT INTO test SELECT row_number() OVER() FROM pg_class a CROSS JOIN pg_class b; INSERT 0 388129 foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934'; QUERY PLAN --- Index Only Scan using test_pkey on test (cost=0.42..8.44 rows=1 width=4) Index Cond: (id = 8934) (2 rows) foo=# ALTER TABLE Test DROP CONSTRAINT Test_pkey; ALTER TABLE foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934'; QUERY PLAN --- Seq Scan on test (cost=0.00..6569.61 rows=1 width=4) Filter: (id = 8934) (2 rows) foo=# SELECT max(id)/2 FROM Test; ?column? -- 194064 (1 row) foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE id < 194064; CREATE INDEX foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE id >= 194064; CREATE INDEX foo=# ANALYZE Test; ANALYZE foo=# EXPLAIN SELECT * FROM Test WHERE id = 8934; QUERY PLAN -- Index Only Scan using test_pk0 on test (cost=0.42..8.44 rows=1 width=4) Index Cond: (id = 8934) (2 rows) foo=# DROP INDEX Test_pk0; DROP INDEX foo=# DROP INDEX Test_pk1; DROP INDEX foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE mod(id,2) = 0; CREATE INDEX foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE mod(id,2) = 1; CREATE INDEX foo=# ANALYZE Test; ANALYZE foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934'; QUERY PLAN --- Seq Scan on test (cost=0.00..6569.61 rows=1 width=4) Filter: (id = 8934) (2 rows) Why is that index never used? PS: there is a performance question behind this, big table, heavily used index, the hope was that with this simple scheme of partitioning just the index one might distribute the load better. I know, if the load really is so big, why not partition the entire table. But just for hecks, why not this way? regards, -Gunther
Re: Distributing data over "spindles" even on AWS EBS, (followup to the work queue saga)
On 3/14/2019 11:11, Jeremy Schneider wrote: On 3/14/19 07:53, Gunther wrote: 2. build a low level "spreading" scheme which is to take the partial files 4653828 and 4653828.1, .2, _fsm, etc. and move each to another device and then symlink it back to that directory (I come back to this!) ... To 2. I find that it would be a nice feature of PostgreSQL if we could just use symlinks and a symlink rule, for example, when PostgreSQL finds that 4653828 is in fact a symlink to /otherdisk/PG/16284/4653828, then it would * by default also create 4653828.1 as a symlink and place the actual data file on /otherdisk/PG/16284/4653828.1 How about if we could just specify multiple tablespaces for an object, and then PostgreSQL would round-robin new segments across the presently configured tablespaces? This seems like a simple and elegant solution to me. Very good idea! I agree. Very important also would be to take out the existing patch someone had contributed to allow toast tables to be assigned to different tablespaces. 4. maybe I can configure in AWS EBS to reserve more IOPS -- but why would I pay for more IOPS if my cost is by volume size? I can just make another volume? or does AWS play a similar trick on us with IOPS being limited on some "credit" system??? Not credits, but if you're using gp2 volumes then pay close attention to how burst balance works. A single large volume is the same price as two striped volumes at half size -- but the striped volumes will have double the burst speed and take twice as long to refill the burst balance. Yes, I learned that too. It seems a very interesting "bug" of the Amazon GP2 IOPS allocation scheme. They say it's like 3 IOPS per GiB, so if I have 100 GiB I get 300 IOPS. But it also says minimum 100. So that means if I have 10 volumes of 10 GiB each, I get 1000 IOPS minimum between them all. But if I have it all on one 100 GiB volume I only get 300 IOPS. I wonder if Amazon is aware of this. I hope they are and think that's just fine. Because I like it. It also is a clear sign to me that I want to use page sizes > 4k for the file system. I have tried on Amazon Linux to use 8k block sizes of the XFS volume, but I cannot mount those, since the Linux says it can currently only deal with 4k blocks. This is another reason I consider switching the database server(s) to FreeBSD. OTOH, who knows may be this 4k is a limit of the AWS EBS infrastructure. After all, if I am scraping the 300 or 1000 IOPS limit already and if I can suddenly upgrade my block sizes per IO, I double my IO throughput. regards, -Gunther
Re: Facing issue in using special characters
This is not an issue for "hackers" nor "performance" in fact even for "general" it isn't really an issue. "Special characters" is actually nonsense. When people complain about "special characters" they haven't thought things through. If you are unwilling to think things through and go step by step to make sure you know what you are doing, then you will not get it and really nobody can help you. In my professional experience, people who complain about "special characters" need to be cut loose or be given a chance (if they are established employees who carry some weight). If a contractor complains about "special characters" they need to be fired. Understand charsets -- character set, code point, and encoding. Then understand how encoding and string literals and "escape sequences" in string literals might work. Know that UNICODE today is the one standard, and there is no more need to do code table switch. There is nothing special about a Hebrew alef or a greek lower case alpha or a latin A. Nor a hyphen and en-dash or an em-dash. All these characters are in the UNICODE. Yes, there are some Japanese who claim that they don't like that their Chinese character versions are put together with simplified reform Chinese font. But that's a font issue, not a character code issue. 7 bit ASCII is the first page of UNICODE, even in the UTF-8 encoding. ISO Latin 1, or the Windoze 123 whatever special table of ISO Latin 1 has the same code points as UNICODE pages 0 and 1, but not compatible with UTF-8 coding because of the way UTF-8 uses the 8th bit. But none of this is likely your problem. Your problem is about string literals in SQL for examples. About the configuration of your database (I always use initdb with --locale C and --encoding UTF-8). Use UTF-8 in the database. Then all your issues are about string literals in SQL and in JAVA and JSON and XML or whatever you are using. You have to do the right thing. If you produce any representation, whether that is XML or JSON or SQL or URL query parameters, or a CSV file, or anything at all, you need to escape your string values properly. This question with no detail didn't deserve such a thorough answer, but it's my soap box. I do not accept people complaining about "special characters". My own people get that same sermon from me when they make that mistake. -Gunther On 3/15/2019 1:19, M Tarkeshwar Rao wrote: Hi all, Facing issue in using special characters. We are trying to insert records to a remote Postgres Server and our application not able to perform this because of errors. It seems that issue is because of the special characters that has been used in one of the field of a row. Regards Tarkeshwar
Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.
Wow, yes, partition instead of index, that is interesting. Thanks Corey and Justin. The index isn't required at all if all my pending jobs are in a partition of only pending jobs. In that case the plan can just be a sequential scan. And Jeff James, sorry, I failed to show the LIMIT 1 clause on my dequeue query. That was an omission. My query is actually somewhat more complex and I just translated it down to the essentials but forgot the LIMIT 1 clause. SELECT seqNo, action FROM Queue WHERE pending AND/... other criteria .../ LIMIT 1 FOR UPDATE SKIP LOCKED; And sorry I didn't capture the stats for vacuum verbose. And they would be confusing because there are other things involved. Anyway, I think the partitioned table is the right and brilliant solution, because an index really isn't required. The actual pending partition will always remain quite small, and being a queue, it doesn't even matter how big it might grow, as long as new rows are inserted at the end and not in the middle of the data file and still there be some way of fast skip over the part of the dead rows at the beginning that have already been processed and moved away. Good thing is, I don't worry about maintenance window. I have the leisure to simply tear down my design now and make a better design. What's 2 million transactions if I can re-process them at a rate of 80/s? 7 hours max. I am still in development. So, no need to worry about migration / transition acrobatics. So if I take Corey's steps and envision the final result, not worrying about the transition steps, then I understand this: 1. Create the Queue table partitioned on the pending column, this creates the partition with the pending jobs (on which I set the fillfactor kind of low, maybe 50) and the default partition with all the rest. Of course that allows people with a constant transaction volume to also partition on jobId or completionTime and move chunks out to cold archive storage. But that's beside the current point. 2. Add all needed indexes on the partitioned table, except the main partial index that I used before and that required all that reindexing maintenance. Actually I won't need any other indexes really, why invite another similar problem again. That's really simple. One question I have though: I imagine our pending partition heap file to now be essentially sequentially organized as a queue. New jobs are appended at the end, old jobs are at the beginning. As pending jobs become completed (pending = false) these initial rows will be marked as dead. So, while the number of live rows will remain small in that pending partition, sequential scans will have to skip over the dead rows in the beginning. Does PostgreSQL structure its files such that skipping over dead rows is fast? Or do the dead rows have to be read and discarded during a table scan? Of course vacuum eliminates dead rows, but unless I do vacuum full, it will not re-pack the live rows, and that requires an exclusive table lock. So, what is the benefit of vacuuming that pending partition? What I _/don't/_ want is insertion of new jobs to go into open slots at the beginning of the file. I want them to be appended (in Oracle there is an INSERT /*+APPEND*/ hint for that. How does that work in PostgreSQL? Ultimately that partition will amass too many dead rows, then what do I do? I don't think that the OS has a way to truncate files physically from the head, does it? I guess it could set the file pointer from the first block to a later block. But I don't know of an IOCTL/FCNTL command for that. On some OS there is a way of making blocks sparse again, is that how PostgreSQL might do it? Just knock out blocks as sparse from the front of the file? If not, the next thing I can think of is to partition the table further by time, may be alternating even and odd days, such that on any given day one of the two pending partitions are quiet? Is that how it's done? regards, -Gunther
Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.
Wow, yes, partition instead of index, that is interesting. Thanks Corey and Justin. The index isn't required at all if all my pending jobs are in a partition of only pending jobs. In that case the plan can just be a sequential scan. And Jeff James, sorry, I failed to show the LIMIT 1 clause on my dequeue query. That was an omission. My query is actually somewhat more complex and I just translated it down to the essentials but forgot the LIMIT 1 clause. SELECT seqNo, action FROM Queue WHERE pending AND/... other criteria .../ LIMIT 1 FOR UPDATE SKIP LOCKED; And sorry I didn't capture the stats for vacuum verbose. And they would be confusing because there are other things involved. Anyway, I think the partitioned table is the right and brilliant solution, because an index really isn't required. The actual pending partition will always remain quite small, and being a queue, it doesn't even matter how big it might grow, as long as new rows are inserted at the end and not in the middle of the data file and still there be some way of fast skip over the part of the dead rows at the beginning that have already been processed and moved away. Good thing is, I don't worry about maintenance window. I have the leisure to simply tear down my design now and make a better design. What's 2 million transactions if I can re-process them at a rate of 80/s? 7 hours max. I am still in development. So, no need to worry about migration / transition acrobatics. So if I take Corey's steps and envision the final result, not worrying about the transition steps, then I understand this: 1. Create the Queue table partitioned on the pending column, this creates the partition with the pending jobs (on which I set the fillfactor kind of low, maybe 50) and the default partition with all the rest. Of course that allows people with a constant transaction volume to also partition on jobId or completionTime and move chunks out to cold archive storage. But that's beside the current point. 2. Add all needed indexes on the partitioned table, except the main partial index that I used before and that required all that reindexing maintenance. Actually I won't need any other indexes really, why invite another similar problem again. That's really simple. One question I have though: I imagine our pending partition heap file to now be essentially sequentially organized as a queue. New jobs are appended at the end, old jobs are at the beginning. As pending jobs become completed (pending = false) these initial rows will be marked as dead. So, while the number of live rows will remain small in that pending partition, sequential scans will have to skip over the dead rows in the beginning. Does PostgreSQL structure its files such that skipping over dead rows is fast? Or do the dead rows have to be read and discarded during a table scan? Of course vacuum eliminates dead rows, but unless I do vacuum full, it will not re-pack the live rows, and that requires an exclusive table lock. So, what is the benefit of vacuuming that pending partition? What I _/don't/_ want is insertion of new jobs to go into open slots at the beginning of the file. I want them to be appended (in Oracle there is an INSERT /*+APPEND*/ hint for that. How does that work in PostgreSQL? Ultimately that partition will amass too many dead rows, then what do I do? I don't think that the OS has a way to truncate files physically from the head, does it? I guess it could set the file pointer from the first block to a later block. But I don't know of an IOCTL/FCNTL command for that. On some OS there is a way of making blocks sparse again, is that how PostgreSQL might do it? Just knock out blocks as sparse from the front of the file? If not, the next thing I can think of is to partition the table further by time, may be alternating even and odd days, such that on any given day one of the two pending partitions are quiet? Is that how it's done? regards, -Gunther
Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.
Thank you all for responding so far. David Rowley and Justin Pryzby suggested things about autovacuum. But I don't think autovacuum has any helpful role here. I am explicitly doing a vacuum on that table. And it doesn't help at all. Almost not at all. I want to believe that VACUUM FREEZE Queue; will push the database CPU% down again once it is climbing up, and I can do this may be 3 to 4 times, but ultimately I will always have to rebuild the index. But also, none of these vaccuum operations I do takes very long at all. It is just not efficacious at all. Rebuilding the index by building a new index and removing the old, then rename, and vacuum again, is prone to get stuck. I tried to do it in a transaction. But it says CREATE INDEX can't be done in a transaction. Need to CREATE INDEX CONCURRENTLY ... and I can't even do that in a procedure. If I do it manually by issuing first CREATE INDEX CONCURRENTLY new and then DROP INDEX CONCURRENTLY old, it might work once, but usually it just gets stuck with two indexes. Although I noticed that it would actually put CPU back down and improve transaction throughput. I also noticed that after I quit from DROP INDEX CONCURRENTLY old, that index is shown as INVALID \d Queue ... Indexes: "queue_idx_pending" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending2" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending3" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending4" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending5" UNIQUE, btree (jobId, action) WHERE pending INVALID "queue_idx_pending6" UNIQUE, btree (jobId, action) WHERE pending ... and so I keep doing that same routine hands-on, every time that the CPU% creeps above 50% I do CREATE UNIQUE INDEX CONCURRENTLY Queue_idx_pending6 ON Queue(jobId, action) WHERE currentlyOpen; DROP INDEX CONCURRENTLY Queue_idx_pending5; at which place it hangs, I interrupt the DROP command, which leaves the old index behind as "INVALID". VACUUM FREEZE ANALYZE Queue; At this point the db's CPU% dropping below 20% after the new index has been built. Unfortunately this is totally hands on approach I have to do this every 5 minutes or so. And possibly the time between these necessities decreases. It also leads to inefficiency over time, even despite the CPU seemingly recovering. So this isn't sustainable like that (worse because my Internet constantly drops). What I am most puzzled by is that no matter how long I wait, the DROP INDEX CONCURRENTLY never completes. Why is that? Also, the REINDEX command always fails with a deadlock because there is a row lock and a complete table lock involved. I consider this ultimately a bug, or at the very least there is room for improvement. And I am on version 11.1. regards, -Gunther
Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.
On 2/23/2019 16:13, Peter Geoghegan wrote: On Sat, Feb 23, 2019 at 1:06 PM Gunther wrote: I thought to keep my index tight, I would define it like this: CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending; so that only pending jobs are in that index. When a job is done, follow up work is often inserted into the Queue as pending, thus adding to that index. How many distinct jobIds are there in play, roughly? Would you say that there are many fewer distinct Jobs than distinct entries in the index/table? Is the number of jobs fixed at a fairly low number, that doesn't really grow as the workload needs to scale up? Jobs start on another, external queue, there were about 200,000 of them waiting when I started the run. When the SQL Queue is empty, the workers pick one job from the external queue and add it to the SQL queue. When that happens immediately 2 more jobs are created on that queue. Let's cal it phase 1 a and b When phase 1 a has been worked off, another follow-up job is created. Let' s call it phase 2. When phase 2 has been worked off, a final phase 3 job is created. When that is worked off, nothing new is created, and the next item is pulled from the external queue and added to the SQL queue. So this means, each of the 200,000 items add (up to) 4 jobs onto the queue during their processing. But since these 200,000 items are on an external queue, the SQL queue itself is not stuffed full at all. It only slowly grows, and on the main index where we have only the pending jobs, there are only probably than 20 at any given point in time. When I said 7 jobs per second, it meant 7/s simultaneously for all these 3+1 phases, i.e., 28 jobs per second. And at that rate it takes little less than 30 min for the index to deteriorate. I.e. once about 50,000 queue entries have been processed through that index it has deteriorated to become nearly unusable until it is rebuilt. thanks, -Gunther
Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.
Hi, I am using an SQL queue for distributing work to massively parallel workers. Workers come in servers with 12 parallel threads. One of those worker sets handles 7 transactions per second. If I add a second one, for 24 parallel workers, it scales to 14 /s. Even a third, for 36 parallel workers, I can add to reach 21 /s. If I try a fourth set, 48 workers, I end up in trouble. But that isn't even so much my problem rather than the fact that in short time, the performance will deteriorate, and it looks like that is because the queue index deteriorates and needs a REINDEX. The queue table is essentially this: CREATE TABLE Queue ( jobId bigint, action text, pending boolean, result text ); the dequeue operation is essentially this: BEGIN SELECT jobId, action FROM Queue WHERE pending FOR UPDATE SKIP LOCKED which is a wonderful concept with the SKIP LOCKED. Then I perform the action and finally: UPDATE Queue SET pending = false, result = ? WHERE jobId = ? COMMIT I thought to keep my index tight, I would define it like this: CREATE UNIQUE INDEX Queue_idx_pending ON Queue(jobId) WHERE pending; so that only pending jobs are in that index. When a job is done, follow up work is often inserted into the Queue as pending, thus adding to that index. Below is the performance chart. The blue line at the bottom is the db server. You can see the orange line is the first worker server with 12 threads. It settled into a steady state of 7/s ran with 90% CPU for some 30 min, and then the database CPU% started climbing and I tried to rebuild the indexes on the queue, got stuck there, exclusive lock, no jobs were processing, but the exclusive lock was never obtained for too long. So I shut down the worker server. Database quiet I could resolve the messed up indexes and restarted again. Soon I added a second worker server (green line) starting around 19:15. Once settled in they were pulling 14/s together. but you can see in just 15 min, the db server CPU % climbed again to over 40% and the performance of the workers dropped, their load falling to 30%. Now at around 19:30 I stopped them all, REINDEXed the queue table and then started 3 workers servers simultaneously. They settled in to 21/s but in just 10 min again the deterioration happened. Again I stopped them all, REINDEXed, and now started 4 worker servers (48 threads). This time 5 min was not enough to see them ever settling into a decent 28/s transaction rate, but I guess they might have reached that for a minute or two, only for the index deteriorating again. I did another stop now started only 2 servers and again, soon the index deteriorated again. Clearly that index is deteriorating quickly, in about 10,000 transactions. BTW: when I said 7/s, it is in reality about 4 times as many transactions, because of the follow up jobs that also get added on this queue. So 10, transactions may be 30 or 40 k transactions before the index deteriorates. Do you have any suggestion how I can avoid that index deterioration problem smoothly? I figured I might just pause all workers briefly to schedule the REINDEX Queue command, but the problem with this is that while the transaction volume is large, some jobs may take minutes to process, and in that case we need to wait minutes to quiet the database with then 47 workers sitting as idle capacity waiting for the 48th to finish so that the index can be rebuilt! Of course I tried to resolve the issue with vacuumdb --analyze (just in case if the autovacuum doesn't act in time) and that doesn't help. Vacuumdb --full --analyze would probably help but can't work because it required an exclusive table lock. I tried to just create a new index of the same CREATE UNIQUE INDEX Queue_idx2_pending ON Queue(jobId) WHERE pending; DROP INDEX Queue_idx_pending; ANALYZE Queue; but with that I got completely stuck with two indexes where I could not remove either of them for those locking issues. And REINDEX will give me a deadlock error rightout. I am looking for a way to manage that index so that it does not deteriorate. May be if I was not defining it with ... WHERE pending; then it would only grow, but never shrink. May be that helps somehow? I doubt it though. Adding to an index also causes deterioration, and most of the rows would be irrelevant because they would be past work. It would be nicer if there was another smooth way. regards, -Gunther
Re: neither CPU nor IO bound, but throttled performance
Thank you Magnus. 68% steal. Indeed. You probably hit the target. Yes. That explains the discrepancy. I need to watch and understand that CPU credits issue. regards, -Gunther On 2/21/2019 4:08, Magnus Hagander wrote: On Thu, Feb 21, 2019 at 12:34 AM Gunther <mailto:r...@gusw.net>> wrote: Hi, I have an Amazon Linux based Postgresql 11 server here on a t2.medium EC2 instance. It is serving 24 worker processes that read jobs from a queue (thanks to SELECT ... FOR UPDATE SKIP LOCKED!) and do jobs some of which are reading and writing business data to the database, others are only reading, and some don't hit the business data at all, only the queue. Everything flows quite nicely. Except, I don't understand why I can't max out the CPU or the IO, instead, IO is almost negligible yet the CPU is at 30% hardly hitting 50%. Here I give you a view of top: top - 23:17:09 up 45 days, 2:07, 4 users, load average: 20.32, 18.92, 13.80 Tasks: 338 total, 24 running, 111 sleeping, 0 stopped, 0 zombie %Cpu(s): 28.7 us, 2.5 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 68.7 st If I read that right, it's about 70% "steal". The description for this is "the percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.". So basically, the CPU is spent dealing with other peoples VMs on the same hardware. Welcome to the multi-tenant cloud. In particular, I believe T series instances get a limited number of CPU "credits" per hours. My guess is you've hit this limit and are thus being throttled. T series are not intended for persistent workloads. Either way, this is probably a question better asked at the Amazon EC2 forums rather than PostgreSQL as you'll find more people who know the EC2 interactions there. -- Magnus Hagander Me: https://www.hagander.net/ <http://www.hagander.net/> Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/>
neither CPU nor IO bound, but throttled performance
0.00 0.00 0.00 0.00 0.00 xvdm 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdn 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdp 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdq 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdr 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvds 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdt 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdu 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdv 0.00 0.00 0.00 8.65 0.00 76.92 17.78 0.03 4.00 0.00 4.00 0.44 0.38 xvdw 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdx 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdy 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdz 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdaa 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 xvdab 0.00 0.00 0.00 2.88 0.00 30.77 21.33 0.01 4.00 0.00 4.00 1.33 0.38 previously I had hit 100 %util here, that was when I didn't have the tables so spread out over so many tablespaces. Now I have it spread out like in the olden days where you spread out your tables over many "spindles", and I did this here so I could see which tables or indexes would be bottlenecks. So how can it be that queries take quite long without the process running at higher CPU%? Or is there something wrong with the total CPU% estimated by both top and iostat? From the top it looks like I have 24 worker processes use 8% each, most of them in R(unning) state, so that would be 192%, which is divided over the 2 CPUs of the t2.medium instance, really 96%. So I am CPU bound after all? regards, -Gunther
Re: pg_dump 3 times as slow after 8.4 -> 9.5 upgrade
I confess I don't do dump or any backups much other than file system snapshots. But when I do, I don't like how long it takes. I confess my database is big, I have about 200 GB. But still, dumping it should not take 48 hours (and running) while the system is 75% idle and reads are at 4.5 MB/s when the system sustains over 100 MB/s during processing of table scan and hash join queries. Something is wrong with the dump thing. And no, it's not SSL or whatever, I am doing it on a local system with local connections. Version 9.5 something. regards, -Gunther On 11/23/2017 4:26, Henrik Cednert (Filmlance) wrote: On 22 Nov 2017, at 22:07, Patrick KUI-LI <mailto:pku...@hipay.com>> wrote: Hello, I had this behaviors when the upgraded pg 9.5 was on ssl mode by default. So i deactivated ssl mode in postgresql.conf. That's all. Regards, Patrick Hello And you just uncommented the 'ssl = off' line in the config for this? Is this default behaviour different from 8.4? Is there a 'show running config' for pgsql? I tried that in the test vm and didn't really give me a significant difference. COMPRESSION LEVEL: 6, SSL ON real82m33.744s user60m55.069s sys3m3.375s COMPRESSION LEVEL: 6, SSL OFF real76m31.083s user61m23.282s sys1m23.341s