Re: Query on partitioned table needs memory n_partitions * work_mem
On Thu, 11 Jul 2024, Tom Lane wrote: Dimitrios Apostolou writes: The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). So don't do that. Adding partitions is not cost-free. I understand that, they also add an administrative cost that I'd rather avoid. But I ended up adding all these partitions because of performance issues on a multi-billion rows table. There is probably some message from me on this list a couple of years ago. At the moment I have a work-around. I'm thankful that everyone is willing to provide workarounds to all potential issues/bugs I have presented, but unfortunately workarounds are not fixes, one will hit the same wall again at some point. My current concern is **reporting my findings responsibly**. I want to provide as much data needed to pinpoint the issue, so that the developers know exactly what's going on. Having right data is half the fix. A way to track the issue would be nice. I might revisit it and even try to submit a patch. I wonder how the postgres development community is tracking all these issues, I've even started forgetting the ones I have found, and I'm sure I have previously reported (on this list) a couple of should-be-easy issues that would be ideal for beginners. Regards, Dimitris
Re: Query on partitioned table needs memory n_partitions * work_mem
Thank you for the feedback. So I've managed to reduce the query to a rather simple one: SELECT workitem_n, test_executable_n, bool_or(test_resulttype_n IN (2,3)) FROM test_runs_raw GROUP BY workitem_n, test_executable_n LIMIT 10; The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). All the columns are various integer types. There is an index on workitem_n. On Thu, 11 Jul 2024, David Rowley wrote: On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou wrote: I have a table with 1000 partitions on PostgreSQL 16. I notice that a fairly complicated query of the form: SELECT ... GROUP BY ... LIMIT ... causes the postgres backend process to grow insanely very fast, and the kernel OOM killer to kill it rather soon. It seems it tries to allocate at least 1000 * work_mem. -> Append -> HashAggregate -> Seq Scan -> ... 1000 more hashagg+seqscans Is this allocation pattern (workmem * n_partitions) expected under any scenario? I can't find it documented. AFAIU the backend should allocate up to (depth_of_execution_plan * work_mem) (putting aside the hash_mem_multiplier and the parallel workers). Not depth of execution plan. It relates to the number of nodes in the plan which allocate work_mem or work_mem * hash_mem_multiplier. There is some documentation in [1]: "Note that a complex query might perform several sort and hash operations at the same time" The latest query is not complex at all and I don't see it doing 1000s of operations at the same time. By "number of nodes" would you add up all HashAggregate nodes under an Append node? Here is part of the EXPLAIN ANALYZE output: Limit (cost=0.01..28.00 rows=10 width=7) (actual time=43120.466..43292.246 rows=10 loops=1) Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, (bool_or((test_runs_raw.test_resulttype_n = ANY ('{2,3}'::integer[] Buffers: shared hit=96 read=883975 I/O Timings: shared read=16284.731 -> Append (cost=0.01..3416299633.71 rows=1220556171 width=7) (actual time=42968.794..43139.855 rows=10 loops=1) Buffers: shared hit=96 read=883975 I/O Timings: shared read=16284.731 -> HashAggregate (cost=0.01..0.02 rows=1 width=7) (actual time=10.662..10.663 rows=0 loops=1) Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, bool_or((test_runs_raw.test_resulttype_n = ANY ('{2,3}'::integer[]))) Group Key: test_runs_raw.workitem_n, test_runs_raw.test_executable_n Batches: 1 Memory Usage: 24kB -> Seq Scan on public.test_runs_raw__part_max20k test_runs_raw (cost=0.00..0.00 rows=1 width=8) (actual time=9.960..9.961 rows=0 loops=1) Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, test_runs_raw.test_resulttype_n -> HashAggregate (cost=0.01..0.02 rows=1 width=7) (actual time=1.913..1.914 rows=0 loops=1) Output: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n, bool_or((test_runs_raw_1.test_resulttype_n = ANY ('{2,3}'::integer[]))) Group Key: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n Batches: 1 Memory Usage: 24kB -> Seq Scan on public.test_runs_raw__part_max40k test_runs_raw_1 (cost=0.00..0.00 rows=1 width=8) (actual time=1.031..1.031 rows=0 loops=1) Output: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n, test_runs_raw_1.test_resulttype_n [ 1000s of similar HashAggregate nodes ... ] Settings: temp_buffers = '32MB', work_mem = '32MB', effective_io_concurrency = '300', max_parallel_workers_per_gather = '0', enable_hashjoin = 'off', enable_partitionwise_join = 'on', enable_partitionwise_aggregate = 'on', random_page_cost = '1.1', effective_cache_size = '6GB', from_collapse_limit = '24', join_collapse_limit = '24' Planning: Buffers: shared hit=377 Planning Time: 1503.800 ms Execution Time: 56515.185 ms (5382 rows) Memory usage on each HashAggregate is logged as 24KB (many HashAggregates are missing that info though), I guess the EXPLAIN output is missing some important part of the allocations here since I'm seeing MBs of allocations per node. I can't help but see this as a bug. I see many issues: * postgres is not reading from partitions in parallel, but one after the other. It shouldn't need all this memory simultaneously. * The memory is unnecessarily allocated early on, before any partitions are actually aggregated. I know this because I/O is slow on this device and the table sizes are huge, it's simply not possible that postgres went through all partitions and blew up the memory. That would take hours, but the OOM happens seconds after I start the query. * The memory is not only allocated by the planner, but it's actually accessed. Libc's malloc() has no problem allocating gigabytes more than wha
Query on partitioned table needs memory n_partitions * work_mem
Hello list, I have a table with 1000 partitions on PostgreSQL 16. I notice that a fairly complicated query of the form: SELECT ... GROUP BY ... LIMIT ... causes the postgres backend process to grow insanely very fast, and the kernel OOM killer to kill it rather soon. It seems it tries to allocate at least 1000 * work_mem. If I reduce the amount of work_mem, I can control the outcome and avoid the crash, but this is suboptimal. I have parallel plans disabled (max_parallel_workers_per_gather=0). To add a bit more info on the execution plan, I believe the relevant part is the 1000 HashAggregate nodes under Append: -> Append -> HashAggregate -> Seq Scan -> ... 1000 more hashagg+seqscans Is this allocation pattern (workmem * n_partitions) expected under any scenario? I can't find it documented. AFAIU the backend should allocate up to (depth_of_execution_plan * work_mem) (putting aside the hash_mem_multiplier and the parallel workers). NOTE: after having written the above message, it occured to me that I have enable_partitionwise_aggregate=on. And Turning it off fixes the issue and makes the query faster too! Expected behaviour or bug? Thank you in advance, Dimitris P.S. In the meantime I'm trying to reduce the query and the table schema, in order to submit a precise bug report with repro instructions.
Re: pg_stat_io clarifications: background worker, writes and reads
Hi Kashif, just to clarify my question, are you saying that the I/O from parallel workers is recorded into the "background worker" entry of pg_stat_io? Thanks, Dimitris On Wed, 15 May 2024, Kashif Zeeshan wrote: Hi parallel worker are used for parallel execution of the queries and you can find the help in the below link. https://www.postgresql.org/docs/current/how-parallel-query-works.html Its controlled by following parameters. max_worker_processes = 6 max_parallel_workers_per_gather = 3 max_parallel_workers = 6 The limit of concurrent parallel workers for the whole cluster is max_parallel_workers, which must be ≤ max_worker_processes. The limit of parallel workers per query is max_parallel_workers_per_gather. Thanks Kashif Zeeshan Bitnine Global On Wed, May 15, 2024 at 5:59 PM Dimitrios Apostolou wrote: So what is this particular "background worker" I'm seeing, given that I have no replication or extensions? Searching the logs I found entries like the following: LOG: background worker "parallel worker" (PID 93384) exited with exit code 1 This got logged when I killed a simple SELECT query that took too long doing parallel seqscans. Could it be that the entry in pg_stat_io named "background worker" also includes the parallel workers from a SELECT query? Thanks, Dimitris
Re: pg_stat_io clarifications: background worker, writes and reads
So what is this particular "background worker" I'm seeing, given that I have no replication or extensions? Searching the logs I found entries like the following: LOG: background worker "parallel worker" (PID 93384) exited with exit code 1 This got logged when I killed a simple SELECT query that took too long doing parallel seqscans. Could it be that the entry in pg_stat_io named "background worker" also includes the parallel workers from a SELECT query? Thanks, Dimitris On Wed, 15 May 2024, Muhammad Imtiaz wrote: Hi, In PostgreSQL, the pg_stat_io view provides detailed statistics on I/O operations. Background process perform maintenance tasks and other background operations essential to the functioning of the PostgreSQL database. They include processes such as: 1. Autovacuum Workers 2. WAL Writer 3. Background Writer 4. Logical Replication Workers 5. Custom Background Workers In the pg_stat_io view, statistics related to I/O operations performed by these background workers are recorded. Regards, Imtiaz On Wed, 15 May 2024, 01:26 Dimitrios Apostolou, wrote: Hello list, what is the "background worker" in the pg_stat_io statistics view? I'm reading the documentation but can't figure this one out knowing that it is not autovacuum or bgwriter. And I'm not aware of any extension I might have with registered background worker. Additionally, how can it be evictions > writes? I would expect every eviction to cause a write. Finally about "hits", I understand they are reads found in shared_buffers, so they never registered into the "reads" counter. So is "reads" in pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read attempts not found in the shared_buffers, that needed to be fetched from the disk (or OS buffercache)? backend_type | object | context | reads | read_time | writes | write_time | writebacks | writeback_time | extends | extend_time | op_bytes | hits | evictions | reuses | fsyncs | fsync_time | stats_reset ---+---+-+-+-+++++-+-+--+---+---++++--- background worker | relation | normal | 5139575 | 2196288.011 | 63277 | 1766.94 | 0 | 0 | 0 | 0 | 8192 | 876913705 | 5139653 | | 0 | 0 | 2024-04-08 08:50:02.971192+00 Thank you in advance, Dimitris
pg_stat_io clarifications: background worker, writes and reads
Hello list, what is the "background worker" in the pg_stat_io statistics view? I'm reading the documentation but can't figure this one out knowing that it is not autovacuum or bgwriter. And I'm not aware of any extension I might have with registered background worker. Additionally, how can it be evictions > writes? I would expect every eviction to cause a write. Finally about "hits", I understand they are reads found in shared_buffers, so they never registered into the "reads" counter. So is "reads" in pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read attempts not found in the shared_buffers, that needed to be fetched from the disk (or OS buffercache)? backend_type|object | context | reads | read_time | writes | write_time | writebacks | writeback_time | extends | extend_time | op_bytes | hits| evictions | reuses | fsyncs | fsync_time | stats_reset ---+---+-+-+-+++++-+-+--+---+---++++--- background worker | relation | normal | 5139575 | 2196288.011 | 63277 | 1766.94 | 0 | 0 | 0 | 0 | 8192 | 876913705 | 5139653 || 0 | 0 | 2024-04-08 08:50:02.971192+00 Thank you in advance, Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
I have forgotten to mention that I have enable_partitionwise_aggregate=on in the global settings since the beginning. According to the docs: Enables or disables the query planner's use of partitionwise grouping or aggregation, which allows grouping or aggregation on partitioned tables to be performed separately for each partition. Reading that, I'd expect to see a separate DISTINCT->LIMIT 10 on every partition, and then it would be up to independent plans to decide whether each partition follows a parallel or a serial plan. Not sure if this plan was checked but rejected because of cost. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Tue, 14 May 2024, David Rowley wrote: If you were to put the n_distinct_inherited estimate back to 200 and disable sort, you should see the costs are higher for the index plan. If that's not the case then there might be a bug. It seems more likely that due to the n_distinct estimate being so low that the planner thought that a large enough fraction of the rows needed to be read and that made the non-index plan appear cheaper. I'd be interested in seeing what the costs are for the index plan. I think the following will give you that (untested): alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=200); analyze test_runs_raw; I had to stop this step because it was taking too long going through all partitions again. But it seems it had the desired effect. set enable_sort=0; explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; It chooses the non-parallel index plan: Limit (cost=365.17..1135517462.36 rows=10 width=4) -> Unique (cost=365.17..22710342308.83 rows=200 width=4) -> Append (cost=365.17..22546660655.46 rows=65472661350 width=4) -> Index Only Scan Backward using test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) [... only index scans follow] LIMIT 100 goes for the parallel seqscan plan, that even contains a sort! But it seams to me that the extra upper level HashAggregate step raises the cost by an order of magnitude, from 800M to 10G, in comparison to doing (Unique->Sort) - see plan in the next paragraph. Limit (cost=10857220388.76..10857220389.01 rows=100 width=4) -> Sort (cost=10857220388.76..10857220389.26 rows=200 width=4) Sort Key: test_runs_raw.workitem_n DESC -> HashAggregate (cost=857220379.12..857220381.12 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Gather (cost=857220295.12..857220377.12 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=857219295.12..857219297.12 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..816295259.21 rows=16369614363 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1597356.30 rows=33623360 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1210806.37 rows=25794030 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037800.88 rows=64122388 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918865.36 rows=61611136 width=4) [... only seqscans follow] If I re-enable sort, then it goes for the parallel seqscan plan even with LIMIT 10: SET SESSION enable_sort TO TRUE; EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Limit (cost=857166256.39..857166256.59 rows=10 width=4) -> Unique (cost=857166256.39..857166260.39 rows=200 width=4) -> Sort (cost=857166256.39..857166258.39 rows=800 width=4) Sort Key: test_runs_raw.workitem_n DESC -> Gather (cost=857166135.82..857166217.82 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=857165135.82..857165137.82 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..816243567.24 rows=16368627432 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1597356.30 rows=33623360 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1210806.37 rows=25794030 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037800.88 rows=64122388 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918865.36 rows=61611136 width=4) [... only seqscans follow] So in order of higher to lower cost, we have the following alternatives: 1. non-parallel index scan (800M) 2. parallel seqscan with sort (1.3G) 3. parallel seqscan
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Tue, 14 May 2024, Dimitrios Apostolou wrote: It took long but if finished: ANALYZE Time: 19177398.025 ms (05:19:37.398) I see now that default_statistics_target is globally set to 500, so this is probably the reason it took so long. I guess with the default of 100, it would take approximately one hour. This is much better to have in a cron job. :-) Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Fri, 10 May 2024, Tom Lane wrote: Dimitrios Apostolou writes: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47 rows=10 width=4) -> Unique (cost=363.84..22690570036.41 rows=200 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) ... The total cost on the 1st line (cost=363.84..1134528847.47) has a much higher upper limit than the total cost when max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). This explains the planner's choice. But I wonder why the cost estimation is so far away from reality. I'd say the blame lies with that (probably-default) estimate of just 200 distinct rows. That means the planner expects to have to read about 5% (10/200) of the tables to get the result, and that's making fast-start plans look bad. Possibly an explicit ANALYZE on the partitioned table would help. It took long but if finished: ANALYZE Time: 19177398.025 ms (05:19:37.398) And it made a difference indeed, the serial plan is chosen now: EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Limit (cost=364.29..1835512.29 rows=10 width=4) -> Unique (cost=364.29..22701882164.56 rows=123706 width=4) -> Append (cost=364.29..22538472401.60 rows=65363905182 width=4) -> Index Only Scan Backward using test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) [...] -> Index Only Scan Backward using test_runs_raw__part_max12460k_pkey on test_runs_raw__part_max12460k test_runs_raw_623 (cost=0.57..12329614.53 rows=121368496 width=4) -> Index Only Scan Backward using test_runs_raw__part_max12440k_pkey on test_runs_raw__part_max12440k test_runs_raw_622 (cost=0.57..5180832.16 rows=184927264 width=4) -> Index Only Scan Backward using test_runs_raw__part_max12420k_pkey on test_runs_raw__part_max12420k test_runs_raw_621 (cost=0.57..4544964.21 rows=82018824 width=4) [...] Overall I think there are two issues that postgres could handle better here: 1. Avoid the need for manual ANALYZE on partitioned table 2. Create a different parallel plan, one that can exit early, when the LIMIT is proportionally low. I feel the partitions could be parallel-scanned in-order, so that the whole thing stops when one partition has been read. Thank you! Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Tue, 14 May 2024, David Rowley wrote: That assumes the Append won't ever use > 1 worker per subnode, but that's not the case for your plan as the subnodes are "Parallel". That means all the workers could be working on the same subnode which could result in one group being split between 2 or more workers. Didn't think of that, makes sense! Parallel Append can also run in a way that the Append child nodes will only get 1 worker each. How can I tell which case it is, from the EXPLAIN output (for example the output at [1]) ? [1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Tue, 14 May 2024, David Rowley wrote: On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou wrote: On Sat, 11 May 2024, David Rowley wrote: It will. It's just that Sorting requires fetching everything from its subnode. Isn't it plain wrong to have a sort step in the plan than? The different partitions contain different value ranges with no overlap, and the last query I posted doesn't even contain an ORDER BY clause, just a DISTINCT clause on an indexed column. The query does contain an ORDER BY, so if the index is not chosen to provide pre-sorted input, then something has to put the results in the correct order before the LIMIT is applied. The last query I tried was: SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10; See my message at [1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net Will re-check things and report back with further debugging info you asked for later today. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Mon, 13 May 2024, Dimitrios Apostolou wrote: On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou wrote: Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or analyzed at all. I think they should have been auto-analyzed, since they get a ton of INSERTs (no deletes/updates though) and I have the default autovacuum settings. Could it be that autovacuum starts, but never finishes? I can't find something in the logs. It's not the partitions getting analyzed you need to worry about for an ndistinct estimate on the partitioned table. It's auto-analyze or ANALYZE on the partitioned table itself that you should care about. If you look at [1], it says "Tuples changed in partitions and inheritance children do not trigger analyze on the parent table." Thanks Do I read that correctly, that I have to setup cron jobs to manually analyze partitioned tables? Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou wrote: Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or analyzed at all. I think they should have been auto-analyzed, since they get a ton of INSERTs (no deletes/updates though) and I have the default autovacuum settings. Could it be that autovacuum starts, but never finishes? I can't find something in the logs. It's not the partitions getting analyzed you need to worry about for an ndistinct estimate on the partitioned table. It's auto-analyze or ANALYZE on the partitioned table itself that you should care about. If you look at [1], it says "Tuples changed in partitions and inheritance children do not trigger analyze on the parent table." Thanks In any case, even after the planner decides to execute the terrible plan with the parallel seqscans, why doesn't it finish right when it finds 10 distinct values? It will. It's just that Sorting requires fetching everything from its subnode. Isn't it plain wrong to have a sort step in the plan than? The different partitions contain different value ranges with no overlap, and the last query I posted doesn't even contain an ORDER BY clause, just a DISTINCT clause on an indexed column. Even with bad estimates, even with seq scan instead of index scan, the plan should be such that it concludes all parallel work as soon as it finds the 10 distinct values. And this is actually achieved if I disable parallel plans. Could it be a bug in the parallel plan generation? Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:33, Tom Lane wrote: I do kind of wonder why it's producing both a hashagg and a Unique step --- seems like it should do one or the other. It still needs to make the duplicate groups from parallel workers unique. Range partitioning of the table guarantees that, since the ranges are not overlapping. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Fri, 10 May 2024, Tom Lane wrote: Dimitrios Apostolou writes: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47 rows=10 width=4) -> Unique (cost=363.84..22690570036.41 rows=200 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) ... The total cost on the 1st line (cost=363.84..1134528847.47) has a much higher upper limit than the total cost when max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). This explains the planner's choice. But I wonder why the cost estimation is so far away from reality. I'd say the blame lies with that (probably-default) estimate of just 200 distinct rows. That means the planner expects to have to read about 5% (10/200) of the tables to get the result, and that's making fast-start plans look bad. Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or analyzed at all. I think they should have been auto-analyzed, since they get a ton of INSERTs (no deletes/updates though) and I have the default autovacuum settings. Could it be that autovacuum starts, but never finishes? I can't find something in the logs. In any case, even after the planner decides to execute the terrible plan with the parallel seqscans, why doesn't it finish right when it finds 10 distinct values? Possibly an explicit ANALYZE on the partitioned table would help. Thanks, I'll save the ANALYZE as the last step; I feel it's a good opportunity to figure out more details about how postgres works. Plus I expect ANALYZE to last a couple of days, so I should first find quiet time for that. :-) Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Fri, 10 May 2024, Dimitrios Apostolou wrote: On Fri, 10 May 2024, Dimitrios Apostolou wrote: Update: even the simplest SELECT DISTINCT query shows similar behaviour: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47 rows=10 width=4) -> Unique (cost=363.84..22690570036.41 rows=200 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) -> Index Only Scan using test_runs_raw__part_max20k_pkey on test_runs_raw__part_max20k test_runs_raw_1 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan using test_runs_raw__part_max40k_pkey on test_runs_raw__part_max40k test_runs_raw_2 (cost=0.12..2.34 rows=1 width=4) [...] -> Index Only Scan using test_runs_raw__part_max1780k_pkey on test_runs_raw__part_max1780k test_runs_raw_89 (cost=0.57..53587294.65 rows=106088160 width=4) -> Index Only Scan using test_runs_raw__part_max1800k_pkey on test_runs_raw__part_max1800k test_runs_raw_90 (cost=0.57..98943539.74 rows=96214080 width=4) -> Index Only Scan using test_runs_raw__part_max1820k_pkey on test_runs_raw__part_max1820k test_runs_raw_91 (cost=0.57..97495653.34 rows=193248960 width=4) -> Index Only Scan using test_runs_raw__part_max1840k_pkey on test_runs_raw__part_max1840k test_runs_raw_92 (cost=0.57..110205205.07 rows=218440928 width=4) -> Index Only Scan using test_runs_raw__part_max1860k_pkey on test_runs_raw__part_max1860k test_runs_raw_93 (cost=0.57..50164056.28 rows=99431760 width=4) [...] The total cost on the 1st line (cost=363.84..1134528847.47) has a much higher upper limit than the total cost when max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). This explains the planner's choice. But I wonder why the cost estimation is so far away from reality. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Fri, 10 May 2024, Dimitrios Apostolou wrote: I noticed that the following query is very very slow (too long to wait for it to finish): SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Update: even the simplest SELECT DISTINCT query shows similar behaviour: EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10; Limit (cost=724518979.52..724518979.92 rows=10 width=4) -> Unique (cost=724518979.52..724518987.52 rows=200 width=4) -> Sort (cost=724518979.52..724518983.52 rows=1600 width=4) Sort Key: test_runs_raw.workitem_n -> Gather (cost=724518732.37..724518894.37 rows=1600 width=4) Workers Planned: 4 -> HashAggregate (cost=724517732.37..724517734.37 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..704131546.90 rows=8154474186 width=4) -> Parallel Index Only Scan using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1429238.50 rows=16811660 width=4) -> Parallel Index Only Scan using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1081827.27 rows=12896836 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..2717185.06 rows=32060806 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2610814.95 rows=30806095 width=4) It also takes ages to return, so I have to interrupt it. I believe it should exit early, as soon as it finds 10 distinct values (which should be rather easy even with parallel seqscans, given the pattern followed when inserting the data). Thanks, Dimitris
SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Hello list, INTRO I have a huge (multi-billion rows) table partitioned into 1000 partitions. Around half of the partitions are full and the rest are empty, created in advance ready to receive future incoming data. Postgres is 16.2. Here are the relevant parts of the schema: \d test_runs_raw Partitioned table "public.test_runs_raw" Column |Type | Collation | Nullable | Default ---+-+---+--+-- run_n | bigint | | not null | generated by default as identity test_case_n | smallint| | not null | workitem_n| integer | | not null | test_resulttype_n | smallint| | | Partition key: RANGE (workitem_n) Indexes: "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n) Each partition is made to keep entries with workitem_n in ranges (0,20k), (20k,40k) and so on (k = kilo) up to 2k. PROBLEM I noticed that the following query is very very slow (too long to wait for it to finish): SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; What is remarkable, is that in 998 out of 1000 table scans it involves, the planner does not use the index. Instead it chooses a sequential scan. Here is the output from EXPLAIN: Limit (cost=853891608.79..853891608.99 rows=10 width=4) -> Unique (cost=853891608.79..853891612.79 rows=200 width=4) -> Sort (cost=853891608.79..853891610.79 rows=800 width=4) Sort Key: test_runs_raw.workitem_n DESC -> Gather (cost=853891488.22..853891570.22 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=853890488.22..853890490.22 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..813118117.30 rows=16308948365 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1597355.10 rows=33623320 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1210795.63 rows=25793672 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037793.12 rows=64121612 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918875.90 rows=61612190 width=4) [ ... 996 more sequential scans ... ] If I remove DISTINCT then the plan changes dramatically and it runs instantaneously: Limit (cost=363.84..367.30 rows=10 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) -> Index Only Scan Backward using test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19940k_pkey on test_runs_raw__part_max19940k test_runs_raw_997 (cost=0.12..2.34 rows=1 width=4) [ ... 996 more index scans ... ] Notice how in the last plan there is no parallel scanning. Instead the partitions are scanned sequentially, *in proper order*, so that the plan execution stops after reading the first 10 rows in the first non-empty partition. Why can't the same be done with DISTINCT? Please note that the workitem_n value range is well spread into in range (0,13M) and the table has been gradually filled within one year, so I'm assuming the vacuum worker has worked long enough to build sane statistics (not sure how to verify that). REMARKS 1. I tried reproducing the problem on an artificial table with few partitions and few values, but I couldn't. Both queries execute fast, and the planner is always choosing a non-parallel index-only scan. 2. Among testing changes to various settings, I just noticed that setting max_parallel_workers_per_gather to 0 (from the original value of 4) fixes the issue! On the original huge table, disabling parallelism actually makes the query infinitely faster and it returns within 1s! Is this a bug in the planner? Thank you, Dimitris
Re: Orphan files filling root partition after crash
Thanks for the feedback Laurenz, On Wed, 28 Feb 2024, Laurenz Albe wrote: On Wed, 2024-02-28 at 12:16 +0100, Dimitrios Apostolou wrote: I ended up doing some risky actions to remediate the problem: Find the filenames that have no identically named "oid" in pg_class, and delete (move to backup) the biggest ones while the database is stopped. Fortunately the database started up fine after that! Lucky you. It should have been "relfilenode" rather than "oid", and some catalog tables don't have their files listed in the catalog, because they are needed *before* the database can access tables. I actually double checked that the filenames don't appear anywhere in SELECT * FROM pg_class and that the files were multi-GB in size including all the 1GB-pieces. But luck was definitely a part of the equation, I didn't know that the files might be accessed before tables (at db startup?) or that "relfilenode" would be more appropriate. Why is that, where can I read more? I see that many (but not all) rows in pg_class have oid=relfilenode but for many rows relfilenode=0 which is meaningless as filename. So what is the moral of the story? How to guard against this? Monitor disk usage ... It happened *fast*. And it was quite a big suprise coming from "just" a disk-full situation. A couple of suggestions; wouldn't it make sense: + for the index to be written by default to the table's tablespace? + for postgres to refuse to write non-wal files, if it's on the same device as the WAL and less than max_wal_size bytes are free? The root of the problem is that you created the index in the default tablespace. You should have ALTER TABLE x ADD UNIQUE ... USING INDEX TABLESPACE bigtblspc; Thank you, was reading the docs but didn't realize this syntax is valid. I thought it was only for CREATE/ALTER INDEX. Needless to say, I would have hoped the database cleaned-up after itself even after an uncontrolled crash, or that it provided some tools for the job. (I tried VACUUM FULL on the table, but the orphaned files did not go away). That is not so simple... Also, it would slow down crash recovery. But I agree that it would be nice to have a tool that reports or cleans up orphaned files. Yours, Laurenz Albe
Re: Orphan files filling root partition after crash
Hi Sergey, On Wed, 28 Feb 2024, Sergey Fukanchik wrote: Hi Dimitrios, Do you have wal archiving enabled? $PGDATA/pg_wal/ is a bit different from tablespaces (including "default" one). It stores transaction journal. I don't think I have WAL archiving enabled. Here are the relevant WAL options in my config: wal_compression = zstd max_wal_size = 8GB min_wal_size = 1GB track_wal_io_timing = on wal_writer_flush_after = 8MB The issue happens because the WAL directory is by default in the same filesystem with the default tablespace (root partition for Ubuntu). So when the default tablespace filled up because of my ALTER TABLE operation, there was no space for WAL either. Thanks, Dimitris
Orphan files filling root partition after crash
Hello list, yesterday I was doing: ALTER TABLE x ADD UNIQUE ... , ADD PRIMARY KEY ...; The table is almost a billion rows long but lies in its own TABLESPACE that has plenty of space. But apparently the ALTER TABLE command is writing a lot to the default tablespace (not the temp_tablespace, that is already moved to a different partition). That quickly filled up the 50GB free space in my root partition: 20:18:04.222 UTC [94144] PANIC: could not write to file "pg_wal/xlogtemp.94144": No space left on device [...] 20:19:11.578 UTC [94140] LOG: WAL writer process (PID 94144) was terminated by signal 6: Aborted 20:19:11.578 UTC [94140] LOG: terminating any other active server processes After postgresql crashed and restarted, the disk space in the root partition was still not freed! I believe this is because of "orphaned files" as discussed in mailing list thread [1]. [1] https://www.postgresql.org/message-id/CAN-RpxDBA7HbTsJPq4t4VznmRFJkssP2SNEMuG%3DoNJ%2B%3DsxLQew%40mail.gmail.com I ended up doing some risky actions to remediate the problem: Find the filenames that have no identically named "oid" in pg_class, and delete (move to backup) the biggest ones while the database is stopped. Fortunately the database started up fine after that! So what is the moral of the story? How to guard against this? Why did the database write so much to the default tablespace, even when both the table and the temp tablespace are elsewhere? Also should one always keep the default tablespace away from the wal partition? (I guess it would have helped to avoid the crash, but the ALTER TABLE command would have still run out of space, so I'm not sure if the orphan files would have been avoided)? Needless to say, I would have hoped the database cleaned-up after itself even after an uncontrolled crash, or that it provided some tools for the job. (I tried VACUUM FULL on the table, but the orphaned files did not go away). My postgresql version is 16.2 installed on Ubuntu. Thank you, Dimitris
Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT
On Wed, 15 Nov 2023, Erik Wienhold wrote: On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote: SELECT [DISTINCT] ... EXCEPT ... In this query I get the same results regardless of including DISTINCT or not. But I get different query plans, I get an extra HashAggregate node in the case of SELECT DISTINCT. Any idea why? As Tom Lane recently wrote[1] EXCEPT is not optimized and will operate on the subqueries which are planned independently. [1] https://www.postgresql.org/message-id/2664450.1698799...@sss.pgh.pa.us Heh, as he wrote to me even. :-) I just wanted to make sure that this is indeed a missing optimisation of the planner, and that the queries are effectively the same. Thank you for clarifying. As mentioned, the docs don't make it clear if the SELECT DISTINCT part is implied or not, only the EXCEPT DISTINCT part is clearly on by default. Dimitris
Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT
Hello list, SELECT [DISTINCT] ... EXCEPT ... In this query I get the same results regardless of including DISTINCT or not. But I get different query plans, I get an extra HashAggregate node in the case of SELECT DISTINCT. Any idea why? Reading the docs, I understand that postgres does by default EXCEPT DISTINCT, but I don't see the same for the 1st clause, SELECT [DISTINCT]. Steps to reproduce: CREATE TABLE t1(i INTEGER NOT NULL); CREATE TABLE t2(i INTEGER NOT NULL); INSERT INTO t1 VALUES (1),(2),(1),(2),(3),(3),(3); INSERT INTO t2 VALUES (3); SELECT * FROM t1 EXCEPT SELECT * FROM t2; i --- 2 1 (2 rows) SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2; i --- 2 1 (2 rows) EXPLAIN SELECT * FROM t1 EXCEPT SELECT * FROM t2; QUERY PLAN - HashSetOp Except (cost=0.00..160.25 rows=200 width=8) -> Append (cost=0.00..147.50 rows=5100 width=8) -> Subquery Scan on "*SELECT* 1" (cost=0.00..61.00 rows=2550 width=8) -> Seq Scan on t1 (cost=0.00..35.50 rows=2550 width=4) -> Subquery Scan on "*SELECT* 2" (cost=0.00..61.00 rows=2550 width=8) -> Seq Scan on t2 (cost=0.00..35.50 rows=2550 width=4) (6 rows) EXPLAIN SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2; QUERY PLAN - HashSetOp Except (cost=41.88..127.50 rows=200 width=8) -> Append (cost=41.88..120.62 rows=2750 width=8) -> Subquery Scan on "*SELECT* 1" (cost=41.88..45.88 rows=200 width=8) -> HashAggregate (cost=41.88..43.88 rows=200 width=4) Group Key: t1.i -> Seq Scan on t1 (cost=0.00..35.50 rows=2550 width=4) -> Subquery Scan on "*SELECT* 2" (cost=0.00..61.00 rows=2550 width=8) -> Seq Scan on t2 (cost=0.00..35.50 rows=2550 width=4) (8 rows) Regards, Dimitris
Re: Inefficient query plan for SELECT ... EXCEPT ...
Thank you all for the answers, they covered me well. Is this worth a bug report? I can file one if the issue is not known. No. It's just a missing optimisation. We know about it. It's good I shot an email first then. FWIW my usual way in other projects would be to check the bugtracker, and just "follow" the relevant issue if it's minor like a missing optimisation. I didn't find a way to search for "known issues" in the Postgresql project. Dimitris
Inefficient query plan for SELECT ... EXCEPT ...
Hello list, I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query, where the left side is a very short table (even zero-length sometimes, but also also rarely can be as long as 200K rows), and the right side is a table with 10M UNIQUE NOT NULL rows: \d test_datatags Table "public.test_datatags" Column | Type | Collation | Nullable | Default +-+---+--+-- test_datatag_n | integer | | not null | generated by default as identity test_datatag | text| | not null | Indexes: "test_datatags_pkey" PRIMARY KEY, btree (test_datatag_n) "test_datatags_test_datatag_key" UNIQUE CONSTRAINT, btree (test_datatag) Follows a simplified and pathological case, that takes 1.7s even though the left side is empty: BEGIN; CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP; ANALYZE VERBOSE tmp_table ; INFO: analyzing "pg_temp_9.tmp_table" INFO: "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS) SELECT DISTINCT d1 FROM tmp_table WHERE d1 IS NOT NULL EXCEPT SELECT test_datatag FROM test_datatags; QUERY PLAN -- HashSetOp Except (cost=0.00..299854.08 rows=1 width=36) (actual time=1726.470..1726.472 rows=0 loops=1) Output: "*SELECT* 1".d1, (0) Buffers: shared hit=1259 read=58800 I/O Timings: shared/local read=77.713 -> Append (cost=0.00..278054.53 rows=8719821 width=36) (actual time=3.754..1287.901 rows=8702840 loops=1) Buffers: shared hit=1259 read=58800 I/O Timings: shared/local read=77.713 -> Subquery Scan on "*SELECT* 1" (cost=0.00..0.02 rows=1 width=36) (actual time=0.003..0.003 rows=0 loops=1) Output: "*SELECT* 1".d1, 0 -> HashAggregate (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=1) Output: tmp_table.d1 Group Key: tmp_table.d1 Batches: 1 Memory Usage: 24kB -> Seq Scan on pg_temp.tmp_table (cost=0.00..0.00 rows=1 width=32) (actual time=0.001..0.001 rows=0 loops=1) Output: tmp_table.d1 Filter: (tmp_table.d1 IS NOT NULL) -> Subquery Scan on "*SELECT* 2" (cost=0.00..234455.40 rows=8719820 width=26) (actual time=3.751..943.850 rows=8702840 loops=1) Output: "*SELECT* 2".test_datatag, 1 Buffers: shared hit=1259 read=58800 I/O Timings: shared/local read=77.713 -> Seq Scan on public.test_datatags (cost=0.00..147257.20 rows=8719820 width=22) (actual time=3.747..515.420 rows=8702840 loops=1) Output: test_datatags.test_datatag Buffers: shared hit=1259 read=58800 I/O Timings: shared/local read=77.713 Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 'on', enable_partitionwise_join = 'on', hash_mem_multiplier = '1', maintenance_io_concurrency = '0', max_parallel_workers = '12', max_parallel_workers_per_gather = '0', temp_buffers = '64MB' Planning: Buffers: shared hit=2 Planning Time: 0.055 ms JIT: Functions: 15 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms, Emission 3.542 ms, Total 4.037 ms Execution Time: 1726.835 ms (33 rows) I'm wondering why the planner doesn't see that the left table is very small and follow a different path. From an abstract computer science POV, I would 1. sort the left table (the right one is already indexed) 2. "merge" the two tables, by walking them in-order in parallel and excluding the matches 3. stop when the left table is exhausted, which would happen very early. Is this worth a bug report? I can file one if the issue is not known. Or am I misunderstanding the implications of the SELECT-EXCEPT query? In the meantime I have replaced the query with a LEFT OUTER JOIN which performs much better, and I believe is equivalent. I find it less readable than the query in question though. Plus, I have a bunch of SELECT-EXCEPT queries (with smaller right-side tables) in my application that I would hate to change them all to the ugliest equivalent. Under what conditions would the above query plan perform well? Thanks in advance, Dimitris
Re: BRIN index maintenance on table without primary key
On Fri, 27 Oct 2023, Dimitrios Apostolou wrote: So the question is: how to maintain the physical order of the tuples? Answering to myself, there doesn't seem to be any way to run pg_repack on a table without a UNIQUE key. To run CLUSTER, the only way I see is to: 1. Create a btree index on the same column that has the BRIN index 2. CLUSTER 3. Drop the index This should take very long on my huge table, and keeps the table exclusively locked. The disk space needed would also be huge, if the table isn't partitioned. I wonder why CLUSTER can't run based on a BRIN-indexed column. Is it theoretically impossible, or is it just not implemented yet? My understanding so far is that CLUSTER only rewrites the table *according to the index order* and does not touch the index itself. For a BRIN index though it would need to rewrite the table *ignoring the index* and then rewrite the index too, in order to keep the ranges fully optimized. So the logic is very different, and maybe closer to what VACUUM FULL does. Thanks, Dimitris
BRIN index maintenance on table without primary key
Hello list. Key characteristics of my case: + HUGE table with 20G rows, ca 2TB + May be partitioned (have both versions on two test clusters ATM) + Plenty of inserts, no updates + No primary key - we had one IDENTITY bigint column until recently, but it proved useless and inefficient (index too big for mem) so we deleted it. + All queries are filtering on a not-unique not-null integer column. + On this column we have a BRIN INDEX since insert order is /mostly/ incremental. So the question is: how to maintain the physical order of the tuples? Even though the insertions populate the index column /mostly/ incrementally, there are outliers and it's certain that over time the BRIN index will deteriorate. I'm monitoring the "correlation" and I want to run a command to "defrag" it when it drops below 0.9. + Can't run CLUSTER: ERROR: cannot cluster on index "tst_brin_idx" because access method does not support clustering + Can't run pg_repack, from [1]: > Target table must have a PRIMARY KEY, or at least a UNIQUE total index on a NOT NULL column [1] https://reorg.github.io/pg_repack/ Any ideas? Thank you in advance, Dimitris
Re: Partitionwise JOIN scanning all partitions, even unneeded ones
Thank you for the clear explanation, and I hope the missing optimisation gets implemented sooner rather than later. Maybe the query planner should consider the missing optimisation and ban *hash* partitionwise joins. Indeed I verified that disabling hash join fixed the situation, with both queries taking almost the same time. Great! In the meantime, I'm considering disabling hash joins globally, as I've had issues with them before (excessive I/O slowing causing much slower execution than merge join, see [1] for your answer to my question then :). Do you think that would save me from other problems I'm not aware of, given the huge size of the table I'm querying? [1] https://www.postgresql.org/message-id/caaphdvppvydonkeqlybsbjwq8kq8m7ywdka44rtea2mnao3...@mail.gmail.com Regards, Dimitris
Re: Moving data from huge table slow, min() query on indexed column taking 38s
On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: + Then I broke the command above in many small chunks WITH rows AS ( DELETE FROM tbl_legacy AS t WHERE (partition_key_column < $1) RETURNING t.* ) INSERT INTO tbl SELECT * FROM rows; COMMIT; I increase the parameter $1 and keep going in a loop. At first this goes OK, after one day though I notice that it has slowed down significantly. My investigation shows that the simple query SELECT min(partition_key_column) from tbl_legacy; takes 38s, despite having an index on the column! A VACUUM fixes that, so I guess the index has a ton of dead tuples. I guess autovacuum does not do its job because the table is constantly busy. Unfortunately VACUUM takes long (several hours) on this huge table, so I can't add in the loop after the DELETE command. Is there a better way to avoid the bloat in the loop above? Why can't the DELETE command update the index by pointing the minimum element beyond the dead tuples? Any comments on this one? It annoys me that a simple loop deteriorated so much and kept filling the table with bloat. What is that VACUUM does that DELETE can't do, to keep the index fresh?
Re: Moving data from huge table slow, min() query on indexed column taking 38s
Thanks! I have completed the transfer by taking down the database and working exclusively on it, but I still wonder how one could do it in production without exclusive locks. The loop with small DELETE...INSERT transactions I posted on the parent post bloats the table fast. The link you posted contains very useful info, I was not expecting that the constraints could blow up the memory like that. Comments from me: Disabling and then re-enabling the foreign key constraints is easily done with ALTER TABLE. Unfortunately it doesn't seem to be the same case for indices. One has to create the table without indices, and then create the indices separately. With such a process there is a risk of ending up with non-identical table schemas. By the way, with COPY one must use an intermediate file, right? There is no way to COPY from table to table directly? Thanks, Dimitris On Thu, 6 Jul 2023, Lorusso Domenico wrote: Hello, this is a standard problem during bulk copy. here some suggestions; for example disable indexes. The main issue is related to index, lock escalation and log writing. In other dbms you should set log off on the table, but postgresql does not seem to have this feature. Anyway, using an explicit lock table exclusive should prevent lock escalation. So: disable indexes in target table lock exclusive both table insert data truncate old table If this doesn't work you can consider using the copy command. Il giorno gio 6 lug 2023 alle ore 18:12 Dimitrios Apostolou ha scritto: On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: > + First I chose the method to DELETE...INSERT everything. Just to clarify, the query looked more like WITH rows AS ( DELETE FROM tbl_legacy RETURNING * ) INSERT INTO tbl SELECT * FROM rows; > I noticed that the postgres process was growing without bounds, up to > the point that it consumed almost all of swap space and I had to kill > the command. Any idea why that happens? Also note that my settings for work_mem, temp_buffers, shared_buffers etc are all well below the RAM size and postgres has never shown unbound growth again. Postgres version is 15.2. Dimitris -- Domenico L. per stupire mezz'ora basta un libro di storia, io cercai di imparare la Treccani a memoria... [F.d.A.]
Re: Moving data from huge table slow, min() query on indexed column taking 38s
On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: + First I chose the method to DELETE...INSERT everything. Just to clarify, the query looked more like WITH rows AS ( DELETE FROM tbl_legacy RETURNING * ) INSERT INTO tbl SELECT * FROM rows; I noticed that the postgres process was growing without bounds, up to the point that it consumed almost all of swap space and I had to kill the command. Any idea why that happens? Also note that my settings for work_mem, temp_buffers, shared_buffers etc are all well below the RAM size and postgres has never shown unbound growth again. Postgres version is 15.2. Dimitris
Moving data from huge table slow, min() query on indexed column taking 38s
I have a huge (10G rows) table "tbl_legacy" named test_runs that I want to PARTITION BY RANGE on an indexed integer column. I have created the new partitioned table "tbl" and 1000 partitions in it, so that it can take my existing data and have the partitions ready for the next year (so most of the new partitions will be empty for now). On the topic of bulk-transferring the data from tbl_legacy to tbl: + First I chose the method to DELETE...INSERT everything. I noticed that the postgres process was growing without bounds, up to the point that it consumed almost all of swap space and I had to kill the command. Any idea why that happens? + Then I broke the command above in many small chunks WITH rows AS ( DELETE FROM tbl_legacy AS t WHERE (partition_key_column < $1) RETURNING t.* ) INSERT INTO tbl SELECT * FROM rows; COMMIT; I increase the parameter $1 and keep going in a loop. At first this goes OK, after one day though I notice that it has slowed down significantly. My investigation shows that the simple query SELECT min(partition_key_column) from tbl_legacy; takes 38s, despite having an index on the column! A VACUUM fixes that, so I guess the index has a ton of dead tuples. I guess autovacuum does not do its job because the table is constantly busy. Unfortunately VACUUM takes long (several hours) on this huge table, so I can't add in the loop after the DELETE command. Is there a better way to avoid the bloat in the loop above? Why can't the DELETE command update the index by pointing the minimum element beyond the dead tuples? + Finally I resorted to just copying the data: INSERT INTO tbl SELECT * FROM tbl_legacy; This took half a day or so but worked without blowing up the memory usage like the 1st query. Why is that? By the way, I don't consider the last solution ideal, since the data is duplicated between the two tables and it's harder to implement workarounds to do it without taking downtime. But it worked for me for this time. Any other ideas for partitioning an existing huge table? Thanks in advance, Dimitris
VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)
Thanks everyone for the help. I have now run simple VACUUM but it didn't change anything, the simple SELECT is still slow. My understanding by reading the docs is that it should reclaim all unused space, just not return it to the OS. Which is fine by me. Any idea why it failed to reclaim the space in my case? I'm now running VACUUM FULL as everyone suggested. I just tried plain VACUUM as I was curious if it would work and because it doesn't lock the table with an operation that takes hours for my setup. Thanks, Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
On Tue, 31 Jan 2023, David G. Johnston wrote: It feels like there is room for improvement here using table statistics and the visibility map to significantly reduce the number of pages retrieved that turn out to be all dead. Yes, I too wish postgres was more clever for a simple unordered query. I should also mention that this table has two btree indices, one primary key index and one on another column. Would it be feasible to "descend" an index in order to immediately find some live data? A limit without an order is not a query to optimize for of course... A good reason is that the web interface I'm using to browse my tables (and I guess many other too) do a simple limited+unordered SELECT when you click on a table. Timing out was what made me look deeper. I couldn't accept that no matter how slow my system is, it has trouble returning a few rows. It is also the first query somoebody will type on psql to interactively get a clue about the contents of a table. Regards, Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
Thanks for the insight on the internals. Regarding your questions: On Tue, 31 Jan 2023, Tom Lane wrote: Do you get the same 10 rows when you repeat the command? Yes. Just tested with both cold and hot caches. The first 10 rows are exactly the same, either they return slowly or immediately. If turning synchronize_seqscans off changes the behavior, that'd be a good clue that this is the right theory. Turning it off makes the query slow no matter how many times I re-run it. The system is doing lots of read I/O with both hot and cold caches. Here is the EXPLAIN output from the hot cache run (that previously had only 14 hits and no reads): Limit (cost=0.00..0.29 rows=10 width=42) (actual time=620510.813..620510.821 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=64 read=2334462 I/O Timings: shared/local read=567846.559 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=620510.800..620510.804 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=64 read=2334462 I/O Timings: shared/local read=567846.559 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.099 ms Execution Time: 620510.855 ms After reading the docs, I'm surprised this setting affects my case given that I have no parallelism in my setup. As for a real fix, it might be time for a VACUUM FULL or CLUSTER on that table. Regarding CLUSTER, would it help with a seqscan on a bloated table? Furthermore, given that the table is expected to grow every day by a few million rows, do you suggest running CLUSTER every night? Will postgres remember that the rows up to N are clustered, even after appending more rows? Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
On Tue, 31 Jan 2023, David G. Johnston wrote: On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou wrote: -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared read=2334526 I/O Timings: shared/local read=691137.029 The system has to return 10 live rows to you. If it needs to search through that many buffers to find 10 live rows you most likely have a large bloating problem going on. Seems like it is time to vacuum full. I looked up on how to measure bloat, so I run the query found at [1]. [1] https://wiki.postgresql.org/wiki/Show_database_bloat The first two rows show huge bloat on the two indices of this table: ... ORDER BY wastedbytes DESC LIMIT 2; current_database | schemaname | tablename | tbloat | wastedbytes | iname | ibloat | wastedibytes --++---++-+--++-- coin | public | test_runs_raw |1.8 | 21742305280 | test_runs_raw_pkey |1.0 |0 coin | public | test_runs_raw |1.8 | 21742305280 | test_runs_raw_idx_workitem_n |0.3 |0 (2 rows) Is this bloat even affecting queries that do not use the index? It seems I have to add VACUUM FULL to nightly maintainance. I had run some schema restructuring (several ADD COLUMN followed by UPDATE SET on all rows) some days ago, and I was not aware this degraded the table. Thanks for the useful info! Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
I managed to double-paste different subset of my settings. FWIW all of them are true for all the experiments. Sorry for the confusion. On Tue, 31 Jan 2023, Dimitrios Apostolou wrote: Other postgres settings: shared_buffers = 1024MB effective_io_concurrency = 0 wal_compression = zstd wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB checkpoint_timeout = 15min [...] Relevant settings: synchronous_commit = off wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB track_io_timing = on track_wal_io_timing = on wal_compression = zstd
SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
Hello list, I run a very simple SELECT on a huge table (350M rows, 10 columns of type integer or timestamp). The server is an old Mac Mini with 4GB RAM and a slow HDD. All caches are cold either via reboot, or by restarting postgres and using the `purge` command to flush the OS cache. The question is why this simple query is taking so long to complete. Most of the time I see constant 30MB/s read I/O. The memory usage is low and the system is not swapping or in other ways unhealthy. Postgres is not serving any other requests. Is postgres reading all the table sequentially, just to return 10 rows in no particular order? I'm trying to understand what PostgreSQL is doing and the data access patterns involved. All feedback is appreciated. # SELECT version(); PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 10; Limit (cost=0.00..0.29 rows=10 width=42) (actual time=746005.251..746014.910 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared read=2334526 I/O Timings: shared/local read=691137.029 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared read=2334526 I/O Timings: shared/local read=691137.029 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning: Buffers: shared hit=69 read=18 I/O Timings: shared/local read=265.550 Planning Time: 271.719 ms Execution Time: 746057.424 ms (14 rows) Repeating the query returns instantaneously, and EXPLAIN shows shared hit=14 and no reads. Changing the LIMIT to 100 takes another 3min to respond, with constant read I/O while it runs: # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 100; Limit (cost=0.00..2.91 rows=100 width=42) (actual time=0.030..184692.698 rows=100 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=14 read=772427 I/O Timings: shared/local read=171484.072 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=0.029..184692.664 rows=100 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=14 read=772427 I/O Timings: shared/local read=171484.072 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.083 ms Execution Time: 184692.748 ms (11 rows) Reruning this is instantaneous again. Other postgres settings: shared_buffers = 1024MB effective_io_concurrency = 0 wal_compression = zstd wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB checkpoint_timeout = 15min Bonus question: some time ago the server crashed and rebooted, and after a long WAL recovery and some hours of idle time, I run the same query. It took even longer to respond, and besides reading, there was also a lot writing involved. What kind of writes was postgres doing? # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 10; Limit (cost=0.00..0.29 rows=10 width=42) (actual time=1284114.810..1284114.819 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484 I/O Timings: shared/local read=104.264 write=107258.150 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=1283955.130..1283955.135 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484 I/O Timings: shared/local read=104.264 write=107258.150 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.545 ms Execution Time: 1284114.972 ms (11 rows) Relevant settings: synchronous_commit = off wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB track_io_timing = on track_wal_io_timing = on wal_compression = zstd Thanks in advance, Dimitris
Re: Why is a hash join preferred when it does not fit in work_mem
Hello again, I am back with new experiments. First of all, I have a concrete set of steps that replicate the slowness of the hash join that I described. If you have a system with spinning disks lying around, I would appreciate if you can verify the scenario. Can you also replicate it in different kind of systems? CREATE TABLE descriptions (description_id serial PRIMARY KEY, description text); INSERT INTO descriptions (description_id, description) SELECT s, repeat(encode(sha512(s::text::bytea), 'hex'), 4) FROM generate_series(0,1200300) AS s; CREATE TABLE descriptions_in_books (description_id integer REFERENCES descriptions(description_id), book_id integer); INSERT INTO descriptions_in_books (description_id, book_id) SELECT s % 1200300, s FROM generate_series(0,5200300) AS s; SET work_mem TO '1MB'; SET hash_mem_multiplier = 1.0; SET track_io_timing TO on; EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM descriptions NATURAL JOIN descriptions_in_books; SET enable_hashjoin TO off; EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM descriptions NATURAL JOIN descriptions_in_books; The first JOIN query uses a hash join with Batches: 1024 and takes 622s! For the longest part, I can see the disk writing most of the time around 1-2 MB/s, so I have to assume it's not writing sequentially. The second identical JOIN uses a merge join that completes in 14s. The I/O happens in a much higher rate (10x maybe), so I'm assuming it's mostly sequential. Another observation is that the hash join deteriorates as the length of the TEXT column grows. In fact, if I fill it with only 32 char long strings, then the hash join is split in only 128 batches, and it completes almost as fast as the merge join. Could it be that the cost estimation is underestimating the I/O pattern related to splitting in batches? Here are the measurements: Hash Join (cost=192450.84..401456.02 rows=5200486 width=524) (actual time=344516.004..621725.562 rows=5200301 loops=1) Output: descriptions.description_id, descriptions.description, descriptions_in_books.book_id Inner Unique: true Hash Cond: (descriptions_in_books.description_id = descriptions.description_id) Buffers: shared hit=15586 read=93161, temp read=97829 written=97829 I/O Timings: shared/local read=1402.597, temp read=229252.170 write=371508.313 -> Seq Scan on public.descriptions_in_books (cost=0.00..75015.86 rows=5200486 width=8) (actual time=0.068..1819.629 rows=5200301 loops=1) Output: descriptions_in_books.book_id, descriptions_in_books.description_id Buffers: shared hit=32 read=22979 I/O Timings: shared/local read=249.910 -> Hash (cost=97739.04..97739.04 rows=1200304 width=520) (actual time=343268.470..343268.471 rows=1200301 loops=1) Output: descriptions.description_id, descriptions.description Buckets: 2048 Batches: 1024 Memory Usage: 686kB Buffers: shared hit=15554 read=70182, temp written=78538 I/O Timings: shared/local read=1152.687, temp write=338883.205 -> Seq Scan on public.descriptions (cost=0.00..97739.04 rows=1200304 width=520) (actual time=0.028..2278.791 rows=1200301 loops=1) Output: descriptions.description_id, descriptions.description Buffers: shared hit=15554 read=70182 I/O Timings: shared/local read=1152.687 Settings: hash_mem_multiplier = '1', work_mem = '1MB' Planning Time: 0.303 ms Execution Time: 622495.279 ms (22 rows) SET enable_hashjoin TO off; Merge Join (cost=868411.87..1079330.96 rows=5200301 width=524) (actual time=6091.932..13304.924 rows=5200301 loops=1) Output: descriptions.description_id, descriptions.description, descriptions_in_books.book_id Merge Cond: (descriptions.description_id = descriptions_in_books.description_id) Buffers: shared hit=67 read=111962 written=1, temp read=45806 written=46189 I/O Timings: shared/local read=1007.043 write=28.575, temp read=344.937 write=794.483 -> Index Scan using descriptions_pkey on public.descriptions (cost=0.43..116919.99 rows=1200304 width=520) (actual time=0.028..1596.387 rows=1200301 loops=1) Output: descriptions.description_id, descriptions.description Buffers: shared hit=3 read=89015 written=1 I/O Timings: shared/local read=834.732 write=28.575 -> Materialize (cost=868408.84..894410.35 rows=5200301 width=8) (actual time=6091.892..9171.796 rows=5200301 loops=1) Output: descriptions_in_books.book_id, descriptions_in_books.description_id Buffers: shared hit=64 read=22947, temp read=45806 written=46189 I/O Timings: shared/local read=172.311, temp read=344.937 write=794.483 -> Sort (cost=868408.84..881409.60 rows=5200301 width=8) (actual time=6091.885..7392.828 rows=5200301 loops=1) Output: descriptions_in_books.book_id, descriptions_in_books.description_id Sort Key:
Re: Why is a hash join preferred when it does not fit in work_mem
On Sat, 14 Jan 2023, Tom Lane wrote: Dimitrios Apostolou writes: Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here is how I understand things according to posts I've read, and classical algorithms: + The Hash Join is fastest when one side fits in work_mem. Then on one hand you have a hash table lookup (amortized O(1)) and on the other hand, if the table has M rows that that do not fit in memory, you have sequential reads from the disk (given low fragmentation of the table or index files): For every line you read from the disk, you lookup the key in the hash table. If the hash table does not fit in RAM then the cost becomes prohibitive. Every lookup is a random access possibly hitting the disk. The total cost should be random_page_cost * M. That would be true of a simple hash join, but Postgres uses batched hash joins: we split up the hash key space into subsets, where hopefully each subset includes few enough inner-side rows to fit into work_mem. While this can go wrong given pathological distribution of the inner-side keys, it does mean that the join can perform well even when the inner side is much larger than work_mem. So it's not the case that the planner will simply disregard hash joins beyond work_mem. It will apply a cost penalty for the predicted batching overhead; Thanks for this, I found a page [1] that describes the hash join and now I understand a bit more. [1] https://www.interdb.jp/pg/pgsql03.html I'm not sure whether the key distribution is pathological in my case. The join condition is: Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n) and workitem_ids.workitem_n is an integer GENERATED AS IDENTITY and PUBLIC KEY. The TABLE workitem_ids har 1.7M rows, and the other table has 3.7M rows. None of them fit in workmem. In my (simplified and pathological) case of work_mem == 1MB, the hash join does 512 batches (Buckets: 4,096 Batches: 512 Memory Usage: 759kB). I'm not sure which hash-merge strategy is followed, but based on that document, it should be the "hybrid hash join with skew". I don't quite follow the I/O requirements of this algorithm, yet. :-) but that can still come out cheaper than merge join, because the sorting needed for merge is generally also far from cheap. I was under the impression that on-disk merge-sort is a relatively cheap (logN) operation, regarding random I/O. So I would expect an increased random_page_cost to benefit the Merge Join algorithm. And since my setup involves spinning disks, it does makes sense to increase it. What is probably really happening is that random_page_cost affects the estimated cost of performing the sort using an index scan instead of a bespoke sort step. AFAIR, cost_sort doesn't consider random_page_cost at all, and neither does cost_hashjoin. On the last EXPLAIN I posted for the forced merge-join, I see that it uses an index-scan on the "small" table. It makes sense since the join happens on the primary key of the table. On the large table it does not use an index scan, because an index doesn't exist for that column. It sorts the 3.7M rows of the table (and FWIW that table only has two integer columns). If I understood correctly what you meant with "performing the sort using an index scan". The problem I see is that the estimated cost of the sort operation is 609,372.91..618,630.40. It's already way above the whole hash-join cost (121,222.68..257,633.01). However the real timings are very different. Actual time for Sort is 4,602.569..5,414.072 ms while for the whole hash join it is 145,641.295..349,682.387 ms. Am I missing some configuration knobs to put some sense to the planner? Thanks, Dimitris
Re: Why is a hash join preferred when it does not fit in work_mem
On Fri, 13 Jan 2023, David Rowley wrote: I'd expect reducing random_page_cost to make the Mege Join cheaper as that's where the Index Scan is. I'm not quite sure where you think the random I/O is coming from in a batched hash join. Thanks for the feedback, indeed you are right! Decreasing random_page_cost to values way below the default makes the planner prefer the merge join! This seems strange to me. Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here is how I understand things according to posts I've read, and classical algorithms: + The Hash Join is fastest when one side fits in work_mem. Then on one hand you have a hash table lookup (amortized O(1)) and on the other hand, if the table has M rows that that do not fit in memory, you have sequential reads from the disk (given low fragmentation of the table or index files): For every line you read from the disk, you lookup the key in the hash table. If the hash table does not fit in RAM then the cost becomes prohibitive. Every lookup is a random access possibly hitting the disk. The total cost should be random_page_cost * M. + The Merge Join involves mostly sequential accesses if the disk files are not fragmented. It reads sequentially and in parallel from both tables, merging the results where the key matches. It requires on-disk sorting (because tables don't fit in work_mem), but even this operation requires little disk seeking. A merge-sort algorithm might have a random access cost of logN * random_page_cost. So I would expect an increased random_page_cost to benefit the Merge Join algorithm. And since my setup involves spinning disks, it does makes sense to increase it. It would be interesting to see the same plans with SET track_io_timing = on; set. It's possible that there's less *actual* I/O going on with the Merge Join plan vs the Hash Join plan. Since we do buffered I/O, without track_io_timing, we don't know if the read buffers resulted in an actual disk read or a read from the kernel buffers. The database has been VACUUM ANALYZEd first and is otherwise idle. Every query has been run twice, and I paste here only the 2nd run. Slow Hash Join: # EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids; Hash Join (cost=121222.68..257633.01 rows=3702994 width=241) (actual time=145641.295..349682.387 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id Inner Unique: true Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n) Buffers: shared hit=12121 read=50381, temp read=56309 written=56309 I/O Timings: shared/local read=745.925, temp read=162199.307 write=172758.699 -> Seq Scan on public.tasks_mm_workitems (cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.114..1401.896 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Buffers: shared hit=65 read=16394 I/O Timings: shared/local read=183.959 -> Hash (cost=59780.19..59780.19 rows=1373719 width=237) (actual time=145344.555..145344.557 rows=1373737 loops=1) Output: workitem_ids.workitem_id, workitem_ids.workitem_n Buckets: 4096 Batches: 512 Memory Usage: 759kB Buffers: shared hit=12056 read=33987, temp written=43092 I/O Timings: shared/local read=561.966, temp write=142221.740 -> Seq Scan on public.workitem_ids (cost=0.00..59780.19 rows=1373719 width=237) (actual time=0.033..1493.652 rows=1373737 loops=1) Output: workitem_ids.workitem_id, workitem_ids.workitem_n Buffers: shared hit=12056 read=33987 I/O Timings: shared/local read=561.966 Settings: effective_cache_size = '500MB', enable_memoize = 'off', hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = '1MB' Planning: Buffers: shared hit=8 Planning Time: 0.693 ms Execution Time: 350290.496 ms (24 rows) Fast Merge Join: # SET enable_hashjoin TO off; SET # EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids; Merge Join (cost=609453.49..759407.78 rows=3702994 width=241) (actual time=4602.623..9700.435 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n) Buffers: shared hit=5310 read=66086, temp read=32621 written=32894 I/O Timings: shared/local read=566.121, temp read=228.063 write=526.739 -> Index Scan using workitem_ids_pkey on public.workitem_ids (cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.034..1080.800 rows=1373737 loops=1) Output: workitem_ids.workitem_n, workitem_ids.workitem_id Buffers: shared hit=5310 read=49627 I/O Timings: shared/local read=448.952 -> Materialize
Why is a hash join preferred when it does not fit in work_mem
Hello list, I have a very simple NATURAL JOIN that does not fit in the work_mem. Why does the query planner prefer a hash join that needs 361s, while with a sort operation and a merge join it takes only 13s? The server is an old Mac Mini with hard disk drive and only 4GB RAM. Postgres version info: PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit The low work_mem and the disabled memoization are set on purpose, in order to simplify a complex query, while reproducing the same problem that I experienced there. This result is the simplest query I could get, where the optimizer does not go for a faster merge join. From my point of view a merge join is clearly faster, because the hash table does not fit in memory and I expect a hash join to do a lot of random I/O. But the query planner does not see that, and increasing random_page_cost does not help either. In fact the opposite happens: the merge join gets a higher cost difference to the hash join, as I increase the random page cost! # EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids; QUERY PLAN -- Hash Join (cost=121222.68..257633.01 rows=3702994 width=241) (actual time=184498.464..360606.257 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id Inner Unique: true Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n) Buffers: shared hit=15068 read=47434, temp read=56309 written=56309 -> Seq Scan on public.tasks_mm_workitems (cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.040..1376.084 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Buffers: shared read=16459 -> Hash (cost=59780.19..59780.19 rows=1373719 width=237) (actual time=184361.874..184361.875 rows=1373737 loops=1) Output: workitem_ids.workitem_id, workitem_ids.workitem_n Buckets: 4096 Batches: 512 Memory Usage: 759kB Buffers: shared hit=15068 read=30975, temp written=43092 -> Seq Scan on public.workitem_ids (cost=0.00..59780.19 rows=1373719 width=237) (actual time=0.026..1912.312 rows=1373737 loops=1) Output: workitem_ids.workitem_id, workitem_ids.workitem_n Buffers: shared hit=15068 read=30975 Settings: effective_cache_size = '500MB', enable_memoize = 'off', hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = '1MB' Planning: Buffers: shared hit=2 read=6 Planning Time: 0.568 ms Execution Time: 361106.876 ms (20 rows) # EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids; QUERY PLAN --- Merge Join (cost=609453.49..759407.78 rows=3702994 width=241) (actual time=5062.513..10866.313 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n) Buffers: shared hit=5343 read=66053, temp read=32621 written=32894 -> Index Scan using workitem_ids_pkey on public.workitem_ids (cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.111..1218.363 rows=1373737 loops=1) Output: workitem_ids.workitem_n, workitem_ids.workitem_id Buffers: shared hit=5310 read=49627 -> Materialize (cost=609372.91..627887.88 rows=3702994 width=8) (actual time=5062.389..7392.640 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Buffers: shared hit=33 read=16426, temp read=32621 written=32894 -> Sort (cost=609372.91..618630.40 rows=3702994 width=8) (actual time=5062.378..6068.703 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Sort Key: tasks_mm_workitems.workitem_n Sort Method: external merge Disk: 65256kB Buffers: shared hit=33 read=16426, temp read=32621 written=32894 -> Seq Scan on public.tasks_mm_workitems (cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.045..1177.202 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Buffers: shared hit=33 read=16426 Settings: effective_cache_size = '500MB', enable_hashjoin = 'off', enable_memoize = 'off', hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = '1MB' Planning: Buffers: shared hit=8 Planning Time: 0.677 ms Execution Time: 13364.545 ms (23 rows) Thank you in advance,