Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-19 Thread Dimitrios Apostolou
Ok I reproduced the OOM, with only 200 partitions as opposed to 2K that I mentioned before. The keys to reproduce it I believe are: 1. Write millions of rows to *all* partitions 2. Run ANALYSE so that the planner knows about the sizes Here are the correct steps now. Let me know if you manage

Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-19 Thread Dimitrios Apostolou
On Thu, 18 Jul 2024, David Rowley wrote: On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou wrote: * The memory is unnecessarily allocated early on, before any partitions are actually aggregated. I know this because I/O is slow on this device and the table sizes are huge, it's simply

Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-11 Thread Dimitrios Apostolou
On Thu, 11 Jul 2024, Tom Lane wrote: Dimitrios Apostolou writes: The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). So don't do that. Adding partitions is not cost-free. I understand that, they also add an administrative cost that I'd rather avoid. But I ended up adding

Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-11 Thread Dimitrios Apostolou
on RANGE(workitem_n). All the columns are various integer types. There is an index on workitem_n. On Thu, 11 Jul 2024, David Rowley wrote: On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou wrote: I have a table with 1000 partitions on PostgreSQL 16. I notice that a fairly complicated query

Query on partitioned table needs memory n_partitions * work_mem

2024-07-10 Thread Dimitrios Apostolou
Hello list, I have a table with 1000 partitions on PostgreSQL 16. I notice that a fairly complicated query of the form: SELECT ... GROUP BY ... LIMIT ... causes the postgres backend process to grow insanely very fast, and the kernel OOM killer to kill it rather soon. It seems it tries to

Re: pg_stat_io clarifications: background worker, writes and reads

2024-05-21 Thread Dimitrios Apostolou
le cluster is max_parallel_workers, which must be ≤ max_worker_processes. The limit of parallel workers per query is max_parallel_workers_per_gather. Thanks Kashif Zeeshan Bitnine Global On Wed, May 15, 2024 at 5:59 PM Dimitrios Apostolou wrote: So what is this particular "background wor

Re: pg_stat_io clarifications: background worker, writes and reads

2024-05-15 Thread Dimitrios Apostolou
n Workers 5. Custom Background Workers In the pg_stat_io view, statistics related to I/O operations performed by these background workers are recorded.  Regards, Imtiaz On Wed, 15 May 2024, 01:26 Dimitrios Apostolou, wrote: Hello list, what is the "background worker"

pg_stat_io clarifications: background worker, writes and reads

2024-05-14 Thread Dimitrios Apostolou
Hello list, what is the "background worker" in the pg_stat_io statistics view? I'm reading the documentation but can't figure this one out knowing that it is not autovacuum or bgwriter. And I'm not aware of any extension I might have with registered background worker. Additionally, how can it

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou
I have forgotten to mention that I have enable_partitionwise_aggregate=on in the global settings since the beginning. According to the docs: Enables or disables the query planner's use of partitionwise grouping or aggregation, which allows grouping or aggregation on partitioned tables to be

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou
On Tue, 14 May 2024, David Rowley wrote: If you were to put the n_distinct_inherited estimate back to 200 and disable sort, you should see the costs are higher for the index plan. If that's not the case then there might be a bug. It seems more likely that due to the n_distinct estimate being

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou
On Tue, 14 May 2024, Dimitrios Apostolou wrote: It took long but if finished: ANALYZE Time: 19177398.025 ms (05:19:37.398) I see now that default_statistics_target is globally set to 500, so this is probably the reason it took so long. I guess with the default of 100, it would take

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou
On Fri, 10 May 2024, Tom Lane wrote: Dimitrios Apostolou writes: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Tue, 14 May 2024, David Rowley wrote: That assumes the Append won't ever use > 1 worker per subnode, but that's not the case for your plan as the subnodes are "Parallel". That means all the workers could be working on the same subnode which could result in one group being split between 2

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Tue, 14 May 2024, David Rowley wrote: On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou wrote: On Sat, 11 May 2024, David Rowley wrote: It will. It's just that Sorting requires fetching everything from its subnode. Isn't it plain wrong to have a sort step in the plan than

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Mon, 13 May 2024, Dimitrios Apostolou wrote: On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou wrote: Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou wrote: Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or analyzed at all

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:33, Tom Lane wrote: I do kind of wonder why it's producing both a hashagg and a Unique step --- seems like it should do one or the other. It still needs to make the duplicate groups from parallel workers unique. Range

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou
On Fri, 10 May 2024, Tom Lane wrote: Dimitrios Apostolou writes: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou
On Fri, 10 May 2024, Dimitrios Apostolou wrote: On Fri, 10 May 2024, Dimitrios Apostolou wrote: Update: even the simplest SELECT DISTINCT query shows similar behaviour: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou
On Fri, 10 May 2024, Dimitrios Apostolou wrote: I noticed that the following query is very very slow (too long to wait for it to finish): SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Update: even the simplest SELECT DISTINCT query shows similar behaviour

SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou
Hello list, INTRO I have a huge (multi-billion rows) table partitioned into 1000 partitions. Around half of the partitions are full and the rest are empty, created in advance ready to receive future incoming data. Postgres is 16.2. Here are the relevant parts of the schema: \d test_runs_raw

Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou
Thanks for the feedback Laurenz, On Wed, 28 Feb 2024, Laurenz Albe wrote: On Wed, 2024-02-28 at 12:16 +0100, Dimitrios Apostolou wrote: I ended up doing some risky actions to remediate the problem: Find the filenames that have no identically named "oid" in pg_class, and de

Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou
Hi Sergey, On Wed, 28 Feb 2024, Sergey Fukanchik wrote: Hi Dimitrios, Do you have wal archiving enabled? $PGDATA/pg_wal/ is a bit different from tablespaces (including "default" one). It stores transaction journal. I don't think I have WAL archiving enabled. Here are the relevant WAL options

Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou
Hello list, yesterday I was doing: ALTER TABLE x ADD UNIQUE ... , ADD PRIMARY KEY ...; The table is almost a billion rows long but lies in its own TABLESPACE that has plenty of space. But apparently the ALTER TABLE command is writing a lot to the default tablespace (not the temp_tablespace,

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou
On Wed, 15 Nov 2023, Erik Wienhold wrote: On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote: SELECT [DISTINCT] ... EXCEPT ... In this query I get the same results regardless of including DISTINCT or not. But I get different query plans, I get an extra HashAggregate node in the case

Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou
Hello list, SELECT [DISTINCT] ... EXCEPT ... In this query I get the same results regardless of including DISTINCT or not. But I get different query plans, I get an extra HashAggregate node in the case of SELECT DISTINCT. Any idea why? Reading the docs, I understand that postgres does by

Re: Inefficient query plan for SELECT ... EXCEPT ...

2023-11-01 Thread Dimitrios Apostolou
Thank you all for the answers, they covered me well. Is this worth a bug report? I can file one if the issue is not known. No. It's just a missing optimisation. We know about it. It's good I shot an email first then. FWIW my usual way in other projects would be to check the bugtracker, and

Inefficient query plan for SELECT ... EXCEPT ...

2023-10-31 Thread Dimitrios Apostolou
Hello list, I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query, where the left side is a very short table (even zero-length sometimes, but also also rarely can be as long as 200K rows), and the right side is a table with 10M UNIQUE NOT NULL rows: \d test_datatags

Re: BRIN index maintenance on table without primary key

2023-10-30 Thread Dimitrios Apostolou
On Fri, 27 Oct 2023, Dimitrios Apostolou wrote: So the question is: how to maintain the physical order of the tuples? Answering to myself, there doesn't seem to be any way to run pg_repack on a table without a UNIQUE key. To run CLUSTER, the only way I see is to: 1. Create a btree index

BRIN index maintenance on table without primary key

2023-10-27 Thread Dimitrios Apostolou
Hello list. Key characteristics of my case: + HUGE table with 20G rows, ca 2TB + May be partitioned (have both versions on two test clusters ATM) + Plenty of inserts, no updates + No primary key - we had one IDENTITY bigint column until recently, but it proved useless and inefficient (index

Re: Partitionwise JOIN scanning all partitions, even unneeded ones

2023-08-30 Thread Dimitrios Apostolou
Thank you for the clear explanation, and I hope the missing optimisation gets implemented sooner rather than later. Maybe the query planner should consider the missing optimisation and ban *hash* partitionwise joins. Indeed I verified that disabling hash join fixed the situation, with both

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-21 Thread Dimitrios Apostolou
On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: + Then I broke the command above in many small chunks WITH rows AS ( DELETE FROM tbl_legacy AS t WHERE (partition_key_column < $1) RETURNING t.* ) INSERT INTO tbl SELECT * FROM r

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-10 Thread Dimitrios Apostolou
Dimitrios Apostolou ha scritto: On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: > + First I chose the method to DELETE...INSERT everything. Just to clarify, the query looked more like    WITH rows AS ( DELETE FROM tbl_legacy RETURNING * )      INSERT INTO

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou
On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: + First I chose the method to DELETE...INSERT everything. Just to clarify, the query looked more like WITH rows AS ( DELETE FROM tbl_legacy RETURNING * ) INSERT INTO tbl SELECT * FROM rows; I noticed that the postgres process

Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou
I have a huge (10G rows) table "tbl_legacy" named test_runs that I want to PARTITION BY RANGE on an indexed integer column. I have created the new partitioned table "tbl" and 1000 partitions in it, so that it can take my existing data and have the partitions ready for the next year (so most of

VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread Dimitrios Apostolou
Thanks everyone for the help. I have now run simple VACUUM but it didn't change anything, the simple SELECT is still slow. My understanding by reading the docs is that it should reclaim all unused space, just not return it to the OS. Which is fine by me. Any idea why it failed to reclaim the

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-02-01 Thread Dimitrios Apostolou
On Tue, 31 Jan 2023, David G. Johnston wrote: It feels like there is room for improvement here using table statistics and the visibility map to significantly reduce the number of pages retrieved that turn out to be all dead.  Yes, I too wish postgres was more clever for a simple unordered

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou
Thanks for the insight on the internals. Regarding your questions: On Tue, 31 Jan 2023, Tom Lane wrote: Do you get the same 10 rows when you repeat the command? Yes. Just tested with both cold and hot caches. The first 10 rows are exactly the same, either they return slowly or immediately.

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou
On Tue, 31 Jan 2023, David G. Johnston wrote: On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou wrote:       ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1)           Output: ru

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou
I managed to double-paste different subset of my settings. FWIW all of them are true for all the experiments. Sorry for the confusion. On Tue, 31 Jan 2023, Dimitrios Apostolou wrote: Other postgres settings: shared_buffers = 1024MB effective_io_concurrency = 0 wal_compression = zstd

SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou
Hello list, I run a very simple SELECT on a huge table (350M rows, 10 columns of type integer or timestamp). The server is an old Mac Mini with 4GB RAM and a slow HDD. All caches are cold either via reboot, or by restarting postgres and using the `purge` command to flush the OS cache. The

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-20 Thread Dimitrios Apostolou
Hello again, I am back with new experiments. First of all, I have a concrete set of steps that replicate the slowness of the hash join that I described. If you have a system with spinning disks lying around, I would appreciate if you can verify the scenario. Can you also replicate it in

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-16 Thread Dimitrios Apostolou
On Sat, 14 Jan 2023, Tom Lane wrote: Dimitrios Apostolou writes: Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here is how I understand things according to posts I've read, and classical algorithms: + The Hash Join is fastest when one side fits in work_mem

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-14 Thread Dimitrios Apostolou
On Fri, 13 Jan 2023, David Rowley wrote: I'd expect reducing random_page_cost to make the Mege Join cheaper as that's where the Index Scan is. I'm not quite sure where you think the random I/O is coming from in a batched hash join. Thanks for the feedback, indeed you are right! Decreasing

Why is a hash join preferred when it does not fit in work_mem

2023-01-12 Thread Dimitrios Apostolou
Hello list, I have a very simple NATURAL JOIN that does not fit in the work_mem. Why does the query planner prefer a hash join that needs 361s, while with a sort operation and a merge join it takes only 13s? The server is an old Mac Mini with hard disk drive and only 4GB RAM. Postgres version