Re: array_agg() does not stop aggregating according to HAVING clause

2024-08-17 Thread Dimitrios Apostolou
On Sat, 17 Aug 2024, Tom Lane wrote: Well, yes: the two aggregates (array_agg and count) are computed concurrently in a single Aggregate plan node scanning the output of the JOIN. There's no way to apply the HAVING filter until after the aggregation is finished. I think this approach is basica

array_agg() does not stop aggregating according to HAVING clause

2024-08-17 Thread Dimitrios Apostolou
Hello list, I have a query that goes through *billions* of rows and for the columns that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it selects all the IDs of the entries (array_agg(run_n)). Here is the full query: INSERT INTO infrequent_datatags_in_this_chunk SELECT datata

Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-19 Thread Dimitrios Apostolou
Ok I reproduced the OOM, with only 200 partitions as opposed to 2K that I mentioned before. The keys to reproduce it I believe are: 1. Write millions of rows to *all* partitions 2. Run ANALYSE so that the planner knows about the sizes Here are the correct steps now. Let me know if you manage to

Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-19 Thread Dimitrios Apostolou
On Thu, 18 Jul 2024, David Rowley wrote: On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou wrote: * The memory is unnecessarily allocated early on, before any partitions are actually aggregated. I know this because I/O is slow on this device and the table sizes are huge, it's s

Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-11 Thread Dimitrios Apostolou
On Thu, 11 Jul 2024, Tom Lane wrote: Dimitrios Apostolou writes: The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). So don't do that. Adding partitions is not cost-free. I understand that, they also add an administrative cost that I'd rather avoid. But I ended

Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-11 Thread Dimitrios Apostolou
ns on RANGE(workitem_n). All the columns are various integer types. There is an index on workitem_n. On Thu, 11 Jul 2024, David Rowley wrote: On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou wrote: I have a table with 1000 partitions on PostgreSQL 16. I notice that a fairly complicated que

Query on partitioned table needs memory n_partitions * work_mem

2024-07-10 Thread Dimitrios Apostolou
Hello list, I have a table with 1000 partitions on PostgreSQL 16. I notice that a fairly complicated query of the form: SELECT ... GROUP BY ... LIMIT ... causes the postgres backend process to grow insanely very fast, and the kernel OOM killer to kill it rather soon. It seems it tries to alloca

Re: pg_stat_io clarifications: background worker, writes and reads

2024-05-21 Thread Dimitrios Apostolou
cluster is max_parallel_workers, which must be ≤ max_worker_processes. The limit of parallel workers per query is max_parallel_workers_per_gather. Thanks Kashif Zeeshan Bitnine Global On Wed, May 15, 2024 at 5:59 PM Dimitrios Apostolou wrote: So what is this particular "background wor

Re: pg_stat_io clarifications: background worker, writes and reads

2024-05-15 Thread Dimitrios Apostolou
cation Workers 5. Custom Background Workers In the pg_stat_io view, statistics related to I/O operations performed by these background workers are recorded.  Regards, Imtiaz On Wed, 15 May 2024, 01:26 Dimitrios Apostolou, wrote: Hello list, what is the "background worker

pg_stat_io clarifications: background worker, writes and reads

2024-05-14 Thread Dimitrios Apostolou
Hello list, what is the "background worker" in the pg_stat_io statistics view? I'm reading the documentation but can't figure this one out knowing that it is not autovacuum or bgwriter. And I'm not aware of any extension I might have with registered background worker. Additionally, how can it be

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou
I have forgotten to mention that I have enable_partitionwise_aggregate=on in the global settings since the beginning. According to the docs: Enables or disables the query planner's use of partitionwise grouping or aggregation, which allows grouping or aggregation on partitioned tables to be perf

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou
On Tue, 14 May 2024, David Rowley wrote: If you were to put the n_distinct_inherited estimate back to 200 and disable sort, you should see the costs are higher for the index plan. If that's not the case then there might be a bug. It seems more likely that due to the n_distinct estimate being so

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou
On Tue, 14 May 2024, Dimitrios Apostolou wrote: It took long but if finished: ANALYZE Time: 19177398.025 ms (05:19:37.398) I see now that default_statistics_target is globally set to 500, so this is probably the reason it took so long. I guess with the default of 100, it would take

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou
On Fri, 10 May 2024, Tom Lane wrote: Dimitrios Apostolou writes: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Tue, 14 May 2024, David Rowley wrote: That assumes the Append won't ever use > 1 worker per subnode, but that's not the case for your plan as the subnodes are "Parallel". That means all the workers could be working on the same subnode which could result in one group being split between 2 o

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Tue, 14 May 2024, David Rowley wrote: On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou wrote: On Sat, 11 May 2024, David Rowley wrote: It will. It's just that Sorting requires fetching everything from its subnode. Isn't it plain wrong to have a sort step in the plan

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Mon, 13 May 2024, Dimitrios Apostolou wrote: On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou wrote: Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can'

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou wrote: Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou
On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:33, Tom Lane wrote: I do kind of wonder why it's producing both a hashagg and a Unique step --- seems like it should do one or the other. It still needs to make the duplicate groups from parallel workers unique. Range parti

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou
On Fri, 10 May 2024, Tom Lane wrote: Dimitrios Apostolou writes: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou
On Fri, 10 May 2024, Dimitrios Apostolou wrote: On Fri, 10 May 2024, Dimitrios Apostolou wrote: Update: even the simplest SELECT DISTINCT query shows similar behaviour: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou
On Fri, 10 May 2024, Dimitrios Apostolou wrote: I noticed that the following query is very very slow (too long to wait for it to finish): SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Update: even the simplest SELECT DISTINCT query shows similar behaviour

SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou
Hello list, INTRO I have a huge (multi-billion rows) table partitioned into 1000 partitions. Around half of the partitions are full and the rest are empty, created in advance ready to receive future incoming data. Postgres is 16.2. Here are the relevant parts of the schema: \d test_runs_raw

Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou
Thanks for the feedback Laurenz, On Wed, 28 Feb 2024, Laurenz Albe wrote: On Wed, 2024-02-28 at 12:16 +0100, Dimitrios Apostolou wrote: I ended up doing some risky actions to remediate the problem: Find the filenames that have no identically named "oid" in pg_class, and delet

Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou
Hi Sergey, On Wed, 28 Feb 2024, Sergey Fukanchik wrote: Hi Dimitrios, Do you have wal archiving enabled? $PGDATA/pg_wal/ is a bit different from tablespaces (including "default" one). It stores transaction journal. I don't think I have WAL archiving enabled. Here are the relevant WAL options

Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou
Hello list, yesterday I was doing: ALTER TABLE x ADD UNIQUE ... , ADD PRIMARY KEY ...; The table is almost a billion rows long but lies in its own TABLESPACE that has plenty of space. But apparently the ALTER TABLE command is writing a lot to the default tablespace (not the temp_tablespace, th

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou
On Wed, 15 Nov 2023, Erik Wienhold wrote: On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote: SELECT [DISTINCT] ... EXCEPT ... In this query I get the same results regardless of including DISTINCT or not. But I get different query plans, I get an extra HashAggregate node in the case of

Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou
Hello list, SELECT [DISTINCT] ... EXCEPT ... In this query I get the same results regardless of including DISTINCT or not. But I get different query plans, I get an extra HashAggregate node in the case of SELECT DISTINCT. Any idea why? Reading the docs, I understand that postgres does by defaul

Re: Inefficient query plan for SELECT ... EXCEPT ...

2023-11-01 Thread Dimitrios Apostolou
Thank you all for the answers, they covered me well. Is this worth a bug report? I can file one if the issue is not known. No. It's just a missing optimisation. We know about it. It's good I shot an email first then. FWIW my usual way in other projects would be to check the bugtracker, and

Inefficient query plan for SELECT ... EXCEPT ...

2023-10-31 Thread Dimitrios Apostolou
Hello list, I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query, where the left side is a very short table (even zero-length sometimes, but also also rarely can be as long as 200K rows), and the right side is a table with 10M UNIQUE NOT NULL rows: \d test_datatags

Re: BRIN index maintenance on table without primary key

2023-10-30 Thread Dimitrios Apostolou
On Fri, 27 Oct 2023, Dimitrios Apostolou wrote: So the question is: how to maintain the physical order of the tuples? Answering to myself, there doesn't seem to be any way to run pg_repack on a table without a UNIQUE key. To run CLUSTER, the only way I see is to: 1. Create a btree ind

BRIN index maintenance on table without primary key

2023-10-27 Thread Dimitrios Apostolou
Hello list. Key characteristics of my case: + HUGE table with 20G rows, ca 2TB + May be partitioned (have both versions on two test clusters ATM) + Plenty of inserts, no updates + No primary key - we had one IDENTITY bigint column until recently, but it proved useless and inefficient (index too

Re: Partitionwise JOIN scanning all partitions, even unneeded ones

2023-08-30 Thread Dimitrios Apostolou
Thank you for the clear explanation, and I hope the missing optimisation gets implemented sooner rather than later. Maybe the query planner should consider the missing optimisation and ban *hash* partitionwise joins. Indeed I verified that disabling hash join fixed the situation, with both querie

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-21 Thread Dimitrios Apostolou
On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: + Then I broke the command above in many small chunks WITH rows AS ( DELETE FROM tbl_legacy AS t WHERE (partition_key_column < $1) RETURNING t.* ) INSERT INTO tbl SELECT * FROM r

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-10 Thread Dimitrios Apostolou
ore 18:12 Dimitrios Apostolou ha scritto: On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: > + First I chose the method to DELETE...INSERT everything. Just to clarify, the query looked more like    WITH rows AS ( DELETE FROM tbl_legacy RETURNING * )      IN

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou
On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: + First I chose the method to DELETE...INSERT everything. Just to clarify, the query looked more like WITH rows AS ( DELETE FROM tbl_legacy RETURNING * ) INSERT INTO tbl SELECT * FROM rows; I noticed that the postgres process was

Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou
I have a huge (10G rows) table "tbl_legacy" named test_runs that I want to PARTITION BY RANGE on an indexed integer column. I have created the new partitioned table "tbl" and 1000 partitions in it, so that it can take my existing data and have the partitions ready for the next year (so most of th

VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread Dimitrios Apostolou
Thanks everyone for the help. I have now run simple VACUUM but it didn't change anything, the simple SELECT is still slow. My understanding by reading the docs is that it should reclaim all unused space, just not return it to the OS. Which is fine by me. Any idea why it failed to reclaim the spa

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-02-01 Thread Dimitrios Apostolou
On Tue, 31 Jan 2023, David G. Johnston wrote: It feels like there is room for improvement here using table statistics and the visibility map to significantly reduce the number of pages retrieved that turn out to be all dead.  Yes, I too wish postgres was more clever for a simple unordered qu

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou
Thanks for the insight on the internals. Regarding your questions: On Tue, 31 Jan 2023, Tom Lane wrote: Do you get the same 10 rows when you repeat the command? Yes. Just tested with both cold and hot caches. The first 10 rows are exactly the same, either they return slowly or immediately.

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou
On Tue, 31 Jan 2023, David G. Johnston wrote: On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou wrote:       ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1)           Output: ru

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou
I managed to double-paste different subset of my settings. FWIW all of them are true for all the experiments. Sorry for the confusion. On Tue, 31 Jan 2023, Dimitrios Apostolou wrote: Other postgres settings: shared_buffers = 1024MB effective_io_concurrency = 0 wal_compression = zstd

SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou
Hello list, I run a very simple SELECT on a huge table (350M rows, 10 columns of type integer or timestamp). The server is an old Mac Mini with 4GB RAM and a slow HDD. All caches are cold either via reboot, or by restarting postgres and using the `purge` command to flush the OS cache. The ques

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-20 Thread Dimitrios Apostolou
Hello again, I am back with new experiments. First of all, I have a concrete set of steps that replicate the slowness of the hash join that I described. If you have a system with spinning disks lying around, I would appreciate if you can verify the scenario. Can you also replicate it in different

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-16 Thread Dimitrios Apostolou
On Sat, 14 Jan 2023, Tom Lane wrote: Dimitrios Apostolou writes: Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here is how I understand things according to posts I've read, and classical algorithms: + The Hash Join is fastest when one side fits in wo

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-14 Thread Dimitrios Apostolou
On Fri, 13 Jan 2023, David Rowley wrote: I'd expect reducing random_page_cost to make the Mege Join cheaper as that's where the Index Scan is. I'm not quite sure where you think the random I/O is coming from in a batched hash join. Thanks for the feedback, indeed you are right! Decreasing ra

Why is a hash join preferred when it does not fit in work_mem

2023-01-12 Thread Dimitrios Apostolou
Hello list, I have a very simple NATURAL JOIN that does not fit in the work_mem. Why does the query planner prefer a hash join that needs 361s, while with a sort operation and a merge join it takes only 13s? The server is an old Mac Mini with hard disk drive and only 4GB RAM. Postgres version i