Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-11 Thread Dimitrios Apostolou

On Thu, 11 Jul 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n).


So don't do that.  Adding partitions is not cost-free.



I understand that, they also add an administrative cost that I'd rather
avoid. But I ended up adding all these partitions because of performance
issues on a multi-billion rows table. There is probably some message from
me on this list a couple of years ago.

At the moment I have a work-around. I'm thankful that everyone is willing
to provide workarounds to all potential issues/bugs I have presented, but
unfortunately workarounds are not fixes, one will hit the same wall again
at some point.

My current concern is **reporting my findings responsibly**. I want to
provide as much data needed to pinpoint the issue, so that the developers
know exactly what's going on. Having right data is half the fix.

A way to track the issue would be nice. I might revisit it and even try to
submit a patch. I wonder how the postgres development community is
tracking all these issues, I've even started forgetting the ones I have
found, and I'm sure I have previously reported (on this list) a couple of
should-be-easy issues that would be ideal for beginners.


Regards,
Dimitris





Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-11 Thread Dimitrios Apostolou

Thank you for the feedback.

So I've managed to reduce the query to a rather simple one:

SELECT
workitem_n, test_executable_n,
bool_or(test_resulttype_n IN (2,3))
FROM
test_runs_raw
GROUP BY
workitem_n, test_executable_n
LIMIT 10;


The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). All the
columns are various integer types. There is an index on workitem_n.



On Thu, 11 Jul 2024, David Rowley wrote:


On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou  wrote:

I have a table with 1000 partitions on PostgreSQL 16.
I notice that a fairly complicated query of the form:

SELECT ... GROUP BY ... LIMIT ...

causes the postgres backend process to grow insanely very fast, and the
kernel OOM killer to kill it rather soon.
It seems it tries to allocate at least 1000 * work_mem.



->  Append
   ->  HashAggregate
 ->  Seq Scan
   -> ... 1000 more hashagg+seqscans


Is this allocation pattern (workmem * n_partitions) expected under any
scenario? I can't find it documented.  AFAIU the backend should allocate
up to (depth_of_execution_plan * work_mem) (putting aside the
hash_mem_multiplier and the parallel workers).


Not depth of execution plan. It relates to the number of nodes in the
plan which allocate work_mem or work_mem * hash_mem_multiplier.

There is some documentation in [1]:

"Note that a complex query might perform several sort and hash
operations at the same time"


The latest query is not complex at all and I don't see it doing 1000s of
operations at the same time. By "number of nodes" would you add up all
HashAggregate nodes under an Append node? Here is part of the EXPLAIN
ANALYZE output:

 Limit  (cost=0.01..28.00 rows=10 width=7) (actual time=43120.466..43292.246 
rows=10 loops=1)
   Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, 
(bool_or((test_runs_raw.test_resulttype_n = ANY ('{2,3}'::integer[]
   Buffers: shared hit=96 read=883975
   I/O Timings: shared read=16284.731
   ->  Append  (cost=0.01..3416299633.71 rows=1220556171 width=7) (actual 
time=42968.794..43139.855 rows=10 loops=1)
 Buffers: shared hit=96 read=883975
 I/O Timings: shared read=16284.731
 ->  HashAggregate  (cost=0.01..0.02 rows=1 width=7) (actual 
time=10.662..10.663 rows=0 loops=1)
   Output: test_runs_raw.workitem_n, 
test_runs_raw.test_executable_n, bool_or((test_runs_raw.test_resulttype_n = ANY 
('{2,3}'::integer[])))
   Group Key: test_runs_raw.workitem_n, 
test_runs_raw.test_executable_n
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on public.test_runs_raw__part_max20k test_runs_raw  
(cost=0.00..0.00 rows=1 width=8) (actual time=9.960..9.961 rows=0 loops=1)
 Output: test_runs_raw.workitem_n, 
test_runs_raw.test_executable_n, test_runs_raw.test_resulttype_n
 ->  HashAggregate  (cost=0.01..0.02 rows=1 width=7) (actual 
time=1.913..1.914 rows=0 loops=1)
   Output: test_runs_raw_1.workitem_n, 
test_runs_raw_1.test_executable_n, bool_or((test_runs_raw_1.test_resulttype_n = 
ANY ('{2,3}'::integer[])))
   Group Key: test_runs_raw_1.workitem_n, 
test_runs_raw_1.test_executable_n
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on public.test_runs_raw__part_max40k 
test_runs_raw_1  (cost=0.00..0.00 rows=1 width=8) (actual time=1.031..1.031 rows=0 
loops=1)
 Output: test_runs_raw_1.workitem_n, 
test_runs_raw_1.test_executable_n, test_runs_raw_1.test_resulttype_n
[  1000s of similar HashAggregate nodes ... ]
Settings: temp_buffers = '32MB', work_mem = '32MB', effective_io_concurrency = 
'300', max_parallel_workers_per_gather = '0', enable_hashjoin = 'off', 
enable_partitionwise_join = 'on', enable_partitionwise_aggregate = 'on', 
random_page_cost = '1.1', effective_cache_size = '6GB', from_collapse_limit = 
'24', join_collapse_limit = '24'
 Planning:
   Buffers: shared hit=377
 Planning Time: 1503.800 ms
 Execution Time: 56515.185 ms
(5382 rows)

Memory usage on each HashAggregate is logged as 24KB (many HashAggregates
are missing that info though), I guess the EXPLAIN output is missing some
important part of the allocations here since I'm seeing MBs of allocations
per node.

I can't help but see this as a bug. I see many issues:

* postgres is not reading from partitions in parallel, but one after the
  other. It shouldn't need all this memory simultaneously.

* The memory is unnecessarily allocated early on, before any partitions
  are actually aggregated. I know this because I/O is slow on this device
  and the table sizes are huge, it's simply not possible that postgres
  went through all partitions and blew up the memory. That would take
  hours, but the OOM happens seconds after I start the query.

* The memory is not only allocated by the planner, but it's actually
  accessed. Libc's malloc() has no problem allocating gigabytes more than
  wha

Query on partitioned table needs memory n_partitions * work_mem

2024-07-10 Thread Dimitrios Apostolou

Hello list,

I have a table with 1000 partitions on PostgreSQL 16.
I notice that a fairly complicated query of the form:

SELECT ... GROUP BY ... LIMIT ...

causes the postgres backend process to grow insanely very fast, and the
kernel OOM killer to kill it rather soon.
It seems it tries to allocate at least 1000 * work_mem.

If I reduce the amount of work_mem, I can control the outcome and avoid
the crash, but this is suboptimal.
I have parallel plans disabled (max_parallel_workers_per_gather=0).

To add a bit more info on the execution plan, I believe the relevant part
is the 1000 HashAggregate nodes under Append:

->  Append
  ->  HashAggregate
->  Seq Scan
  -> ... 1000 more hashagg+seqscans


Is this allocation pattern (workmem * n_partitions) expected under any
scenario? I can't find it documented.  AFAIU the backend should allocate
up to (depth_of_execution_plan * work_mem) (putting aside the
hash_mem_multiplier and the parallel workers).

NOTE: after having written the above message, it occured to me that I have
enable_partitionwise_aggregate=on. And Turning it off fixes the issue and
makes the query faster too! Expected behaviour or bug?

Thank you in advance,
Dimitris

P.S. In the meantime I'm trying to reduce the query and the table schema,
 in order to submit a precise bug report with repro instructions.





Re: pg_stat_io clarifications: background worker, writes and reads

2024-05-21 Thread Dimitrios Apostolou

Hi Kashif, just to clarify my question, are you saying that the I/O from
parallel workers is recorded into the "background worker" entry of
pg_stat_io?

Thanks,
Dimitris


On Wed, 15 May 2024, Kashif Zeeshan wrote:


Hi
parallel worker are used for parallel execution of the queries and you can find 
the help in the below link.

https://www.postgresql.org/docs/current/how-parallel-query-works.html

Its controlled by following parameters.

max_worker_processes = 6
max_parallel_workers_per_gather = 3
max_parallel_workers = 6
The limit of concurrent parallel workers for the whole cluster is 
max_parallel_workers, which must be ≤ max_worker_processes. The limit of 
parallel workers per query is max_parallel_workers_per_gather.

Thanks
Kashif Zeeshan
Bitnine Global

On Wed, May 15, 2024 at 5:59 PM Dimitrios Apostolou  wrote:
  So what is this particular "background worker" I'm seeing, given that I
  have no replication or extensions?

  Searching the logs I found entries like the following:

  LOG:  background worker "parallel worker" (PID 93384) exited with exit 
code 1

  This got logged when I killed a simple SELECT query that took too long
  doing parallel seqscans. Could it be that the entry in pg_stat_io named
  "background worker" also includes the parallel workers from a SELECT
  query?

  Thanks,
  Dimitris


Re: pg_stat_io clarifications: background worker, writes and reads

2024-05-15 Thread Dimitrios Apostolou

So what is this particular "background worker" I'm seeing, given that I
have no replication or extensions?

Searching the logs I found entries like the following:

LOG:  background worker "parallel worker" (PID 93384) exited with exit code 1

This got logged when I killed a simple SELECT query that took too long
doing parallel seqscans. Could it be that the entry in pg_stat_io named
"background worker" also includes the parallel workers from a SELECT
query?

Thanks,
Dimitris

On Wed, 15 May 2024, Muhammad Imtiaz wrote:


Hi,

In PostgreSQL, the pg_stat_io view provides detailed statistics on I/O 
operations. Background process perform maintenance tasks and other background 
operations essential to the functioning of the PostgreSQL database.
They include processes such as:

1. Autovacuum Workers
2. WAL Writer
3. Background Writer
4. Logical Replication Workers
5. Custom Background Workers

In the pg_stat_io view, statistics related to I/O operations performed by these 
background workers are recorded. 

Regards,
Imtiaz


On Wed, 15 May 2024, 01:26 Dimitrios Apostolou,  wrote:
  Hello list,

  what is the "background worker" in the pg_stat_io statistics view? I'm
  reading the documentation but can't figure this one out knowing that it is
  not autovacuum or bgwriter. And I'm not aware of any extension I might
  have with registered background worker.

  Additionally, how can it be evictions > writes? I would expect every
  eviction to cause a write.

  Finally about "hits", I understand they are reads found in shared_buffers,
  so they never registered into the "reads" counter. So is "reads" in
  pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read
  attempts not found in the shared_buffers, that needed to be fetched from
  the disk (or OS buffercache)?

      backend_type    |    object     | context |  reads  |  read_time  | 
writes | write_time | writebacks | writeback_time | extends | extend_time | 
op_bytes |   hits    | evictions | reuses | fsyncs | fsync_time |          
stats_reset
  
---+---+-+-+-+++++-+-+--+---+---++++---
    background worker | relation      | normal  | 5139575 | 2196288.011 |  
63277 |    1766.94 |          0 |              0 |       0 |           0 |     
8192 | 876913705 |   5139653 |        |      0 |          0 | 2024-04-08 
08:50:02.971192+00


  Thank you in advance,
  Dimitris







pg_stat_io clarifications: background worker, writes and reads

2024-05-14 Thread Dimitrios Apostolou

Hello list,

what is the "background worker" in the pg_stat_io statistics view? I'm
reading the documentation but can't figure this one out knowing that it is
not autovacuum or bgwriter. And I'm not aware of any extension I might
have with registered background worker.

Additionally, how can it be evictions > writes? I would expect every
eviction to cause a write.

Finally about "hits", I understand they are reads found in shared_buffers,
so they never registered into the "reads" counter. So is "reads" in
pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read
attempts not found in the shared_buffers, that needed to be fetched from
the disk (or OS buffercache)?

   backend_type|object | context |  reads  |  read_time  | writes | 
write_time | writebacks | writeback_time | extends | extend_time | op_bytes |   
hits| evictions | reuses | fsyncs | fsync_time |  stats_reset
---+---+-+-+-+++++-+-+--+---+---++++---
 background worker | relation  | normal  | 5139575 | 2196288.011 |  63277 | 
   1766.94 |  0 |  0 |   0 |   0 | 8192 | 
876913705 |   5139653 ||  0 |  0 | 2024-04-08 
08:50:02.971192+00


Thank you in advance,
Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou

I have forgotten to mention that I have enable_partitionwise_aggregate=on
in the global settings since the beginning. According to the docs:


Enables or disables the query planner's use of partitionwise grouping or
aggregation, which allows grouping or aggregation on partitioned tables
to be performed separately for each partition.


Reading that, I'd expect to see a separate DISTINCT->LIMIT 10 on every
partition, and then it would be up to independent plans to decide whether
each partition follows a parallel or a serial plan.

Not sure if this plan was checked but rejected because of cost.


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou

On Tue, 14 May 2024, David Rowley wrote:


If you were to put the n_distinct_inherited estimate back to 200 and
disable sort, you should see the costs are higher for the index plan.
If that's not the case then there might be a bug.  It seems more
likely that due to the n_distinct estimate being so low that the
planner thought that a large enough fraction of the rows needed to be
read and that made the non-index plan appear cheaper.

I'd be interested in seeing what the costs are for the index plan. I
think the following will give you that (untested):

alter table test_runs_raw alter column workitem_n set
(n_distinct_inherited=200);
analyze test_runs_raw;


I had to stop this step because it was taking too long going through all
partitions again. But it seems it had the desired effect.


set enable_sort=0;
explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;


It chooses the non-parallel index plan:

 Limit  (cost=365.17..1135517462.36 rows=10 width=4)
   ->  Unique  (cost=365.17..22710342308.83 rows=200 width=4)
 ->  Append  (cost=365.17..22546660655.46 rows=65472661350 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k 
test_runs_raw_1000  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k 
test_runs_raw_999  (cost=0.12..2.34 rows=1 width=4)
[... only index scans follow]

LIMIT 100 goes for the parallel seqscan plan, that even contains a sort!
But it seams to me that the extra upper level HashAggregate step raises
the cost by an order of magnitude, from 800M to 10G, in comparison to
doing (Unique->Sort) - see plan in the next paragraph.

 Limit  (cost=10857220388.76..10857220389.01 rows=100 width=4)
   ->  Sort  (cost=10857220388.76..10857220389.26 rows=200 width=4)
 Sort Key: test_runs_raw.workitem_n DESC
 ->  HashAggregate  (cost=857220379.12..857220381.12 rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Gather  (cost=857220295.12..857220377.12 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=857219295.12..857219297.12 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..816295259.21 
rows=16369614363 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037800.88 
rows=64122388 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918865.36 
rows=61611136 width=4)
[... only seqscans follow]



If I re-enable sort, then it goes for the parallel seqscan plan even with LIMIT 
10:

SET SESSION enable_sort TO TRUE;
EXPLAIN  SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;

 Limit  (cost=857166256.39..857166256.59 rows=10 width=4)
   ->  Unique  (cost=857166256.39..857166260.39 rows=200 width=4)
 ->  Sort  (cost=857166256.39..857166258.39 rows=800 width=4)
   Sort Key: test_runs_raw.workitem_n DESC
   ->  Gather  (cost=857166135.82..857166217.82 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=857165135.82..857165137.82 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..816243567.24 
rows=16368627432 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037800.88 
rows=64122388 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918865.36 
rows=61611136 width=4)
[... only seqscans follow]

So in order of higher to lower cost, we have the following alternatives:

1. non-parallel index scan  (800M)
2. parallel seqscan with sort  (1.3G)
3. parallel seqscan 

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou

On Tue, 14 May 2024, Dimitrios Apostolou wrote:


It took long but if finished:

ANALYZE
Time: 19177398.025 ms (05:19:37.398)


I see now that default_statistics_target is globally set to 500, so this
is probably the reason it took so long. I guess with the default of 100,
it would take approximately one hour. This is much better to have in a
cron job. :-)

Dimitris






Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou




On Fri, 10 May 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:



  Limit  (cost=363.84..1134528847.47 rows=10 width=4)
->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
  ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
...



The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.

Possibly an explicit ANALYZE on the partitioned table would help.


It took long but if finished:

ANALYZE
Time: 19177398.025 ms (05:19:37.398)

And it made a difference indeed, the serial plan is chosen now:

EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;
 Limit  (cost=364.29..1835512.29 rows=10 width=4)
   ->  Unique  (cost=364.29..22701882164.56 rows=123706 width=4)
 ->  Append  (cost=364.29..22538472401.60 rows=65363905182 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k 
test_runs_raw_1000  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k 
test_runs_raw_999  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k 
test_runs_raw_998  (cost=0.12..2.34 rows=1 width=4)
[...]
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12460k_pkey on test_runs_raw__part_max12460k 
test_runs_raw_623  (cost=0.57..12329614.53 rows=121368496 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12440k_pkey on test_runs_raw__part_max12440k 
test_runs_raw_622  (cost=0.57..5180832.16 rows=184927264 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12420k_pkey on test_runs_raw__part_max12420k 
test_runs_raw_621  (cost=0.57..4544964.21 rows=82018824 width=4)
[...]

Overall I think there are two issues that postgres could handle better
here:

1. Avoid the need for manual ANALYZE on partitioned table

2. Create a different parallel plan, one that can exit early, when the
   LIMIT is proportionally low. I feel the partitions could be
   parallel-scanned in-order, so that the whole thing stops when one
   partition has been read.

Thank you!
Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou




On Tue, 14 May 2024, David Rowley wrote:


That assumes the Append won't ever use > 1 worker per subnode, but
that's not the case for your plan as the subnodes are "Parallel".
That means all the workers could be working on the same subnode which
could result in one group being split between 2 or more workers.


Didn't think of that, makes sense!


Parallel Append can also run in a way that the Append child nodes will
only get 1 worker each.


How can I tell which case it is, from the EXPLAIN output (for example
the output at [1]) ?

[1] 
https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou

On Tue, 14 May 2024, David Rowley wrote:


On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou  wrote:


On Sat, 11 May 2024, David Rowley wrote:

It will. It's just that Sorting requires fetching everything from its subnode.


Isn't it plain wrong to have a sort step in the plan than? The different
partitions contain different value ranges with no overlap, and the last
query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
clause on an indexed column.


The query does contain an ORDER BY, so if the index is not chosen to
provide pre-sorted input, then something has to put the results in the
correct order before the LIMIT is applied.


The last query I tried was:

SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

See my message at

[1] 
https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Will re-check things and report back with further debugging info you asked
for later today.


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou

On Mon, 13 May 2024, Dimitrios Apostolou wrote:


On Sat, 11 May 2024, David Rowley wrote:


 On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou  wrote:

 Indeed that's an awful estimate, the table has more than 1M of unique
 values in that column. Looking into pg_stat_user_tables, I can't see the
 partitions having been vacuum'd or analyzed at all. I think they should
 have been auto-analyzed, since they get a ton of INSERTs
 (no deletes/updates though) and I have the default autovacuum settings.
 Could it be that autovacuum starts, but never
 finishes? I can't find something in the logs.


 It's not the partitions getting analyzed you need to worry about for
 an ndistinct estimate on the partitioned table. It's auto-analyze or
 ANALYZE on the partitioned table itself that you should care about.

 If you look at [1], it says "Tuples changed in partitions and
 inheritance children do not trigger analyze on the parent table."


Thanks


Do I read that correctly, that I have to setup cron jobs to manually
analyze partitioned tables?


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou

On Sat, 11 May 2024, David Rowley wrote:


On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou  wrote:

Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.


It's not the partitions getting analyzed you need to worry about for
an ndistinct estimate on the partitioned table. It's auto-analyze or
ANALYZE on the partitioned table itself that you should care about.

If you look at [1], it says "Tuples changed in partitions and
inheritance children do not trigger analyze on the parent table."


Thanks




In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?


It will. It's just that Sorting requires fetching everything from its subnode.


Isn't it plain wrong to have a sort step in the plan than? The different
partitions contain different value ranges with no overlap, and the last
query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
clause on an indexed column.

Even with bad estimates, even with seq scan instead of index scan, the
plan should be such that it concludes all parallel work as soon as it
finds the 10 distinct values. And this is actually achieved if I disable
parallel plans. Could it be a bug in the parallel plan generation?


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou

On Sat, 11 May 2024, David Rowley wrote:


On Sat, 11 May 2024 at 13:33, Tom Lane  wrote:

I do kind of wonder why it's producing both a hashagg and a Unique
step --- seems like it should do one or the other.


It still needs to make the duplicate groups from parallel workers unique.


Range partitioning of the table guarantees that, since the ranges are not
overlapping.


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

On Fri, 10 May 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:



  Limit  (cost=363.84..1134528847.47 rows=10 width=4)
->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
  ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
...



The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.


Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.

In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?



Possibly an explicit ANALYZE on the partitioned table would help.


Thanks, I'll save the ANALYZE as the last step; I feel it's a good
opportunity to figure out more details about how postgres works. Plus I
expect ANALYZE to last a couple of days, so I should first find quiet time
for that. :-)

Dimitris




Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou




On Fri, 10 May 2024, Dimitrios Apostolou wrote:


On Fri, 10 May 2024, Dimitrios Apostolou wrote:

Update: even the simplest SELECT DISTINCT query shows similar behaviour:


Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:

 Limit  (cost=363.84..1134528847.47 rows=10 width=4)
   ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
 ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
   ->  Index Only Scan using test_runs_raw__part_max20k_pkey on 
test_runs_raw__part_max20k test_runs_raw_1  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan using test_runs_raw__part_max40k_pkey on 
test_runs_raw__part_max40k test_runs_raw_2  (cost=0.12..2.34 rows=1 width=4)
[...]
   ->  Index Only Scan using test_runs_raw__part_max1780k_pkey on 
test_runs_raw__part_max1780k test_runs_raw_89  (cost=0.57..53587294.65 
rows=106088160 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1800k_pkey on 
test_runs_raw__part_max1800k test_runs_raw_90  (cost=0.57..98943539.74 
rows=96214080 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1820k_pkey on 
test_runs_raw__part_max1820k test_runs_raw_91  (cost=0.57..97495653.34 
rows=193248960 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1840k_pkey on 
test_runs_raw__part_max1840k test_runs_raw_92  (cost=0.57..110205205.07 
rows=218440928 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1860k_pkey on 
test_runs_raw__part_max1860k test_runs_raw_93  (cost=0.57..50164056.28 
rows=99431760 width=4)
[...]


The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).

This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

On Fri, 10 May 2024, Dimitrios Apostolou wrote:


I noticed that the following query is very very slow (too long to wait for it
to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;


Update: even the simplest SELECT DISTINCT query shows similar behaviour:

EXPLAIN
SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

 Limit  (cost=724518979.52..724518979.92 rows=10 width=4)
   ->  Unique  (cost=724518979.52..724518987.52 rows=200 width=4)
 ->  Sort  (cost=724518979.52..724518983.52 rows=1600 width=4)  
 Sort Key: test_runs_raw.workitem_n
   ->  Gather  (cost=724518732.37..724518894.37 rows=1600 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=724517732.37..724517734.37 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..704131546.90 
rows=8154474186 width=4)
 ->  Parallel Index Only Scan using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1429238.50 rows=16811660 width=4)
 ->  Parallel Index Only Scan using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1081827.27 rows=12896836 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..2717185.06 
rows=32060806 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2610814.95 
rows=30806095 width=4)


It also takes ages to return, so I have to interrupt it.

I believe it should exit early, as soon as it finds 10 distinct values
(which should be rather easy even with parallel seqscans, given the
pattern followed when inserting the data).


Thanks,
Dimitris




SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

Hello list,

INTRO

I have a huge (multi-billion rows) table partitioned into 1000 partitions.
Around half of the partitions are full and the rest are empty, created in
advance ready to receive future incoming data. Postgres is 16.2.
Here are the relevant parts of the schema:


\d test_runs_raw

 Partitioned table "public.test_runs_raw"
  Column   |Type | Collation | Nullable |
Default
---+-+---+--+--
 run_n | bigint  |   | not null | 
generated by default as identity
 test_case_n   | smallint|   | not null |
 workitem_n| integer |   | not null |
 test_resulttype_n | smallint|   |  |
Partition key: RANGE (workitem_n)
Indexes:
"test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n)

Each partition is made to keep entries with workitem_n in ranges
(0,20k), (20k,40k) and so on (k = kilo) up to 2k.


PROBLEM

I noticed that the following query is very very slow (too long to wait for
it to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

What is remarkable, is that in 998 out of 1000 table scans it involves,
the planner does not use the index. Instead it chooses a sequential scan.
Here is the output from EXPLAIN:

 Limit  (cost=853891608.79..853891608.99 rows=10 width=4)
   ->  Unique  (cost=853891608.79..853891612.79 rows=200 width=4)
 ->  Sort  (cost=853891608.79..853891610.79 rows=800 width=4)   
Sort Key: test_runs_raw.workitem_n DESC
   ->  Gather  (cost=853891488.22..853891570.22 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=853890488.22..853890490.22 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..813118117.30 
rows=16308948365 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597355.10 rows=33623320 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210795.63 rows=25793672 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037793.12 
rows=64121612 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918875.90 
rows=61612190 width=4)
[ ... 996 more sequential scans ... ]

If I remove DISTINCT then the plan changes dramatically and it runs
instantaneously:

 Limit  (cost=363.84..367.30 rows=10 width=4)
   ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max2k_pkey 
on test_runs_raw__part_max2k test_runs_raw_1000  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey 
on test_runs_raw__part_max19980k test_runs_raw_999  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19960k_pkey 
on test_runs_raw__part_max19960k test_runs_raw_998  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19940k_pkey 
on test_runs_raw__part_max19940k test_runs_raw_997  (cost=0.12..2.34 rows=1 
width=4)
[ ... 996 more index scans ... ]

Notice how in the last plan there is no parallel scanning. Instead the
partitions are scanned sequentially, *in proper order*,
so that the plan execution stops after reading the first
10 rows in the first non-empty partition.

Why can't the same be done with DISTINCT?
Please note that the workitem_n value range is well spread into in range
(0,13M) and the table has been gradually filled within one year, so I'm
assuming the vacuum worker has worked long enough to build sane statistics
(not sure how to verify that).


REMARKS

1. I tried reproducing the problem on an artificial table with few
   partitions and few values, but I couldn't. Both queries execute fast,
   and the planner is always choosing a non-parallel index-only scan.

2. Among testing changes to various settings, I just noticed that setting
   max_parallel_workers_per_gather to 0 (from the original value of 4)
   fixes the issue! On the original huge table, disabling parallelism
   actually makes the query infinitely faster and it returns within 1s! Is
   this a bug in the planner?


Thank you,
Dimitris





Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou

Thanks for the feedback Laurenz,

On Wed, 28 Feb 2024, Laurenz Albe wrote:


On Wed, 2024-02-28 at 12:16 +0100, Dimitrios Apostolou wrote:


I ended up doing some risky actions to remediate the problem: Find the
filenames that have no identically named "oid" in pg_class, and delete
(move to backup) the biggest ones while the database is stopped.
Fortunately the database started up fine after that!


Lucky you.  It should have been "relfilenode" rather than "oid",
and some catalog tables don't have their files listed in the catalog,
because they are needed *before* the database can access tables.


I actually double checked that the filenames don't appear anywhere in
SELECT * FROM pg_class
and that the files were multi-GB in size including all the
1GB-pieces. But luck was definitely a part of the equation, I didn't know
that the files might be accessed before tables (at db startup?) or that
"relfilenode" would be more appropriate. Why is that, where can I read
more? I see that many (but not all) rows in pg_class have oid=relfilenode
but for many rows relfilenode=0 which is meaningless as filename.




So what is the moral of the story? How to guard against this?


Monitor disk usage ...


It happened *fast*. And it was quite a big suprise coming
from "just" a disk-full situation.

A couple of suggestions; wouldn't it make sense:

+ for the index to be written by default to the table's tablespace?

+ for postgres to refuse to write non-wal files, if it's on
  the same device as the WAL and less than max_wal_size bytes are free?



The root of the problem is that you created the index in the default
tablespace.  You should have

  ALTER TABLE x ADD UNIQUE ... USING INDEX TABLESPACE bigtblspc;



Thank you, was reading the docs but didn't realize this
syntax is valid. I thought it was only for CREATE/ALTER INDEX.



Needless to say, I would have hoped the database cleaned-up after itself
even after an uncontrolled crash, or that it provided some tools for the
job. (I tried VACUUM FULL on the table, but the orphaned files did not go
away).


That is not so simple... Also, it would slow down crash recovery.

But I agree that it would be nice to have a tool that reports or
cleans up orphaned files.

Yours,
Laurenz Albe






Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou

Hi Sergey,

On Wed, 28 Feb 2024, Sergey Fukanchik wrote:


Hi Dimitrios,
Do you have wal archiving enabled?
$PGDATA/pg_wal/ is a bit different from tablespaces (including
"default" one). It stores transaction journal.


I don't think I have WAL archiving enabled. Here are the relevant WAL
options in my config:

wal_compression = zstd
max_wal_size = 8GB
min_wal_size = 1GB
track_wal_io_timing = on
wal_writer_flush_after = 8MB

The issue happens because the WAL directory is by default in the same
filesystem with the default tablespace (root partition for Ubuntu). So
when the default tablespace filled up because of my ALTER TABLE operation,
there was no space for WAL either.


Thanks,
Dimitris





Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou

Hello list,

yesterday I was doing:

ALTER TABLE x ADD UNIQUE ... , ADD PRIMARY KEY ...;

The table is almost a billion rows long but lies in its own TABLESPACE
that has plenty of space.  But apparently the ALTER TABLE command is
writing a lot to the default tablespace (not the temp_tablespace, that is
already moved to a different partition).

That quickly filled up the 50GB free space in my root partition:


20:18:04.222 UTC [94144] PANIC:  could not write to file 
"pg_wal/xlogtemp.94144": No space left on device
[...]
20:19:11.578 UTC [94140] LOG:  WAL writer process (PID 94144) was terminated by 
signal 6: Aborted
20:19:11.578 UTC [94140] LOG:  terminating any other active server processes


After postgresql crashed and restarted, the disk space in the root
partition was still not freed! I believe this is because of "orphaned
files" as discussed in mailing list thread [1].

[1] 
https://www.postgresql.org/message-id/CAN-RpxDBA7HbTsJPq4t4VznmRFJkssP2SNEMuG%3DoNJ%2B%3DsxLQew%40mail.gmail.com

I ended up doing some risky actions to remediate the problem: Find the
filenames that have no identically named "oid" in pg_class, and delete
(move to backup) the biggest ones while the database is stopped.
Fortunately the database started up fine after that!

So what is the moral of the story? How to guard against this?

Why did the database write so much to the default tablespace, even when
both the table and the temp tablespace are elsewhere?  Also should one
always keep the default tablespace away from the wal partition? (I guess
it would have helped to avoid the crash, but the ALTER TABLE command would
have still run out of space, so I'm not sure if the orphan files would
have been avoided)?

Needless to say, I would have hoped the database cleaned-up after itself
even after an uncontrolled crash, or that it provided some tools for the
job. (I tried VACUUM FULL on the table, but the orphaned files did not go
away).

My postgresql version is 16.2 installed on Ubuntu.

Thank you,
Dimitris





Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou

On Wed, 15 Nov 2023, Erik Wienhold wrote:


On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote:

SELECT [DISTINCT] ... EXCEPT ...

In this query I get the same results regardless of including DISTINCT or
not. But I get different query plans, I get an extra HashAggregate node
in the case of SELECT DISTINCT. Any idea why?


As Tom Lane recently wrote[1] EXCEPT is not optimized and will operate
on the subqueries which are planned independently.

[1] https://www.postgresql.org/message-id/2664450.1698799...@sss.pgh.pa.us


Heh, as he wrote to me even. :-) I just wanted to make sure that this is
indeed a missing optimisation of the planner, and that the queries are
effectively the same. Thank you for clarifying.

As mentioned, the docs don't make it clear if the SELECT DISTINCT part is
implied or not, only the EXCEPT DISTINCT part is clearly on by default.


Dimitris





Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou

Hello list,

SELECT [DISTINCT] ... EXCEPT ...

In this query I get the same results regardless of including DISTINCT or
not. But I get different query plans, I get an extra HashAggregate node
in the case of SELECT DISTINCT. Any idea why?

Reading the docs, I understand that postgres does by default
EXCEPT DISTINCT, but I don't see the same for the 1st clause,
SELECT [DISTINCT].


Steps to reproduce:

CREATE TABLE t1(i INTEGER NOT NULL);
CREATE TABLE t2(i INTEGER NOT NULL);

INSERT INTO t1 VALUES (1),(2),(1),(2),(3),(3),(3);
INSERT INTO t2 VALUES (3);

SELECT * FROM t1 EXCEPT SELECT * FROM t2;
 i
---
 2
 1
(2 rows)

SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2;
 i
---
 2
 1
(2 rows)


EXPLAIN SELECT * FROM t1 EXCEPT SELECT * FROM t2;
   QUERY PLAN
-
 HashSetOp Except  (cost=0.00..160.25 rows=200 width=8)
   ->  Append  (cost=0.00..147.50 rows=5100 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

EXPLAIN SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2;
   QUERY PLAN
-
 HashSetOp Except  (cost=41.88..127.50 rows=200 width=8)
   ->  Append  (cost=41.88..120.62 rows=2750 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=41.88..45.88 rows=200 width=8)
   ->  HashAggregate  (cost=41.88..43.88 rows=200 width=4)
 Group Key: t1.i
 ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(8 rows)



Regards,
Dimitris





Re: Inefficient query plan for SELECT ... EXCEPT ...

2023-11-01 Thread Dimitrios Apostolou

Thank you all for the answers, they covered me well.



Is this worth a bug report? I can file one if the issue is not known.


No. It's just a missing optimisation. We know about it.


It's good I shot an email first then. FWIW my usual way in other projects
would be to check the bugtracker, and just "follow" the relevant issue if
it's minor like a missing optimisation. I didn't find a way to search for
"known issues" in the Postgresql project.


Dimitris




Inefficient query plan for SELECT ... EXCEPT ...

2023-10-31 Thread Dimitrios Apostolou

Hello list,

I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query,
where the left side is a very short table (even zero-length sometimes, but
also also rarely can be as long as 200K rows), and the right side is a
table with 10M UNIQUE NOT NULL rows:


\d test_datatags

Table "public.test_datatags"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+--
 test_datatag_n | integer |   | not null | generated by default as 
identity
 test_datatag   | text|   | not null |
Indexes:
"test_datatags_pkey" PRIMARY KEY, btree (test_datatag_n)
"test_datatags_test_datatag_key" UNIQUE CONSTRAINT, btree (test_datatag)


Follows a simplified and pathological case, that takes 1.7s even though
the left side is empty:



BEGIN;
CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP;
ANALYZE VERBOSE tmp_table ;

INFO:  analyzing "pg_temp_9.tmp_table"
INFO:  "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead 
rows; 0 rows in sample, 0 estimated total rows

EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS)

SELECT DISTINCT  d1
FROM tmp_table
WHERE d1 IS NOT NULL
EXCEPT
SELECT test_datatag FROM test_datatags;

QUERY PLAN
--
 HashSetOp Except  (cost=0.00..299854.08 rows=1 width=36) (actual 
time=1726.470..1726.472 rows=0 loops=1)
   Output: "*SELECT* 1".d1, (0)
   Buffers: shared hit=1259 read=58800
   I/O Timings: shared/local read=77.713
   ->  Append  (cost=0.00..278054.53 rows=8719821 width=36) (actual 
time=3.754..1287.901 rows=8702840 loops=1)
 Buffers: shared hit=1259 read=58800
 I/O Timings: shared/local read=77.713
 ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..0.02 rows=1 width=36) 
(actual time=0.003..0.003 rows=0 loops=1)
   Output: "*SELECT* 1".d1, 0
   ->  HashAggregate  (cost=0.00..0.01 rows=1 width=32) (actual 
time=0.002..0.003 rows=0 loops=1)
 Output: tmp_table.d1
 Group Key: tmp_table.d1
 Batches: 1  Memory Usage: 24kB
 ->  Seq Scan on pg_temp.tmp_table  (cost=0.00..0.00 rows=1 
width=32) (actual time=0.001..0.001 rows=0 loops=1)
   Output: tmp_table.d1
   Filter: (tmp_table.d1 IS NOT NULL)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..234455.40 rows=8719820 
width=26) (actual time=3.751..943.850 rows=8702840 loops=1)
   Output: "*SELECT* 2".test_datatag, 1
   Buffers: shared hit=1259 read=58800
   I/O Timings: shared/local read=77.713
   ->  Seq Scan on public.test_datatags  (cost=0.00..147257.20 
rows=8719820 width=22) (actual time=3.747..515.420 rows=8702840 loops=1)
 Output: test_datatags.test_datatag
 Buffers: shared hit=1259 read=58800
 I/O Timings: shared/local read=77.713
 Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 
'on', enable_partitionwise_join = 'on', hash_mem_multiplier = '1',
 maintenance_io_concurrency = '0', max_parallel_workers = '12', 
max_parallel_workers_per_gather = '0', temp_buffers = '64MB'
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.055 ms
 JIT:
   Functions: 15
   Options: Inlining false, Optimization false, Expressions true,
Deforming true
   Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms,
Emission 3.542 ms, Total 4.037 ms
 Execution Time: 1726.835 ms
(33 rows)


I'm wondering why the planner doesn't see that the left table is very small and 
follow a different path.
From an abstract computer science POV, I would

1. sort the left table  (the right one is already indexed)
2. "merge" the two tables, by walking them in-order in parallel and excluding 
the matches
3. stop when the left table is exhausted, which would happen very early.

Is this worth a bug report? I can file one if the issue is not known.
Or am I misunderstanding the implications of the SELECT-EXCEPT query?

In the meantime I have replaced the query with a LEFT OUTER JOIN which
performs much better, and I believe is equivalent. I find it less readable
than the query in question though. Plus, I have a bunch of SELECT-EXCEPT
queries (with smaller right-side tables) in my application that I would
hate to change them all to the ugliest equivalent. Under what conditions
would the above query plan perform well?

Thanks in advance,
Dimitris





Re: BRIN index maintenance on table without primary key

2023-10-30 Thread Dimitrios Apostolou

On Fri, 27 Oct 2023, Dimitrios Apostolou wrote:


So the question is: how to maintain the physical order of the tuples?


Answering to myself, there doesn't seem to be any way to run pg_repack on
a table without a UNIQUE key.

To run CLUSTER, the only way I see is to:

1. Create a btree index on the same column that has the BRIN index
2. CLUSTER
3. Drop the index

This should take very long on my huge table, and keeps the table
exclusively locked. The disk space needed would also be huge, if the table
isn't partitioned.

I wonder why CLUSTER can't run based on a BRIN-indexed column. Is it
theoretically impossible, or is it just not implemented yet?

My understanding so far is that CLUSTER only rewrites the table *according
to the index order* and does not touch the index itself. For a BRIN index
though it would need to rewrite the table *ignoring the index* and then
rewrite the index too, in order to keep the ranges fully optimized. So the
logic is very different, and maybe closer to what VACUUM FULL does.


Thanks,
Dimitris




BRIN index maintenance on table without primary key

2023-10-27 Thread Dimitrios Apostolou

Hello list.

Key characteristics of my case:
+ HUGE table with 20G rows, ca 2TB
+ May be partitioned (have both versions on two test clusters ATM)
+ Plenty of inserts, no updates
+ No primary key - we had one IDENTITY bigint column until recently, but
  it proved useless and inefficient (index too big for mem) so we deleted it.
+ All queries are filtering on a not-unique not-null integer column.
  + On this column we have a BRIN INDEX since insert order is
/mostly/ incremental.

So the question is: how to maintain the physical order of the tuples?

Even though the insertions populate the index column /mostly/
incrementally, there are outliers and it's certain that over time the
BRIN index will deteriorate.

I'm monitoring the "correlation" and I want to run a command to "defrag"
it when it drops below 0.9.

+ Can't run CLUSTER:
ERROR:  cannot cluster on index "tst_brin_idx" because access method does 
not support clustering

+ Can't run pg_repack, from [1]:
> Target table must have a PRIMARY KEY, or at least a UNIQUE total index on 
a NOT NULL column

[1] https://reorg.github.io/pg_repack/


Any ideas?

Thank you in advance,
Dimitris





Re: Partitionwise JOIN scanning all partitions, even unneeded ones

2023-08-30 Thread Dimitrios Apostolou

Thank you for the clear explanation, and I hope the missing optimisation
gets implemented sooner rather than later. Maybe the query planner should
consider the missing optimisation and ban *hash* partitionwise joins.

Indeed I verified that disabling hash join fixed the situation, with both
queries taking almost the same time. Great!

In the meantime, I'm considering disabling hash joins globally, as I've
had issues with them before (excessive I/O slowing causing much slower
execution than merge join, see [1] for your answer to my question then :).
Do you think that would save me from other problems I'm not aware of,
given the huge size of the table I'm querying?

[1] 
https://www.postgresql.org/message-id/caaphdvppvydonkeqlybsbjwq8kq8m7ywdka44rtea2mnao3...@mail.gmail.com


Regards,
Dimitris




Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-21 Thread Dimitrios Apostolou

On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:

+ Then I broke the command above in many small chunks

  WITH rows AS (
DELETE FROM tbl_legacy AS t
WHERE (partition_key_column < $1)
RETURNING t.*
   )  INSERT INTO tbl SELECT * FROM rows;
  COMMIT;

  I increase the parameter $1 and keep going in a loop.  At first this
  goes OK, after one day though I notice that it has slowed down
  significantly. My investigation shows that the simple query

   SELECT min(partition_key_column) from tbl_legacy;

  takes 38s, despite having an index on the column! A VACUUM fixes that,
  so I guess the index has a ton of dead tuples. I guess autovacuum does
  not do its job because the table is constantly busy.

  Unfortunately VACUUM takes long (several hours) on this huge table, so I
  can't add in the loop after the DELETE command.

  Is there a better way to avoid the bloat in the loop above?  Why can't
  the DELETE command update the index by pointing the minimum element
  beyond the dead tuples?


Any comments on this one? It annoys me that a simple loop deteriorated so
much and kept filling the table with bloat. What is that VACUUM does that
DELETE can't do, to keep the index fresh?




Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-10 Thread Dimitrios Apostolou

Thanks! I have completed the transfer by taking down the database and
working exclusively on it, but I still wonder how one could do it in
production without exclusive locks. The loop with small DELETE...INSERT
transactions I posted on the parent post bloats the table fast.

The link you posted contains very useful info, I was not expecting that
the constraints could blow up the memory like that. Comments from me:

Disabling and then re-enabling the foreign key constraints is easily done
with ALTER TABLE.

Unfortunately it doesn't seem to be the same case for indices. One has to
create the table without indices, and then create the indices separately.
With such a process there is a risk of ending up with non-identical
table schemas.

By the way, with COPY one must use an intermediate file, right? There is
no way to COPY from table to table directly?


Thanks,
Dimitris

On Thu, 6 Jul 2023, Lorusso Domenico wrote:


Hello,
this is a standard problem during bulk copy.

here some suggestions; for example disable indexes.

The main issue is related to index, lock escalation and log writing.
In other dbms you should set log off on the table, but postgresql does not seem 
to have this feature.

Anyway, using an explicit lock table exclusive should prevent lock escalation.

So: disable indexes in target table
lock exclusive both table
insert data
truncate old table

If this doesn't work you can consider using the copy command.







Il giorno gio 6 lug 2023 alle ore 18:12 Dimitrios Apostolou  ha 
scritto:
  On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:

  > + First I chose the method to DELETE...INSERT everything.

  Just to clarify, the query looked more like

     WITH rows AS ( DELETE FROM tbl_legacy RETURNING * )
       INSERT INTO tbl
         SELECT * FROM rows;

  >   I noticed that the postgres process was growing without bounds, up to
  >   the point that it consumed almost all of swap space and I had to kill
  >   the command. Any idea why that happens?

  Also note that my settings for work_mem, temp_buffers, shared_buffers etc
  are all well below the RAM size and postgres has never shown unbound
  growth again. Postgres version is 15.2.


  Dimitris





--
Domenico L.

per stupire mezz'ora basta un libro di storia,
io cercai di imparare la Treccani a memoria... [F.d.A.]




Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou

On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:


+ First I chose the method to DELETE...INSERT everything.


Just to clarify, the query looked more like

  WITH rows AS ( DELETE FROM tbl_legacy RETURNING * )
INSERT INTO tbl
  SELECT * FROM rows;


  I noticed that the postgres process was growing without bounds, up to
  the point that it consumed almost all of swap space and I had to kill
  the command. Any idea why that happens?


Also note that my settings for work_mem, temp_buffers, shared_buffers etc
are all well below the RAM size and postgres has never shown unbound
growth again. Postgres version is 15.2.


Dimitris





Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou

I have a huge (10G rows) table "tbl_legacy" named test_runs that I want to
PARTITION BY RANGE on an indexed integer column.

I have created the new partitioned table "tbl" and 1000 partitions in it,
so that it can take my existing data and have the partitions ready for the
next year (so most of the new partitions will be empty for now).

On the topic of bulk-transferring the data from tbl_legacy to tbl:

+ First I chose the method to DELETE...INSERT everything.

  I noticed that the postgres process was growing without bounds, up to
  the point that it consumed almost all of swap space and I had to kill
  the command. Any idea why that happens?

+ Then I broke the command above in many small chunks

  WITH rows AS (
DELETE FROM tbl_legacy AS t
WHERE (partition_key_column < $1)
RETURNING t.*
) INSERT INTO tbl SELECT * FROM rows;
  COMMIT;

  I increase the parameter $1 and keep going in a loop.  At first this
  goes OK, after one day though I notice that it has slowed down
  significantly. My investigation shows that the simple query

SELECT min(partition_key_column) from tbl_legacy;

  takes 38s, despite having an index on the column! A VACUUM fixes that,
  so I guess the index has a ton of dead tuples. I guess autovacuum does
  not do its job because the table is constantly busy.

  Unfortunately VACUUM takes long (several hours) on this huge table, so I
  can't add in the loop after the DELETE command.

  Is there a better way to avoid the bloat in the loop above?  Why can't
  the DELETE command update the index by pointing the minimum element
  beyond the dead tuples?

+ Finally I resorted to just copying the data:

INSERT INTO tbl SELECT * FROM tbl_legacy;

  This took half a day or so but worked without blowing up the memory
  usage like the 1st query.  Why is that?

  By the way, I don't consider the last solution ideal, since the data is
  duplicated between the two tables and it's harder to implement
  workarounds to do it without taking downtime. But it worked for me for
  this time. Any other ideas for partitioning an existing huge table?


Thanks in advance,
Dimitris




VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread Dimitrios Apostolou

Thanks everyone for the help.

I have now run simple VACUUM but it didn't change anything, the simple
SELECT is still slow.

My understanding by reading the docs is that it should reclaim all unused
space, just not return it to the OS. Which is fine by me. Any idea why it
failed to reclaim the space in my case?

I'm now running VACUUM FULL as everyone suggested. I just tried plain
VACUUM as I was curious if it would work and because it doesn't lock
the table with an operation that takes hours for my setup.


Thanks,
Dimitris





Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-02-01 Thread Dimitrios Apostolou

On Tue, 31 Jan 2023, David G. Johnston wrote:


It feels like there is room for improvement here using table statistics and the 
visibility map to significantly reduce the number of pages retrieved that turn 
out to be all dead. 


Yes, I too wish postgres was more clever for a simple unordered query. I
should also mention that this table has two btree indices, one primary key
index and one on another column. Would it be feasible to "descend" an
index in order to immediately find some live data?


A limit without an order is not a query to optimize for of course...


A good reason is that the web interface I'm using to browse my tables (and
I guess many other too) do a simple limited+unordered SELECT when you
click on a table. Timing out was what made me look deeper. I couldn't
accept that no matter how slow my system is, it has trouble returning a
few rows.

It is also the first query somoebody will type on psql to interactively
get a clue about the contents of a table.


Regards,
Dimitris


Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

Thanks for the insight on the internals. Regarding your questions:

On Tue, 31 Jan 2023, Tom Lane wrote:


Do you get the same 10 rows when you repeat the command?


Yes. Just tested with both cold and hot caches. The first 10 rows are
exactly the same, either they return slowly or immediately.


If turning synchronize_seqscans off changes the behavior, that'd
be a good clue that this is the right theory.


Turning it off makes the query slow no matter how many times I re-run it.
The system is doing lots of read I/O with both hot and cold caches. Here
is the EXPLAIN output from the hot cache run (that previously had only 14
hits and no reads):

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=620510.813..620510.821 
rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=64 read=2334462
   I/O Timings: shared/local read=567846.559
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=620510.800..620510.804 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=64 read=2334462
 I/O Timings: shared/local read=567846.559
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.099 ms
 Execution Time: 620510.855 ms


After reading the docs, I'm surprised this setting affects my case given
that I have no parallelism in my setup.



As for a real fix,
it might be time for a VACUUM FULL or CLUSTER on that table.


Regarding CLUSTER, would it help with a seqscan on a bloated table?
Furthermore, given that the table is expected to grow every day by a few
million rows, do you suggest running CLUSTER every night? Will postgres
remember that the rows up to N are clustered, even after appending more
rows?


Dimitris




Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

On Tue, 31 Jan 2023, David G. Johnston wrote:


On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou  wrote:
 
      ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 
rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
            Output: run_n, test_name_n, workitem_n, started_on, 
duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n
            Buffers: shared read=2334526
            I/O Timings: shared/local read=691137.029


The system has to return 10 live rows to you.  If it needs to search through 
that many buffers to find 10 live rows you most likely have a large bloating 
problem going on.  Seems like it is time to vacuum full.


I looked up on how to measure bloat, so I run the query found at [1].

[1] https://wiki.postgresql.org/wiki/Show_database_bloat

The first two rows show huge bloat on the two indices of this table:

... ORDER BY wastedbytes DESC LIMIT 2;
 current_database | schemaname |   tablename   | tbloat | wastedbytes | 
   iname | ibloat | wastedibytes
--++---++-+--++--
 coin | public | test_runs_raw |1.8 | 21742305280 | 
test_runs_raw_pkey   |1.0 |0
 coin | public | test_runs_raw |1.8 | 21742305280 | 
test_runs_raw_idx_workitem_n |0.3 |0
(2 rows)

Is this bloat even affecting queries that do not use the index?

It seems I have to add VACUUM FULL to nightly maintainance. I had run some
schema restructuring (several ADD COLUMN followed by UPDATE SET on all
rows) some days ago, and I was not aware this degraded the table.
Thanks for the useful info!


Dimitris


Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

I managed to double-paste different subset of my settings. FWIW all of
them are true for all the experiments. Sorry for the confusion.


On Tue, 31 Jan 2023, Dimitrios Apostolou wrote:


Other postgres settings:

shared_buffers = 1024MB
effective_io_concurrency = 0
wal_compression = zstd
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
checkpoint_timeout = 15min


[...]


Relevant settings:

synchronous_commit = off
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
track_io_timing = on
track_wal_io_timing = on
wal_compression = zstd





SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou



Hello list,

I run a very simple SELECT on a huge table (350M rows, 10 columns of type
integer or timestamp). The server is an old Mac Mini with 4GB RAM and a
slow HDD. All caches are cold either via reboot, or by restarting postgres
and using the `purge` command to flush the OS cache.

The question is why this simple query is taking so long to complete.
Most of the time I see constant 30MB/s read I/O. The memory usage is low
and the system is not swapping or in other ways unhealthy.  Postgres is
not serving any other requests.

Is postgres reading all the table sequentially, just to return 10 rows in
no particular order?  I'm trying to understand what PostgreSQL is doing
and the data access patterns involved.  All feedback is appreciated.


# SELECT version();

 PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 
12.0.0 (clang-1200.0.32.29), 64-bit

# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 10;

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=746005.251..746014.910 
rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared read=2334526
   I/O Timings: shared/local read=691137.029
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared read=2334526
 I/O Timings: shared/local read=691137.029
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning:
   Buffers: shared hit=69 read=18
   I/O Timings: shared/local read=265.550
 Planning Time: 271.719 ms
 Execution Time: 746057.424 ms
(14 rows)


Repeating the query returns instantaneously, and EXPLAIN shows shared
hit=14 and no reads. Changing the LIMIT to 100 takes another 3min to
respond, with constant read I/O while it runs:


# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 100;

 Limit  (cost=0.00..2.91 rows=100 width=42) (actual time=0.030..184692.698 
rows=100 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=14 read=772427
   I/O Timings: shared/local read=171484.072
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=0.029..184692.664 rows=100 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=14 read=772427
 I/O Timings: shared/local read=171484.072
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.083 ms
 Execution Time: 184692.748 ms
(11 rows)


Reruning this is instantaneous again.
Other postgres settings:

shared_buffers = 1024MB
effective_io_concurrency = 0
wal_compression = zstd
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
checkpoint_timeout = 15min


Bonus question: some time ago the server crashed and rebooted, and after a
long WAL recovery and some hours of idle time, I run the same query.  It
took even longer to respond, and besides reading, there was also a lot
writing involved.  What kind of writes was postgres doing?


# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 10;

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual 
time=1284114.810..1284114.819 rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484
   I/O Timings: shared/local read=104.264 write=107258.150
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=1283955.130..1283955.135 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484
 I/O Timings: shared/local read=104.264 write=107258.150
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.545 ms
 Execution Time: 1284114.972 ms
(11 rows)


Relevant settings:

synchronous_commit = off
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
track_io_timing = on
track_wal_io_timing = on
wal_compression = zstd


Thanks in advance,
Dimitris





Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-20 Thread Dimitrios Apostolou

Hello again, I am back with new experiments.

First of all, I have a concrete set of steps that replicate the
slowness of the hash join that I described. If you have a system with
spinning disks lying around, I would appreciate if you can verify the
scenario. Can you also replicate it in different kind of systems?


CREATE TABLE descriptions (description_id serial PRIMARY KEY, description text);
INSERT INTO  descriptions (description_id, description)
   SELECT s, repeat(encode(sha512(s::text::bytea), 'hex'), 4)
   FROM generate_series(0,1200300) AS s;
CREATE TABLE descriptions_in_books (description_id integer REFERENCES 
descriptions(description_id), book_id integer);
INSERT INTO  descriptions_in_books (description_id, book_id)
   SELECT s % 1200300, s
   FROM generate_series(0,5200300) AS s;

SET work_mem TO '1MB';
SET hash_mem_multiplier = 1.0;
SET track_io_timing TO on;

EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM descriptions NATURAL 
JOIN descriptions_in_books;
SET enable_hashjoin TO off;
EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM descriptions NATURAL 
JOIN descriptions_in_books;



The first JOIN query uses a hash join with Batches: 1024 and takes 622s!
For the longest part, I can see the disk writing most of the time around
1-2 MB/s, so I have to assume it's not writing sequentially.

The second identical JOIN uses a merge join that completes in 14s. The I/O
happens in a much higher rate (10x maybe), so I'm assuming it's mostly
sequential.

Another observation is that the hash join deteriorates as the length of
the TEXT column grows. In fact, if I fill it with only 32 char long
strings, then the hash join is split in only 128 batches, and it completes
almost as fast as the merge join. Could it be that the cost estimation
is underestimating the I/O pattern related to splitting in batches?


Here are the measurements:


 Hash Join  (cost=192450.84..401456.02 rows=5200486 width=524) (actual 
time=344516.004..621725.562 rows=5200301 loops=1)
   Output: descriptions.description_id, descriptions.description, 
descriptions_in_books.book_id
   Inner Unique: true
   Hash Cond: (descriptions_in_books.description_id = 
descriptions.description_id)
   Buffers: shared hit=15586 read=93161, temp read=97829 written=97829
   I/O Timings: shared/local read=1402.597, temp read=229252.170 
write=371508.313
   ->  Seq Scan on public.descriptions_in_books  (cost=0.00..75015.86 
rows=5200486 width=8) (actual time=0.068..1819.629 rows=5200301 loops=1)
 Output: descriptions_in_books.book_id, 
descriptions_in_books.description_id
 Buffers: shared hit=32 read=22979
 I/O Timings: shared/local read=249.910
   ->  Hash  (cost=97739.04..97739.04 rows=1200304 width=520) (actual 
time=343268.470..343268.471 rows=1200301 loops=1)
 Output: descriptions.description_id, descriptions.description
 Buckets: 2048  Batches: 1024  Memory Usage: 686kB
 Buffers: shared hit=15554 read=70182, temp written=78538
 I/O Timings: shared/local read=1152.687, temp write=338883.205
 ->  Seq Scan on public.descriptions  (cost=0.00..97739.04 rows=1200304 
width=520) (actual time=0.028..2278.791 rows=1200301 loops=1)
   Output: descriptions.description_id, descriptions.description
   Buffers: shared hit=15554 read=70182
   I/O Timings: shared/local read=1152.687
 Settings: hash_mem_multiplier = '1', work_mem = '1MB'
 Planning Time: 0.303 ms
 Execution Time: 622495.279 ms
(22 rows)


SET enable_hashjoin TO off;


 Merge Join  (cost=868411.87..1079330.96 rows=5200301 width=524) (actual 
time=6091.932..13304.924 rows=5200301 loops=1)
   Output: descriptions.description_id, descriptions.description, 
descriptions_in_books.book_id
   Merge Cond: (descriptions.description_id = 
descriptions_in_books.description_id)
   Buffers: shared hit=67 read=111962 written=1, temp read=45806 written=46189
   I/O Timings: shared/local read=1007.043 write=28.575, temp read=344.937 
write=794.483
   ->  Index Scan using descriptions_pkey on public.descriptions
(cost=0.43..116919.99 rows=1200304 width=520) (actual time=0.028..1596.387 
rows=1200301 loops=1)
 Output: descriptions.description_id, descriptions.description
 Buffers: shared hit=3 read=89015 written=1
 I/O Timings: shared/local read=834.732 write=28.575
   ->  Materialize  (cost=868408.84..894410.35 rows=5200301 width=8) (actual 
time=6091.892..9171.796 rows=5200301 loops=1)
 Output: descriptions_in_books.book_id, 
descriptions_in_books.description_id
 Buffers: shared hit=64 read=22947, temp read=45806 written=46189
 I/O Timings: shared/local read=172.311, temp read=344.937 write=794.483
 ->  Sort  (cost=868408.84..881409.60 rows=5200301 width=8) (actual 
time=6091.885..7392.828 rows=5200301 loops=1)
   Output: descriptions_in_books.book_id, 
descriptions_in_books.description_id
   Sort Key: 

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-16 Thread Dimitrios Apostolou

On Sat, 14 Jan 2023, Tom Lane wrote:


Dimitrios Apostolou  writes:

Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here
is how I understand things according to posts I've read, and classical
algorithms:



+ The Hash Join is fastest when one side fits in work_mem. Then on one
   hand you have a hash table lookup (amortized O(1)) and on the other
   hand, if the table has M rows that that do not fit in memory, you have
   sequential reads from the disk (given low fragmentation of the table or
   index files):  For every line you read from the disk, you lookup the key
   in the hash table.



   If the hash table does not fit in RAM then the cost becomes prohibitive.
   Every lookup is a random access possibly hitting the disk. The total
   cost should be random_page_cost * M.


That would be true of a simple hash join, but Postgres uses batched
hash joins: we split up the hash key space into subsets, where hopefully
each subset includes few enough inner-side rows to fit into work_mem.
While this can go wrong given pathological distribution of the inner-side
keys, it does mean that the join can perform well even when the inner
side is much larger than work_mem.  So it's not the case that the planner
will simply disregard hash joins beyond work_mem.  It will apply a cost
penalty for the predicted batching overhead;


Thanks for this, I found a page [1] that describes the hash join and
now I understand a bit more.

[1] https://www.interdb.jp/pg/pgsql03.html

I'm not sure whether the key distribution is pathological in my case.
The join condition is:

  Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)

and workitem_ids.workitem_n is an integer GENERATED AS IDENTITY and PUBLIC
KEY. The TABLE workitem_ids har 1.7M rows, and the other table has 3.7M
rows. None of them fit in workmem.

In my (simplified and pathological) case of work_mem == 1MB, the hash join
does 512 batches (Buckets: 4,096 Batches: 512 Memory Usage: 759kB). I'm
not sure which hash-merge strategy is followed, but based on that
document, it should be the "hybrid hash join with skew". I don't quite
follow the I/O requirements of this algorithm, yet. :-)


but that can still come out
cheaper than merge join, because the sorting needed for merge is generally
also far from cheap.


I was under the impression that on-disk merge-sort is a relatively cheap
(logN) operation, regarding random I/O.




So I would expect an increased random_page_cost to benefit the Merge Join
algorithm. And since my setup involves spinning disks, it does makes sense
to increase it.


What is probably really happening is that random_page_cost affects the
estimated cost of performing the sort using an index scan instead of
a bespoke sort step.  AFAIR, cost_sort doesn't consider random_page_cost
at all, and neither does cost_hashjoin.


On the last EXPLAIN I posted for the forced merge-join, I see that it uses
an index-scan on the "small" table. It makes sense since the join happens
on the primary key of the table. On the large table it does not use an
index scan, because an index doesn't exist for that column. It sorts the
3.7M rows of the table (and FWIW that table only has two integer columns).
If I understood correctly what you meant with "performing the sort using
an index scan".


The problem I see is that the estimated cost of the sort operation is
609,372.91..618,630.40. It's already way above the whole hash-join cost
(121,222.68..257,633.01). However the real timings are very different.
Actual time for Sort is 4,602.569..5,414.072 ms while for the whole hash
join it is 145,641.295..349,682.387 ms.

Am I missing some configuration knobs to put some sense to the planner?


Thanks,
Dimitris






Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-14 Thread Dimitrios Apostolou



On Fri, 13 Jan 2023, David Rowley wrote:


I'd expect reducing random_page_cost to make the Mege Join cheaper as
that's where the Index Scan is. I'm not quite sure where you think the
random I/O is coming from in a batched hash join.


Thanks for the feedback, indeed you are right! Decreasing random_page_cost
to values way below the default makes the planner prefer the merge join!
This seems strange to me.

Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here
is how I understand things according to posts I've read, and classical
algorithms:

+ The Hash Join is fastest when one side fits in work_mem. Then on one
  hand you have a hash table lookup (amortized O(1)) and on the other
  hand, if the table has M rows that that do not fit in memory, you have
  sequential reads from the disk (given low fragmentation of the table or
  index files):  For every line you read from the disk, you lookup the key
  in the hash table.

  If the hash table does not fit in RAM then the cost becomes prohibitive.
  Every lookup is a random access possibly hitting the disk. The total
  cost should be random_page_cost * M.

+ The Merge Join involves mostly sequential accesses if the disk files are
  not fragmented. It reads sequentially and in parallel from both tables,
  merging the results where the key matches.

  It requires on-disk sorting (because tables don't fit in work_mem), but
  even this operation requires little disk seeking. A merge-sort algorithm
  might have a random access cost of logN * random_page_cost.

So I would expect an increased random_page_cost to benefit the Merge Join
algorithm. And since my setup involves spinning disks, it does makes sense
to increase it.



It would be interesting to see the same plans with SET track_io_timing
= on; set.  It's possible that there's less *actual* I/O going on with
the Merge Join plan vs the Hash Join plan.  Since we do buffered I/O,
without track_io_timing, we don't know if the read buffers resulted in
an actual disk read or a read from the kernel buffers.



The database has been VACUUM ANALYZEd first and is otherwise idle.
Every query has been run twice, and I paste here only the 2nd run.


Slow Hash Join:

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems 
NATURAL JOIN workitem_ids;

 Hash Join  (cost=121222.68..257633.01 rows=3702994 width=241) (actual 
time=145641.295..349682.387 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, 
workitem_ids.workitem_id
   Inner Unique: true
   Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)
   Buffers: shared hit=12121 read=50381, temp read=56309 written=56309
   I/O Timings: shared/local read=745.925, temp read=162199.307 write=172758.699
   ->  Seq Scan on public.tasks_mm_workitems  (cost=0.00..53488.94 rows=3702994 
width=8) (actual time=0.114..1401.896 rows=3702994 loops=1)
 Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
 Buffers: shared hit=65 read=16394
 I/O Timings: shared/local read=183.959
   ->  Hash  (cost=59780.19..59780.19 rows=1373719 width=237) (actual 
time=145344.555..145344.557 rows=1373737 loops=1)
 Output: workitem_ids.workitem_id, workitem_ids.workitem_n
 Buckets: 4096  Batches: 512  Memory Usage: 759kB
 Buffers: shared hit=12056 read=33987, temp written=43092
 I/O Timings: shared/local read=561.966, temp write=142221.740
 ->  Seq Scan on public.workitem_ids  (cost=0.00..59780.19 rows=1373719 
width=237) (actual time=0.033..1493.652 rows=1373737 loops=1)
   Output: workitem_ids.workitem_id, workitem_ids.workitem_n
   Buffers: shared hit=12056 read=33987
   I/O Timings: shared/local read=561.966
 Settings: effective_cache_size = '500MB', enable_memoize = 'off', 
hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = 
'1MB'
 Planning:
   Buffers: shared hit=8
 Planning Time: 0.693 ms
 Execution Time: 350290.496 ms
(24 rows)


Fast Merge Join:

# SET enable_hashjoin TO off;
SET

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems 
NATURAL JOIN workitem_ids;

 Merge Join  (cost=609453.49..759407.78 rows=3702994 width=241) (actual 
time=4602.623..9700.435 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, 
workitem_ids.workitem_id
   Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n)
   Buffers: shared hit=5310 read=66086, temp read=32621 written=32894
   I/O Timings: shared/local read=566.121, temp read=228.063 write=526.739
   ->  Index Scan using workitem_ids_pkey on public.workitem_ids  
(cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.034..1080.800 
rows=1373737 loops=1)
 Output: workitem_ids.workitem_n, workitem_ids.workitem_id
 Buffers: shared hit=5310 read=49627
 I/O Timings: shared/local read=448.952
   ->  Materialize  

Why is a hash join preferred when it does not fit in work_mem

2023-01-12 Thread Dimitrios Apostolou

Hello list,

I have a very simple NATURAL JOIN that does not fit in the work_mem.  Why
does the query planner prefer a hash join that needs 361s, while with a
sort operation and a merge join it takes only 13s?

The server is an old Mac Mini with hard disk drive and only 4GB RAM.
Postgres version info:

 PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang
version 12.0.0 (clang-1200.0.32.29), 64-bit

The low work_mem and the disabled memoization are set on purpose, in order
to simplify a complex query, while reproducing the same problem that I
experienced there. This result is the simplest query I could get, where
the optimizer does not go for a faster merge join.

From my point of view a merge join is clearly faster, because the hash
table does not fit in memory and I expect a hash join to do a lot of
random I/O. But the query planner does not see that, and increasing
random_page_cost does not help either. In fact the opposite happens: the
merge join gets a higher cost difference to the hash join, as I increase
the random page cost!



# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM
tasks_mm_workitems NATURAL JOIN workitem_ids;

QUERY PLAN
--
 Hash Join  (cost=121222.68..257633.01 rows=3702994 width=241) (actual
time=184498.464..360606.257 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n,
workitem_ids.workitem_id
   Inner Unique: true
   Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)
   Buffers: shared hit=15068 read=47434, temp read=56309 written=56309
   ->  Seq Scan on public.tasks_mm_workitems  (cost=0.00..53488.94
rows=3702994 width=8) (actual time=0.040..1376.084 rows=3702994 loops=1)
 Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
 Buffers: shared read=16459
   ->  Hash  (cost=59780.19..59780.19 rows=1373719 width=237) (actual
time=184361.874..184361.875 rows=1373737 loops=1)
 Output: workitem_ids.workitem_id, workitem_ids.workitem_n
 Buckets: 4096  Batches: 512  Memory Usage: 759kB
 Buffers: shared hit=15068 read=30975, temp written=43092
 ->  Seq Scan on public.workitem_ids  (cost=0.00..59780.19
rows=1373719 width=237) (actual time=0.026..1912.312 rows=1373737 loops=1)
   Output: workitem_ids.workitem_id, workitem_ids.workitem_n
   Buffers: shared hit=15068 read=30975
 Settings: effective_cache_size = '500MB', enable_memoize = 'off',
hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem
= '1MB'
 Planning:
   Buffers: shared hit=2 read=6
 Planning Time: 0.568 ms
 Execution Time: 361106.876 ms
(20 rows)


# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM
tasks_mm_workitems NATURAL JOIN workitem_ids;

QUERY PLAN
---
 Merge Join  (cost=609453.49..759407.78 rows=3702994 width=241) (actual 
time=5062.513..10866.313 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, 
workitem_ids.workitem_id
   Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n)
   Buffers: shared hit=5343 read=66053, temp read=32621 written=32894
   ->  Index Scan using workitem_ids_pkey on public.workitem_ids
(cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.111..1218.363 
rows=1373737 loops=1)
 Output: workitem_ids.workitem_n, workitem_ids.workitem_id
 Buffers: shared hit=5310 read=49627
   ->  Materialize  (cost=609372.91..627887.88 rows=3702994 width=8) (actual 
time=5062.389..7392.640 rows=3702994 loops=1)
 Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
 Buffers: shared hit=33 read=16426, temp read=32621 written=32894
 ->  Sort  (cost=609372.91..618630.40 rows=3702994 width=8) (actual 
time=5062.378..6068.703 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
   Sort Key: tasks_mm_workitems.workitem_n
   Sort Method: external merge  Disk: 65256kB
   Buffers: shared hit=33 read=16426, temp read=32621 written=32894
   ->  Seq Scan on public.tasks_mm_workitems
(cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.045..1177.202 
rows=3702994 loops=1)
 Output: tasks_mm_workitems.workitem_n, 
tasks_mm_workitems.task_n
 Buffers: shared hit=33 read=16426
 Settings: effective_cache_size = '500MB', enable_hashjoin = 'off', 
enable_memoize = 'off', hash_mem_multiplier = '1', 
max_parallel_workers_per_gather = '1', work_mem = '1MB'
 Planning:
   Buffers: shared hit=8
 Planning Time: 0.677 ms
 Execution Time: 13364.545 ms
(23 rows)


Thank you in advance,