from:"Dimitrios Apostolou"


Hello list,

what is the "background worker" in the pg_stat_io statistics view? I'm
reading the documentation but can't figure this one out knowing that it is
not autovacuum or bgwriter. And I'm not aware of any extension I might
have with registered background worker.

Additionally, how can it be evictions > writes? I would expect every
eviction to cause a write.

Finally about "hits", I understand they are reads found in shared_buffers,
so they never registered into the "reads" counter. So is "reads" in
pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read
attempts not found in the shared_buffers, that needed to be fetched from
the disk (or OS buffercache)?

   backend_type|object | context |  reads  |  read_time  | writes | 
write_time | writebacks | writeback_time | extends | extend_time | op_bytes |   
hits| evictions | reuses | fsyncs | fsync_time |  stats_reset
---+---+-+-+-+++++-+-+--+---+---++++---
 background worker | relation  | normal  | 5139575 | 2196288.011 |  63277 | 
   1766.94 |  0 |  0 |   0 |   0 | 8192 | 
876913705 |   5139653 ||  0 |  0 | 2024-04-08 
08:50:02.971192+00


Thank you in advance,
Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


I have forgotten to mention that I have enable_partitionwise_aggregate=on
in the global settings since the beginning. According to the docs:


Enables or disables the query planner's use of partitionwise grouping or
aggregation, which allows grouping or aggregation on partitioned tables
to be performed separately for each partition.


Reading that, I'd expect to see a separate DISTINCT->LIMIT 10 on every
partition, and then it would be up to independent plans to decide whether
each partition follows a parallel or a serial plan.

Not sure if this plan was checked but rejected because of cost.


Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


On Tue, 14 May 2024, David Rowley wrote:


If you were to put the n_distinct_inherited estimate back to 200 and
disable sort, you should see the costs are higher for the index plan.
If that's not the case then there might be a bug.  It seems more
likely that due to the n_distinct estimate being so low that the
planner thought that a large enough fraction of the rows needed to be
read and that made the non-index plan appear cheaper.

I'd be interested in seeing what the costs are for the index plan. I
think the following will give you that (untested):

alter table test_runs_raw alter column workitem_n set
(n_distinct_inherited=200);
analyze test_runs_raw;


I had to stop this step because it was taking too long going through all
partitions again. But it seems it had the desired effect.


set enable_sort=0;
explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;


It chooses the non-parallel index plan:

 Limit  (cost=365.17..1135517462.36 rows=10 width=4)
   ->  Unique  (cost=365.17..22710342308.83 rows=200 width=4)
 ->  Append  (cost=365.17..22546660655.46 rows=65472661350 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k 
test_runs_raw_1000  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k 
test_runs_raw_999  (cost=0.12..2.34 rows=1 width=4)
[... only index scans follow]

LIMIT 100 goes for the parallel seqscan plan, that even contains a sort!
But it seams to me that the extra upper level HashAggregate step raises
the cost by an order of magnitude, from 800M to 10G, in comparison to
doing (Unique->Sort) - see plan in the next paragraph.

 Limit  (cost=10857220388.76..10857220389.01 rows=100 width=4)
   ->  Sort  (cost=10857220388.76..10857220389.26 rows=200 width=4)
 Sort Key: test_runs_raw.workitem_n DESC
 ->  HashAggregate  (cost=857220379.12..857220381.12 rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Gather  (cost=857220295.12..857220377.12 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=857219295.12..857219297.12 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..816295259.21 
rows=16369614363 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037800.88 
rows=64122388 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918865.36 
rows=61611136 width=4)
[... only seqscans follow]



If I re-enable sort, then it goes for the parallel seqscan plan even with LIMIT 
10:

SET SESSION enable_sort TO TRUE;
EXPLAIN  SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;

 Limit  (cost=857166256.39..857166256.59 rows=10 width=4)
   ->  Unique  (cost=857166256.39..857166260.39 rows=200 width=4)
 ->  Sort  (cost=857166256.39..857166258.39 rows=800 width=4)
   Sort Key: test_runs_raw.workitem_n DESC
   ->  Gather  (cost=857166135.82..857166217.82 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=857165135.82..857165137.82 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..816243567.24 
rows=16368627432 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037800.88 
rows=64122388 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918865.36 
rows=61611136 width=4)
[... only seqscans follow]

So in order of higher to lower cost, we have the following alternatives:

1. non-parallel index scan  (800M)
2. parallel seqscan with sort  (1.3G)
3. parallel seqscan

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


On Tue, 14 May 2024, Dimitrios Apostolou wrote:


It took long but if finished:

ANALYZE
Time: 19177398.025 ms (05:19:37.398)


I see now that default_statistics_target is globally set to 500, so this
is probably the reason it took so long. I guess with the default of 100,
it would take approximately one hour. This is much better to have in a
cron job. :-)

Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions





On Fri, 10 May 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:



  Limit  (cost=363.84..1134528847.47 rows=10 width=4)
->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
  ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
...



The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.

Possibly an explicit ANALYZE on the partitioned table would help.


It took long but if finished:

ANALYZE
Time: 19177398.025 ms (05:19:37.398)

And it made a difference indeed, the serial plan is chosen now:

EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;
 Limit  (cost=364.29..1835512.29 rows=10 width=4)
   ->  Unique  (cost=364.29..22701882164.56 rows=123706 width=4)
 ->  Append  (cost=364.29..22538472401.60 rows=65363905182 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k 
test_runs_raw_1000  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k 
test_runs_raw_999  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k 
test_runs_raw_998  (cost=0.12..2.34 rows=1 width=4)
[...]
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12460k_pkey on test_runs_raw__part_max12460k 
test_runs_raw_623  (cost=0.57..12329614.53 rows=121368496 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12440k_pkey on test_runs_raw__part_max12440k 
test_runs_raw_622  (cost=0.57..5180832.16 rows=184927264 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12420k_pkey on test_runs_raw__part_max12420k 
test_runs_raw_621  (cost=0.57..4544964.21 rows=82018824 width=4)
[...]

Overall I think there are two issues that postgres could handle better
here:

1. Avoid the need for manual ANALYZE on partitioned table

2. Create a different parallel plan, one that can exit early, when the
   LIMIT is proportionally low. I feel the partitions could be
   parallel-scanned in-order, so that the whole thing stops when one
   partition has been read.

Thank you!
Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions





On Tue, 14 May 2024, David Rowley wrote:


That assumes the Append won't ever use > 1 worker per subnode, but
that's not the case for your plan as the subnodes are "Parallel".
That means all the workers could be working on the same subnode which
could result in one group being split between 2 or more workers.


Didn't think of that, makes sense!


Parallel Append can also run in a way that the Append child nodes will
only get 1 worker each.


How can I tell which case it is, from the EXPLAIN output (for example
the output at [1]) ?

[1] 
https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


On Tue, 14 May 2024, David Rowley wrote:


On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou  wrote:


On Sat, 11 May 2024, David Rowley wrote:

It will. It's just that Sorting requires fetching everything from its subnode.


Isn't it plain wrong to have a sort step in the plan than? The different
partitions contain different value ranges with no overlap, and the last
query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
clause on an indexed column.


The query does contain an ORDER BY, so if the index is not chosen to
provide pre-sorted input, then something has to put the results in the
correct order before the LIMIT is applied.


The last query I tried was:

SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

See my message at

[1] 
https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Will re-check things and report back with further debugging info you asked
for later today.


Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


On Mon, 13 May 2024, Dimitrios Apostolou wrote:


On Sat, 11 May 2024, David Rowley wrote:


 On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou  wrote:

 Indeed that's an awful estimate, the table has more than 1M of unique
 values in that column. Looking into pg_stat_user_tables, I can't see the
 partitions having been vacuum'd or analyzed at all. I think they should
 have been auto-analyzed, since they get a ton of INSERTs
 (no deletes/updates though) and I have the default autovacuum settings.
 Could it be that autovacuum starts, but never
 finishes? I can't find something in the logs.


 It's not the partitions getting analyzed you need to worry about for
 an ndistinct estimate on the partitioned table. It's auto-analyze or
 ANALYZE on the partitioned table itself that you should care about.

 If you look at [1], it says "Tuples changed in partitions and
 inheritance children do not trigger analyze on the parent table."


Thanks


Do I read that correctly, that I have to setup cron jobs to manually
analyze partitioned tables?


Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


On Sat, 11 May 2024, David Rowley wrote:


On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou  wrote:

Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.


It's not the partitions getting analyzed you need to worry about for
an ndistinct estimate on the partitioned table. It's auto-analyze or
ANALYZE on the partitioned table itself that you should care about.

If you look at [1], it says "Tuples changed in partitions and
inheritance children do not trigger analyze on the parent table."


Thanks




In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?


It will. It's just that Sorting requires fetching everything from its subnode.


Isn't it plain wrong to have a sort step in the plan than? The different
partitions contain different value ranges with no overlap, and the last
query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
clause on an indexed column.

Even with bad estimates, even with seq scan instead of index scan, the
plan should be such that it concludes all parallel work as soon as it
finds the 10 distinct values. And this is actually achieved if I disable
parallel plans. Could it be a bug in the parallel plan generation?


Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


On Sat, 11 May 2024, David Rowley wrote:


On Sat, 11 May 2024 at 13:33, Tom Lane  wrote:

I do kind of wonder why it's producing both a hashagg and a Unique
step --- seems like it should do one or the other.


It still needs to make the duplicate groups from parallel workers unique.


Range partitioning of the table guarantees that, since the ranges are not
overlapping.


Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


On Fri, 10 May 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:



  Limit  (cost=363.84..1134528847.47 rows=10 width=4)
->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
  ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
...



The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.


Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.

In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?



Possibly an explicit ANALYZE on the partitioned table would help.


Thanks, I'll save the ANALYZE as the last step; I feel it's a good
opportunity to figure out more details about how postgres works. Plus I
expect ANALYZE to last a couple of days, so I should first find quiet time
for that. :-)

Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions





On Fri, 10 May 2024, Dimitrios Apostolou wrote:


On Fri, 10 May 2024, Dimitrios Apostolou wrote:

Update: even the simplest SELECT DISTINCT query shows similar behaviour:


Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:

 Limit  (cost=363.84..1134528847.47 rows=10 width=4)
   ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
 ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
   ->  Index Only Scan using test_runs_raw__part_max20k_pkey on 
test_runs_raw__part_max20k test_runs_raw_1  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan using test_runs_raw__part_max40k_pkey on 
test_runs_raw__part_max40k test_runs_raw_2  (cost=0.12..2.34 rows=1 width=4)
[...]
   ->  Index Only Scan using test_runs_raw__part_max1780k_pkey on 
test_runs_raw__part_max1780k test_runs_raw_89  (cost=0.57..53587294.65 
rows=106088160 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1800k_pkey on 
test_runs_raw__part_max1800k test_runs_raw_90  (cost=0.57..98943539.74 
rows=96214080 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1820k_pkey on 
test_runs_raw__part_max1820k test_runs_raw_91  (cost=0.57..97495653.34 
rows=193248960 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1840k_pkey on 
test_runs_raw__part_max1840k test_runs_raw_92  (cost=0.57..110205205.07 
rows=218440928 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1860k_pkey on 
test_runs_raw__part_max1860k test_runs_raw_93  (cost=0.57..50164056.28 
rows=99431760 width=4)
[...]


The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).

This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


Dimitris

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


On Fri, 10 May 2024, Dimitrios Apostolou wrote:


I noticed that the following query is very very slow (too long to wait for it
to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;


Update: even the simplest SELECT DISTINCT query shows similar behaviour:

EXPLAIN
SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

 Limit  (cost=724518979.52..724518979.92 rows=10 width=4)
   ->  Unique  (cost=724518979.52..724518987.52 rows=200 width=4)
 ->  Sort  (cost=724518979.52..724518983.52 rows=1600 width=4)  
 Sort Key: test_runs_raw.workitem_n
   ->  Gather  (cost=724518732.37..724518894.37 rows=1600 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=724517732.37..724517734.37 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..704131546.90 
rows=8154474186 width=4)
 ->  Parallel Index Only Scan using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1429238.50 rows=16811660 width=4)
 ->  Parallel Index Only Scan using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1081827.27 rows=12896836 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..2717185.06 
rows=32060806 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2610814.95 
rows=30806095 width=4)


It also takes ages to return, so I have to interrupt it.

I believe it should exit early, as soon as it finds 10 distinct values
(which should be rather easy even with parallel seqscans, given the
pattern followed when inserting the data).


Thanks,
Dimitris

SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions


Hello list,

INTRO

I have a huge (multi-billion rows) table partitioned into 1000 partitions.
Around half of the partitions are full and the rest are empty, created in
advance ready to receive future incoming data. Postgres is 16.2.
Here are the relevant parts of the schema:


\d test_runs_raw

 Partitioned table "public.test_runs_raw"
  Column   |Type | Collation | Nullable |
Default
---+-+---+--+--
 run_n | bigint  |   | not null | 
generated by default as identity
 test_case_n   | smallint|   | not null |
 workitem_n| integer |   | not null |
 test_resulttype_n | smallint|   |  |
Partition key: RANGE (workitem_n)
Indexes:
"test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n)

Each partition is made to keep entries with workitem_n in ranges
(0,20k), (20k,40k) and so on (k = kilo) up to 2k.


PROBLEM

I noticed that the following query is very very slow (too long to wait for
it to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

What is remarkable, is that in 998 out of 1000 table scans it involves,
the planner does not use the index. Instead it chooses a sequential scan.
Here is the output from EXPLAIN:

 Limit  (cost=853891608.79..853891608.99 rows=10 width=4)
   ->  Unique  (cost=853891608.79..853891612.79 rows=200 width=4)
 ->  Sort  (cost=853891608.79..853891610.79 rows=800 width=4)   
Sort Key: test_runs_raw.workitem_n DESC
   ->  Gather  (cost=853891488.22..853891570.22 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=853890488.22..853890490.22 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..813118117.30 
rows=16308948365 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597355.10 rows=33623320 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210795.63 rows=25793672 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037793.12 
rows=64121612 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918875.90 
rows=61612190 width=4)
[ ... 996 more sequential scans ... ]

If I remove DISTINCT then the plan changes dramatically and it runs
instantaneously:

 Limit  (cost=363.84..367.30 rows=10 width=4)
   ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max2k_pkey 
on test_runs_raw__part_max2k test_runs_raw_1000  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey 
on test_runs_raw__part_max19980k test_runs_raw_999  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19960k_pkey 
on test_runs_raw__part_max19960k test_runs_raw_998  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19940k_pkey 
on test_runs_raw__part_max19940k test_runs_raw_997  (cost=0.12..2.34 rows=1 
width=4)
[ ... 996 more index scans ... ]

Notice how in the last plan there is no parallel scanning. Instead the
partitions are scanned sequentially, *in proper order*,
so that the plan execution stops after reading the first
10 rows in the first non-empty partition.

Why can't the same be done with DISTINCT?
Please note that the workitem_n value range is well spread into in range
(0,13M) and the table has been gradually filled within one year, so I'm
assuming the vacuum worker has worked long enough to build sane statistics
(not sure how to verify that).


REMARKS

1. I tried reproducing the problem on an artificial table with few
   partitions and few values, but I couldn't. Both queries execute fast,
   and the planner is always choosing a non-parallel index-only scan.

2. Among testing changes to various settings, I just noticed that setting
   max_parallel_workers_per_gather to 0 (from the original value of 4)
   fixes the issue! On the original huge table, disabling parallelism
   actually makes the query infinitely faster and it returns within 1s! Is
   this a bug in the planner?


Thank you,
Dimitris

Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou


Thanks for the feedback Laurenz,

On Wed, 28 Feb 2024, Laurenz Albe wrote:


On Wed, 2024-02-28 at 12:16 +0100, Dimitrios Apostolou wrote:


I ended up doing some risky actions to remediate the problem: Find the
filenames that have no identically named "oid" in pg_class, and delete
(move to backup) the biggest ones while the database is stopped.
Fortunately the database started up fine after that!


Lucky you.  It should have been "relfilenode" rather than "oid",
and some catalog tables don't have their files listed in the catalog,
because they are needed *before* the database can access tables.


I actually double checked that the filenames don't appear anywhere in
SELECT * FROM pg_class
and that the files were multi-GB in size including all the
1GB-pieces. But luck was definitely a part of the equation, I didn't know
that the files might be accessed before tables (at db startup?) or that
"relfilenode" would be more appropriate. Why is that, where can I read
more? I see that many (but not all) rows in pg_class have oid=relfilenode
but for many rows relfilenode=0 which is meaningless as filename.




So what is the moral of the story? How to guard against this?


Monitor disk usage ...


It happened *fast*. And it was quite a big suprise coming
from "just" a disk-full situation.

A couple of suggestions; wouldn't it make sense:

+ for the index to be written by default to the table's tablespace?

+ for postgres to refuse to write non-wal files, if it's on
  the same device as the WAL and less than max_wal_size bytes are free?



The root of the problem is that you created the index in the default
tablespace.  You should have

  ALTER TABLE x ADD UNIQUE ... USING INDEX TABLESPACE bigtblspc;



Thank you, was reading the docs but didn't realize this
syntax is valid. I thought it was only for CREATE/ALTER INDEX.



Needless to say, I would have hoped the database cleaned-up after itself
even after an uncontrolled crash, or that it provided some tools for the
job. (I tried VACUUM FULL on the table, but the orphaned files did not go
away).


That is not so simple... Also, it would slow down crash recovery.

But I agree that it would be nice to have a tool that reports or
cleans up orphaned files.

Yours,
Laurenz Albe

Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou


Hi Sergey,

On Wed, 28 Feb 2024, Sergey Fukanchik wrote:


Hi Dimitrios,
Do you have wal archiving enabled?
$PGDATA/pg_wal/ is a bit different from tablespaces (including
"default" one). It stores transaction journal.


I don't think I have WAL archiving enabled. Here are the relevant WAL
options in my config:

wal_compression = zstd
max_wal_size = 8GB
min_wal_size = 1GB
track_wal_io_timing = on
wal_writer_flush_after = 8MB

The issue happens because the WAL directory is by default in the same
filesystem with the default tablespace (root partition for Ubuntu). So
when the default tablespace filled up because of my ALTER TABLE operation,
there was no space for WAL either.


Thanks,
Dimitris

Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou

Hello list,

yesterday I was doing:

ALTER TABLE x ADD UNIQUE ... , ADD PRIMARY KEY ...;

The table is almost a billion rows long but lies in its own TABLESPACE
that has plenty of space. But apparently the ALTER TABLE command is
writing a lot to the default tablespace (not the temp_tablespace, that is
already moved to a different partition).

That quickly filled up the 50GB free space in my root partition:

20:18:04.222 UTC [94144] PANIC: could not write to file
"pg_wal/xlogtemp.94144": No space left on device
[...]
20:19:11.578 UTC [94140] LOG: WAL writer process (PID 94144) was terminated by
signal 6: Aborted
20:19:11.578 UTC [94140] LOG: terminating any other active server processes

After postgresql crashed and restarted, the disk space in the root
partition was still not freed! I believe this is because of "orphaned
files" as discussed in mailing list thread [1].

[1]
https://www.postgresql.org/message-id/CAN-RpxDBA7HbTsJPq4t4VznmRFJkssP2SNEMuG%3DoNJ%2B%3DsxLQew%40mail.gmail.com

I ended up doing some risky actions to remediate the problem: Find the
filenames that have no identically named "oid" in pg_class, and delete
(move to backup) the biggest ones while the database is stopped.
Fortunately the database started up fine after that!

So what is the moral of the story? How to guard against this?

Why did the database write so much to the default tablespace, even when
both the table and the temp tablespace are elsewhere? Also should one
always keep the default tablespace away from the wal partition? (I guess
it would have helped to avoid the crash, but the ALTER TABLE command would
have still run out of space, so I'm not sure if the orphan files would
have been avoided)?

Needless to say, I would have hoped the database cleaned-up after itself
even after an uncontrolled crash, or that it provided some tools for the
job. (I tried VACUUM FULL on the table, but the orphaned files did not go
away).

My postgresql version is 16.2 installed on Ubuntu.

Thank you,
Dimitris

Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou


On Wed, 15 Nov 2023, Erik Wienhold wrote:


On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote:

SELECT [DISTINCT] ... EXCEPT ...

In this query I get the same results regardless of including DISTINCT or
not. But I get different query plans, I get an extra HashAggregate node
in the case of SELECT DISTINCT. Any idea why?


As Tom Lane recently wrote[1] EXCEPT is not optimized and will operate
on the subqueries which are planned independently.

[1] https://www.postgresql.org/message-id/2664450.1698799...@sss.pgh.pa.us


Heh, as he wrote to me even. :-) I just wanted to make sure that this is
indeed a missing optimisation of the planner, and that the queries are
effectively the same. Thank you for clarifying.

As mentioned, the docs don't make it clear if the SELECT DISTINCT part is
implied or not, only the EXCEPT DISTINCT part is clearly on by default.


Dimitris

Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou


Hello list,

SELECT [DISTINCT] ... EXCEPT ...

In this query I get the same results regardless of including DISTINCT or
not. But I get different query plans, I get an extra HashAggregate node
in the case of SELECT DISTINCT. Any idea why?

Reading the docs, I understand that postgres does by default
EXCEPT DISTINCT, but I don't see the same for the 1st clause,
SELECT [DISTINCT].


Steps to reproduce:

CREATE TABLE t1(i INTEGER NOT NULL);
CREATE TABLE t2(i INTEGER NOT NULL);

INSERT INTO t1 VALUES (1),(2),(1),(2),(3),(3),(3);
INSERT INTO t2 VALUES (3);

SELECT * FROM t1 EXCEPT SELECT * FROM t2;
 i
---
 2
 1
(2 rows)

SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2;
 i
---
 2
 1
(2 rows)


EXPLAIN SELECT * FROM t1 EXCEPT SELECT * FROM t2;
   QUERY PLAN
-
 HashSetOp Except  (cost=0.00..160.25 rows=200 width=8)
   ->  Append  (cost=0.00..147.50 rows=5100 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

EXPLAIN SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2;
   QUERY PLAN
-
 HashSetOp Except  (cost=41.88..127.50 rows=200 width=8)
   ->  Append  (cost=41.88..120.62 rows=2750 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=41.88..45.88 rows=200 width=8)
   ->  HashAggregate  (cost=41.88..43.88 rows=200 width=4)
 Group Key: t1.i
 ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(8 rows)



Regards,
Dimitris

Re: Inefficient query plan for SELECT ... EXCEPT ...

2023-11-01 Thread Dimitrios Apostolou


Thank you all for the answers, they covered me well.



Is this worth a bug report? I can file one if the issue is not known.


No. It's just a missing optimisation. We know about it.


It's good I shot an email first then. FWIW my usual way in other projects
would be to check the bugtracker, and just "follow" the relevant issue if
it's minor like a missing optimisation. I didn't find a way to search for
"known issues" in the Postgresql project.


Dimitris

Inefficient query plan for SELECT ... EXCEPT ...

2023-10-31 Thread Dimitrios Apostolou


Hello list,

I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query,
where the left side is a very short table (even zero-length sometimes, but
also also rarely can be as long as 200K rows), and the right side is a
table with 10M UNIQUE NOT NULL rows:


\d test_datatags

Table "public.test_datatags"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+--
 test_datatag_n | integer |   | not null | generated by default as 
identity
 test_datatag   | text|   | not null |
Indexes:
"test_datatags_pkey" PRIMARY KEY, btree (test_datatag_n)
"test_datatags_test_datatag_key" UNIQUE CONSTRAINT, btree (test_datatag)


Follows a simplified and pathological case, that takes 1.7s even though
the left side is empty:



BEGIN;
CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP;
ANALYZE VERBOSE tmp_table ;

INFO:  analyzing "pg_temp_9.tmp_table"
INFO:  "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead 
rows; 0 rows in sample, 0 estimated total rows

EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS)

SELECT DISTINCT  d1
FROM tmp_table
WHERE d1 IS NOT NULL
EXCEPT
SELECT test_datatag FROM test_datatags;

QUERY PLAN
--
 HashSetOp Except  (cost=0.00..299854.08 rows=1 width=36) (actual 
time=1726.470..1726.472 rows=0 loops=1)
   Output: "*SELECT* 1".d1, (0)
   Buffers: shared hit=1259 read=58800
   I/O Timings: shared/local read=77.713
   ->  Append  (cost=0.00..278054.53 rows=8719821 width=36) (actual 
time=3.754..1287.901 rows=8702840 loops=1)
 Buffers: shared hit=1259 read=58800
 I/O Timings: shared/local read=77.713
 ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..0.02 rows=1 width=36) 
(actual time=0.003..0.003 rows=0 loops=1)
   Output: "*SELECT* 1".d1, 0
   ->  HashAggregate  (cost=0.00..0.01 rows=1 width=32) (actual 
time=0.002..0.003 rows=0 loops=1)
 Output: tmp_table.d1
 Group Key: tmp_table.d1
 Batches: 1  Memory Usage: 24kB
 ->  Seq Scan on pg_temp.tmp_table  (cost=0.00..0.00 rows=1 
width=32) (actual time=0.001..0.001 rows=0 loops=1)
   Output: tmp_table.d1
   Filter: (tmp_table.d1 IS NOT NULL)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..234455.40 rows=8719820 
width=26) (actual time=3.751..943.850 rows=8702840 loops=1)
   Output: "*SELECT* 2".test_datatag, 1
   Buffers: shared hit=1259 read=58800
   I/O Timings: shared/local read=77.713
   ->  Seq Scan on public.test_datatags  (cost=0.00..147257.20 
rows=8719820 width=22) (actual time=3.747..515.420 rows=8702840 loops=1)
 Output: test_datatags.test_datatag
 Buffers: shared hit=1259 read=58800
 I/O Timings: shared/local read=77.713
 Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 
'on', enable_partitionwise_join = 'on', hash_mem_multiplier = '1',
 maintenance_io_concurrency = '0', max_parallel_workers = '12', 
max_parallel_workers_per_gather = '0', temp_buffers = '64MB'
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.055 ms
 JIT:
   Functions: 15
   Options: Inlining false, Optimization false, Expressions true,
Deforming true
   Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms,
Emission 3.542 ms, Total 4.037 ms
 Execution Time: 1726.835 ms
(33 rows)


I'm wondering why the planner doesn't see that the left table is very small and 
follow a different path.
From an abstract computer science POV, I would

1. sort the left table  (the right one is already indexed)
2. "merge" the two tables, by walking them in-order in parallel and excluding 
the matches
3. stop when the left table is exhausted, which would happen very early.

Is this worth a bug report? I can file one if the issue is not known.
Or am I misunderstanding the implications of the SELECT-EXCEPT query?

In the meantime I have replaced the query with a LEFT OUTER JOIN which
performs much better, and I believe is equivalent. I find it less readable
than the query in question though. Plus, I have a bunch of SELECT-EXCEPT
queries (with smaller right-side tables) in my application that I would
hate to change them all to the ugliest equivalent. Under what conditions
would the above query plan perform well?

Thanks in advance,
Dimitris

Re: BRIN index maintenance on table without primary key

2023-10-30 Thread Dimitrios Apostolou


On Fri, 27 Oct 2023, Dimitrios Apostolou wrote:


So the question is: how to maintain the physical order of the tuples?


Answering to myself, there doesn't seem to be any way to run pg_repack on
a table without a UNIQUE key.

To run CLUSTER, the only way I see is to:

1. Create a btree index on the same column that has the BRIN index
2. CLUSTER
3. Drop the index

This should take very long on my huge table, and keeps the table
exclusively locked. The disk space needed would also be huge, if the table
isn't partitioned.

I wonder why CLUSTER can't run based on a BRIN-indexed column. Is it
theoretically impossible, or is it just not implemented yet?

My understanding so far is that CLUSTER only rewrites the table *according
to the index order* and does not touch the index itself. For a BRIN index
though it would need to rewrite the table *ignoring the index* and then
rewrite the index too, in order to keep the ranges fully optimized. So the
logic is very different, and maybe closer to what VACUUM FULL does.


Thanks,
Dimitris

BRIN index maintenance on table without primary key

2023-10-27 Thread Dimitrios Apostolou


Hello list.

Key characteristics of my case:
+ HUGE table with 20G rows, ca 2TB
+ May be partitioned (have both versions on two test clusters ATM)
+ Plenty of inserts, no updates
+ No primary key - we had one IDENTITY bigint column until recently, but
  it proved useless and inefficient (index too big for mem) so we deleted it.
+ All queries are filtering on a not-unique not-null integer column.
  + On this column we have a BRIN INDEX since insert order is
/mostly/ incremental.

So the question is: how to maintain the physical order of the tuples?

Even though the insertions populate the index column /mostly/
incrementally, there are outliers and it's certain that over time the
BRIN index will deteriorate.

I'm monitoring the "correlation" and I want to run a command to "defrag"
it when it drops below 0.9.

+ Can't run CLUSTER:
ERROR:  cannot cluster on index "tst_brin_idx" because access method does 
not support clustering

+ Can't run pg_repack, from [1]:
> Target table must have a PRIMARY KEY, or at least a UNIQUE total index on 
a NOT NULL column

[1] https://reorg.github.io/pg_repack/


Any ideas?

Thank you in advance,
Dimitris

Re: Partitionwise JOIN scanning all partitions, even unneeded ones

2023-08-30 Thread Dimitrios Apostolou


Thank you for the clear explanation, and I hope the missing optimisation
gets implemented sooner rather than later. Maybe the query planner should
consider the missing optimisation and ban *hash* partitionwise joins.

Indeed I verified that disabling hash join fixed the situation, with both
queries taking almost the same time. Great!

In the meantime, I'm considering disabling hash joins globally, as I've
had issues with them before (excessive I/O slowing causing much slower
execution than merge join, see [1] for your answer to my question then :).
Do you think that would save me from other problems I'm not aware of,
given the huge size of the table I'm querying?

[1] 
https://www.postgresql.org/message-id/caaphdvppvydonkeqlybsbjwq8kq8m7ywdka44rtea2mnao3...@mail.gmail.com


Regards,
Dimitris

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-21 Thread Dimitrios Apostolou


On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:

+ Then I broke the command above in many small chunks

  WITH rows AS (
DELETE FROM tbl_legacy AS t
WHERE (partition_key_column < $1)
RETURNING t.*
   )  INSERT INTO tbl SELECT * FROM rows;
  COMMIT;

  I increase the parameter $1 and keep going in a loop.  At first this
  goes OK, after one day though I notice that it has slowed down
  significantly. My investigation shows that the simple query

   SELECT min(partition_key_column) from tbl_legacy;

  takes 38s, despite having an index on the column! A VACUUM fixes that,
  so I guess the index has a ton of dead tuples. I guess autovacuum does
  not do its job because the table is constantly busy.

  Unfortunately VACUUM takes long (several hours) on this huge table, so I
  can't add in the loop after the DELETE command.

  Is there a better way to avoid the bloat in the loop above?  Why can't
  the DELETE command update the index by pointing the minimum element
  beyond the dead tuples?


Any comments on this one? It annoys me that a simple loop deteriorated so
much and kept filling the table with bloat. What is that VACUUM does that
DELETE can't do, to keep the index fresh?

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-10 Thread Dimitrios Apostolou

Thanks! I have completed the transfer by taking down the database and
working exclusively on it, but I still wonder how one could do it in
production without exclusive locks. The loop with small DELETE...INSERT
transactions I posted on the parent post bloats the table fast.

The link you posted contains very useful info, I was not expecting that
the constraints could blow up the memory like that. Comments from me:

Disabling and then re-enabling the foreign key constraints is easily done
with ALTER TABLE.

Unfortunately it doesn't seem to be the same case for indices. One has to
create the table without indices, and then create the indices separately.
With such a process there is a risk of ending up with non-identical
table schemas.

By the way, with COPY one must use an intermediate file, right? There is
no way to COPY from table to table directly?

Thanks,
Dimitris

On Thu, 6 Jul 2023, Lorusso Domenico wrote:

Hello,
this is a standard problem during bulk copy.

here some suggestions; for example disable indexes.

The main issue is related to index, lock escalation and log writing.
In other dbms you should set log off on the table, but postgresql does not seem 
to have this feature.

Anyway, using an explicit lock table exclusive should prevent lock escalation.

So: disable indexes in target table
lock exclusive both table
insert data
truncate old table

If this doesn't work you can consider using the copy command.

Il giorno gio 6 lug 2023 alle ore 18:12 Dimitrios Apostolou  ha 
scritto:
  On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:

  > + First I chose the method to DELETE...INSERT everything.

  Just to clarify, the query looked more like

     WITH rows AS ( DELETE FROM tbl_legacy RETURNING * )
       INSERT INTO tbl
         SELECT * FROM rows;

  >   I noticed that the postgres process was growing without bounds, up to
  >   the point that it consumed almost all of swap space and I had to kill
  >   the command. Any idea why that happens?

  Also note that my settings for work_mem, temp_buffers, shared_buffers etc
  are all well below the RAM size and postgres has never shown unbound
  growth again. Postgres version is 15.2.

  Dimitris

--
Domenico L.

per stupire mezz'ora basta un libro di storia,
io cercai di imparare la Treccani a memoria... [F.d.A.]

Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou


On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:


+ First I chose the method to DELETE...INSERT everything.


Just to clarify, the query looked more like

  WITH rows AS ( DELETE FROM tbl_legacy RETURNING * )
INSERT INTO tbl
  SELECT * FROM rows;


  I noticed that the postgres process was growing without bounds, up to
  the point that it consumed almost all of swap space and I had to kill
  the command. Any idea why that happens?


Also note that my settings for work_mem, temp_buffers, shared_buffers etc
are all well below the RAM size and postgres has never shown unbound
growth again. Postgres version is 15.2.


Dimitris

Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou


I have a huge (10G rows) table "tbl_legacy" named test_runs that I want to
PARTITION BY RANGE on an indexed integer column.

I have created the new partitioned table "tbl" and 1000 partitions in it,
so that it can take my existing data and have the partitions ready for the
next year (so most of the new partitions will be empty for now).

On the topic of bulk-transferring the data from tbl_legacy to tbl:

+ First I chose the method to DELETE...INSERT everything.

  I noticed that the postgres process was growing without bounds, up to
  the point that it consumed almost all of swap space and I had to kill
  the command. Any idea why that happens?

+ Then I broke the command above in many small chunks

  WITH rows AS (
DELETE FROM tbl_legacy AS t
WHERE (partition_key_column < $1)
RETURNING t.*
) INSERT INTO tbl SELECT * FROM rows;
  COMMIT;

  I increase the parameter $1 and keep going in a loop.  At first this
  goes OK, after one day though I notice that it has slowed down
  significantly. My investigation shows that the simple query

SELECT min(partition_key_column) from tbl_legacy;

  takes 38s, despite having an index on the column! A VACUUM fixes that,
  so I guess the index has a ton of dead tuples. I guess autovacuum does
  not do its job because the table is constantly busy.

  Unfortunately VACUUM takes long (several hours) on this huge table, so I
  can't add in the loop after the DELETE command.

  Is there a better way to avoid the bloat in the loop above?  Why can't
  the DELETE command update the index by pointing the minimum element
  beyond the dead tuples?

+ Finally I resorted to just copying the data:

INSERT INTO tbl SELECT * FROM tbl_legacy;

  This took half a day or so but worked without blowing up the memory
  usage like the 1st query.  Why is that?

  By the way, I don't consider the last solution ideal, since the data is
  duplicated between the two tables and it's harder to implement
  workarounds to do it without taking downtime. But it worked for me for
  this time. Any other ideas for partitioning an existing huge table?


Thanks in advance,
Dimitris

VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread Dimitrios Apostolou


Thanks everyone for the help.

I have now run simple VACUUM but it didn't change anything, the simple
SELECT is still slow.

My understanding by reading the docs is that it should reclaim all unused
space, just not return it to the OS. Which is fine by me. Any idea why it
failed to reclaim the space in my case?

I'm now running VACUUM FULL as everyone suggested. I just tried plain
VACUUM as I was curious if it would work and because it doesn't lock
the table with an operation that takes hours for my setup.


Thanks,
Dimitris

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-02-01 Thread Dimitrios Apostolou


On Tue, 31 Jan 2023, David G. Johnston wrote:


It feels like there is room for improvement here using table statistics and the 
visibility map to significantly reduce the number of pages retrieved that turn 
out to be all dead. 


Yes, I too wish postgres was more clever for a simple unordered query. I
should also mention that this table has two btree indices, one primary key
index and one on another column. Would it be feasible to "descend" an
index in order to immediately find some live data?


A limit without an order is not a query to optimize for of course...


A good reason is that the web interface I'm using to browse my tables (and
I guess many other too) do a simple limited+unordered SELECT when you
click on a table. Timing out was what made me look deeper. I couldn't
accept that no matter how slow my system is, it has trouble returning a
few rows.

It is also the first query somoebody will type on psql to interactively
get a clue about the contents of a table.


Regards,
Dimitris

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches


Thanks for the insight on the internals. Regarding your questions:

On Tue, 31 Jan 2023, Tom Lane wrote:


Do you get the same 10 rows when you repeat the command?


Yes. Just tested with both cold and hot caches. The first 10 rows are
exactly the same, either they return slowly or immediately.


If turning synchronize_seqscans off changes the behavior, that'd
be a good clue that this is the right theory.


Turning it off makes the query slow no matter how many times I re-run it.
The system is doing lots of read I/O with both hot and cold caches. Here
is the EXPLAIN output from the hot cache run (that previously had only 14
hits and no reads):

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=620510.813..620510.821 
rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=64 read=2334462
   I/O Timings: shared/local read=567846.559
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=620510.800..620510.804 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=64 read=2334462
 I/O Timings: shared/local read=567846.559
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.099 ms
 Execution Time: 620510.855 ms


After reading the docs, I'm surprised this setting affects my case given
that I have no parallelism in my setup.



As for a real fix,
it might be time for a VACUUM FULL or CLUSTER on that table.


Regarding CLUSTER, would it help with a seqscan on a bloated table?
Furthermore, given that the table is expected to grow every day by a few
million rows, do you suggest running CLUSTER every night? Will postgres
remember that the rows up to N are clustered, even after appending more
rows?


Dimitris

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches


On Tue, 31 Jan 2023, David G. Johnston wrote:


On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou  wrote:
 
      ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 
rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
            Output: run_n, test_name_n, workitem_n, started_on, 
duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n
            Buffers: shared read=2334526
            I/O Timings: shared/local read=691137.029


The system has to return 10 live rows to you.  If it needs to search through 
that many buffers to find 10 live rows you most likely have a large bloating 
problem going on.  Seems like it is time to vacuum full.


I looked up on how to measure bloat, so I run the query found at [1].

[1] https://wiki.postgresql.org/wiki/Show_database_bloat

The first two rows show huge bloat on the two indices of this table:

... ORDER BY wastedbytes DESC LIMIT 2;
 current_database | schemaname |   tablename   | tbloat | wastedbytes | 
   iname | ibloat | wastedibytes
--++---++-+--++--
 coin | public | test_runs_raw |1.8 | 21742305280 | 
test_runs_raw_pkey   |1.0 |0
 coin | public | test_runs_raw |1.8 | 21742305280 | 
test_runs_raw_idx_workitem_n |0.3 |0
(2 rows)

Is this bloat even affecting queries that do not use the index?

It seems I have to add VACUUM FULL to nightly maintainance. I had run some
schema restructuring (several ADD COLUMN followed by UPDATE SET on all
rows) some days ago, and I was not aware this degraded the table.
Thanks for the useful info!


Dimitris

Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches


I managed to double-paste different subset of my settings. FWIW all of
them are true for all the experiments. Sorry for the confusion.


On Tue, 31 Jan 2023, Dimitrios Apostolou wrote:


Other postgres settings:

shared_buffers = 1024MB
effective_io_concurrency = 0
wal_compression = zstd
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
checkpoint_timeout = 15min


[...]


Relevant settings:

synchronous_commit = off
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
track_io_timing = on
track_wal_io_timing = on
wal_compression = zstd

SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches