Re: VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread Tom Lane
Christophe Pettus  writes:
>> On Feb 1, 2023, at 10:45, David G. Johnston  
>> wrote:
>> The system just isn't that intelligent for "sequential scan", instead it 
>> does literally what the label says, goes through the table one page at a 
>> time and returns any live rows it finds. 

> Although this does raise a question: Couldn't it skip pages that are marked 
> appropriately in the visibility map?

There's no bit in the visibility map for "contains no live tuples".
I think adding one would be optimizing for the wrong thing.

(You might be thinking of the free-space map, but we don't maintain
that rigorously enough to trust it as a guide to skipping pages.)

regards, tom lane




Re: VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread Christophe Pettus



> On Feb 1, 2023, at 10:45, David G. Johnston  
> wrote:
> The system just isn't that intelligent for "sequential scan", instead it does 
> literally what the label says, goes through the table one page at a time and 
> returns any live rows it finds. 

Although this does raise a question: Couldn't it skip pages that are marked 
appropriately in the visibility map?



Re: VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread David G. Johnston
On Wed, Feb 1, 2023 at 11:27 AM Dimitrios Apostolou  wrote:

> I have now run simple VACUUM but it didn't change anything, the simple
> SELECT is still slow.
>
> My understanding by reading the docs is that it should reclaim all unused
> space, just not return it to the OS. Which is fine by me. Any idea why it
> failed to reclaim the space in my case?
>

The system just isn't that intelligent for "sequential scan", instead it
does literally what the label says, goes through the table one page at a
time and returns any live rows it finds.  You already had lots of
completely dead pages at the start of the table and after vacuum those
pages still exist (pages are sequentially numbered and gapless) just with a
different kind of dead contents.

In short, say you have 20 pages, 1-10 dead and 11-20 alive.  The only way
to change which pages exist and get rid of 1-10 is to rewrite the table
putting the contents in 11-20 into the newly recreated pages 1-10.  VACUUM
doesn't do that - it just makes it so when new data is written they can be
placed into the abandoned 1-10 range.  If for some reason page 21 existed
and VACUUM saw it was all dead it would remove page 21 from the end of the
relation since that wouldn't affect "sequential and gapless".

David J.


VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread Dimitrios Apostolou

Thanks everyone for the help.

I have now run simple VACUUM but it didn't change anything, the simple
SELECT is still slow.

My understanding by reading the docs is that it should reclaim all unused
space, just not return it to the OS. Which is fine by me. Any idea why it
failed to reclaim the space in my case?

I'm now running VACUUM FULL as everyone suggested. I just tried plain
VACUUM as I was curious if it would work and because it doesn't lock
the table with an operation that takes hours for my setup.


Thanks,
Dimitris





Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-02-01 Thread David G. Johnston
On Wed, Feb 1, 2023 at 11:15 AM Dimitrios Apostolou  wrote:

> On Tue, 31 Jan 2023, David G. Johnston wrote:
> >
> > It feels like there is room for improvement here using table statistics
> and the visibility map to significantly reduce the number of pages
> retrieved that turn out to be all dead.
>
> Yes, I too wish postgres was more clever for a simple unordered query. I
> should also mention that this table has two btree indices, one primary key
> index and one on another column. Would it be feasible to "descend" an
> index in order to immediately find some live data?
>

That isn't possible since indexes don't have tuple visibility information
within them; most of the dead bloat entries present in the table have dead
bloat entries pointing to them in the index.

David J.


Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-02-01 Thread Dimitrios Apostolou

On Tue, 31 Jan 2023, David G. Johnston wrote:


It feels like there is room for improvement here using table statistics and the 
visibility map to significantly reduce the number of pages retrieved that turn 
out to be all dead. 


Yes, I too wish postgres was more clever for a simple unordered query. I
should also mention that this table has two btree indices, one primary key
index and one on another column. Would it be feasible to "descend" an
index in order to immediately find some live data?


A limit without an order is not a query to optimize for of course...


A good reason is that the web interface I'm using to browse my tables (and
I guess many other too) do a simple limited+unordered SELECT when you
click on a table. Timing out was what made me look deeper. I couldn't
accept that no matter how slow my system is, it has trouble returning a
few rows.

It is also the first query somoebody will type on psql to interactively
get a clue about the contents of a table.


Regards,
Dimitris


Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Tom Lane
Dimitrios Apostolou  writes:
> On Tue, 31 Jan 2023, Tom Lane wrote:
>> Do you get the same 10 rows when you repeat the command?

> Yes. Just tested with both cold and hot caches. The first 10 rows are
> exactly the same, either they return slowly or immediately.

Hm.  I don't recall exactly how synchronize_seqscans works --- it
definitely changes the point in the table where a seqscan begins
scanning, but I don't remember where/when that point gets updated
(and I'm too lazy to go look).  This result suggests that the first
query advances the table's sync start point to someplace a bit before
the first live tuple, not after all the returned tuples as I'd been
thinking.  It might be that the start point is always the beginning
of a block, so that if you fetch only a few tuples that are all in
the same block then the next attempt will re-fetch them.  If you
increase the LIMIT to say 1000, does the behavior (with
synchronize_seqscans on) change?

>> If turning synchronize_seqscans off changes the behavior, that'd
>> be a good clue that this is the right theory.

> Turning it off makes the query slow no matter how many times I re-run it.

OK, well that's enough of a smoking gun to say that it is the sync
start point that's the relevant state, and not the contents of your
data caches at all.

Anyway, the follow-on discussion makes it clear that you do indeed
have a badly bloated table, and fixing that situation should fix
this performance issue.

regards, tom lane




Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

Thanks for the insight on the internals. Regarding your questions:

On Tue, 31 Jan 2023, Tom Lane wrote:


Do you get the same 10 rows when you repeat the command?


Yes. Just tested with both cold and hot caches. The first 10 rows are
exactly the same, either they return slowly or immediately.


If turning synchronize_seqscans off changes the behavior, that'd
be a good clue that this is the right theory.


Turning it off makes the query slow no matter how many times I re-run it.
The system is doing lots of read I/O with both hot and cold caches. Here
is the EXPLAIN output from the hot cache run (that previously had only 14
hits and no reads):

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=620510.813..620510.821 
rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=64 read=2334462
   I/O Timings: shared/local read=567846.559
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=620510.800..620510.804 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=64 read=2334462
 I/O Timings: shared/local read=567846.559
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.099 ms
 Execution Time: 620510.855 ms


After reading the docs, I'm surprised this setting affects my case given
that I have no parallelism in my setup.



As for a real fix,
it might be time for a VACUUM FULL or CLUSTER on that table.


Regarding CLUSTER, would it help with a seqscan on a bloated table?
Furthermore, given that the table is expected to grow every day by a few
million rows, do you suggest running CLUSTER every night? Will postgres
remember that the rows up to N are clustered, even after appending more
rows?


Dimitris




Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread David G. Johnston
On Tue, Jan 31, 2023 at 8:28 AM Tom Lane  wrote:

> Dimitrios Apostolou  writes:
> > The question is why this simple query is taking so long to complete.
>
> Do you get the same 10 rows when you repeat the command?
>
> On the basis of the limited info you provided, I'm going to guess that
>
> (1) there are huge subranges of the table containing no live rows,
> so that a seqscan might have to pass over many blocks before it finds
> some to return;
>

It feels like there is room for improvement here using table statistics and
the visibility map to significantly reduce the number of pages retrieved
that turn out to be all dead.  A limit without an order is not a query to
optimize for of course...


> (2) once you do reach an area having live rows, the next SELECT picks
> up scanning in that same area due to the effects of
> "synchronize_seqscans", so you get immediate answers until you reach
> the next desert of dead tuples.
>

Interesting, I was assuming that the behavior on repeated calls was more
due to intelligent buffer pool usage.  When doing a sequential scan (I
suppose the key word being sequential...) the system decides which pages to
fetch before checking to see if they are already in shared buffers (as
opposed to returning the ones in shared buffers first then fetching more if
needed)?

David J.


Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Christophe Pettus



> On Jan 31, 2023, at 07:40, Dimitrios Apostolou  wrote:
> Is this bloat even affecting queries that do not use the index?

No, but a bloated index often (although not always) goes along with a bloated 
table.

> It seems I have to add VACUUM FULL to nightly maintainance.

I wouldn't go that far; that's basically changing your oil every time you get 
gas.  However, monitoring bloat and either rebuilding the indexes (if they're 
all that's bloated) or using pg_repack [1] periodically is a good practice.

[1] https://github.com/reorg/pg_repack



Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

On Tue, 31 Jan 2023, David G. Johnston wrote:


On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou  wrote:
 
      ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 
rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
            Output: run_n, test_name_n, workitem_n, started_on, 
duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n
            Buffers: shared read=2334526
            I/O Timings: shared/local read=691137.029


The system has to return 10 live rows to you.  If it needs to search through 
that many buffers to find 10 live rows you most likely have a large bloating 
problem going on.  Seems like it is time to vacuum full.


I looked up on how to measure bloat, so I run the query found at [1].

[1] https://wiki.postgresql.org/wiki/Show_database_bloat

The first two rows show huge bloat on the two indices of this table:

... ORDER BY wastedbytes DESC LIMIT 2;
 current_database | schemaname |   tablename   | tbloat | wastedbytes | 
   iname | ibloat | wastedibytes
--++---++-+--++--
 coin | public | test_runs_raw |1.8 | 21742305280 | 
test_runs_raw_pkey   |1.0 |0
 coin | public | test_runs_raw |1.8 | 21742305280 | 
test_runs_raw_idx_workitem_n |0.3 |0
(2 rows)

Is this bloat even affecting queries that do not use the index?

It seems I have to add VACUUM FULL to nightly maintainance. I had run some
schema restructuring (several ADD COLUMN followed by UPDATE SET on all
rows) some days ago, and I was not aware this degraded the table.
Thanks for the useful info!


Dimitris


Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Tom Lane
Dimitrios Apostolou  writes:
> The question is why this simple query is taking so long to complete.

Do you get the same 10 rows when you repeat the command?

On the basis of the limited info you provided, I'm going to guess that

(1) there are huge subranges of the table containing no live rows,
so that a seqscan might have to pass over many blocks before it finds
some to return;

(2) once you do reach an area having live rows, the next SELECT picks
up scanning in that same area due to the effects of
"synchronize_seqscans", so you get immediate answers until you reach
the next desert of dead tuples.

If turning synchronize_seqscans off changes the behavior, that'd
be a good clue that this is the right theory.  As for a real fix,
it might be time for a VACUUM FULL or CLUSTER on that table.

regards, tom lane




Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread David G. Johnston
On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou  wrote:


> ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80
> rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10
> loops=1)
>   Output: run_n, test_name_n, workitem_n, started_on, duration_ms,
> test_result_n, test_executable_n, test_function_n, test_datatag_n
>   Buffers: shared read=2334526
>   I/O Timings: shared/local read=691137.029
>

The system has to return 10 live rows to you.  If it needs to search
through that many buffers to find 10 live rows you most likely have a large
bloating problem going on.  Seems like it is time to vacuum full.

David J.


Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

I managed to double-paste different subset of my settings. FWIW all of
them are true for all the experiments. Sorry for the confusion.


On Tue, 31 Jan 2023, Dimitrios Apostolou wrote:


Other postgres settings:

shared_buffers = 1024MB
effective_io_concurrency = 0
wal_compression = zstd
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
checkpoint_timeout = 15min


[...]


Relevant settings:

synchronous_commit = off
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
track_io_timing = on
track_wal_io_timing = on
wal_compression = zstd





SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou



Hello list,

I run a very simple SELECT on a huge table (350M rows, 10 columns of type
integer or timestamp). The server is an old Mac Mini with 4GB RAM and a
slow HDD. All caches are cold either via reboot, or by restarting postgres
and using the `purge` command to flush the OS cache.

The question is why this simple query is taking so long to complete.
Most of the time I see constant 30MB/s read I/O. The memory usage is low
and the system is not swapping or in other ways unhealthy.  Postgres is
not serving any other requests.

Is postgres reading all the table sequentially, just to return 10 rows in
no particular order?  I'm trying to understand what PostgreSQL is doing
and the data access patterns involved.  All feedback is appreciated.


# SELECT version();

 PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 
12.0.0 (clang-1200.0.32.29), 64-bit

# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 10;

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=746005.251..746014.910 
rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared read=2334526
   I/O Timings: shared/local read=691137.029
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared read=2334526
 I/O Timings: shared/local read=691137.029
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning:
   Buffers: shared hit=69 read=18
   I/O Timings: shared/local read=265.550
 Planning Time: 271.719 ms
 Execution Time: 746057.424 ms
(14 rows)


Repeating the query returns instantaneously, and EXPLAIN shows shared
hit=14 and no reads. Changing the LIMIT to 100 takes another 3min to
respond, with constant read I/O while it runs:


# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 100;

 Limit  (cost=0.00..2.91 rows=100 width=42) (actual time=0.030..184692.698 
rows=100 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=14 read=772427
   I/O Timings: shared/local read=171484.072
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=0.029..184692.664 rows=100 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=14 read=772427
 I/O Timings: shared/local read=171484.072
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.083 ms
 Execution Time: 184692.748 ms
(11 rows)


Reruning this is instantaneous again.
Other postgres settings:

shared_buffers = 1024MB
effective_io_concurrency = 0
wal_compression = zstd
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
checkpoint_timeout = 15min


Bonus question: some time ago the server crashed and rebooted, and after a
long WAL recovery and some hours of idle time, I run the same query.  It
took even longer to respond, and besides reading, there was also a lot
writing involved.  What kind of writes was postgres doing?


# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 10;

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual 
time=1284114.810..1284114.819 rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484
   I/O Timings: shared/local read=104.264 write=107258.150
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=1283955.130..1283955.135 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484
 I/O Timings: shared/local read=104.264 write=107258.150
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.545 ms
 Execution Time: 1284114.972 ms
(11 rows)


Relevant settings:

synchronous_commit = off
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
track_io_timing = on
track_wal_io_timing = on
wal_compression = zstd


Thanks in advance,
Dimitris