Hi Nazir,

thank you for your review. I comment below.


On 05/03/2024 12:07, Nazir Bilal Yavuz wrote:
2. The second one does implement smgrprefetch with range and loops by
default per segment to still have a check for interrupts.
It looks good codewise but RELSEG_SIZE is too big to prefetch. Man
page of posix_fadvise [1] states that: "The amount of data read may be
decreased by the kernel depending on virtual memory load. (A few
megabytes will usually be fully satisfied, and more is rarely
useful.)". It is trying to prefetch 1GB data now. That could explain
your observation about differences between nr_cache numbers.

From an "adminsys" point of view I will find beneficial to get a single syscall per file, respecting the logic and behavior of underlying system call.

The behavior is 100% OK, and in fact it might a bad idea to prefetch block by block as the result is just to put more pressure on a system if it is already under pressure.

Though there are use cases and it's nice to be able to do that too at this per page level.

About [1], it's very old statement about resources. And Linux manages a part of the problem for us here I think [2]:

/*
 * Chunk the readahead into 2 megabyte units, so that we don't pin too much
 * memory at once.
 */
void force_page_cache_ra(....)

Q: posix_fadvise may not work exactly the way you think it does, or does
it ?


In details, and for the question:

However,  if instead you provide a real range, or the magic len=0 to
posix_fadvise, then blocks are "more" loaded according to effective vm
pressure (which is not the case on the previous example).
As a result only a small part of the relation might be loaded, and this
is probably not what end-users expect despite being probably a good
choice (you can still free cache beforehand to help the kernel).

I think it's a matter of documenting well the feature, and if at all possible, as usual, not let users be negatively impacted by default.


An example, below I'm using vm_relation_cachestat() which provides linux
cachestat output, and vm_relation_fadvise() to unload cache, and
pg_prewarm for the demo:

# clear cache: (nr_cache is the number of file system pages in cache,
not postgres blocks)

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
            0 |       32768 |    65536 |        0
        32768 |       32768 |    65536 |        0
        65536 |       32768 |    65536 |        0
        98304 |       32768 |    65536 |        0
       131072 |        1672 |     3344 |        0
```

# load full relation with pg_prewarm (patched)

```
postgres=# select pg_prewarm('foo','prefetch');
pg_prewarm
------------
      132744
(1 row)
```

# Checking results:

```
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
            0 |       32768 |    65536 |      320
        32768 |       32768 |    65536 |        0
        65536 |       32768 |    65536 |        0
        98304 |       32768 |    65536 |        0
       131072 |        1672 |     3344 |      320  <-- segment 1

```

# Load block by block and check:

```
postgres=# select from generate_series(0, 132743) g(n), lateral
pg_prewarm('foo','prefetch', 'main', n, n);
postgres=# select block_start, block_count, nr_pages, nr_cache from
vm_relation_cachestat('foo',range:=1024*32);
block_start | block_count | nr_pages | nr_cache
-------------+-------------+----------+----------
            0 |       32768 |    65536 |    65536
        32768 |       32768 |    65536 |    65536
        65536 |       32768 |    65536 |    65536
        98304 |       32768 |    65536 |    65536
       131072 |        1672 |     3344 |     3344

```

The duration of the last example is also really significant: full
relation is 0.3ms and block by block is 1550ms!
You might think it's because of generate_series or whatever, but I have
the exact same behavior with pgfincore.
I can compare loading and unloading duration for similar "async" work,
here each call is from block 0 with len of 132744 and a range of 1 block
(i.e. posix_fadvise on 8kB at a time).
So they have exactly the same number of operations doing DONTNEED or
WILLNEED, but distinct duration on the first "load":

```

postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_DONTNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 25.202 ms
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 1523.636 ms (00:01.524) <----- not free !
postgres=# select * from
vm_relation_fadvise('foo','main',0,132744,1,'POSIX_FADV_WILLNEED');
vm_relation_fadvise
---------------------

(1 row)

Time: 24.967 ms
```
I confirm that there is a time difference between calling pg_prewarm
by full relation and block by block, but IMO this is expected. When
pg_prewarm is called by full relation, it does the initialization part
just once but when it is called block by block, it does initialization
for each call, right?


Not sure what initialization is here exactly, in my example with WILLNEED/DONTNEED there are exactly the same code pattern and syscall request(s), just the flag is distinct, so initialization cost are expected to be very similar. I'll try to move forward on those vm_relation functions into pgfincore so it'll be easier to run similar tests and compare.



I run 'select pg_prewarm('foo','prefetch', 'main', n, n) FROM
generate_series(0, 132744)n;' a couple of times consecutively but I
could not see the time difference between first run (first load) and
the consecutive runs. Am I doing something wrong?


Maybe the system is overloaded and thus by the time you're done prefetching tail blocks, the heads ones have been dropped already. So looping on that leads to similar duration. If it's already in cache and not removed from it, execution time is stable. This point (in cache or not) is hard to guess right until you do check the status, or you ensure to clean it first.

[1] https://man7.org/linux/man-pages/man2/posix_fadvise.2.html#DESCRIPTION

[2] https://elixir.bootlin.com/linux/latest/source/mm/readahead.c#L303

My apologize about the email address with sub-address which leads to undelivered email. Please update with the current one.

---
Cédric Villemain +33 (0)6 20 30 22 52
https://Data-Bene.io
PostgreSQL Expertise, Support, Training, R&D



Reply via email to