Re: Asynchronous and "direct" IO support for PostgreSQL.

Andres Freund Tue, 23 Feb 2021 13:06:08 -0800

Hi,

On 2021-02-23 14:58:32 -0500, Greg Stark wrote:
> So firstly this is all just awesome work and I have questions but I
> don't want them to come across in any way as criticism or as a demand
> for more work.


I posted it to get argued with ;).


> The callbacks make me curious about two questions:
>
> 1) Is there a chance that a backend issues i/o, the i/o completes in
> some other backend and by the time this backend gets around to looking
> at the buffer it's already been overwritten again? Do we have to
> initiate I/O again or have you found a way to arrange that this
> backend has the buffer pinned from the time the i/o starts even though
> it doesn't handle the comletion?

The initiator of the IO can just keep a pin for the buffer to prevent
that.

There's a lot of complexity around how to handle pinning and locking
around asynchronous buffer IO. I plan to send a separate email with a
more details.

In short: I made it so that for shared buffer IO holds a separate
refcount for the duration of the IO - that way the issuer can release
its own pin without causing a problem (consider e.g. an error while an
IO is in flight). The pin held by the IO gets released in the completion
callback.  There's similar trickery with locks - but more about that later.


> 2) Have you made (or considered making) things like sequential scans
> (or more likely bitmap index scans) asynchronous at a higher level.
> That is, issue a bunch of asynchronous i/o and then handle the pages
> and return the tuples as the pages arrive. Since sequential scans and
> bitmap scans don't guarantee to read the pages in order they're
> generally free to return tuples from any page in any order. I'm not
> sure how much of a win that would actually be since all the same i/o
> would be getting executed and the savings in shared buffers would be
> small but if there are mostly hot pages you could imagine interleaving
> a lot of in-memory pages with the few i/os instead of sitting idle
> waiting for the async i/o to return.

I have not. Mostly because it'd basically break the entire regression
test suite. And probably a lot of user expectations (yes,
synchronize_seqscan exists, but it pretty rarely triggers).

I'm not sure how big the win would be - the readahead for heapam.c that
is in the patch set tries to keep ahead of the "current scan position"
by a certain amount - so it'll issue the reads for the "missing" pages
before they're needed. Hopefully early enough to avoid unnecessary
stalls. But I'm pretty sure there'll be cases where that'd require a
prohibitively long "readahead distance".

I think there's a lot of interesting things along these lines that we
could tackle, but since they involve changing results and/or larger
changes to avoid those (e.g. detecting when sequential scan order isn't
visible to the user) I think it'd make sense to separate them from the
aio patchset itself.

If we had the infrastructure to detect whether seqscan order matters, we
could also switch to the tuple-iteration order to "backwards" in
heapgetpage() - right now iterating forward in "itemid order" causes a
lot of cache misses because the hardware prefetcher doesn't predict that
the tuples are layed out "decreasing pointer order".


> > ## Stats
> >
> > There are two new views: pg_stat_aios showing AIOs that are currently
> > in-progress, pg_stat_aio_backends showing per-backend statistics about AIO.
>
> This is impressive. How easy is it to correlate with system aio stats?

Could you say a bit more about what you are trying to correlate?

Here's some example IOs from pg_stat_aios.

┌─[ RECORD 1 
]─┬───────────────────────────────────────────────────────────────────────────────────────────────────┐
│ backend_type │ client backend                                                 
                                   │
│ id           │ 98                                                             
                                   │
│ gen          │ 13736                                                          
                                   │
│ op           │ write                                                          
                                   │
│ scb          │ sb                                                             
                                   │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                         
                                   │
│ ring         │ 5                                                              
                                   │
│ owner_pid    │ 1866051                                                        
                                   │
│ merge_with   │ (null)                                                         
                                   │
│ result       │ 0                                                              
                                   │
│ desc         │ fd: 38, offset: 329588736, nbytes: 8192, already_done: 0, 
release_lock: 1, buffid: 238576         │

├─[ RECORD 24 
]┼───────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ checkpointer                                                   
                               │
│ id           │ 1501                                                           
                               │
│ gen          │ 15029                                                          
                               │
│ op           │ write                                                          
                               │
│ scb          │ sb                                                             
                               │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                         
                               │
│ ring         │ 3                                                              
                               │
│ owner_pid    │ 1865275                                                        
                               │
│ merge_with   │ 1288                                                           
                               │
│ result       │ 0                                                              
                               │
│ desc         │ fd: 24, offset: 105136128, nbytes: 8192, already_done: 0, 
release_lock: 1, buffid: 202705     │

├─[ RECORD 31 
]┼───────────────────────────────────────────────────────────────────────────────────────────────────┤
│ backend_type │ walwriter                                                      
                                   │
│ id           │ 90                                                             
                                   │
│ gen          │ 26498                                                          
                                   │
│ op           │ write                                                          
                                   │
│ scb          │ wal                                                            
                                   │
│ flags        │ PGAIOIP_IN_PROGRESS | PGAIOIP_INFLIGHT                         
                                   │
│ ring         │ 5                                                              
                                   │
│ owner_pid    │ 1865281                                                        
                                   │
│ merge_with   │ 181                                                            
                                   │
│ result       │ 0                                                              
                                   │
│ desc         │ write_no: 17, fd: 12, offset: 6307840, nbytes: 1048576, 
already_done: 0, bufdata: 0x7f94dd670000  │
└──────────────┴───────────────────────────────────────────────────────────────────────────────────────────────────┘


And the per-backend AIO view shows information like this (too wide to
display all cols):

SELECT
    pid, last_context, backend_type,
    executed_total_count,issued_total_count,submissions_total_count, 
inflight_count
FROM pg_stat_aio_backends sab JOIN pg_stat_activity USING (pid);
┌─────────┬──────────────┬──────────────────────────────┬──────────────────────┬────────────────────┬─────────────────────────┬────────────────┐
│   pid   │ last_context │         backend_type         │ executed_total_count 
│ issued_total_count │ submissions_total_count │ inflight_count │
├─────────┼──────────────┼──────────────────────────────┼──────────────────────┼────────────────────┼─────────────────────────┼────────────────┤
│ 1865291 │            3 │ logical replication launcher │                    0 
│                  0 │                       0 │              0 │
│ 1865296 │            0 │ client backend               │                   85 
│                 85 │                      85 │              0 │
│ 1865341 │            7 │ client backend               │              9574416 
│            2905321 │                  345642 │              0 │
...
│ 1873501 │            3 │ client backend               │                 3565 
│               3565 │                     467 │             10 │
...
│ 1865277 │            0 │ background writer            │               695402 
│             575906 │                   13513 │              0 │
│ 1865275 │            3 │ checkpointer                 │              4664110 
│            3488530 │                 1399896 │              0 │
│ 1865281 │            3 │ walwriter                    │                77203 
│               7759 │                    7747 │              3 │
└─────────┴──────────────┴──────────────────────────────┴──────────────────────┴────────────────────┴─────────────────────────┴────────────────┘

It's not super obvious at this point but executed_total_count /
issued_total_count shows the rate at which IOs have been
merged. issued_total_count / submissions_total_count shows how many
(already merged) IOs were submitted together in one "submission batch"
(for io_method=io_uring, io_uring_enter()).


So pg_stat_aios should allow to enrich some system stats - e.g. by being
able to split out WAL writes from data file writes. And - except that
the current callbacks for that aren't great - it should even allow to
split the IO by different relfilenodes etc.


I assume pg_stat_aio_backends also can be helpful, e.g. by seeing which
backends currently the deep IO queues that cause latency delays in other
backends, and which backends do a lot of sequential IO (high
executed_total_count / issued_total_count) and which only random...

Greetings,

Andres Freund

Re: Asynchronous and "direct" IO support for PostgreSQL.

Reply via email to